linux运⾏wordcount,Hadoop之本地运⾏WordCount 本⽂主要记录在windows搭建Hadoop开发环境并编写⼀个WordCount的mapreduce在本地环境执⾏。
主要内容:
1.搭建本地环境
2.编写WordCount并在本地运⾏
1.搭建本地环境
1.1.解压
去官⽹下载指定的hadoop版本
hadoop-2.7.
将下载好的hadoop压缩包解压到任意⽬录
拷贝 到 hadoop-2.7.3/bin ⽬录下
1.2 配置环境变量
新建环境变量执⾏hadoop解压路径
HADOOP_HOME:D:\soft\dev\hadoop-2.7.3
在Path后新增
%HADOOP_HOME%\bin;
2.编写WordCount
输⼊⽂件格式如下:
hello java
hello hadoop
输出如下:
hello 2
hadoop 1
java 1
项⽬⽬录如下:
image.png
2.1.引⼊Maven依赖
org.apache.hadoop
hadoop-client
2.7.3
org.apache.hadoop
hadoop-common
2.7.3
org.apache.hadoop
hadoop-hdfs
2.7.3
2.2.加⼊log4j.properties配置⽂件
sole=org.apache.log4j.ConsoleAppender
sole.Target=System.out
sole.layout=org.apache.log4j.PatternLayout
sole.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n
2.3.编写Mapper
读取输⼊⽂本中的每⼀⾏,并切分单词,记录单词的数量并输出,输出类型为Text,IntWritable 例如:java,1
public class WcMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
System.out.println("--->Map-->" + Thread.currentThread().getName());
String[] words = StringUtils.String(), ' ');
for (String w : words) {
context.write(new Text(w), new IntWritable(1));
}
}
}
2.4.编写Reducer
接收Mapper的输出结果进⾏累加并输出结果,接收类型为Mapper的输出类型Text,Iterable 例如:java (1,1),输出类型为Text,intWritable 例如:java 2
public class WcReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
System.out.println("--->Reducer-->" + Thread.currentThread().getName());
int sum = 0;
for (IntWritable i : values) {
sum = sum + i.get();
}
context.write(key, new IntWritable(sum));
}
}
2.5.编写Job
将Mapper和Reducer组装起来封装成功⼀个Job,作为⼀个执⾏单元。计算WordCount就是⼀个Job。
public class RunWcJob {
public static void main(String[] args) throws Exception {
// 创建本次mr程序的job实例
Configuration conf = new Configuration();
Job job = Instance(conf);
// 指定本次job运⾏的主类
job.setJarByClass(RunWcJob.class);
/
/ 指定本次job的具体mapper reducer实现类
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
// 指定本次job map阶段的输出数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 指定本次job reduce阶段的输出数据类型 也就是整个mr任务的最终输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 指定本次job待处理数据的⽬录 和程序执⾏完输出结果存放的⽬录
FileInputFormat.setInputPaths(job, "D:\\hadoop\\input");
FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\output"));
// 提交本次job
boolean b = job.waitForCompletion(true);
}
}
在本地⽂件夹D:\hadoop\input下新建 ,内容为上⾯给出的输⼊内容作为输⼊
同样输出⽂件夹为output,那么直接运⾏程序:
hadoop安装详细步骤linux可能出现的错误:
java.io.IOException: Could not locate executable null\ in the Hadoop binaries.
原因:
没有拷贝winutils拷贝到hadoop-2.7.3/bin⽬录下或者没有配置HADOOP_HOME环境变量或者配置HADOOP_HOME环境变量没⽣效
解决:
1.下载winutils拷贝到hadoop-
2.7.3/bin⽬录下
2.检查环境变量是否配置
3.如果已经配置好环境变量,重启idea或这电脑,有可能是环境变量没⽣效
Exception in thread "main" java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
原因:
不太清楚
解决:
拷贝org.apache.hadoop.io.nativeio.NativeIO源码,重写access⽅法的返回值
image.png
2.6运⾏结果
允许如果出现⼀下信息就表⽰已经正确执⾏了。
14:40:01,813 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for using builtin-java classes where applicable
14:40:02,058 INFO deprecation:1173 - session.id is deprecated. Instead, ics.session-id
14:40:02,060 INFO JvmMetrics:76 - Initializing JVM Metrics with processName=JobTracker, sessionId=
14:40:02,355 WARN JobResourceUploader:64 - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14:40:02,387 WARN JobResourceUploader:171 - No job jar file set. User classes may not be found. See Job or
Job#setJar(String).
14:40:02,422 INFO FileInputFormat:283 - Total input paths to process : 1
14:40:02,685 INFO JobSubmitter:198 - number of splits:1
14:40:02,837 INFO JobSubmitter:287 - Submitting tokens for job: job_local866013445_0001
14:40:03,042 INFO Job:1339 - Running job: job_local866013445_0001
14:40:03,044 INFO LocalJobRunner:471 - OutputCommitter set in config null
14:40:03,110 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,115 INFO LocalJobRunner:489 - OutputCommitter is
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14:40:03,211 INFO LocalJobRunner:448 - Waiting for map tasks
14:40:03,211 INFO LocalJobRunner:224 - Starting task: attempt_local866013445_0001_m_000000_0
14:40:03,238 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,383 INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,439 INFO Task:612 - Using ResourceCalculatorProcessTree :
org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4d11cc8c
14:40:03,445 INFO MapTask:756 - Processing split: file:/D:/hadoop/:0+24
14:40:03,509 INFO MapTask:1205 - (EQUATOR) 0 kvi 26214396(104857584)
14:40:03,509 INFO MapTask:998 - mapreduce.task.io.sort.mb: 100
14:40:03,509 INFO MapTask:999 - soft limit at 83886080
14:40:03,509 INFO MapTask:1000 - bufstart = 0; bufvoid = 104857600
14:40:03,510 INFO MapTask:1001 - kvstart = 26214396; length = 6553600
14:40:03,515 INFO MapTask:403 - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
--->Map-->LocalJobRunner Map Task Executor #0
--->Map-->LocalJobRunner Map Task Executor #0
14:40:03,522 INFO LocalJobRunner:591 -
14:40:03,522 INFO MapTask:1460 - Starting flush of map output
14:40:03,522 INFO MapTask:1482 - Spilling map output
14:40:03,522 INFO MapTask:1483 - bufstart = 0; bufend = 40; bufvoid = 104857600
14:40:03,522 INFO MapTask:1485 - kvstart = 26214396(104857584); kvend = 26214384(104857536); length =
13/6553600
14:40:03,573 INFO MapTask:1667 - Finished spill 0
14:40:03,583 INFO Task:1038 - Task:attempt_local866013445_0001_m_000000_0 is done. And is in the process of committing
14:40:03,589 INFO LocalJobRunner:591 - map
14:40:03,589 INFO Task:1158 - Task 'attempt_local866013445_0001_m_000000_0' done.
14:40:03,589 INFO LocalJobRunner:249 - Finishing task: attempt_local866013445_0001_m_000000_0
14:40:03,590 INFO LocalJobRunner:456 - map task executor complete.
14:40:03,593 INFO LocalJobRunner:448 - Waiting for reduce tasks
14:40:03,593 INFO LocalJobRunner:302 - Starting task: attempt_local866013445_0001_r_000000_0
14:40:03,597 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,597 INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,627 INFO Task:612 - Using ResourceCalculatorProcessTree :
org.apache.hadoop.yarn.util.WindowsBasedProcessTree@2ae5eb6
14:40:03,658 INFO ReduceTask:362 - Using ShuffleConsumerPlugin:
org.apache.hadoop.duce.Shuffle@72ddfb0b
14:40:03,686 INFO MergeManagerImpl:197 - MergerManager: memoryLimit=1314232704,
maxSingleShuffleLimit=328558176, mergeThreshold=867393600, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14:40:03,688 INFO EventFetcher:61 - attempt_local866013445_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14:40:03,720 INFO LocalFetcher:144 - localfetcher#1 about to shuffle output of map
attempt_local866013445_0001_m_000000_0 decomp: 50 len: 54 to MEMORY
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论