1.源碼如下:
package com.mapred.core;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.junit.Test;
import java.io.IOException;
public class WordCount {
@Test
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
//FileSystem fs = FileSystem.get(new URI("hdfs://192.168.70.128:9000"),conf); //理解爲一個訪問端到服務端的連接
Job job = new Job(conf);
//指明程序的入口
job.setJarByClass(WordCount.class);
//指明輸入的數據
FileInputFormat.setInputPaths(job, new Path(args[0]));
//組織mapper和reducer
//設置mapper
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//設置reducer
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//指明數據輸出的路徑
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//提交任務運行
job.waitForCompletion(true);
}
}
class WordCountMapper extends Mapper<LongWritable,Text,Text,LongWritable>{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String val = value.toString();
String[] words = val.split(" ");
for(String word : words){
context.write(new Text(word), new LongWritable(1));
}
}
}
class WordCountReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
@Override
protected void reduce(Text key, Iterable<LongWritable> values,Context context)
throws IOException, InterruptedException {
long sum = 0;
for(LongWritable value : values){
sum += value.get();
}
context.write(key, new LongWritable(sum));
}
}
2.運行步驟筆記:
運行WordCount步驟:
1.將項目打成jar包,比如打成mapredProject.jar包。
2.上傳mapredProject.jar到/soft目錄
3.在/soft目錄創建輸入數據文件input.txt。查看input.txt的文件內容, more /soft/input.txt:
lengend
i
am
a
hero
i
am
a
fool
i
am
a
apple
but
you
are
a
bastard
4.在hdfs中創建輸入數據存放目錄: hadoop fs -mkdir -p /wordcount/in
5.將/soft/input.txt上傳到hdfs文件系統的wordcount目錄下: hadoop fs -put /soft/input.txt /wordcount/in
6.運行WordCount項目: hadoop jar /soft/mapredProject.jar /wocount/in /wocount/output
7.等待完成,控制檯會展示進度,作爲示例,某次運行進度可能如下所示:
[hadoop@node1 soft]$ hadoop jar /soft/mapredProject.jar /wocount/in /wocount/output
18/01/10 21:22:13 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.209.129:8032
18/01/10 21:22:14 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/01/10 21:22:17 INFO input.FileInputFormat: Total input paths to process : 1
18/01/10 21:22:17 INFO mapreduce.JobSubmitter: number of splits:1
18/01/10 21:22:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515211219380_0005
18/01/10 21:22:21 INFO impl.YarnClientImpl: Submitted application application_1515211219380_0005
18/01/10 21:22:21 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1515211219380_0005/
18/01/10 21:22:21 INFO mapreduce.Job: Running job: job_1515211219380_0005
18/01/10 21:23:18 INFO mapreduce.Job: Job job_1515211219380_0005 running in uber mode : false
18/01/10 21:23:18 INFO mapreduce.Job: map 0% reduce 0%
18/01/10 21:23:51 INFO mapreduce.Job: map 100% reduce 0%
18/01/10 21:24:10 INFO mapreduce.Job: map 100% reduce 100%
18/01/10 21:24:12 INFO mapreduce.Job: Job job_1515211219380_0005 completed successfully
18/01/10 21:24:13 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=264
FILE: Number of bytes written=242009
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=171
HDFS: Number of bytes written=76
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30472
Total time spent by all reduces in occupied slots (ms)=15714
Total time spent by all map tasks (ms)=30472
Total time spent by all reduce tasks (ms)=15714
Total vcore-milliseconds taken by all map tasks=30472
Total vcore-milliseconds taken by all reduce tasks=15714
Total megabyte-milliseconds taken by all map tasks=31203328
Total megabyte-milliseconds taken by all reduce tasks=16091136
Map-Reduce Framework
Map input records=19
Map output records=19
Map output bytes=220
Map output materialized bytes=264
Input split bytes=103
Combine input records=0
Combine output records=0
Reduce input groups=12
Reduce shuffle bytes=264
Reduce input records=19
Reduce output records=12
Spilled Records=38
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=249
CPU time spent (ms)=2860
Physical memory (bytes) snapshot=288591872
Virtual memory (bytes) snapshot=4164571136
Total committed heap usage (bytes)=141230080
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=68
File Output Format Counters
Bytes Written=76
8.查看輸出: hadoop fs -ls /wocount/output,輸出結果可能如下所示:
Found 2 items
-rw-r--r-- 2 hadoop supergroup 0 2018-01-10 21:24 /wocount/output/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 76 2018-01-10 21:24 /wocount/output/part-r-00000
9.打開上述步驟列出的輸入文件: hadoop fs -cat /wocount/output/part-r-00000 , 可以看到統計的單詞以及對應的數目如下所示:
1
a 4
am 3
apple 1
are 1
bastard 1
but 1
fool 1
hero 1
i 3
lengend 1
you 1