假如我們集羣和僞分佈式hadoop系統已經搭建完畢。我們都會根據官網或一些資料提供的wordcount函數來測試我們系統是否能正常工作。假設,我們在執行wordcount函數,都沒有問題。那我們就可以開始寫M/R程序,開始數據分析了。
因爲,hadoop集羣,還有其他一些組件需要我們去安裝,這裏還沒有涉及,暫時不考慮。你要做的就是,把要分析的數據上傳到HDFS中。至於其餘組件,遇到的時候,在學習。這裏對概念,不做太多的介紹。必要的概念,和程序執行步驟,這個是必須瞭解的。
任務要求:分析通話記錄,查處每個手機號碼有哪些打過來的號碼
— 現有電話通訊清單,記錄用戶A打給用戶B的記錄
— 查找出沒有電話號碼,都有哪些電話來了
例如:120 13688893333 13733883333 1393302942
說明:120 有後面3個電話打過來
要實現上述功能,傳統處理方面同樣也可以實現,但當數據大到一定程度,就會遇到瓶頸:
1、通過c,java程序直接截取上面的數據。
2、把數據直接導入數據庫中,直接select也能解決問題。
3、但當數據TB級別的時候,簡單的計算的話,傳統方法就會遇到瓶頸。這時候M/R就起到作用了,hadoop會將我們上傳到HDFS中的數據自動的分配到每一臺機器中。
4、mapper程序,開始按行讀取數據: 一行行的讀入,然後做的是:一個個的分割原始數據、輸出所需要數據,處理異常數據(導致數據的崩潰,如何處理異常數據)最後輸出到HDFS上。每一行數據都會運行這個mapper,最後這個mapper輸出到hdfs上。最後得到的結果在輸出到hdfs上。
5、可以沒有reduce函數,根據情況而定。不帶mapper函數的輸出發送到輸出文件,map函數的輸出格式必須與程序輸出格式一致。
6、帶有reduce函數任務,系統首先把mapper中輸出的key相同的部分都發送到同一個reduce,然後再把reduce函數的結果輸出,map函數的輸出格式必須和reduce函數的輸入格式一致。
Reduce的功能:
1、某一臺機器負責一些key,最後通過reduce函數彙總,輸出:
源數據:
13599999999 10086
13633333333 120
12678998778 dada13690378933
13833238233 13690378933
12678998778 120
19837366873 10086
16389323733 10086
18373984487 120
19877367388 13690378933
18937933444 13840238422
輸出數據:中間用製表符分隔
10086 13599999999|19837366873|16389323733|
120 13633333333|12678998778|18373984487|
13690378933 13833238233|19877367388|
13840238422 18937933444|
測試代碼:import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
public class PhoneTest extends Configured implements Tool {
enum Counter {
LINESKIP; // 出錯的行
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, "PhoneTest"); // 任務名
job.setJarByClass(PhoneTest.class); // 指定Class
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/home/zhangzhen/input")); // 輸入路徑
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/home/zhangzhen/output")); // 輸出路徑
job.setMapperClass(Map.class); // 調用Map類作爲Mapper任務代碼
job.setReducerClass(Reduce.class); // 調用Reduce類作爲Reducer任務代碼
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class); // 指定輸出的Key的格式(KEYOUT)
job.setOutputValueClass(Text.class); // 指定輸出的Value的格式(VALUEOUT)
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}
public static class Map extends
Mapper<LongWritable, Text, Text, Text> { //<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
try {
// key - 行號 value - 一行的文本
String line = value.toString(); //13510000000 10086(13510000000撥打10086)
// 數據處理
String[] lineSplit = line.split(" ");
String phone1 = lineSplit[0];
String phone2 = lineSplit[1];
context.write(new Text(phone2), new Text(phone1)); // 輸出 key \t value
} catch (java.lang.ArrayIndexOutOfBoundsException e) {
context.getCounter(Counter.LINESKIP).increment(1); // 出錯令計數器+1
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException, InterruptedException {
String valueStr;
String out = "";
for(Text value:values){
valueStr = value.toString() + "|";
out += valueStr;
}
// 輸出 key \t value(如果我們的輸出結果不是key \t value格式,那麼我們的key可定義爲NullWritable,而value使用key與value的組合。)
context.write(key, new Text(out));
}
}
public static void main(String[] args) throws Exception {
//運行任務
int res = ToolRunner.run(new Configuration(), new PhoneTest(), args);
System.exit(res);
}
}
eclipse終端輸出:
14/01/22 10:44:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/22 10:44:47 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/01/22 10:44:47 INFO input.FileInputFormat: Total input paths to process : 1
14/01/22 10:44:47 WARN snappy.LoadSnappy: Snappy native library not loaded
14/01/22 10:44:48 INFO mapred.JobClient: Running job: job_local1140955208_0001
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Waiting for map tasks
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Starting task: attempt_local1140955208_0001_m_000000_0
14/01/22 10:44:48 INFO util.ProcessTree: setsid exited with exit code 0
14/01/22 10:44:48 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@99548b
14/01/22 10:44:48 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/home/zhangzhen/input/tonghua.txt:0+202
14/01/22 10:44:48 INFO mapred.MapTask: io.sort.mb = 100
14/01/22 10:44:48 INFO mapred.MapTask: data buffer = 79691776/99614720
14/01/22 10:44:48 INFO mapred.MapTask: record buffer = 262144/327680
14/01/22 10:44:48 INFO mapred.MapTask: Starting flush of map output
14/01/22 10:44:48 INFO mapred.MapTask: Finished spill 0
14/01/22 10:44:48 INFO mapred.Task: Task:attempt_local1140955208_0001_m_000000_0 is done. And is in the process of commiting
14/01/22 10:44:48 INFO mapred.LocalJobRunner:
14/01/22 10:44:48 INFO mapred.Task: Task 'attempt_local1140955208_0001_m_000000_0' done.
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Finishing task: attempt_local1140955208_0001_m_000000_0
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Map task executor complete.
14/01/22 10:44:48 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1457048
14/01/22 10:44:48 INFO mapred.LocalJobRunner:
14/01/22 10:44:48 INFO mapred.Merger: Merging 1 sorted segments
14/01/22 10:44:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 194 bytes
14/01/22 10:44:48 INFO mapred.LocalJobRunner:
14/01/22 10:44:48 INFO mapred.Task: Task:attempt_local1140955208_0001_r_000000_0 is done. And is in the process of commiting
14/01/22 10:44:48 INFO mapred.LocalJobRunner:
14/01/22 10:44:48 INFO mapred.Task: Task attempt_local1140955208_0001_r_000000_0 is allowed to commit now
14/01/22 10:44:48 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1140955208_0001_r_000000_0' to hdfs://localhost:9000/home/zhangzhen/output
14/01/22 10:44:49 INFO mapred.LocalJobRunner: reduce > reduce
14/01/22 10:44:49 INFO mapred.Task: Task 'attempt_local1140955208_0001_r_000000_0' done.
14/01/22 10:44:49 INFO mapred.JobClient: map 100% reduce 100%
14/01/22 10:44:49 INFO mapred.JobClient: Job complete: job_local1140955208_0001
14/01/22 10:44:49 INFO mapred.JobClient: Counters: 23
14/01/22 10:44:49 INFO mapred.JobClient: PhoneAnalyzer$Counter
14/01/22 10:44:49 INFO mapred.JobClient: LINESKIP=1
14/01/22 10:44:49 INFO mapred.JobClient: File Output Format Counters
14/01/22 10:44:49 INFO mapred.JobClient: Bytes Written=146
14/01/22 10:44:49 INFO mapred.JobClient: File Input Format Counters
14/01/22 10:44:49 INFO mapred.JobClient: Bytes Read=202
14/01/22 10:44:49 INFO mapred.JobClient: FileSystemCounters
14/01/22 10:44:49 INFO mapred.JobClient: FILE_BYTES_READ=568
14/01/22 10:44:49 INFO mapred.JobClient: HDFS_BYTES_READ=404
14/01/22 10:44:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=136400
14/01/22 10:44:49 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=146
14/01/22 10:44:49 INFO mapred.JobClient: Map-Reduce Framework
14/01/22 10:44:49 INFO mapred.JobClient: Reduce input groups=4
14/01/22 10:44:49 INFO mapred.JobClient: Map output materialized bytes=198
14/01/22 10:44:49 INFO mapred.JobClient: Combine output records=0
14/01/22 10:44:49 INFO mapred.JobClient: Map input records=10
14/01/22 10:44:49 INFO mapred.JobClient: Reduce shuffle bytes=0
14/01/22 10:44:49 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/01/22 10:44:49 INFO mapred.JobClient: Reduce output records=4
14/01/22 10:44:49 INFO mapred.JobClient: Spilled Records=18
14/01/22 10:44:49 INFO mapred.JobClient: Map output bytes=174
14/01/22 10:44:49 INFO mapred.JobClient: Total committed heap usage (bytes)=340262912
14/01/22 10:44:49 INFO mapred.JobClient: CPU time spent (ms)=0
14/01/22 10:44:49 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/01/22 10:44:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=119
14/01/22 10:44:49 INFO mapred.JobClient: Map output records=9
14/01/22 10:44:49 INFO mapred.JobClient: Combine input records=0
14/01/22 10:44:49 INFO mapred.JobClient: Reduce input records=9
運行結束後,你可以通過你設置的路徑找到輸出的文件,這裏可以查看輸出的文件;同樣你也可以在eclipse中,如下圖刷新DFS目錄,就可以看到input和output兩個目錄,目錄下的文件也可以在eclipse中打開查閱。測試截圖:
Copyright©BUAA