hadoop學習(四)Map/Reduce數據分析簡述-示例-電話通訊清單

    假如我們集羣和僞分佈式hadoop系統已經搭建完畢。我們都會根據官網或一些資料提供的wordcount函數來測試我們系統是否能正常工作。假設,我們在執行wordcount函數,都沒有問題。那我們就可以開始寫M/R程序,開始數據分析了。

   因爲,hadoop集羣,還有其他一些組件需要我們去安裝,這裏還沒有涉及,暫時不考慮。你要做的就是,把要分析的數據上傳到HDFS中。至於其餘組件,遇到的時候,在學習。這裏對概念,不做太多的介紹。必要的概念,和程序執行步驟,這個是必須瞭解的。

任務要求:分析通話記錄,查處每個手機號碼有哪些打過來的號碼

  — 現有電話通訊清單,記錄用戶A打給用戶B的記錄

  — 查找出沒有電話號碼,都有哪些電話來了

例如:120   13688893333 13733883333 1393302942  

說明:120 有後面3個電話打過來

要實現上述功能,傳統處理方面同樣也可以實現,但當數據大到一定程度,就會遇到瓶頸:

1、通過c,java程序直接截取上面的數據。

2、把數據直接導入數據庫中,直接select也能解決問題。

3、但當數據TB級別的時候,簡單的計算的話,傳統方法就會遇到瓶頸。這時候M/R就起到作用了,hadoop會將我們上傳到HDFS中的數據自動的分配到每一臺機器中。

4、mapper程序,開始按行讀取數據: 一行行的讀入,然後做的是:一個個的分割原始數據、輸出所需要數據,處理異常數據(導致數據的崩潰,如何處理異常數據)最後輸出到HDFS上。每一行數據都會運行這個mapper,最後這個mapper輸出到hdfs上。最後得到的結果在輸出到hdfs上。

5、可以沒有reduce函數,根據情況而定。不帶mapper函數的輸出發送到輸出文件,map函數的輸出格式必須與程序輸出格式一致。

6、帶有reduce函數任務,系統首先把mapper中輸出的key相同的部分都發送到同一個reduce,然後再把reduce函數的結果輸出,map函數的輸出格式必須和reduce函數的輸入格式一致。


Mapper的功能:
1、分割原始數據
2、輸出所需數據
3、處理異常數據

Reduce的功能:
1、某一臺機器負責一些key,最後通過reduce函數彙總,輸出:

源數據:

13599999999 10086
13633333333 120
12678998778    dada13690378933
13833238233 13690378933
12678998778 120
19837366873 10086
16389323733 10086
18373984487 120
19877367388 13690378933
18937933444 13840238422
輸出數據:中間用製表符分隔

10086    13599999999|19837366873|16389323733|
120    13633333333|12678998778|18373984487|
13690378933    13833238233|19877367388|
13840238422    18937933444|
測試代碼:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;


public class PhoneTest extends Configured implements Tool {

    enum Counter {
        LINESKIP; // 出錯的行
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        Job job = new Job(conf, "PhoneTest"); // 任務名
        job.setJarByClass(PhoneTest.class); // 指定Class
        FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/home/zhangzhen/input")); // 輸入路徑
        FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/home/zhangzhen/output")); // 輸出路徑
       
        job.setMapperClass(Map.class); // 調用Map類作爲Mapper任務代碼
        job.setReducerClass(Reduce.class); // 調用Reduce類作爲Reducer任務代碼
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setOutputKeyClass(Text.class); // 指定輸出的Key的格式(KEYOUT)
        job.setOutputValueClass(Text.class); // 指定輸出的Value的格式(VALUEOUT)
        job.waitForCompletion(true);
        
        return job.isSuccessful() ? 0 : 1;
    }

    public static class Map extends
            Mapper<LongWritable, Text, Text, Text> {    //<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            try {
                // key - 行號 value - 一行的文本
                String line = value.toString();    //13510000000 10086(13510000000撥打10086)
                // 數據處理
                String[] lineSplit = line.split(" ");
                String phone1 = lineSplit[0];
                String phone2 = lineSplit[1];
                
                context.write(new Text(phone2), new Text(phone1));    // 輸出 key \t value
            } catch (java.lang.ArrayIndexOutOfBoundsException e) {
                context.getCounter(Counter.LINESKIP).increment(1); // 出錯令計數器+1
            }
        }

    }
    
    public static class Reduce extends Reducer<Text, Text, Text, Text> {    
        
        @Override
        protected void reduce(Text key, Iterable<Text> values,
                Context context)
                throws IOException, InterruptedException {
            String valueStr;
            String out = "";
            for(Text value:values){
                valueStr = value.toString() + "|";
                out += valueStr;
            }
            // 輸出 key \t value(如果我們的輸出結果不是key \t value格式,那麼我們的key可定義爲NullWritable,而value使用key與value的組合。)
            context.write(key, new Text(out));    
        }
    }
    
    public static void main(String[] args) throws Exception {
        //運行任務
        int res = ToolRunner.run(new Configuration(), new PhoneTest(), args);
        System.exit(res);
    }
}

eclipse終端輸出:
14/01/22 10:44:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/22 10:44:47 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/01/22 10:44:47 INFO input.FileInputFormat: Total input paths to process : 1
14/01/22 10:44:47 WARN snappy.LoadSnappy: Snappy native library not loaded
14/01/22 10:44:48 INFO mapred.JobClient: Running job: job_local1140955208_0001
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Waiting for map tasks
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Starting task: attempt_local1140955208_0001_m_000000_0
14/01/22 10:44:48 INFO util.ProcessTree: setsid exited with exit code 0
14/01/22 10:44:48 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@99548b
14/01/22 10:44:48 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/home/zhangzhen/input/tonghua.txt:0+202
14/01/22 10:44:48 INFO mapred.MapTask: io.sort.mb = 100
14/01/22 10:44:48 INFO mapred.MapTask: data buffer = 79691776/99614720
14/01/22 10:44:48 INFO mapred.MapTask: record buffer = 262144/327680
14/01/22 10:44:48 INFO mapred.MapTask: Starting flush of map output
14/01/22 10:44:48 INFO mapred.MapTask: Finished spill 0
14/01/22 10:44:48 INFO mapred.Task: Task:attempt_local1140955208_0001_m_000000_0 is done. And is in the process of commiting
14/01/22 10:44:48 INFO mapred.LocalJobRunner: 
14/01/22 10:44:48 INFO mapred.Task: Task 'attempt_local1140955208_0001_m_000000_0' done.
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Finishing task: attempt_local1140955208_0001_m_000000_0
14/01/22 10:44:48 INFO mapred.LocalJobRunner: Map task executor complete.
14/01/22 10:44:48 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1457048
14/01/22 10:44:48 INFO mapred.LocalJobRunner: 
14/01/22 10:44:48 INFO mapred.Merger: Merging 1 sorted segments
14/01/22 10:44:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 194 bytes
14/01/22 10:44:48 INFO mapred.LocalJobRunner: 
14/01/22 10:44:48 INFO mapred.Task: Task:attempt_local1140955208_0001_r_000000_0 is done. And is in the process of commiting
14/01/22 10:44:48 INFO mapred.LocalJobRunner: 
14/01/22 10:44:48 INFO mapred.Task: Task attempt_local1140955208_0001_r_000000_0 is allowed to commit now
14/01/22 10:44:48 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1140955208_0001_r_000000_0' to hdfs://localhost:9000/home/zhangzhen/output
14/01/22 10:44:49 INFO mapred.LocalJobRunner: reduce > reduce
14/01/22 10:44:49 INFO mapred.Task: Task 'attempt_local1140955208_0001_r_000000_0' done.
14/01/22 10:44:49 INFO mapred.JobClient:  map 100% reduce 100%
14/01/22 10:44:49 INFO mapred.JobClient: Job complete: job_local1140955208_0001
14/01/22 10:44:49 INFO mapred.JobClient: Counters: 23
14/01/22 10:44:49 INFO mapred.JobClient:   PhoneAnalyzer$Counter
14/01/22 10:44:49 INFO mapred.JobClient:     LINESKIP=1
14/01/22 10:44:49 INFO mapred.JobClient:   File Output Format Counters 
14/01/22 10:44:49 INFO mapred.JobClient:     Bytes Written=146
14/01/22 10:44:49 INFO mapred.JobClient:   File Input Format Counters 
14/01/22 10:44:49 INFO mapred.JobClient:     Bytes Read=202
14/01/22 10:44:49 INFO mapred.JobClient:   FileSystemCounters
14/01/22 10:44:49 INFO mapred.JobClient:     FILE_BYTES_READ=568
14/01/22 10:44:49 INFO mapred.JobClient:     HDFS_BYTES_READ=404
14/01/22 10:44:49 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=136400
14/01/22 10:44:49 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=146
14/01/22 10:44:49 INFO mapred.JobClient:   Map-Reduce Framework
14/01/22 10:44:49 INFO mapred.JobClient:     Reduce input groups=4
14/01/22 10:44:49 INFO mapred.JobClient:     Map output materialized bytes=198
14/01/22 10:44:49 INFO mapred.JobClient:     Combine output records=0
14/01/22 10:44:49 INFO mapred.JobClient:     Map input records=10
14/01/22 10:44:49 INFO mapred.JobClient:     Reduce shuffle bytes=0
14/01/22 10:44:49 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
14/01/22 10:44:49 INFO mapred.JobClient:     Reduce output records=4
14/01/22 10:44:49 INFO mapred.JobClient:     Spilled Records=18
14/01/22 10:44:49 INFO mapred.JobClient:     Map output bytes=174
14/01/22 10:44:49 INFO mapred.JobClient:     Total committed heap usage (bytes)=340262912
14/01/22 10:44:49 INFO mapred.JobClient:     CPU time spent (ms)=0
14/01/22 10:44:49 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
14/01/22 10:44:49 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
14/01/22 10:44:49 INFO mapred.JobClient:     Map output records=9
14/01/22 10:44:49 INFO mapred.JobClient:     Combine input records=0
14/01/22 10:44:49 INFO mapred.JobClient:     Reduce input records=9
運行結束後,你可以通過你設置的路徑找到輸出的文件,這裏可以查看輸出的文件;同樣你也可以在eclipse中,如下圖刷新DFS目錄,就可以看到input和output兩個目錄,目錄下的文件也可以在eclipse中打開查閱。

測試截圖:







Copyright©BUAA

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章