hadoop實戰之數據去重Dedup

Hadoop集羣（第9期）_MapReduce初級案例

1、數據去重

　　 "數據去重"主要是爲了掌握和利用並行化思想來對數據進行有意義的篩選。統計大數據集上的數據種類個數、從網站日誌中計算訪問地等這些看似龐雜的任務都會涉及數據去重。下面就進入這個實例的MapReduce程序設計。

1.1 實例描述

　　對數據文件中的數據進行去重。數據文件中的每行都是一個數據。

　　樣例輸入如下所示：

（注意行之間沒有空行）

1）file1：

2012-3-1 a
2012-3-2 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
2012-3-6 b
2012-3-7 c
2012-3-3 c

2）file2：

2012-3-1 b
2012-3-2 a
2012-3-3 b
2012-3-4 d
2012-3-5 a
2012-3-6 c
2012-3-7 d
2012-3-3 c

樣例輸出如下所示：

2012-3-1 a
2012-3-1 b
2012-3-2 a
2012-3-2 b
2012-3-3 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
2012-3-6 b
2012-3-6 c
2012-3-7 c
2012-3-7 d

1.2 設計思路

　　數據去重的最終目標是讓原始數據中出現次數超過一次的數據在輸出文件中只出現一次。我們自然而然會想到將同一個數據的所有記錄都交給一臺reduce機器，無論這個數據出現多少次，只要在最終結果中輸出一次就可以了。具體就是reduce的輸入應該以數據作爲key，而對value-list則沒有要求。當reduce接收到一個<key，value-list>時就直接將key複製到輸出的key中，並將value設置成空值。

　　在MapReduce流程中，map的輸出<key，value>經過shuffle過程聚集成<key，value-list>後會交給reduce。所以從設計好的reduce輸入可以反推出map的輸出key應爲數據，value任意。繼續反推，map輸出數據的key爲數據，而在這個實例中每個數據代表輸入文件中的一行內容，所以map階段要完成的任務就是在採用Hadoop默認的作業輸入方式之後，將value設置爲key，並直接輸出（輸出中的value任意）。map中的結果經過shuffle過程之後交給reduce。reduce階段不會管每個key有多少個value，它直接將輸入的key複製爲輸出的key，並輸出就可以了（輸出中的value被設置成空了）。

1.3 程序代碼

程序代碼如下所示：

package com.hebut.mr;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Dedup {

    //map將輸入中的value複製到輸出數據的key上，並直接輸出
    public static class Map extends Mapper<Object,Text,Text,Text>{
        private static Text line=new Text();//每行數據

        //實現map函數
        public void map(Object key,Text value,Context context)
                throws IOException,InterruptedException{
            line=value;
            context.write(line, new Text(""));
        }

    }

    //reduce將輸入中的key複製到輸出數據的key上，並直接輸出
    public static class Reduce extends Reducer<Text,Text,Text,Text>{
        //實現reduce函數
        public void reduce(Text key,Iterable<Text> values,Context context)
                throws IOException,InterruptedException{
            context.write(key, new Text(""));
        }

    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        //這句話很關鍵
        conf.set("mapred.job.tracker", "192.168.1.2:9001");

        String[] ioArgs=new String[]{"dedup_in","dedup_out"};
     String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();
     if (otherArgs.length != 2) {
     System.err.println("Usage: Data Deduplication <in> <out>");
     System.exit(2);
     }

     Job job = new Job(conf, "Data Deduplication");
     job.setJarByClass(Dedup.class);

     //設置Map、Combine和Reduce處理類
     job.setMapperClass(Map.class);
     job.setCombinerClass(Reduce.class);
     job.setReducerClass(Reduce.class);

     //設置輸出類型
     job.setOutputKeyClass(Text.class);
     job.setOutputValueClass(Text.class);

     //設置輸入和輸出目錄
     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
     System.exit(job.waitForCompletion(true) ? 0 : 1);
     }
}