三種方法實現Hadoop(MapReduce)全局排序(1)

我們可能會有些需求要求MapReduce的輸出全局有序,這裏說的有序是指Key全局有序。但是我們知道,MapReduce默認只是保證同一個分區內的Key是有序的,但是不保證全局有序。基於此,本文提供三種方法來對MapReduce的輸出進行全局排序。

|文章目錄|
|:
|1.生成測試數據
|2.使用一個Reduce進行排序
|3.自定義分區函數實現全局有序

1.生成測試數據

在介紹如何實現之前,我們先來生成一些測試數據,實現如下:

#!/bin/sh
 
for i in {1..100000};do
        echo $RANDOM
done;

將上面的代碼保存到 iteblog.sh 的文件裏面,然後運行

$ sh iteblog.sh > data1
$ sh iteblog.sh > data2
$ hadoop fs -put data1 /user/iteblog/input
$ hadoop fs -put data2 /user/iteblog/input

$RANDOM 變量是Shell內置的,使用它能夠生成五位內的隨機正整數。上面我們一共運行了兩次,這樣我們就有兩份隨機數文件data1和data2;最後我們把生成的隨機數文件上傳到HDFS上。現在我們可以來寫程序對這兩個文件裏面的數據進行排序了。

使用一個Reduce進行排序

前面我們說了,MapReduce默認只是保證同一個分區內的Key是有序的,但是不保證全局有序。如果我們將所有的數據全部發送到一個Reduce,那麼不就可以實現結果全局有序嗎?這種方法實現很簡單,如下:

package com.iteblog.mapreduce.sort;
 
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
 
import java.io.IOException;
 
public class TotalSortV1 extends Configured implements Tool {
    static class SimpleMapper extends
            Mapper<LongWritable, Text, IntWritable, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value,
                           Context context) throws IOException, InterruptedException {
            IntWritable intWritable = new IntWritable(Integer.parseInt(value.toString()));
            context.write(intWritable, intWritable);
        }
    }
 
    static class SimpleReducer extends
            Reducer<IntWritable, IntWritable, IntWritable, NullWritable> {
        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values,
                              Context context) throws IOException, InterruptedException {
            for (IntWritable value : values)
                context.write(value, NullWritable.get());
        }
    }
 
    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("<input> <output>");
            System.exit(127);
        }
 
        Job job = Job.getInstance(getConf());
        job.setJarByClass(TotalSortV1.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
        job.setMapperClass(SimpleMapper.class);
        job.setReducerClass(SimpleReducer.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(1);
        job.setJobName("TotalSort");
        return job.waitForCompletion(true) ? 0 : 1;
    }
 
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new TotalSort(), args);
        System.exit(exitCode);
    }
}

上面程序的實現很簡單,我們直接使用TextInputFormat類來讀取上面生成的隨機數文件(data1data2)。因爲文件裏面的數據是正整數,所以我們在 SimpleMapper 類裏面直接將value轉換成int類型,然後賦值給IntWritable。等數據到 SimpleReducer 的時候,同一個Reduce裏面的Key已經全部有序;因爲我們設置了一個Reduce作業,這樣的話,我們就實現了數據全局有序。運行如下:

[[email protected] /home/iteblog]$ hadoop jar total-sort-0.1.jar com.iteblog.mapreduce.sort.TotalSortV1 /user/iteblog/input /user/iteblog/output
 
[[email protected] /home/iteblog]$ hadoop fs -ls /user/iteblog/output
Found 2 items
-rw-r--r--   3 iteblog supergroup          0 2017-05-09 11:41 /user/iteblog/output/_SUCCESS
-rw-r--r--   3 iteblog supergroup    1131757 2017-05-09 11:41 /user/iteblog/output/part-r-00000
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output/part-r-00000 | head -n 10
0
0
0
0
1
1
1
1
1
1
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output/part-r-00000 | tail -n 10
32766
32766
32766
32766
32767
32767
32767
32767
32767
32767

從上面的測試結果也可以看出,我們只生成了一個數據文件,而且這個文件裏面的數據已經全局有序了。

自定義分區函數實現全局有序


上面實現數據全局有序有個很大的侷限性:所有的數據都發送到一個Reduce進行排序,這樣不能充分利用集羣的計算資源,而且在數據量很大的情況下,很有可能會出現OOM問題。我們分析一下,MapReduce默認的分區函數是HashPartitioner,其實現的原理是計算map輸出key的hashCode,然後對Reduce個數求模,這樣只要求模結果一樣的Key都會發送到同一個Reduce。如果我們能夠實現一個分區函數,使得

  • 所有 Key < 10000 的數據都發送到Reduce 0;
  • 所有 10000 < Key < 20000 的數據都發送到Reduce 1;
  • 其餘的Key都發送到Reduce 2;

這就實現了Reduce 0的數據一定全部小於Reduce 1,且Reduce 1的數據全部小於Reduce 2,再加上同一個Reduce裏面的數據局部有序,這樣就實現了數據的全局有序。實現如下:

package com.iteblog.mapreduce.sort;
 
import com.iteblog.mapreduce.secondSort.IntPair;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
 
import java.io.IOException;
 
public class TotalSortV2 extends Configured implements Tool {
    static class SimpleMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value,
                           Context context) throws IOException, InterruptedException {
            IntWritable intWritable = new IntWritable(Integer.parseInt(value.toString()));
            context.write(intWritable, intWritable);
        }
    }
 
    static class SimpleReducer extends Reducer<IntWritable, IntWritable, IntWritable, NullWritable> {
        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values,
                              Context context) throws IOException, InterruptedException {
            for (IntWritable value : values)
                context.write(value, NullWritable.get());
        }
    }
 
    public static class IteblogPartitioner extends Partitioner<IntWritable, IntWritable> {
        @Override
        public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
            int keyInt = Integer.parseInt(key.toString());
            if (keyInt < 10000) {
                return 0;
            } else if (keyInt < 20000) {
                return 1;
            } else {
                return 2;
            }
        }
    }
 
    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("<input> <output>");
            System.exit(127);
        }
 
        Job job = Job.getInstance(getConf());
        job.setJarByClass(TotalSortV2.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
        job.setMapperClass(SimpleMapper.class);
        job.setReducerClass(SimpleReducer.class);
        job.setPartitionerClass(IteblogPartitioner.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(3);
        job.setJobName("dw_subject");
        return job.waitForCompletion(true) ? 0 : 1;
    }
 
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new TotalSortV2(), args);
        System.exit(exitCode);
    }
}
第二版的排序實現除了自定義的 IteblogPartitioner,其餘的和第一種實現一樣。現在我們來運行一下:

[[email protected] /home/iteblog]$ hadoop jar total-sort-0.1.jar com.iteblog.mapreduce.sort.TotalSortV2 /user/iteblog/input /user/iteblog/output1
 
[[email protected] /home/iteblog]$ hadoop fs -ls /user/iteblog/output1
Found 4 items
-rw-r--r--   3 iteblog supergroup          0 2017-05-09 13:53 /user/iteblog/output1/_SUCCESS
-rw-r--r--   3 iteblog supergroup     299845 2017-05-09 13:53 /user/iteblog/output1/part-r-00000
-rw-r--r--   3 iteblog supergroup     365190 2017-05-09 13:53 /user/iteblog/output1/part-r-00001
-rw-r--r--   3 iteblog supergroup     466722 2017-05-09 13:53 /user/iteblog/output1/part-r-00002
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00000 | head -n 10
0
0
0
0
1
1
1
1
1
1
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00000 | tail -n 10
9998
9998
9998
9999
9999
9999
9999
9999
9999
9999
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00001 | head -n 10
10000
10000
10000
10000
10000
10000
10001
10001
10001
10001
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00001 | tail -n 10
19997
19997
19998
19998
19998
19998
19999
19999
19999
19999
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00002 | head -n 10
20000
20000
20000
20000
20000
20000
20001
20001
20001
20001
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00002 | tail -n 10
32766
32766
32766
32766
32767
32767
32767
32767
32767
32767

我們已經看到了這個程序生成了三個文件(因爲我們設置了Reduce個數爲3),而且每個文件都是局部有序;所有小於10000的數據都在part-r-00000裏面,所有小於20000的數據都在part-r-00001裏面,所有大於20000的數據都在part-r-00002裏面。part-r-00000、part-r-00001和part-r-00002三個文件實現了全局有序。

再這裏我還是要分享一下我新建的大數據qun:784557197, 歡迎大家加入

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章