三種方法實現Hadoop(MapReduce)全局排序(1)

我們可能會有些需求要求MapReduce的輸出全局有序，這裏說的有序是指Key全局有序。但是我們知道，MapReduce默認只是保證同一個分區內的Key是有序的，但是不保證全局有序。基於此，本文提供三種方法來對MapReduce的輸出進行全局排序。

1.生成測試數據

在介紹如何實現之前，我們先來生成一些測試數據，實現如下：

#!/bin/sh
 
for i in {1..100000};do
        echo $RANDOM
done;

將上面的代碼保存到 iteblog.sh 的文件裏面，然後運行

$ sh iteblog.sh > data1
$ sh iteblog.sh > data2
$ hadoop fs -put data1 /user/iteblog/input
$ hadoop fs -put data2 /user/iteblog/input

$RANDOM 變量是Shell內置的，使用它能夠生成五位內的隨機正整數。上面我們一共運行了兩次，這樣我們就有兩份隨機數文件data1和data2；最後我們把生成的隨機數文件上傳到HDFS上。現在我們可以來寫程序對這兩個文件裏面的數據進行排序了。

使用一個Reduce進行排序

前面我們說了，MapReduce默認只是保證同一個分區內的Key是有序的，但是不保證全局有序。如果我們將所有的數據全部發送到一個Reduce，那麼不就可以實現結果全局有序嗎？這種方法實現很簡單，如下：

package com.iteblog.mapreduce.sort;
 
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
 
import java.io.IOException;
 
public class TotalSortV1 extends Configured implements Tool {
    static class SimpleMapper extends
            Mapper<LongWritable, Text, IntWritable, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value,
                           Context context) throws IOException, InterruptedException {
            IntWritable intWritable = new IntWritable(Integer.parseInt(value.toString()));
            context.write(intWritable, intWritable);
        }
    }
 
    static class SimpleReducer extends
            Reducer<IntWritable, IntWritable, IntWritable, NullWritable> {
        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values,
                              Context context) throws IOException, InterruptedException {
            for (IntWritable value : values)
                context.write(value, NullWritable.get());
        }
    }
 
    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("<input> <output>");
            System.exit(127);
        }
 
        Job job = Job.getInstance(getConf());
        job.setJarByClass(TotalSortV1.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
        job.setMapperClass(SimpleMapper.class);
        job.setReducerClass(SimpleReducer.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(1);
        job.setJobName("TotalSort");
        return job.waitForCompletion(true) ? 0 : 1;
    }
 
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new TotalSort(), args);
        System.exit(exitCode);
    }
}

上面程序的實現很簡單，我們直接使用TextInputFormat類來讀取上面生成的隨機數文件（data1和data2）。因爲文件裏面的數據是正整數，所以我們在 SimpleMapper 類裏面直接將value轉換成int類型，然後賦值給IntWritable。等數據到 SimpleReducer 的時候，同一個Reduce裏面的Key已經全部有序；因爲我們設置了一個Reduce作業，這樣的話，我們就實現了數據全局有序。運行如下：

[[email protected] /home/iteblog]$ hadoop jar total-sort-0.1.jar com.iteblog.mapreduce.sort.TotalSortV1 /user/iteblog/input /user/iteblog/output
 
[[email protected] /home/iteblog]$ hadoop fs -ls /user/iteblog/output
Found 2 items
-rw-r--r--   3 iteblog supergroup          0 2017-05-09 11:41 /user/iteblog/output/_SUCCESS
-rw-r--r--   3 iteblog supergroup    1131757 2017-05-09 11:41 /user/iteblog/output/part-r-00000
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output/part-r-00000 | head -n 10
0
0
0
0
1
1
1
1
1
1
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output/part-r-00000 | tail -n 10
32766
32766
32766
32766
32767
32767
32767
32767
32767
32767

從上面的測試結果也可以看出，我們只生成了一個數據文件，而且這個文件裏面的數據已經全局有序了。

自定義分區函數實現全局有序

上面實現數據全局有序有個很大的侷限性：所有的數據都發送到一個Reduce進行排序，這樣不能充分利用集羣的計算資源，而且在數據量很大的情況下，很有可能會出現OOM問題。我們分析一下，MapReduce默認的分區函數是HashPartitioner，其實現的原理是計算map輸出key的hashCode，然後對Reduce個數求模，這樣只要求模結果一樣的Key都會發送到同一個Reduce。如果我們能夠實現一個分區函數，使得

所有 Key < 10000 的數據都發送到Reduce 0；
所有 10000 < Key < 20000 的數據都發送到Reduce 1；
其餘的Key都發送到Reduce 2；

這就實現了Reduce 0的數據一定全部小於Reduce 1，且Reduce 1的數據全部小於Reduce 2，再加上同一個Reduce裏面的數據局部有序，這樣就實現了數據的全局有序。實現如下：

package com.iteblog.mapreduce.sort;
 
import com.iteblog.mapreduce.secondSort.IntPair;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
 
import java.io.IOException;
 
public class TotalSortV2 extends Configured implements Tool {
    static class SimpleMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value,
                           Context context) throws IOException, InterruptedException {
            IntWritable intWritable = new IntWritable(Integer.parseInt(value.toString()));
            context.write(intWritable, intWritable);
        }
    }
 
    static class SimpleReducer extends Reducer<IntWritable, IntWritable, IntWritable, NullWritable> {
        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values,
                              Context context) throws IOException, InterruptedException {
            for (IntWritable value : values)
                context.write(value, NullWritable.get());
        }
    }
 
    public static class IteblogPartitioner extends Partitioner<IntWritable, IntWritable> {
        @Override
        public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
            int keyInt = Integer.parseInt(key.toString());
            if (keyInt < 10000) {
                return 0;
            } else if (keyInt < 20000) {
                return 1;
            } else {
                return 2;
            }
        }
    }
 
    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("<input> <output>");
            System.exit(127);
        }
 
        Job job = Job.getInstance(getConf());
        job.setJarByClass(TotalSortV2.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
        job.setMapperClass(SimpleMapper.class);
        job.setReducerClass(SimpleReducer.class);
        job.setPartitionerClass(IteblogPartitioner.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(3);
        job.setJobName("dw_subject");
        return job.waitForCompletion(true) ? 0 : 1;
    }
 
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new TotalSortV2(), args);
        System.exit(exitCode);
    }
}
第二版的排序實現除了自定義的 IteblogPartitioner，其餘的和第一種實現一樣。現在我們來運行一下：

[[email protected] /home/iteblog]$ hadoop jar total-sort-0.1.jar com.iteblog.mapreduce.sort.TotalSortV2 /user/iteblog/input /user/iteblog/output1
 
[[email protected] /home/iteblog]$ hadoop fs -ls /user/iteblog/output1
Found 4 items
-rw-r--r--   3 iteblog supergroup          0 2017-05-09 13:53 /user/iteblog/output1/_SUCCESS
-rw-r--r--   3 iteblog supergroup     299845 2017-05-09 13:53 /user/iteblog/output1/part-r-00000
-rw-r--r--   3 iteblog supergroup     365190 2017-05-09 13:53 /user/iteblog/output1/part-r-00001
-rw-r--r--   3 iteblog supergroup     466722 2017-05-09 13:53 /user/iteblog/output1/part-r-00002
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00000 | head -n 10
0
0
0
0
1
1
1
1
1
1
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00000 | tail -n 10
9998
9998
9998
9999
9999
9999
9999
9999
9999
9999
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00001 | head -n 10
10000
10000
10000
10000
10000
10000
10001
10001
10001
10001
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00001 | tail -n 10
19997
19997
19998
19998
19998
19998
19999
19999
19999
19999
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00002 | head -n 10
20000
20000
20000
20000
20000
20000
20001
20001
20001
20001
 
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00002 | tail -n 10
32766
32766
32766
32766
32767
32767
32767
32767
32767
32767

我們已經看到了這個程序生成了三個文件（因爲我們設置了Reduce個數爲3），而且每個文件都是局部有序；所有小於10000的數據都在part-r-00000裏面，所有小於20000的數據都在part-r-00001裏面，所有大於20000的數據都在part-r-00002裏面。part-r-00000、part-r-00001和part-r-00002三個文件實現了全局有序。

再這裏我還是要分享一下我新建的大數據qun：784557197，歡迎大家加入

三種方法實現Hadoop(MapReduce)全局排序(1)

1.生成測試數據

使用一個Reduce進行排序

自定義分區函數實現全局有序

京東面試：如何進行JVM調優？

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

現在還值得學Python嗎？

下班了，快滾，別浪費電！！！

行政小姐姐破天荒的求助程序員？背後究竟發生了怎麼樣的故事？

程序員撩妹之讓行政小姐姐xing奮的代碼

程序員快樂撩妹秀技術兩不誤(行政財務拆分篇)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結