我們可能會有些需求要求MapReduce的輸出全局有序,這裏說的有序是指Key全局有序。但是我們知道,MapReduce默認只是保證同一個分區內的Key是有序的,但是不保證全局有序。基於此,本文提供三種方法來對MapReduce的輸出進行全局排序。
|文章目錄|
|:
|1.生成測試數據
|2.使用一個Reduce進行排序
|3.自定義分區函數實現全局有序
1.生成測試數據
在介紹如何實現之前,我們先來生成一些測試數據,實現如下:
#!/bin/sh
for i in {1..100000};do
echo $RANDOM
done;
將上面的代碼保存到 iteblog.sh
的文件裏面,然後運行
$ sh iteblog.sh > data1
$ sh iteblog.sh > data2
$ hadoop fs -put data1 /user/iteblog/input
$ hadoop fs -put data2 /user/iteblog/input
$RANDOM
變量是Shell內置的,使用它能夠生成五位內的隨機正整數。上面我們一共運行了兩次,這樣我們就有兩份隨機數文件data1和data2;最後我們把生成的隨機數文件上傳到HDFS上。現在我們可以來寫程序對這兩個文件裏面的數據進行排序了。
使用一個Reduce進行排序
前面我們說了,MapReduce默認只是保證同一個分區內的Key是有序的,但是不保證全局有序。如果我們將所有的數據全部發送到一個Reduce,那麼不就可以實現結果全局有序嗎?這種方法實現很簡單,如下:
package com.iteblog.mapreduce.sort;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
public class TotalSortV1 extends Configured implements Tool {
static class SimpleMapper extends
Mapper<LongWritable, Text, IntWritable, IntWritable> {
@Override
protected void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
IntWritable intWritable = new IntWritable(Integer.parseInt(value.toString()));
context.write(intWritable, intWritable);
}
}
static class SimpleReducer extends
Reducer<IntWritable, IntWritable, IntWritable, NullWritable> {
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for (IntWritable value : values)
context.write(value, NullWritable.get());
}
}
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("<input> <output>");
System.exit(127);
}
Job job = Job.getInstance(getConf());
job.setJarByClass(TotalSortV1.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(SimpleMapper.class);
job.setReducerClass(SimpleReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setNumReduceTasks(1);
job.setJobName("TotalSort");
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new TotalSort(), args);
System.exit(exitCode);
}
}
上面程序的實現很簡單,我們直接使用TextInputFormat
類來讀取上面生成的隨機數文件(data1
和data2
)。因爲文件裏面的數據是正整數,所以我們在 SimpleMapper
類裏面直接將value轉換成int類型,然後賦值給IntWritable
。等數據到 SimpleReducer
的時候,同一個Reduce
裏面的Key
已經全部有序;因爲我們設置了一個Reduce作業
,這樣的話,我們就實現了數據全局有序。運行如下:
[[email protected] /home/iteblog]$ hadoop jar total-sort-0.1.jar com.iteblog.mapreduce.sort.TotalSortV1 /user/iteblog/input /user/iteblog/output
[[email protected] /home/iteblog]$ hadoop fs -ls /user/iteblog/output
Found 2 items
-rw-r--r-- 3 iteblog supergroup 0 2017-05-09 11:41 /user/iteblog/output/_SUCCESS
-rw-r--r-- 3 iteblog supergroup 1131757 2017-05-09 11:41 /user/iteblog/output/part-r-00000
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output/part-r-00000 | head -n 10
0
0
0
0
1
1
1
1
1
1
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output/part-r-00000 | tail -n 10
32766
32766
32766
32766
32767
32767
32767
32767
32767
32767
從上面的測試結果也可以看出,我們只生成了一個數據文件,而且這個文件裏面的數據已經全局有序了。
自定義分區函數實現全局有序
上面實現數據全局有序有個很大的侷限性:所有的數據都發送到一個
Reduce
進行排序,這樣不能充分利用集羣的計算資源,而且在數據量很大的情況下,很有可能會出現OOM問題。我們分析一下,MapReduce
默認的分區函數是HashPartitioner
,其實現的原理是計算map輸出key的hashCode
,然後對Reduce個數求模,這樣只要求模結果一樣的Key都會發送到同一個Reduce
。如果我們能夠實現一個分區函數,使得
- 所有 Key < 10000 的數據都發送到Reduce 0;
- 所有 10000 < Key < 20000 的數據都發送到Reduce 1;
- 其餘的Key都發送到Reduce 2;
這就實現了Reduce 0的數據一定全部小於Reduce 1,且Reduce 1的數據全部小於Reduce 2,再加上同一個Reduce裏面的數據局部有序,這樣就實現了數據的全局有序。實現如下:
package com.iteblog.mapreduce.sort;
import com.iteblog.mapreduce.secondSort.IntPair;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
public class TotalSortV2 extends Configured implements Tool {
static class SimpleMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
@Override
protected void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
IntWritable intWritable = new IntWritable(Integer.parseInt(value.toString()));
context.write(intWritable, intWritable);
}
}
static class SimpleReducer extends Reducer<IntWritable, IntWritable, IntWritable, NullWritable> {
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for (IntWritable value : values)
context.write(value, NullWritable.get());
}
}
public static class IteblogPartitioner extends Partitioner<IntWritable, IntWritable> {
@Override
public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
int keyInt = Integer.parseInt(key.toString());
if (keyInt < 10000) {
return 0;
} else if (keyInt < 20000) {
return 1;
} else {
return 2;
}
}
}
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("<input> <output>");
System.exit(127);
}
Job job = Job.getInstance(getConf());
job.setJarByClass(TotalSortV2.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(SimpleMapper.class);
job.setReducerClass(SimpleReducer.class);
job.setPartitionerClass(IteblogPartitioner.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setNumReduceTasks(3);
job.setJobName("dw_subject");
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new TotalSortV2(), args);
System.exit(exitCode);
}
}
第二版的排序實現除了自定義的 IteblogPartitioner,其餘的和第一種實現一樣。現在我們來運行一下:
[[email protected] /home/iteblog]$ hadoop jar total-sort-0.1.jar com.iteblog.mapreduce.sort.TotalSortV2 /user/iteblog/input /user/iteblog/output1
[[email protected] /home/iteblog]$ hadoop fs -ls /user/iteblog/output1
Found 4 items
-rw-r--r-- 3 iteblog supergroup 0 2017-05-09 13:53 /user/iteblog/output1/_SUCCESS
-rw-r--r-- 3 iteblog supergroup 299845 2017-05-09 13:53 /user/iteblog/output1/part-r-00000
-rw-r--r-- 3 iteblog supergroup 365190 2017-05-09 13:53 /user/iteblog/output1/part-r-00001
-rw-r--r-- 3 iteblog supergroup 466722 2017-05-09 13:53 /user/iteblog/output1/part-r-00002
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00000 | head -n 10
0
0
0
0
1
1
1
1
1
1
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00000 | tail -n 10
9998
9998
9998
9999
9999
9999
9999
9999
9999
9999
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00001 | head -n 10
10000
10000
10000
10000
10000
10000
10001
10001
10001
10001
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00001 | tail -n 10
19997
19997
19998
19998
19998
19998
19999
19999
19999
19999
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00002 | head -n 10
20000
20000
20000
20000
20000
20000
20001
20001
20001
20001
[[email protected] /home/iteblog]$ hadoop fs -cat /user/iteblog/output1/part-r-00002 | tail -n 10
32766
32766
32766
32766
32767
32767
32767
32767
32767
32767
我們已經看到了這個程序生成了三個文件(因爲我們設置了Reduce個數爲3),而且每個文件都是局部有序;所有小於10000的數據都在part-r-00000裏面,所有小於20000的數據都在part-r-00001裏面,所有大於20000的數據都在part-r-00002裏面。part-r-00000、part-r-00001和part-r-00002三個文件實現了全局有序。
再這裏我還是要分享一下我新建的大數據qun:784557197, 歡迎大家加入