使用MapReduce進行排序

原創

2020-06-27 08:12

之前在工作中使用到過MapReduce的排序，當時對於這個平臺的理解還比較淺顯，選擇的是一個最爲簡單的方式，就是隻用一個Recude來做。因爲Map之後到Reduce階段，爲了Merge的方便，MapReduce的實現會自己依據key值進行排序，這樣得出的結果就是一個整體排序的結果。而如果使用超過一個Reduce任務的話，所得的結果是每個part內部有序，但是整體是需要進行merge纔可以得到最終的全體有序的。今天讀了《Hadoop權威指南》中的第8章，對使用Hadoop這一MapReduce的Java實現進行排序有所瞭解，在此進行簡單的總結。

首先我們來看一下Hadoop中內部Map和Reduce兩個階段所做的排序，可以使用下圖來說明。

對MapReduce或者Hadoop有所瞭解的人可能都知道，所謂對於key值的排序，其實是在Map階段進行的，而Rduce階段所做的工作是對各個Map任務的結果進行Merge工作，這樣就能保證整體是有序的。如果想在使用多個Reduce任務的情況下保證結果有序，我們可以做的是在上圖中的partition階段進行控制，使分配到每個reduce task的數據塊爲數值區域獨立的，即比如整體數據在0~50之間，劃分爲5個Reduce任務的話，可以0~10區間的數據到第一個Reduce Task，10~20之間的到第二個，以此類推。但是這樣就存在一個問題，劃分出的各個任務中的數據可能並不是均等的，這樣某些Reduce Task處理了很多數據，而其他的處理了很少的數據。Hadoop提供了RandomSampler類（位於InputSampler類中）來進行隨機取樣，然後按照取樣結果對值域進行劃分。一個示例代碼如下：

public class SortByTemperatureUsingTotalOrderPartitioner extends Configured
implements Tool {
@Override
public int run(String[] args) throws Exception {
JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (conf == null) {
return -1;
}
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(conf, true);
SequenceFileOutputFormat
.setOutputCompressorClass(conf, GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(conf,
CompressionType.BLOCK);
conf.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(
0.1, 10000, 10);
Path input = FileInputFormat.getInputPaths(conf)[0];
input = input.makeQualified(input.getFileSystem(conf));
Path partitionFile = new Path(input, "_partitions");
TotalOrderPartitioner.setPartitionFile(conf, partitionFile);
InputSampler.writePartitionFile(conf, sampler);
// Add to DistributedCache
URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
DistributedCache.addCacheFile(partitionUri, conf);
DistributedCache.createSymlink(conf);
JobClient.runJob(conf);
return 0;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(
new SortByTemperatureUsingTotalOrderPartitioner(), args);
System.exit(exitCode);
}
}

使用上述程序執行所得的結果會是多個劃分，每個劃分內部是有序的，而且第i個劃分的key值會比i+1個劃分的key值都要小。這樣，就可以不需要進行再一步的merge，就可以得到整體的上有序結果。

關於排序，一個更加有意思的應用是所謂的Secondary Sort，亦即在保證第一個key值有序的情況下，對第二個key值也要保證有序（可以是升序或者降序）。此處的一個實現方法是將這兩個需要排序的部分都作爲key值，使用IntPair進行存儲，然後自己實現一個繼承自WritableComparator的名爲KeyComparator的用於比較的類，其代碼如下：

public static class KeyComparator extends WritableComparator {
protected KeyComparator() {
super(IntPair.class, true);
}

@Override
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
int cmp = IntPair.compare(ip1.getFirst(), ip2.getFirst());
if (cmp != 0) {
return cmp;
}
return -IntPair.compare(ip1.getSecond(), ip2.getSecond()); // reverse
}
}

這裏對於第二列是得到降序的結果，在conf設置的時候，可以使用conf.setOutputKeyComparatorClass(KeyComparator.class);語句進行設置。這樣執行計算程序的話，會存在一個問題，因爲將兩個int型的值共同作爲key值來處理，在Map階段結束後進行Partition的劃分的時候，就會同樣依照這個總key值進行劃分，我們想要兩個值，比如(1900,20)和(1900,23)被放到同一個Reduce任務中進行處理就無法實現，於是我們需要實現自己的Partitioner接口，代碼如下：

public static class FirstPartitioner implements
Partitioner<IntPair, NullWritable> {
@Override
public void configure(JobConf job) {
}

@Override
public int getPartition(IntPair key, NullWritable value, int numPartitions) {
return Math.abs(key.getFirst() * 127) % numPartitions;
}
}

同樣在配置過程中使用conf.setPartitionerClass(FirstPartitioner.class);語句進行設置。除此之外，需要進行控制的還有一個Reduce中的Group by操作，方法是實現一個GroupComparator類，其中的比較只使用第一個鍵值即可，代碼如下：

public static class GroupComparator extends WritableComparator {
protected GroupComparator() {
super(IntPair.class, true);
}

@Override
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
return IntPair.compare(ip1.getFirst(), ip2.getFirst());
}
}

需要設置的是conf.setOutputValueGroupingComparator(GroupComparator.class);。這樣就可以實現Secondary Sort過程了。

轉載：http://www.cnblogs.com/funnydavid/archive/2010/11/24/1886974.html

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用MapReduce進行排序

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

利用hadoop二次排序進行用戶行爲分析

使用MapReduce進行排序

Hadoop入門之HDFS與MapReduce

linux下python安裝 nose lapack atlas numpy scipy sklearn

Mapreduce-Partition分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結