一、概述
MapReduce框架對處理結果的輸出會根據key值進行默認的排序,這個默認排序可以滿足一部分需求,但是也是十分有限的。在我們實際的需求當中,往往有要對reduce輸出結果進行二次排序的需求。對於二次排序的實現,網絡上已經有很多人分享過了,但是對二次排序的實現的原理以及整個MapReduce框架的處理流程的分析還是有非常大的出入,而且部分分析是沒有經過驗證的。本文將通過一個實際的MapReduce二次排序例子,講述二次排序的實現和其MapReduce的整個處理流程,並且通過結果和map、reduce端的日誌來驗證所描述的處理流程的正確性。
二、需求描述
1、輸入數據:
zhangsan,3
lisi,7
wangwu,11
lisi,4
wangwu,66
lisi,7
wangwu,12
zhangsan,45
lisi,72
zhangsan,34
lisi,89
zhangsan,34
lisi,77
2、目標輸出
zhangsan 3
zhangsan 34
zhangsan 34
zhangsan 45
lisi 4
lisi 7
lisi 7
lisi 72
lisi 77
lisi 89
wangwu 11
wangwu 12
wangwu 66
三、解決思路
1、首先,在思考解決問題思路時,我們先應該深刻的理解MapReduce處理數據的整個流程,這是最基礎的,不然的話是不可能找到解決問題的思路的。我描述一下MapReduce處理數據的大概簡單流程:首先,MapReduce框架通過getSplit方法實現對原始文件的切片之後,每一個切片對應着一個map task,inputSplit輸入到Map函數進行處理,中間結果經過環形緩衝區的排序,然後分區、自定義二次排序(如果有的話)和合並,再通過shuffle操作將數據傳輸到reduce task端,reduce端也存在着緩衝區,數據也會在緩衝區和磁盤中進行合併排序等操作,然後對數據按照Key值進行分組,然後沒處理完一個分組之後就會去調用一次reduce函數,最終輸出結果。大概流程我畫了一下,如下圖:
2、具體解決思路
(1)構建進行二次排序的key值
根據上面的需求,我們有一個非常明確的目標就是要對第一列相同的記錄合併,並且對合並後的數字進行排序。我們都知道MapReduce框架不管是默認排序或者是自定義排序都只是對Key值進行排序,現在的情況是這些數據不是key值,怎麼辦?其實我們可以將原始數據的Key值和其對應的數據組合成一個新的Key值,然後新的Key值對應的還是之前的數字。這裏可以採用兩種方法,一種是自定義類實現WritableComparable接口來進行二次排序,類必須包含需要進行二次排序的屬性,比如這裏的first和secondary兩個屬性,第二種方式是在map端程序將key和value組合成一個key值,value依舊還是原來的值。這裏採用第一種方式進行實現,以下是自定義類的具體代碼:
package com.ibeifeng.sort;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class FirstSecondary implements WritableComparable<FirstSecondary>{
private String first;
private Integer secondary;
public String getFirst() {
return first;
}
public void setFirst(String first) {
this.first = first;
}
public Integer getSecondary() {
return secondary;
}
public void setSecondary(Integer secondary) {
this.secondary = secondary;
}
@Override
public String toString() {
return "first=" + first + ", secondary=" + secondary;
}
public void write(DataOutput out) throws IOException {
out.writeUTF(this.getFirst());
out.writeInt(this.getSecondary());
}
public void readFields(DataInput in) throws IOException {
this.first = in.readUTF();
this.secondary = in.readInt();
}
public int compareTo(FirstSecondary o) {
int comp = this.first.compareTo(o.getFirst());
if (0 != comp) {
return comp;
}
return Integer.valueOf(getSecondary()).compareTo(
Integer.valueOf(o.getSecondary()));
}
}
(2)二次排序實現所需的分區,分組的詳細描述
爲了能夠看到二次排序的效果,在reduce階段就不進行迭代累加,而是進行普通的輸出,由於在map階段輸出的key值是自定義類型,所以需要自定義分區和分組,這裏的分區採用FirstSecondary的first屬性進行hash取值,並且指定分區數爲2,這裏的分組也是根據FirstSecondary的first屬性。詳細代碼如下:
主程序入口代碼:
package com.ibeifeng.sort;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.task.reduce.Shuffle;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SecondarySortMapReduce extends Configured implements Tool{
//定義map處理類模板
public static class map extends Mapper<LongWritable, Text, FirstSecondary, IntWritable>{
private IntWritable outputValue = new IntWritable();
FirstSecondary outputKey = new FirstSecondary();
protected void map(LongWritable key, Text values, Context context)
throws IOException, InterruptedException {
//1.分割values
String str = values.toString();
String[] split = str.split(",");
//2.新建FirstSecondary對象進行賦值
outputKey.setFirst(split[0]);
outputKey.setSecondary(Integer.valueOf(split[1]));
//3.進行輸出
outputValue.set(Integer.valueOf(split[1]));
context.write(outputKey, outputValue);
}
}
//定義reduce處理類模板
public static class reduce extends Reducer<FirstSecondary, IntWritable, Text, IntWritable>{
private Text text = new Text();
@Override
protected void reduce(FirstSecondary key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for(IntWritable in : values) {
text.set(key.getFirst());
context.write(text, in);
}
}
}
//配置Driver模塊
public int run(String[] args) {
//1.獲取配置配置文件對象
Configuration configuration = new Configuration();
//2.創建給mapreduce處理的任務
Job job = null;
try {
job = Job.getInstance(configuration,this.getClass().getSimpleName());
} catch (IOException e) {
e.printStackTrace();
}
try {
//3.創建輸入路徑
Path source_path = new Path(args[0]);
FileInputFormat.addInputPath(job, source_path);
//4.創建輸出路徑
Path des_path = new Path(args[1]);
FileOutputFormat.setOutputPath(job, des_path);
} catch (IllegalArgumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
//設置讓任務打包jar運行
job.setJarByClass(SecondarySortMapReduce.class);
//5.設置map
job.setMapperClass(map.class);
job.setMapOutputKeyClass(FirstSecondary.class);
job.setMapOutputValueClass(IntWritable.class);
//================shuffle========================
//1.分區
job.setPartitionerClass(MyPartitioner.class);
//2.排序
// job.setSortComparatorClass(cls);
//3.分組
job.setGroupingComparatorClass(MyGroup.class);
//4.可選項,設置combiner,相當於map過程的reduce處理,優化選項
// job.setCombinerClass(Combiner.class);
//設置reduce個數
job.setNumReduceTasks(2);
//================shuffle========================
//6.設置reduce
job.setReducerClass(reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//7.提交job到yarn組件上
boolean isSuccess = false;
try {
isSuccess = job.waitForCompletion(true);
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
return isSuccess?0:-1;
}
//書寫主函數
public static void main(String[] args) {
Configuration configuration = new Configuration();
//1.書寫輸入和輸出路徑
String[] args1 = new String[] {
"hdfs://hadoop-senior01.ibeifeng.com:8020/user/beifeng/wordcount/input",
"hdfs://hadoop-senior01.ibeifeng.com:8020/user/beifeng/wordcount/output"
};
//2.設置系統以什麼用戶執行job任務
System.setProperty("HADOOP_USER_NAME", "beifeng");
//3.運行job任務
int status = 0;
try {
status = ToolRunner.run(configuration, new SecondarySortMapReduce(), args1);
} catch (Exception e) {
e.printStackTrace();
}
// int status = new MyWordCountMapReduce().run(args1);
//4.退出系統
System.exit(status);
}
}
自定義分區代碼:
package com.ibeifeng.sort;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;
public class MyPartitioner extends Partitioner<FirstSecondary, IntWritable> {
@Override
public int getPartition(FirstSecondary key, IntWritable value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % 2;
}
}
自定義分組代碼:
package com.ibeifeng.sort;
import org.apache.hadoop.io.RawComparator;
public class MyGroup implements RawComparator<FirstSecondary> {
public int compare(FirstSecondary o1, FirstSecondary o2) {
return o1.getFirst().compareTo(o2.getFirst());
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return new String(b1,0,l1-4).compareTo(new String(b2,0,l2-4));
}
}
四.打包並測試結果
將打成的jar包還有測試數據都上傳到HDFS。執行命令:
bin/yarn jar datas/sort.jar /user/beifeng/wordcount/input/ /user/beife
ng/wordcount/output
在HDFS上看到了兩個reducetask處理完的結果,使用以下命令進行查看:bin/hdfs dfs -text /user/beifeng/wordcount/output/part*
結果如下:
zhangsan 3
zhangsan 34
zhangsan 34
zhangsan 45
lisi 4
lisi 7
lisi 7
lisi 72
lisi 77
lisi 89
wangwu 11
wangwu 12
wangwu 66