Hadoop之MapReduce自定義二次排序

一、概述

MapReduce框架對處理結果的輸出會根據key值進行默認的排序，這個默認排序可以滿足一部分需求，但是也是十分有限的。在我們實際的需求當中，往往有要對reduce輸出結果進行二次排序的需求。對於二次排序的實現，網絡上已經有很多人分享過了，但是對二次排序的實現的原理以及整個MapReduce框架的處理流程的分析還是有非常大的出入，而且部分分析是沒有經過驗證的。本文將通過一個實際的MapReduce二次排序例子，講述二次排序的實現和其MapReduce的整個處理流程，並且通過結果和map、reduce端的日誌來驗證所描述的處理流程的正確性。

二、需求描述

1、輸入數據：

zhangsan,3

lisi,7

wangwu,11

lisi,4

wangwu,66

lisi,7

wangwu,12

zhangsan,45

lisi,72

zhangsan,34

lisi,89

zhangsan,34

lisi,77

2、目標輸出

zhangsan 3

zhangsan 34

zhangsan 45

lisi 4

lisi 7

lisi 72

lisi 77

lisi 89

wangwu 11

wangwu 12

wangwu 66

三、解決思路

1、首先，在思考解決問題思路時，我們先應該深刻的理解MapReduce處理數據的整個流程，這是最基礎的，不然的話是不可能找到解決問題的思路的。我描述一下MapReduce處理數據的大概簡單流程：首先，MapReduce框架通過getSplit方法實現對原始文件的切片之後，每一個切片對應着一個map task，inputSplit輸入到Map函數進行處理，中間結果經過環形緩衝區的排序,然後分區、自定義二次排序（如果有的話）和合並，再通過shuffle操作將數據傳輸到reduce task端，reduce端也存在着緩衝區，數據也會在緩衝區和磁盤中進行合併排序等操作，然後對數據按照Key值進行分組，然後沒處理完一個分組之後就會去調用一次reduce函數，最終輸出結果。大概流程我畫了一下，如下圖：

2、具體解決思路

（1）構建進行二次排序的key值

根據上面的需求，我們有一個非常明確的目標就是要對第一列相同的記錄合併，並且對合並後的數字進行排序。我們都知道MapReduce框架不管是默認排序或者是自定義排序都只是對Key值進行排序，現在的情況是這些數據不是key值，怎麼辦？其實我們可以將原始數據的Key值和其對應的數據組合成一個新的Key值，然後新的Key值對應的還是之前的數字。這裏可以採用兩種方法，一種是自定義類實現WritableComparable接口來進行二次排序，類必須包含需要進行二次排序的屬性，比如這裏的first和secondary兩個屬性，第二種方式是在map端程序將key和value組合成一個key值，value依舊還是原來的值。這裏採用第一種方式進行實現，以下是自定義類的具體代碼：

package com.ibeifeng.sort;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class FirstSecondary implements WritableComparable<FirstSecondary>{

private String first;

private Integer secondary;

public String getFirst() {

return first;

}

public void setFirst(String first) {

this.first = first;

}

public Integer getSecondary() {

return secondary;

}

public void setSecondary(Integer secondary) {

this.secondary = secondary;

}

@Override

public String toString() {

return "first=" + first + ", secondary=" + secondary;

}

public void write(DataOutput out) throws IOException {

out.writeUTF(this.getFirst());

out.writeInt(this.getSecondary());

}

public void readFields(DataInput in) throws IOException {

this.first = in.readUTF();

this.secondary = in.readInt();

}

public int compareTo(FirstSecondary o) {

int comp = this.first.compareTo(o.getFirst());

if (0 != comp) {

return comp;

}

return Integer.valueOf(getSecondary()).compareTo(

Integer.valueOf(o.getSecondary()));

}

（2）二次排序實現所需的分區，分組的詳細描述

爲了能夠看到二次排序的效果，在reduce階段就不進行迭代累加，而是進行普通的輸出，由於在map階段輸出的key值是自定義類型，所以需要自定義分區和分組，這裏的分區採用FirstSecondary的first屬性進行hash取值，並且指定分區數爲2，這裏的分組也是根據FirstSecondary的first屬性。詳細代碼如下：

主程序入口代碼：

package com.ibeifeng.sort;

import java.io.IOException;

import java.util.Iterator;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.task.reduce.Shuffle;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class SecondarySortMapReduce extends Configured implements Tool{

//定義map處理類模板

public static class map extends Mapper<LongWritable, Text, FirstSecondary, IntWritable>{

private IntWritable outputValue = new IntWritable();

FirstSecondary outputKey = new FirstSecondary();

protected void map(LongWritable key, Text values, Context context)

throws IOException, InterruptedException {

//1.分割values

String str = values.toString();

String[] split = str.split(",");

//2.新建FirstSecondary對象進行賦值

outputKey.setFirst(split[0]);

outputKey.setSecondary(Integer.valueOf(split[1]));

//3.進行輸出

outputValue.set(Integer.valueOf(split[1]));

context.write(outputKey, outputValue);

}

//定義reduce處理類模板

public static class reduce extends Reducer<FirstSecondary, IntWritable, Text, IntWritable>{

private Text text = new Text();

@Override

protected void reduce(FirstSecondary key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException {

for(IntWritable in : values) {

text.set(key.getFirst());

context.write(text, in);

}

//配置Driver模塊

public int run(String[] args) {

//1.獲取配置配置文件對象

Configuration configuration = new Configuration();

//2.創建給mapreduce處理的任務

Job job = null;

try {

job = Job.getInstance(configuration,this.getClass().getSimpleName());

} catch (IOException e) {

e.printStackTrace();

}

try {

//3.創建輸入路徑

Path source_path = new Path(args[0]);

FileInputFormat.addInputPath(job, source_path);

//4.創建輸出路徑

Path des_path = new Path(args[1]);

FileOutputFormat.setOutputPath(job, des_path);

} catch (IllegalArgumentException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

//設置讓任務打包jar運行

job.setJarByClass(SecondarySortMapReduce.class);

//5.設置map

job.setMapperClass(map.class);

job.setMapOutputKeyClass(FirstSecondary.class);

job.setMapOutputValueClass(IntWritable.class);

//================shuffle========================

//1.分區

job.setPartitionerClass(MyPartitioner.class);

//2.排序

// job.setSortComparatorClass(cls);

//3.分組

job.setGroupingComparatorClass(MyGroup.class);

//4.可選項,設置combiner,相當於map過程的reduce處理，優化選項

// job.setCombinerClass(Combiner.class);

//設置reduce個數

job.setNumReduceTasks(2);

//================shuffle========================

//6.設置reduce

job.setReducerClass(reduce.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

//7.提交job到yarn組件上

boolean isSuccess = false;

try {

isSuccess = job.waitForCompletion(true);

} catch (ClassNotFoundException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

} catch (InterruptedException e) {

e.printStackTrace();

}

return isSuccess?0:-1;

}

//書寫主函數

public static void main(String[] args) {

Configuration configuration = new Configuration();

//1.書寫輸入和輸出路徑

String[] args1 = new String[] {

"hdfs://hadoop-senior01.ibeifeng.com:8020/user/beifeng/wordcount/input",

"hdfs://hadoop-senior01.ibeifeng.com:8020/user/beifeng/wordcount/output"

};

//2.設置系統以什麼用戶執行job任務

System.setProperty("HADOOP_USER_NAME", "beifeng");

//3.運行job任務

int status = 0;

try {

status = ToolRunner.run(configuration, new SecondarySortMapReduce(), args1);

} catch (Exception e) {

e.printStackTrace();

}

// int status = new MyWordCountMapReduce().run(args1);

//4.退出系統

System.exit(status);

}

自定義分區代碼：

package com.ibeifeng.sort;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<FirstSecondary, IntWritable> {

@Override

public int getPartition(FirstSecondary key, IntWritable value, int numPartitions) {

return (key.getFirst().hashCode() & Integer.MAX_VALUE) % 2;

}

自定義分組代碼：

package com.ibeifeng.sort;

import org.apache.hadoop.io.RawComparator;

public class MyGroup implements RawComparator<FirstSecondary> {

public int compare(FirstSecondary o1, FirstSecondary o2) {

return o1.getFirst().compareTo(o2.getFirst());

}

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

return new String(b1,0,l1-4).compareTo(new String(b2,0,l2-4));

}

四.打包並測試結果

將打成的jar包還有測試數據都上傳到HDFS。執行命令：

bin/yarn jar datas/sort.jar /user/beifeng/wordcount/input/ /user/beife

ng/wordcount/output

在HDFS上看到了兩個reducetask處理完的結果，使用以下命令進行查看：bin/hdfs dfs -text /user/beifeng/wordcount/output/part*

結果如下：

zhangsan 3

zhangsan 34

zhangsan 45

lisi 4

lisi 7

lisi 72

lisi 77

lisi 89

wangwu 11

wangwu 12

wangwu 66

Hadoop之MapReduce自定義二次排序

如何基於surging跨網關跨語言進行緩存降級

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

千兆寬帶實際網速能到達多少？

基於用戶的協同過濾推薦算法原理-附python代碼實現

對接MQ實時同步mysql數據到kudu-附案例代碼

mysql和hive同步數據到kudu-附案例項目

基於物品的協同過濾推薦算法原理-附python實現代碼

基於關聯規則分析的推薦算法(Apriori)-附python代碼實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結