前言

儘管現在MapReduce程序在日常開發中已經代碼編寫已經很少了，但作爲大數據Hadoop的三大板塊之一，他內在的許多思想也是很多後續框架的基礎鋪墊。本篇博客，南國重點回顧一下MR中的排序相關知識點。網上關於這個知識點可能已經有很多的知識介紹，本來不打算寫這篇博客。最近一段時間終於抽空看了Hadoop權威指南的大部分內容。於是，本篇博客南國試着從面試回顧的角度去編寫這篇博客。
話不多說，乾貨送上~

排序

在默認情況下，MapReduce根據輸入記錄的鍵對數據集進行排序。

但一些時候，我們需要根據實際的應用場景，對數據進行一些更爲複雜的排序。
例如：全排序和輔助排序(也成爲二次排序)。

全排序

讓MapReduce產生一個全局排序的文件：

最簡單的方法是隻使用一個分區(partition)，這種在處理小規模文件時還行。但是在處理大型文件是效率極低，所有的數據都發送到一個Reduce進行排序，這樣不能充分利用集羣的計算資源，而且在數據量很大的情況下，很有可能會出現OOM問題。
首先創建一系列排好序的文件，其次串聯這些文件，最後生成一個全局排序的文件。它主要的思路使用一個partitioner來描述輸出的全局排序。該方案的重點在於分區方法，默認情況下根據hash值進行分區(默認的分區函數是HashPartitioner，其實現的原理是計算map輸出key的 hashCode ，然後對Reduce個數求餘，餘數相同的 key 都會發送到同一個Reduce)；還可以根據用戶自定義partitioner(自定義一個類並且繼承partitioner類，重寫器getpartition方法)
這裏我舉個簡單例子：

//Partition做分區  
		public static class Partition extends Partitioner<Text,LongWritable> {

		@Override
		public int getPartition(Text key, LongWritable value, int num) {
			// TODO Auto-generated method stub
			if(key.toString().equals("apple")){
				return 0;
			}
			if(key.toString().equals("xiaomi")){
				return 1;
			}
			if(key.toString().equals("huawei")){
				return 2;
			}
			return 3;
		}					
	}

class GlobalSortPartitioner  extends Partitioner<Text,LongWritable> implements Configurable {
    private Configuration configuration = null;
    private int indexRange = 0;

    public int getPartition(Text text, LongWritable longWritable, int numPartitions) {
        //假如取值範圍等於26的話，那麼就意味着只需要根據第一個字母來劃分索引
        int index = 0;
        if(indexRange==26){
            index = text.toString().toCharArray()[0]-'a';
        }else if(indexRange == 26*26 ){
            //這裏就是需要根據前兩個字母進行劃分索引了
            char[] chars = text.toString().toCharArray();
            if (chars.length==1){
                index = (chars[0]-'a')*26;
            }
            index = (chars[0]-'a')*26+(chars[1]-'a');
        }
        int perReducerCount = indexRange/numPartitions;
        if(indexRange<numPartitions){
            return numPartitions;
        }

        for(int i = 0;i<numPartitions;i++){
            int min = i*perReducerCount;
            int max = (i+1)*perReducerCount-1;
            if(index>=min && index<=max){
                return i;
            }
        }
        //這裏我們採用的是第一種不太科學的方法
        return numPartitions-1;

    }

    public void setConf(Configuration conf) {
        this.configuration = conf;
        indexRange = configuration.getInt("key.indexRange",26*26);
    }

    public Configuration getConf() {
        return configuration;
    }
}

使用TotalOrderPartitioner進行全排序
Hadoop 內置還有個名爲 TotalOrderPartitioner 的分區實現類，它解決全排序的問題。其主要做的事實際上和上面介紹的第二種分區實現類很類似，也就是根據Key的分界點將不同的Key發送到相應的分區。例如，下面的demo中使用了：
//設置分區文件, TotalOrderPartitioner必須指定分區文件
Path partitionFile = new Path( “_partitions”);
TotalOrderPartitioner.setPartitionFile(conf, partitionFile);

public class TotalSort {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //access hdfs's user
        System.setProperty("HADOOP_USER_NAME","root");

        Configuration conf = new Configuration();
        conf.set("mapred.jar", "D:\\MyDemo\\MapReduce\\Sort\\out\\artifacts\\TotalSort\\TotalSort.jar");

        FileSystem fs = FileSystem.get(conf);

        /*RandomSampler 參數說明
        * @param freq Probability with which a key will be chosen.
        * @param numSamples Total number of samples to obtain from all selected splits.
        * @param maxSplitsSampled The maximum number of splits to examine.
        */
        InputSampler.RandomSampler<Text, Text> sampler = new InputSampler.RandomSampler<>(0.1, 10, 10);

        //設置分區文件, TotalOrderPartitioner必須指定分區文件
        Path partitionFile = new Path( "_partitions");
        TotalOrderPartitioner.setPartitionFile(conf, partitionFile);

        Job job = Job.getInstance(conf);
        job.setJarByClass(TotalSort.class);
        job.setInputFormatClass(KeyValueTextInputFormat.class); //數據文件默認以\t分割
        job.setMapperClass(Mapper.class);
        job.setReducerClass(Reducer.class);
        job.setNumReduceTasks(4);  //設置reduce任務個數，分區文件以reduce個數爲基準，拆分成n段

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setPartitionerClass(TotalOrderPartitioner.class);

        FileInputFormat.addInputPath(job, new Path("/test/sort"));

        Path path = new Path("/test/wc/output");

        if(fs.exists(path))//如果目錄存在，則刪除目錄
        {
            fs.delete(path,true);
        }
        FileOutputFormat.setOutputPath(job, path);

        //將隨機抽樣數據寫入分區文件
        InputSampler.writePartitionFile(job, sampler);

        boolean b = job.waitForCompletion(true);
        if(b)
        {
            System.out.println("OK");
        }

    }
}

輔助排序(二次排序)

二次排序在Hadoop面試特別是MapReduce中高頻的面試題目了。當數據本身是具有兩個維度的，我們對Key排序的同時還需要對Value進行排序。

二次排序的原理

Map起始階段
在Map階段，使用job.setInputFormatClass()定義的InputFormat，將輸入的數據集分割成小數據塊split，同時InputFormat提供一個RecordReader的實現。在這裏我們使用的是TextInputFormat，該行在整個作業中的字節偏移量作爲Key，這一行的文本作爲Value。這就是自定 Mapper的輸入是<LongWritable,Text> 的原因。然後調用自定義Mapper的map方法，將一個個<LongWritable,Text>鍵值對輸入給Mapper的map方法
注意：很多都將行號作爲key,實際上這是不準確的。在《Hadoop權威指南》中提到：
Map最後階段
在Map階段的最後，會先調用job.setPartitionerClass()對這個Mapper的輸出結果進行分區，每個分區映射到一個Reducer。每個分區內又調用job.setSortComparatorClass()設置的Key比較函數類排序。可以看到，這本身就是一個二次排序。如果沒有通過job.setSortComparatorClass()設置 Key比較函數類，則使用Key實現的compareTo()方法
Reduce階段
在Reduce階段，reduce()方法接受所有映射到這個Reduce的map輸出後，也會調用job.setSortComparatorClass()方法設置的Key比較函數類，對所有數據進行排序。然後開始構造一個Key對應的Value迭代器。這時就要用到分組，使用 job.setGroupingComparatorClass()方法設置分組函數類。只要這個比較器比較的兩個Key相同，它們就屬於同一組，它們的 Value放在一個Value迭代器，而這個迭代器的Key使用屬於同一個組的所有Key的第一個Key。最後就是進入Reducer的 reduce()方法，reduce()方法的輸入是所有的Key和它的Value迭代器，同樣注意輸入與輸出的類型必須與自定義的Reducer中聲明的一致

二次排序的實現流程：

自定義key,假設數據集是整形類型的二維數據。這裏我們構建IntPair類型表示組合key。

package com.xjh.sort_twice;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;
/**
 * 自定義key排序
 * 在mr中，所有的key是需要被比較和排序的，並且是二次，先根據partitioner，再根據大小。而本例中也是要比較兩次。
 * 先按照第一字段排序，然後再對第一字段相同的按照第二字段排序。
 * 根據這一點，我們可以構造一個複合類IntPair，他有兩個字段，先利用分區對第一字段排序，再利用分區內的比較對第二字段排序
 * @author xjh
 *
 */
//自己定義的InPair類，實現WritableComparator
public class IntPair implements WritableComparable<IntPair>{
	int left;
	int right;
	
	public void set(int left, int right) {
		// TODO Auto-generated method stub
		this.left = left;
		this.right = right;
	}
	public int getLeft() {
		return left;
	}

	public int getRight() {
		return right;
	}
	
	//反序列化，從流中讀進二進制轉換成IntPair
	@Override
	public void readFields(DataInput in) throws IOException {
		// TODO Auto-generated method stub
		this.left = in.readInt();
		this.right = in.readInt();
	}
	//序列化，將IntPair轉換成二進制輸出
	@Override
	public void write(DataOutput out) throws IOException {
		// TODO Auto-generated method stub
		out.writeInt(left);
		out.writeInt(right);
	}
	
	/*
	 * 爲什麼要重寫equal方法？
	 * 因爲Object的equal方法默認是兩個對象的引用的比較，意思就是指向同一內存,地址則相等，否則不相等；
	 * 如果你現在需要利用對象裏面的值來判斷是否相等，則重載equal方法。
	 */
	@Override
	public boolean equals(Object obj) {
		// TODO Auto-generated method stub
		if(obj == null)
			return false;
		if(this == obj)
			return true;
		if (obj instanceof IntPair){
			IntPair r = (IntPair) obj;
			return r.left == left && r.right==right;
		}
		else{
			return false;
		}
			
	}
	
	/*
	 * 重寫equal 的同時爲什麼必須重寫hashcode？ 
	 * hashCode是編譯器爲不同對象產生的不同整數，根據equal方法的定義：如果兩個對象是相等（equal）的，那麼兩個對象調用 hashCode必須產生相同的整數結果，
	 * 即：equal爲true，hashCode必須爲true，equal爲false，hashCode也必須 爲false，所以必須重寫hashCode來保證與equal同步。 
	 */
	@Override
	public int hashCode() {
		// TODO Auto-generated method stub
		return left*157 +right;
	}
	
	//實現key的比較
	@Override
	public int compareTo(IntPair o) {
		// TODO Auto-generated method stub
		if(left != o.left)
			return left<o.left? -1:1;
		else if (right != o.right)
			return right<o.right? -1:1;
		else
			return 0;
	}	
}

自定義分區。自定義分區函數類FirstPartitioner，是對組合key的第一次比較，完成對所有key的排序。

public static class MyPartitioner extends Partitioner<IntPair, IntWritable>{
		@Override
		public int getPartition(IntPair key, IntWritable value, int numOfReduce) {
			// TODO Auto-generated method stub
			return Math.abs(key.getLeft()*127) % numOfReduce;
		}
	}

並在main()函數中Job指定：job.setPartitionerClass(MyPartitioner.class);

自定義 SortComparator 實現 IntPair 類中的first和second排序。這裏我們在IntPair類已經實現了compareTo()方法實現。
自定義 GroupingComparator 類，實現分區內的數據分組。

/**
	 * 在分組比較的時候，只比較原來的key，而不是組合key。
	 */
	public static class MyGroupParator implements RawComparator<IntPair>{

		@Override
		public int compare(IntPair o1 , IntPair o2) {
			// TODO Auto-generated method stub
			int l = o1.getLeft();
			int r = o2.getRight();
			return l == r ? 0:(l<r ?-1:1);
		}
		//一個字節一個字節的比，直到找到一個不相同的字節，然後比這個字節的大小作爲兩個字節流的大小比較結果。
		@Override
		public int compare(byte[] b1, int l1, int r1, byte[] b2,int l2, int r2) {
			// TODO Auto-generated method stub
            return WritableComparator.compareBytes(b1, l1, Integer.SIZE/8, b2, l2, Integer.SIZE/8);
		}
		
	}

參考資料：
1.MapReduce二次排序

提燈尋夢在南國

發佈了65 篇原創文章 · 獲贊 20 · 訪問量 1萬+

私信關注

MapReduce高級應用——全排序和二次排序

前言

排序

全排序

輔助排序(二次排序)

二次排序的原理

二次排序的實現流程：

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

一文了解ThreadLocal

Docker入門——基礎概念，安裝運行Tomcat,MySQL

Flink筆記03——一文了解DataStream

一文了解消息隊列

Cloudera Manager HA模式搭建

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結