前言

尽管现在MapReduce程序在日常开发中已经代码编写已经很少了，但作为大数据Hadoop的三大板块之一，他内在的许多思想也是很多后续框架的基础铺垫。本篇博客，南国重点回顾一下MR中的排序相关知识点。网上关于这个知识点可能已经有很多的知识介绍，本来不打算写这篇博客。最近一段时间终于抽空看了Hadoop权威指南的大部分内容。于是，本篇博客南国试着从面试回顾的角度去编写这篇博客。
话不多说，干货送上~

排序

在默认情况下，MapReduce根据输入记录的键对数据集进行排序。

但一些时候，我们需要根据实际的应用场景，对数据进行一些更为复杂的排序。
例如：全排序和辅助排序(也成为二次排序)。

全排序

让MapReduce产生一个全局排序的文件：

最简单的方法是只使用一个分区(partition)，这种在处理小规模文件时还行。但是在处理大型文件是效率极低，所有的数据都发送到一个Reduce进行排序，这样不能充分利用集群的计算资源，而且在数据量很大的情况下，很有可能会出现OOM问题。
首先创建一系列排好序的文件，其次串联这些文件，最后生成一个全局排序的文件。它主要的思路使用一个partitioner来描述输出的全局排序。该方案的重点在于分区方法，默认情况下根据hash值进行分区(默认的分区函数是HashPartitioner，其实现的原理是计算map输出key的 hashCode ，然后对Reduce个数求余，余数相同的 key 都会发送到同一个Reduce)；还可以根据用户自定义partitioner(自定义一个类并且继承partitioner类，重写器getpartition方法)
这里我举个简单例子：

//Partition做分区  
		public static class Partition extends Partitioner<Text,LongWritable> {

		@Override
		public int getPartition(Text key, LongWritable value, int num) {
			// TODO Auto-generated method stub
			if(key.toString().equals("apple")){
				return 0;
			}
			if(key.toString().equals("xiaomi")){
				return 1;
			}
			if(key.toString().equals("huawei")){
				return 2;
			}
			return 3;
		}					
	}

class GlobalSortPartitioner  extends Partitioner<Text,LongWritable> implements Configurable {
    private Configuration configuration = null;
    private int indexRange = 0;

    public int getPartition(Text text, LongWritable longWritable, int numPartitions) {
        //假如取值范围等于26的话，那么就意味着只需要根据第一个字母来划分索引
        int index = 0;
        if(indexRange==26){
            index = text.toString().toCharArray()[0]-'a';
        }else if(indexRange == 26*26 ){
            //这里就是需要根据前两个字母进行划分索引了
            char[] chars = text.toString().toCharArray();
            if (chars.length==1){
                index = (chars[0]-'a')*26;
            }
            index = (chars[0]-'a')*26+(chars[1]-'a');
        }
        int perReducerCount = indexRange/numPartitions;
        if(indexRange<numPartitions){
            return numPartitions;
        }

        for(int i = 0;i<numPartitions;i++){
            int min = i*perReducerCount;
            int max = (i+1)*perReducerCount-1;
            if(index>=min && index<=max){
                return i;
            }
        }
        //这里我们采用的是第一种不太科学的方法
        return numPartitions-1;

    }

    public void setConf(Configuration conf) {
        this.configuration = conf;
        indexRange = configuration.getInt("key.indexRange",26*26);
    }

    public Configuration getConf() {
        return configuration;
    }
}

使用TotalOrderPartitioner进行全排序
Hadoop 内置还有个名为 TotalOrderPartitioner 的分区实现类，它解决全排序的问题。其主要做的事实际上和上面介绍的第二种分区实现类很类似，也就是根据Key的分界点将不同的Key发送到相应的分区。例如，下面的demo中使用了：
//设置分区文件, TotalOrderPartitioner必须指定分区文件
Path partitionFile = new Path( “_partitions”);
TotalOrderPartitioner.setPartitionFile(conf, partitionFile);

public class TotalSort {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //access hdfs's user
        System.setProperty("HADOOP_USER_NAME","root");

        Configuration conf = new Configuration();
        conf.set("mapred.jar", "D:\\MyDemo\\MapReduce\\Sort\\out\\artifacts\\TotalSort\\TotalSort.jar");

        FileSystem fs = FileSystem.get(conf);

        /*RandomSampler 参数说明
        * @param freq Probability with which a key will be chosen.
        * @param numSamples Total number of samples to obtain from all selected splits.
        * @param maxSplitsSampled The maximum number of splits to examine.
        */
        InputSampler.RandomSampler<Text, Text> sampler = new InputSampler.RandomSampler<>(0.1, 10, 10);

        //设置分区文件, TotalOrderPartitioner必须指定分区文件
        Path partitionFile = new Path( "_partitions");
        TotalOrderPartitioner.setPartitionFile(conf, partitionFile);

        Job job = Job.getInstance(conf);
        job.setJarByClass(TotalSort.class);
        job.setInputFormatClass(KeyValueTextInputFormat.class); //数据文件默认以\t分割
        job.setMapperClass(Mapper.class);
        job.setReducerClass(Reducer.class);
        job.setNumReduceTasks(4);  //设置reduce任务个数，分区文件以reduce个数为基准，拆分成n段

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setPartitionerClass(TotalOrderPartitioner.class);

        FileInputFormat.addInputPath(job, new Path("/test/sort"));

        Path path = new Path("/test/wc/output");

        if(fs.exists(path))//如果目录存在，则删除目录
        {
            fs.delete(path,true);
        }
        FileOutputFormat.setOutputPath(job, path);

        //将随机抽样数据写入分区文件
        InputSampler.writePartitionFile(job, sampler);

        boolean b = job.waitForCompletion(true);
        if(b)
        {
            System.out.println("OK");
        }

    }
}

辅助排序(二次排序)

二次排序在Hadoop面试特别是MapReduce中高频的面试题目了。当数据本身是具有两个维度的，我们对Key排序的同时还需要对Value进行排序。

二次排序的原理

Map起始阶段
在Map阶段，使用job.setInputFormatClass()定义的InputFormat，将输入的数据集分割成小数据块split，同时InputFormat提供一个RecordReader的实现。在这里我们使用的是TextInputFormat，该行在整个作业中的字节偏移量作为Key，这一行的文本作为Value。这就是自定 Mapper的输入是<LongWritable,Text> 的原因。然后调用自定义Mapper的map方法，将一个个<LongWritable,Text>键值对输入给Mapper的map方法
注意：很多都将行号作为key,实际上这是不准确的。在《Hadoop权威指南》中提到：
Map最后阶段
在Map阶段的最后，会先调用job.setPartitionerClass()对这个Mapper的输出结果进行分区，每个分区映射到一个Reducer。每个分区内又调用job.setSortComparatorClass()设置的Key比较函数类排序。可以看到，这本身就是一个二次排序。如果没有通过job.setSortComparatorClass()设置 Key比较函数类，则使用Key实现的compareTo()方法
Reduce阶段
在Reduce阶段，reduce()方法接受所有映射到这个Reduce的map输出后，也会调用job.setSortComparatorClass()方法设置的Key比较函数类，对所有数据进行排序。然后开始构造一个Key对应的Value迭代器。这时就要用到分组，使用 job.setGroupingComparatorClass()方法设置分组函数类。只要这个比较器比较的两个Key相同，它们就属于同一组，它们的 Value放在一个Value迭代器，而这个迭代器的Key使用属于同一个组的所有Key的第一个Key。最后就是进入Reducer的 reduce()方法，reduce()方法的输入是所有的Key和它的Value迭代器，同样注意输入与输出的类型必须与自定义的Reducer中声明的一致

二次排序的实现流程：

自定义key,假设数据集是整形类型的二维数据。这里我们构建IntPair类型表示组合key。

package com.xjh.sort_twice;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;
/**
 * 自定义key排序
 * 在mr中，所有的key是需要被比较和排序的，并且是二次，先根据partitioner，再根据大小。而本例中也是要比较两次。
 * 先按照第一字段排序，然后再对第一字段相同的按照第二字段排序。
 * 根据这一点，我们可以构造一个复合类IntPair，他有两个字段，先利用分区对第一字段排序，再利用分区内的比较对第二字段排序
 * @author xjh
 *
 */
//自己定义的InPair类，实现WritableComparator
public class IntPair implements WritableComparable<IntPair>{
	int left;
	int right;
	
	public void set(int left, int right) {
		// TODO Auto-generated method stub
		this.left = left;
		this.right = right;
	}
	public int getLeft() {
		return left;
	}

	public int getRight() {
		return right;
	}
	
	//反序列化，从流中读进二进制转换成IntPair
	@Override
	public void readFields(DataInput in) throws IOException {
		// TODO Auto-generated method stub
		this.left = in.readInt();
		this.right = in.readInt();
	}
	//序列化，将IntPair转换成二进制输出
	@Override
	public void write(DataOutput out) throws IOException {
		// TODO Auto-generated method stub
		out.writeInt(left);
		out.writeInt(right);
	}
	
	/*
	 * 为什么要重写equal方法？
	 * 因为Object的equal方法默认是两个对象的引用的比较，意思就是指向同一内存,地址则相等，否则不相等；
	 * 如果你现在需要利用对象里面的值来判断是否相等，则重载equal方法。
	 */
	@Override
	public boolean equals(Object obj) {
		// TODO Auto-generated method stub
		if(obj == null)
			return false;
		if(this == obj)
			return true;
		if (obj instanceof IntPair){
			IntPair r = (IntPair) obj;
			return r.left == left && r.right==right;
		}
		else{
			return false;
		}
			
	}
	
	/*
	 * 重写equal 的同时为什么必须重写hashcode？ 
	 * hashCode是编译器为不同对象产生的不同整数，根据equal方法的定义：如果两个对象是相等（equal）的，那么两个对象调用 hashCode必须产生相同的整数结果，
	 * 即：equal为true，hashCode必须为true，equal为false，hashCode也必须 为false，所以必须重写hashCode来保证与equal同步。 
	 */
	@Override
	public int hashCode() {
		// TODO Auto-generated method stub
		return left*157 +right;
	}
	
	//实现key的比较
	@Override
	public int compareTo(IntPair o) {
		// TODO Auto-generated method stub
		if(left != o.left)
			return left<o.left? -1:1;
		else if (right != o.right)
			return right<o.right? -1:1;
		else
			return 0;
	}	
}

自定义分区。自定义分区函数类FirstPartitioner，是对组合key的第一次比较，完成对所有key的排序。

public static class MyPartitioner extends Partitioner<IntPair, IntWritable>{
		@Override
		public int getPartition(IntPair key, IntWritable value, int numOfReduce) {
			// TODO Auto-generated method stub
			return Math.abs(key.getLeft()*127) % numOfReduce;
		}
	}

并在main()函数中Job指定：job.setPartitionerClass(MyPartitioner.class);

自定义 SortComparator 实现 IntPair 类中的first和second排序。这里我们在IntPair类已经实现了compareTo()方法实现。
自定义 GroupingComparator 类，实现分区内的数据分组。

/**
	 * 在分组比较的时候，只比较原来的key，而不是组合key。
	 */
	public static class MyGroupParator implements RawComparator<IntPair>{

		@Override
		public int compare(IntPair o1 , IntPair o2) {
			// TODO Auto-generated method stub
			int l = o1.getLeft();
			int r = o2.getRight();
			return l == r ? 0:(l<r ?-1:1);
		}
		//一个字节一个字节的比，直到找到一个不相同的字节，然后比这个字节的大小作为两个字节流的大小比较结果。
		@Override
		public int compare(byte[] b1, int l1, int r1, byte[] b2,int l2, int r2) {
			// TODO Auto-generated method stub
            return WritableComparator.compareBytes(b1, l1, Integer.SIZE/8, b2, l2, Integer.SIZE/8);
		}
		
	}

参考资料：
1.MapReduce二次排序

提灯寻梦在南国

发布了65 篇原创文章 · 获赞 20 · 访问量 1万+

私信关注

MapReduce高级应用——全排序和二次排序

前言

排序

全排序

辅助排序(二次排序)

二次排序的原理

二次排序的实现流程：

AI 画图真刺激，手把手教你如何用 ComfyUI 来画出刺激的图

公司刚入职了一名 Java 中级开发，短短 4 行代码居然凑齐了 3 个 bug！我哭了~~

数据展示动态（跑分）显示

公众号5月C#/.NET热文一览

git 下载大陆镜像地址

一文了解ThreadLocal

Docker入門——基礎概念，安裝運行Tomcat,MySQL

Flink筆記03——一文了解DataStream

一文了解消息隊列

Cloudera Manager HA模式搭建

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結