MapReduce 二次排序

需求：

有這樣的一堆數據：

22      12
22      13
22      6
22      17
21      5
28      79
28      63
28      100
1       79
23      84
1       63
67      45
18      23
19      74
1       100
21      41
57      21
23      79
12      13
22      12
22      13
.......

要求將key相同的數據都放到一起，輸出時按照key的降序排序，key相同的，將值按照升序排序，結果輸出如下：

100:1 1 1 28 28 28 
84:23 23 23 
79:1 1 1 23 23 23 28 28 28 
74:19 19 19 
67:23 23 45 45 45 79 
63:1 1 1 28 28 28 
57:21 21 21 22 22 
45:67 67 67 
41:21 21 21 
28:18 18 19 19 63 63 63 67 67 79 79 79 100 100 100 
23:18 18 18 21 21 21 21 41 79 79 79 84 84 84 
22:1 1 6 6 6 12 12 12 13 13 13 17 17 17 23 23 28 28 28 28 
21:1 1 5 5 5 22 22 41 41 41 57 57 57 
19:22 22 74 74 74 
18:12 12 13 23 23 23 
.........

如何用MR實現這個簡單的需求呢？

方式1

採用內存進行排序。具體做法是在map階段，將key和value輸出，reduce端拉數據併合並相同key的value，最後數據格式爲<key,Iterable>，然後在reduce方法中將values都取出，放到一個可排序的集合中，排序後直接輸出。這種做法簡單，好理解，但是隨着數據量的增加，會發生內存溢出的風險，所以這種做法不推薦。

方式2

我們知道，shuffle過程中會將數據進行洗牌，排序。我們可以利用這個特點，讓MapReduce框架幫我們去排序。具體的做法是：

將文件中的key和value都作爲map端輸出的key，文件中value作爲map端輸出的value。所以我們需要創建一個類來作爲map端輸出的key，同時將文件的key 和value都作爲該類的屬性，爲了不混淆，文件的key作爲該類的first屬性，文件的value作爲該類的second屬性。同時該類要實現WritableComparable接口，在compareTo方法中現比較first，如果first相同，繼續比較second。
第1完事以後，我們還需要一個Group操作，也就是job.setGroupingComparatorClass方法，其作用是將map階段輸出的相同的key都發送到一個reduce中去。該方法接收一個RawComparator類型的Class。Hadoop已經有一個WritableComparator類，該類實現RawComparator，我們可以一個類去繼承WritableComparator類暫且稱爲分組插件類，然後從寫其compare方法。在這個方法的實現中，我們採用了一個小技巧，我們只比較1中生成的key的first，也就是將first都相同的都發送到一個reduce中，然後value相同的，再根據1中提到的compareTo方法去比較，排序。這樣就可以實現我們的需求了，也即二次排序。這地方有點難理解，可以結合代碼，多理解幾遍。思考？如果沒有這一步，結果會是什麼樣的呢？可以將job.setGroupingComparatorClass註釋掉，看結果。
因爲是分佈式計算，要保證全局有序的，還得從分區上做手腳(或者設置reducer個數爲1個，不推薦)。就上面的需求中，我做法是範圍劃分，即根據key的大小以及分區個數，而不同範圍是有序的，加上我們第1，2步，保證的分區內有序，這樣也就認爲是全局有序了。

代碼

定義的Key類：

class Key implements WritableComparable<Key> {
    private Long first;
    private Long second;
    
    @Override
    public int compareTo(Key o) {
        int res = first.compareTo(o.first);
        if (res == 0) {
            res = second.compareTo(o.second);
        }else return -res;
        return res;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(first);
        dataOutput.writeLong(second);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.first = dataInput.readLong();
        this.second = dataInput.readLong();
    }

    public Long getFirst() {
        return first;
    }

    public void setFirst(Long first) {
        this.first = first;
    }

    public Long getSecond() {
        return second;
    }

    public void setSecond(Long second) {
        this.second = second;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        Key key = (Key) o;
        return Objects.equals(first, key.first) && Objects.equals(second,key.second);
    }

    @Override
    public int hashCode() {
        return Objects.hash(first,second);
    }
}

分組插件類：

  class PairGroupComparator extends WritableComparator {
  
      public PairGroupComparator() {
          super(Key.class, true);
      }
  
      @Override
      public int compare(WritableComparable a, WritableComparable b) {
          Key pa = (Key) a;
          Key pb = (Key) b;
          return pa.getFirst().compareTo(pb.getFirst());
      }
  }

分區器：

class PairSortPartitioner extends Partitioner<Key, LongWritable> {
       /**
        * 我的數據的key都在0-100之間，所以簡單的將0-100的數據劃分成與分區數相等的幾個範圍，
        * 然後將根據這些範圍判斷key因該屬於哪個分區
        * 這麼做有很大的侷限性：
        * 1. 存在很嚴重的熱點問題。
        * 2. 如果數不再0-100之間，沒法靈活改變。
        * 
        * 有很好的算法，可以告知,感謝
        */
       @Override
       public int getPartition(Key key, LongWritable value, int i) {
           Long first = key.getFirst();
           int MAX = 100;
           int step = MAX / i;
           for (int j = 1; j <= i; j++) {
               if ((j - 1) * step < first && first <= j * step) {
                   return j - 1;
               }
           }
           throw new IllegalArgumentException("key沒有在0-100之間");
       }
   }

Mapper類：

    class PairSortMapper extends Mapper<LongWritable, Text, Key, LongWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] pair = value.toString().split("\t");
            Key sortKey = new Key();
            sortKey.setFirst(Long.parseLong(pair[0]));
            long second = Long.parseLong(pair[1]);
            sortKey.setSecond(second);
            context.write(sortKey, new LongWritable(second));
        }
    }

Reducer類：

class PairSortReducer extends Reducer<Key, LongWritable, NullWritable, Text> {

    private Text out = new Text();

    @Override
    protected void reduce(Key key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

        StringBuilder sb = new StringBuilder();
        sb.append(key.getFirst()).append(":");
        for (LongWritable value : values) {
            sb.append(value.get()).append(" ");
        }
        String outline = sb.toString();
        out.set(outline);
        context.write(NullWritable.get(), out);
        System.err.println(outline);
    }
}

Driver類：

public class PairSecondarySortDriver extends Configured implements Tool {

    private final static Path input = new Path("/tmp/pair/in/*");
    private final static Path output = new Path("/tmp/pair/out");

    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(getConf());
        job.setJarByClass(this.getClass());
        job.setJobName(this.getClass().getSimpleName());

        job.setMapperClass(PairSortMapper.class);
        job.setMapOutputKeyClass(Key.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(PairSortReducer.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);

        job.setNumReduceTasks(4);
        job.setPartitionerClass(PairSortPartitioner.class);
        job.setGroupingComparatorClass(PairGroupComparator.class);

        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, input);

        FileSystem fs = FileSystem.get(getConf());
        if (fs.exists(output)) {
            fs.delete(output, true);
        }

        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, output);

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new PairSecondarySortDriver(), null);
        System.exit(run);
    }
}

以上的分區算法不可取，如果有更好的分區算法，可以@我一下，感謝。

MapReduce 二次排序

MapReduce 二次排序

需求：

方式1

方式2

代碼

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

kerberos + Ranger 實現對Kafka的認證以及權限管理

發佈開源項目至maven中央倉庫，內附打scala源碼包，scala doc 包的教程。

Hive on Spark 搭建過程(hvie-2.3.6 spark-2.4.4 hadoop-2.8.5)

MapReduce 二次排序

深入理解G1GC日誌

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結