Hadoop中MapReduce的自定義分區、排序、分組

原創

2019-07-30 16:03

分區：
在MR的job中，默認使用的分區類爲：HashPartitioner.class
其源代碼爲：

public class HashPartitioner<K, V> extends Partitioner<K, V> {
    public HashPartitioner() {
    }

    public int getPartition(K key, V value, int numReduceTasks) {
        return (key.hashCode() & 2147483647) % numReduceTasks;
    }
}

可以看到 HashPartitioner.class 中得到分區號時，會對job設置的reduce任務數取餘，這樣的到0~numReduceTasks-1之間的正數作爲分區號。

因此可以得到結論，HashPartitioner.class可以使相同Key一定在相同分區，同一分區裏可能有不同Key。

自定義分區類：繼承class Partitioner<KEY, VALUE> 類，重寫getPartition方法。

public static class GlobleSortPartitioner extends Partitioner<IntWritable,Text>{
        @Override
        public int getPartition(IntWritable key,Text value,int numPartitions){
            int a = key.get();
            if(a>=-100 && a<=0 ){
                return 0;
            }
            else if(a>0 && a<=50){
                return 1;
            }
            else {
                return 2;
            }
        }
    }
    
//job中設置
job.setPartitionerClass(GlobleSortPartitioner.class);

在這個自定義方法中，使分區之間有序，也就是此分區中的key一定比下一個分區的key都要小，因此可以實現全局排序。
效果：

排序：
規則：
1、如果設置了job.setSortComparatorClass(A.class)則使用該類排序,A是WritableComparator的子類，且重寫了compare()方法。
2、1沒有設置的話，按照Key的類中的comparator方法排序。（Key的類爲WritableComparable的子類）
3、如果2中的Key的類沒有comparator方法，則使用RawComparator。

自定義排序類（Key爲IntWrotable類，由大到小排序）：

public static class KeyComparator extends WritableComparator{
        protected KeyComparator(){
            super(IntWritable.class,true);

        }
        @Override
        public int compare(WritableComparable w1,WritableComparable w2) {
            IntWritable it1 = (IntWritable) w1;
            IntWritable it2 = (IntWritable) w2;
            int cmp = it1.compareTo(it2);
            return -cmp;

        }
    }
  //在job中設置
  job.setSortComparatorClass(KeyComparator.class);

效果(組內按照倒序排序)：

分組：

分組是什麼？我們都知道，一個reduce任務，默認只會接收到一個key的數據，並且將這個key對應的value可迭代。
如果我們有這麼幾個數據：
-11 1950
-11 1950
-52 2017
-52 2017

我的reduce是將相同key下的value用“|”連接起來，如：
-11 1950|1950
-52 2017|2017

現在我們希望能將-11和-52的value全拼一起，期望出現 1950|1950|2017|2017| 這種reduce的輸出value。

自定義分組：
繼承WritableComparator，重寫compare()

public static class GroupComparator extends WritableComparator{
        protected GroupComparator(){
            super(IntWritable.class,true);
        }

        @Override
        public int compare(WritableComparable w1,WritableComparable w2){
            IntWritable it1 = (IntWritable) w1;
            IntWritable it2 = (IntWritable) w2;
            if(it1.get() == -11 && it2.get() == -52){
                return 0;
            }
            else{
                return it1.compareTo(it2);
            }
        }
    }

//在job中使用
job.setGroupingComparatorClass(GroupComparator.class);

最後結果我們可以看到，分組後的key使用的是排序靠後的key。
分組中注意：必須是連續的key纔可以分組

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hadoop中MapReduce的自定義分區、排序、分組

985 碩士程序員，空窗 4 個月沒有 Offer！

【入門教程】5分鐘教你快速學會集成Java springboot ~

營銷系統黑名單優化：位圖的應用解析

一文搞懂 Spring 循環依賴

我真的從測試轉成了開發......

盛大發布 | Zabbix 7.0 LTS--性能與擴展的卓越融合

nginx添加相應配置，通過瀏覽器訪問或curl時返回客戶端對應公網IP

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

python內置函數——sorted

[oeasy]python020在遊戲中體驗數值自由_勇闖地下城_終端文字遊戲

Apache Kylin 高基數維度處理以及其他優化

sqoop更新導入總結，從hive到mysql

最長公共子序列-LeetCode1143圖解

RDD使用map函數返回多行的解決辦法,scala語言

使用api查詢Kylin數據

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結