HADOOP(2)__Mapreduce分區、排序、分組

Hadoop生態中的Mapreduce在map階段可以將大數據或大文件進行分區,然後到Reduce階段可並行處理,分區數量一般與reduce任務數量一致;自定義實現Hadoop的WritableComparable接口(序列化並排列接口)的Bean在mapreduce中進行排序分組的好處是在Reduce階段時可將數據按照自定義的分組屬性進行分組處理。
文章通過“尋找訂單中的最大金額”的Demo將以上幾個要點合成一起來說明用法。

訂單Bean序列化、排序

實現WritableComparable接口中的方法,其中Writable是Hadoop中的序列化接口,Comparable是比較接口,用來排序算法調用。另外需要複寫toString方法,在mapreduce序列化輸出時會調用toString方法

/**
 * 訂單bean, 實現序列化Writable和比較接口Comparable
 */
public class OrderBean implements WritableComparable<OrderBean>{
    private Text orderId;
    private DoubleWritable amount;

    public Text getOrderId() {
        return orderId;
    }
    public void setOrderId(Text orderId) {
        this.orderId = orderId;
    }
    public DoubleWritable getAmount() {
        return amount;
    }
    public void setAmount(DoubleWritable amount) {
        this.amount = amount;
    }

    //按照屬性順序  序列化bean
    public void write(DataOutput out) throws IOException {
        out.writeUTF(orderId.toString());
        out.writeDouble(amount.get());

    }
    //按照屬性順序讀取序列化bean
    public void readFields(DataInput in) throws IOException {
        String readUTF = in.readUTF();
        double readDouble = in.readDouble();

        this.orderId = new Text(readUTF);
        this.amount = new DoubleWritable(readDouble);
    }

    public int compareTo(OrderBean o) {
        int res = this.getOrderId().compareTo(o.getOrderId());
        if (res == 0) {
            res = -this.getAmount().compareTo(o.getAmount()); //從大到不排序
        }
        return res;
    }

    //重寫toString方法,在mapreduce輸出時調用toString方法輸出
    @Override
    public String toString() {
        return orderId.toString() + "\t" + amount.get();
    }
}

分區

Hadoop中的分區是HashCode求模分區算法,用分區屬性的hashCode對Reduce任務數求模,可將數據進行分區(相同屬性的必然在一個分區)輸出,在Reduce階段時則每個分區對應一個reducetask處理分區數據。

/**
 * 根據id進行的Map分區
 */
public class OrderIdPartitioner extends Partitioner<OrderBean, NullWritable>{

    /**
     * 分區數由reduceTasks的數量決定,即numPartitions=reduceTasks的數量
     */
    @Override
    public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
        return (key.getOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions;
    }

}

分組

hadoop中的分組是用WritableComparator來實現的,實際使用時需要自定義一個分組類,繼承WritableComparator指定實例類型和分組屬性(實現compare比較屬性方法),並在提交yarn任務時指定分組所有class

/**
 * 分組order, 在reduce端根據orderId進行分組
 */
public class OrderGroupingComparator extends WritableComparator{

    //指定分組所用的Bean
    public OrderGroupingComparator() {
        super(OrderBean.class, true);
    }

    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean orderA = (OrderBean) a;
        OrderBean orderB = (OrderBean) b;
        return orderA.getOrderId().compareTo(orderB.getOrderId());
    }
}

MapReduce方法

/**
 * 利用groupingcomparator在reduce端聚合分組任務提交類
 */
public class RunMain {

    static class OrderGroupMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {

        OrderBean order = new OrderBean();
        NullWritable NullCons = NullWritable.get();

        //讀取文件寫入bean中,並輸出
        @Override
        protected void map(
                LongWritable key,
                Text value,
                Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context)
                throws IOException, InterruptedException {
            String[] fields = value.toString().split("\t");
            order.setOrderId(new Text(fields[0]));
            order.setAmount(new DoubleWritable(Double.parseDouble(fields[2])));
            context.write(order, NullCons);
        }
    }

    static class OrderGroupReduce extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {
        //不作處理,直接輸出。在reduce輸出之前,hadoop已調用自定實現的分組class
        @Override
        protected void reduce(
                OrderBean arg0,
                Iterable<NullWritable> arg1,
                Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context)
                throws IOException, InterruptedException {
            context.write(arg0, NullWritable.get());
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);
        job.setJarByClass(RunMain.class);

        job.setMapperClass(OrderGroupMapper.class);
        job.setReducerClass(OrderGroupReduce.class);

        //設置業務邏輯:Map\Reduce的輸入輸出類型
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

        //指定位置
        FileInputFormat.setInputPaths(job, new Path("hdfs://server1:9000/grouptest/order.txt"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://server1:9000/grouptest/result"));
        //設置reduceTask,即分區數量
        job.setNumReduceTasks(2);

        //設置shuffle所使用的類
        job.setGroupingComparatorClass(OrderGroupingComparator.class);
        //設置shuffle所使用的partioner類
        job.setPartitionerClass(OrderIdPartitioner.class);
        //提交任務,打印信息
        boolean completion = job.waitForCompletion(true);
        System.exit(completion ? 0 : 1);
    }
}

運行測試

  • 打開hadoop集羣,上傳所需要的數據到hadoop分佈文件系統hdfs中,訂單數據如下:

    Order_0000001 Pdt_01 222.8
    Order_0000001 Pdt_05 25.8
    Order_0000002 Pdt_03 522.8
    Order_0000002 Pdt_04 122.4
    Order_0000002 Pdt_05 722.4
    Order_0000003 Pdt_01 223.8
    Order_0000003 Pdt_01 23.8
    Order_0000003 Pdt_01 322.8
    Order_0000004 Pdt_01 701.9
    Order_0000004 Pdt_01 120.8

  • 將以上java文件導出testgroup.jar包,上傳到服務器中運行:
    hadoop jar testgroup.jar com.spark.mapreduce.group.RunMain

  • 查看結果
    可通過WEB端http://192.168.10.121:8088/查看任務運行情況
    web查看結果

結果顯示,由於設置reducetask的數量是兩個,因此出現兩個結果文件。
運行結果

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章