MapReduce on Avro Data Files

時間2014-03-10 13:11:41 Architects Zone 原文 http://java.dzone.com/articles/mapreduce-avro-data-files

Related MicroZone Resources

Build Big Data Apps with JavaScript and Django

Download Hunk: Splunk Analytics for Hadoop

Get Started Developing with Splunk, the Platform for Machine Data

Like this piece? Share it with your friends:

In this post we are going to write a MapReduce program to consume Avro input data and also produce data in Avro format.

We will write a program to calculate average of student marks.

Data Preparation

The schema for the records is:

student.avsc
{
  "type" : "record",
  "name" : "student_marks",
  "namespace" : "com.rishav.avro",
  "fields" : [ {
  "name" : "student_id",
  "type" : "int"
  }, {
  "name" : "subject_id",
  "type" : "int"
  }, {
  "name" : "marks",
  "type" : "int"
  } ]
}

And some sample records are:

student.json
{"student_id":1,"subject_id":63,"marks":19}
{"student_id":2,"subject_id":64,"marks":74}
{"student_id":3,"subject_id":10,"marks":94}
{"student_id":4,"subject_id":79,"marks":27}
{"student_id":1,"subject_id":52,"marks":95}
{"student_id":2,"subject_id":34,"marks":16}
{"student_id":3,"subject_id":81,"marks":17}
{"student_id":4,"subject_id":60,"marks":52}
{"student_id":1,"subject_id":11,"marks":66}
{"student_id":2,"subject_id":84,"marks":39}
{"student_id":3,"subject_id":24,"marks":39}
{"student_id":4,"subject_id":16,"marks":0}
{"student_id":1,"subject_id":65,"marks":75}
{"student_id":2,"subject_id":5,"marks":52}
{"student_id":3,"subject_id":86,"marks":50}
{"student_id":4,"subject_id":55,"marks":42}
{"student_id":1,"subject_id":30,"marks":21}

Now we will convert the above sample records to avro format and upload the avro data file to HDFS:

java -jar avro-tools-1.7.5.jar fromjson student.json --schema-file student.avsc > student.avro
hadoop fs -put student.avro student.avro

Avro MapReduce Program

In my program I have used Avro Java class for student_marks schema. To generate Java class from the schema file use below command:

java -jar avro-tools-1.7.5.jar compile schema student.avsc .

Then add the generated Java class to IDE.

I have written a MapReduce program which reads Avro data file student.avro (passed as argument) and calculates average marks for each student and store the output also in Avro format. The program is given below:

package com.rishav.avro.mapreduce;

import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapred.AvroValue;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyInputFormat;
import org.apache.avro.mapreduce.AvroKeyValueOutputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import com.rishav.avro.IntPair;
import com.rishav.avro.student_marks;

public class AvroAverageDriver extends Configured implements Tool{

  public static class AvroAverageMapper extends 
  Mapper<AvroKey<student_marks>, NullWritable, IntWritable, IntPair> {
    protected void map(AvroKey<student_marks> key, NullWritable value, Context context) 
        throws IOException, InterruptedException {
      IntWritable s_id = new IntWritable(key.datum().getStudentId());
      IntPair marks_one = new IntPair(key.datum().getMarks(), 1);
      context.write(s_id, marks_one);
    }
  } // end of mapper class

  public static class AvroAverageCombiner extends 
  Reducer<IntWritable, IntPair, IntWritable, IntPair> {
    IntPair p_sum_count = new IntPair();
    Integer p_sum = new Integer(0);
    Integer p_count = new Integer(0);
    protected void reduce(IntWritable key, Iterable<IntPair> values, Context context) 
        throws IOException, InterruptedException {
      p_sum = 0;
      p_count = 0;
      for (IntPair value : values) {
        p_sum += value.getFirstInt();
        p_count += value.getSecondInt();
      }
      p_sum_count.set(p_sum, p_count);
      context.write(key, p_sum_count);
    }
  } // end of combiner class 

  public static class AvroAverageReducer extends 
  Reducer<IntWritable, IntPair, AvroKey<Integer>, AvroValue<Float>> {
    Integer f_sum = 0;
    Integer f_count = 0;
    
    protected void reduce(IntWritable key, Iterable<IntPair> values, Context context) 
        throws IOException, InterruptedException {
      f_sum = 0;
      f_count = 0;
      for (IntPair value : values) {
        f_sum += value.getFirstInt();
        f_count += value.getSecondInt();
      }
      Float average = (float)f_sum/f_count;
      Integer s_id = new Integer(key.toString());
      context.write(new AvroKey<Integer>(s_id), new AvroValue<Float>(average));
    }
  } // end of reducer class 

  @Override
  public int run(String[] rawArgs) throws Exception {
    if (rawArgs.length != 2) {
      System.err.printf("Usage: %s [generic options] <input> <output>\n",
          getClass().getName());
      ToolRunner.printGenericCommandUsage(System.err);
      return -1;
    }
    
    Job job = new Job(super.getConf());
    job.setJarByClass(AvroAverageDriver.class);
    job.setJobName("Avro Average");
    
    String[] args = new GenericOptionsParser(rawArgs).getRemainingArgs();
    Path inPath = new Path(args[0]);
    Path outPath = new Path(args[1]);

    FileInputFormat.setInputPaths(job, inPath);
    FileOutputFormat.setOutputPath(job, outPath);
    outPath.getFileSystem(super.getConf()).delete(outPath, true);

    job.setInputFormatClass(AvroKeyInputFormat.class);
    job.setMapperClass(AvroAverageMapper.class);
    AvroJob.setInputKeySchema(job, student_marks.getClassSchema());
    job.setMapOutputKeyClass(IntWritable.class);
    job.setMapOutputValueClass(IntPair.class);
    
    job.setCombinerClass(AvroAverageCombiner.class);
    
    job.setOutputFormatClass(AvroKeyValueOutputFormat.class);
    job.setReducerClass(AvroAverageReducer.class);
    AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.INT));
    AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.FLOAT));

    return (job.waitForCompletion(true) ? 0 : 1);
  }

  public static void main(String[] args) throws Exception {
    int result = ToolRunner.run(new AvroAverageDriver(), args);
    System.exit(result);
  }
}

In the program the input key to mapper is AvroKey<student_marks> and the input value is null. The output key of map method is student_id and output value is an IntPair having marks and 1.
We have a combiner also which aggregates partial sums for each student_id.
Finally reducer takes student_id and partial sums and counts and uses them to calculate average for each student_id. The reducer writes the output in Avro format.

For Avro job setup we have added these properties:

// set InputFormatClass to AvroKeyInputFormat and define input schema
  job.setInputFormatClass(AvroKeyInputFormat.class);
  AvroJob.setInputKeySchema(job, student_marks.getClassSchema());

// set OutputFormatClass to AvroKeyValueOutputFormat and key as INT type and value as FLOAT type
  job.setOutputFormatClass(AvroKeyValueOutputFormat.class);
  AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.INT));
  AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.FLOAT));

Job Execution

We package our Java program to avro_mr.jar and add Avro jars to libjars and hadoop classpath using below commands:

export LIBJARS=avro-1.7.5.jar,avro-mapred-1.7.5-hadoop1.jar,paranamer-2.6.jar
export HADOOP_CLASSPATH=avro-1.7.5.jar:avro-mapred-1.7.5-hadoop1.jar:paranamer-2.6.jar
hadoop jar avro_mr.jar com.rishav.avro.mapreduce.AvroAverageDriver -libjars ${LIBJARS} student.avro output

You can verify the output using avro-tool command.

To enable snappy compression for output add below lines to run method and add snappy-java jar to libjars and hadoop classpath:

FileOutputFormat.setCompressOutput(job, true);
    FileOutputFormat.setOutputCompressorClass(job, SnappyCodec.class);

Published at DZone with permission ofRishav Rohit, author and DZone MVB. ( source )

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

小橋

發佈了41 篇原創文章 · 獲贊 7 · 訪問量 35萬+

私信關注

MapReduce on Avro Data Files

MapReduce on Avro Data Files

Related MicroZone Resources

Data Preparation

Avro MapReduce Program

Job Execution

認知提升的方法

C#開源的兩款功能強大的錄屏神器

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

前端 Vue yarn.lock文件：詳解和使用指南

hadoop 一個Job多個MAP與REDUCE的執行

如何使用Hadoop的MultipleOutputs進行多文件輸出

騰訊深度學習平臺（譯）

HDFS中文件的壓縮與解壓

[MapReduce] 如何向map和reduce腳本傳遞參數,加載文件和目錄

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結