MapReduce中的InputFormat(2)自定義

1 概述

Hadoop內置的輸入文件格式類有:
1)FileInputFormat<K,V>這個是基本的父類,自定義就直接使用它作爲父類。
2)TextInputFormat<LongWritable,Text>這個是默認的數據格式類。key代表當前行數據距離文件開始的距離,value代碼當前行字符串。
3)SequenceFileInputFormat<K,V>這個是序列文件輸入格式,使用序列文件可以提高效率,但是不利於查看結果,建議在過程中使用序列文件,最後展示可以使用可視化輸出。
4)KeyValueTextInputFormat<Text,Text>這個是讀取以Tab(也即是\t)分隔的數據,每行數據如果以\t分隔,那麼使用這個讀入,就可以自動把\t前面的當做key,後面的當做value。
5)CombineFileInputFormat<K,V>合併大量小數據是使用。
6)MultipleInputs,多種輸入,可以爲每個輸入指定邏輯處理的Mapper。


2 運行軌跡

2.1 Mapper

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
	}
}

進入context.nextKeyValue()方法,從而進入WrappedMapper類。


2.2 WrappedMapper

public boolean nextKeyValue() throws IOException, InterruptedException{
      return mapContext.nextKeyValue();
}

進入該方法的nextKeyValue(),從而進入MapContextImpl類。


2.3 MapContextImpl

public boolean nextKeyValue() throws IOException, InterruptedException {
    return reader.nextKeyValue();
}
現希望知道reader具體類型是什麼,先看reader的申明和賦值。

public class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
  private RecordReader<KEYIN,VALUEIN> reader;
  private InputSplit split;
  public MapContextImpl(Configuration conf, TaskAttemptID taskid,
                        RecordReader<KEYIN,VALUEIN> reader,
                        RecordWriter<KEYOUT,VALUEOUT> writer,
                        OutputCommitter committer,
                        StatusReporter reporter,
                        InputSplit split) {
    super(conf, taskid, writer, committer, reporter);
    this.reader = reader;
    this.split = split;
  }
}
此處看到是調用MapContextImpl構造方法進行賦值的,那麼繼續跟進看何處調用了MapContextImpl方法。右擊MapContextImpl > open call Hierarchy。跟進一個方法叫runNewMapper可以看到,一步步看變量申明,就可以看到inputFormat就是我們代碼中設置的InputFormat.class類型。


3 自定義InputFormat

基於文件的FileInputFormat的設計思想是:
A 由公共基類FileInputFormat採用統一的方法,對文件進行切分成InputSplit(如按照統一的大小)。getSplit方法。
B 由各個派生類根據自己的需求,解析InputSplit。即各個子類實現的createRecordReader方法。
那麼Input只需實現自定義createRecordReader方法即可。


3.1 MyInputFormat

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;;

public class MyInputFormat extends FileInputFormat<Text, Text> {
	@Override
	public RecordReader<Text, Text> createRecordReader(InputSplit split,
			TaskAttemptContext context) throws IOException,
			InterruptedException {
		return new MyRecordReader();
	}
}

3.2 MyRecordReader

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;

public class MyRecordReader extends RecordReader<Text, Text> {
	private LineReader lr;
	private Text key = new Text();
	private Text value = new Text();
	private long start;
	private long end;
	private long currentPos;
	private Text line = new Text();

	@Override
	public void initialize(InputSplit inputSplit, TaskAttemptContext cxt)
			throws IOException, InterruptedException {

		FileSplit split = (FileSplit) inputSplit;
		Configuration conf = cxt.getConfiguration();
		// 獲取分片文件對應的完整文件
		Path path = split.getPath();
		FileSystem fs = path.getFileSystem(conf);
		FSDataInputStream is = fs.open(path);
		lr = new LineReader(is, conf);
		// 獲取分片文件的啓始位置
		start = split.getStart();
		// 獲取分片文件的結束位置
		end = start + split.getLength();
		is.seek(start);
		if (start != 0) {
			start += lr.readLine(new Text(), 0,
					(int) Math.min(Integer.MAX_VALUE, end - start));
		}
		currentPos = start;
	}

	// 針對每行數據進行處理
	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException{
		if (currentPos > end) {
			return false;
		}
		currentPos += lr.readLine(line);
		if (line.getLength() == 0) {
			return false;
		}
		// 若是需要被忽略的行,直接讀下一行
		if (line.toString().startsWith("ignore")) {
			currentPos += lr.readLine(line);
		}
		String[] words = line.toString().split(",");
		if (words.length < 2) {
			System.err.println("line:" + line.toString() + ".");
			return false;
		}
		key.set(words[0]);
		value.set(words[1]);
		return true;
	}

	@Override
	public Text getCurrentKey() throws IOException, InterruptedException {
		return key;
	}

	@Override
	public Text getCurrentValue() throws IOException, InterruptedException{
		return value;
	}

	@Override
	public float getProgress() throws IOException, InterruptedException {
		if (start == end) {
			return 0.0f;
		} else {
			return Math.min(1.0f, (currentPos - start) / (float) (end - start));
		}
	}

	@Override
	public void close() throws IOException {
		lr.close();
	}
} 

3.3 TestFormat

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class TestFormat extends Configured implements Tool {

	public static void main(String[] args) throws Exception {
		ToolRunner.run(new Configuration(), new TestFormat(), args);
	}

	@Override
	public int run(String[] args) throws Exception {
		Configuration conf = getConf();
		String inPath = "hdfs://192.XXX.XXX.XXX:9000/test/bigFile.txt";
		String outPath = "hdfs://192.XXX.XXX.XXX:9000/test/out/";
		Path in = new Path(inPath);
		Path out = new Path(outPath);
		out.getFileSystem(conf).delete(out, true);

		Job job = Job.getInstance(conf, "fileintputformat test job");
		job.setJarByClass(getClass());
		job.setInputFormatClass(MyInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		job.setMapperClass(Mapper.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setNumReduceTasks(0);

		FileInputFormat.setInputPaths(job, in);
		FileOutputFormat.setOutputPath(job, out);
		return job.waitForCompletion(true) ? 0 : -1;
	}
}

參考地址:http://www.cnblogs.com/hyl8218/p/5198030.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章