1 概述
Hadoop內置的輸入文件格式類有:1)FileInputFormat<K,V>這個是基本的父類,自定義就直接使用它作爲父類。
2)TextInputFormat<LongWritable,Text>這個是默認的數據格式類。key代表當前行數據距離文件開始的距離,value代碼當前行字符串。
3)SequenceFileInputFormat<K,V>這個是序列文件輸入格式,使用序列文件可以提高效率,但是不利於查看結果,建議在過程中使用序列文件,最後展示可以使用可視化輸出。
4)KeyValueTextInputFormat<Text,Text>這個是讀取以Tab(也即是\t)分隔的數據,每行數據如果以\t分隔,那麼使用這個讀入,就可以自動把\t前面的當做key,後面的當做value。
5)CombineFileInputFormat<K,V>合併大量小數據是使用。
6)MultipleInputs,多種輸入,可以爲每個輸入指定邏輯處理的Mapper。
2 運行軌跡
2.1 Mapper
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
進入context.nextKeyValue()方法,從而進入WrappedMapper類。
2.2 WrappedMapper
public boolean nextKeyValue() throws IOException, InterruptedException{
return mapContext.nextKeyValue();
}
進入該方法的nextKeyValue(),從而進入MapContextImpl類。
2.3 MapContextImpl
public boolean nextKeyValue() throws IOException, InterruptedException {
return reader.nextKeyValue();
}
現希望知道reader具體類型是什麼,先看reader的申明和賦值。
public class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
private RecordReader<KEYIN,VALUEIN> reader;
private InputSplit split;
public MapContextImpl(Configuration conf, TaskAttemptID taskid,
RecordReader<KEYIN,VALUEIN> reader,
RecordWriter<KEYOUT,VALUEOUT> writer,
OutputCommitter committer,
StatusReporter reporter,
InputSplit split) {
super(conf, taskid, writer, committer, reporter);
this.reader = reader;
this.split = split;
}
}
此處看到是調用MapContextImpl構造方法進行賦值的,那麼繼續跟進看何處調用了MapContextImpl方法。右擊MapContextImpl > open call Hierarchy。跟進一個方法叫runNewMapper可以看到,一步步看變量申明,就可以看到inputFormat就是我們代碼中設置的InputFormat.class類型。
3 自定義InputFormat
基於文件的FileInputFormat的設計思想是:A 由公共基類FileInputFormat採用統一的方法,對文件進行切分成InputSplit(如按照統一的大小)。getSplit方法。
B 由各個派生類根據自己的需求,解析InputSplit。即各個子類實現的createRecordReader方法。那麼Input只需實現自定義createRecordReader方法即可。
3.1 MyInputFormat
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;;
public class MyInputFormat extends FileInputFormat<Text, Text> {
@Override
public RecordReader<Text, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
return new MyRecordReader();
}
}
3.2 MyRecordReader
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
public class MyRecordReader extends RecordReader<Text, Text> {
private LineReader lr;
private Text key = new Text();
private Text value = new Text();
private long start;
private long end;
private long currentPos;
private Text line = new Text();
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cxt)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = cxt.getConfiguration();
// 獲取分片文件對應的完整文件
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
FSDataInputStream is = fs.open(path);
lr = new LineReader(is, conf);
// 獲取分片文件的啓始位置
start = split.getStart();
// 獲取分片文件的結束位置
end = start + split.getLength();
is.seek(start);
if (start != 0) {
start += lr.readLine(new Text(), 0,
(int) Math.min(Integer.MAX_VALUE, end - start));
}
currentPos = start;
}
// 針對每行數據進行處理
@Override
public boolean nextKeyValue() throws IOException, InterruptedException{
if (currentPos > end) {
return false;
}
currentPos += lr.readLine(line);
if (line.getLength() == 0) {
return false;
}
// 若是需要被忽略的行,直接讀下一行
if (line.toString().startsWith("ignore")) {
currentPos += lr.readLine(line);
}
String[] words = line.toString().split(",");
if (words.length < 2) {
System.err.println("line:" + line.toString() + ".");
return false;
}
key.set(words[0]);
value.set(words[1]);
return true;
}
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
return key;
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException{
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (currentPos - start) / (float) (end - start));
}
}
@Override
public void close() throws IOException {
lr.close();
}
}
3.3 TestFormat
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class TestFormat extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new TestFormat(), args);
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
String inPath = "hdfs://192.XXX.XXX.XXX:9000/test/bigFile.txt";
String outPath = "hdfs://192.XXX.XXX.XXX:9000/test/out/";
Path in = new Path(inPath);
Path out = new Path(outPath);
out.getFileSystem(conf).delete(out, true);
Job job = Job.getInstance(conf, "fileintputformat test job");
job.setJarByClass(getClass());
job.setInputFormatClass(MyInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(Mapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
return job.waitForCompletion(true) ? 0 : -1;
}
}
參考地址:http://www.cnblogs.com/hyl8218/p/5198030.html