Hadoop之NLineInputFormat解析

Hadoop默認實現的InputFormat是FileInputFormat<K,V>,在FileInputFormat下有如下五個子類:CombineFileInputFormat<K,V>、TextInputFormat<K,V>、KeyValueTextInputFormat<K,V>、NLineInputFormat<K,V>和 SequenceFileInputForma<K,V>t。其中TextInputFormat默認的實現是TextInputFormat<K,V>。該輸入格式的分片方式爲:輸入文本的每一行作爲一個分片,其中該行的偏移量作爲Key,該行的內容爲Value。而這篇文章介紹到的是是FileInputFormat的另外一種子類——NLineInputFormat。

NLineInputFormat是Hadoop的一種非默認初始化的一種輸入格式。不同與InputFormat默認初始化的LineInputFormat,這種分片方式是可以一次性按照指定的分片數進行InputSplit。所以在特定情況下,可以有效提升代碼的運行效率。例如:數據文件爲每行一個浮點數,總共一百行;指定Reducer個數爲5個,分片行數爲20。則表明在InputSplit時,文本的每20行作爲一個split;分別輸入每個Mapper處理的分片內容。因爲Hadoop中Mapper的個數與split有關,又因爲NLineInputSplit的Mapper計算方法爲:Mapper = Splits = 文件行數/分片行數。所以在上面的例子中,總共劃分爲5個Mapper。下面貼出實現上面案例的源代碼。

package Hadoop_InputFormat;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NLineSplite {
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		String input = "E:/Document/Study/Data/ComparatorNum.txt";
	 	String output = "E:/Document/Study/Data/output";
		
		System.setProperty("hadoop.home.dir", "D:/Tools/Office/hadoop-2.6.0");
		Configuration conf = new Configuration();
		Job job = new Job(conf,"NLineSplite");

		job.setInputFormatClass(NLineInputFormat.class);
		NLineInputFormat.setNumLinesPerSplit(job, 20);
		job.setNumReduceTasks(5);
		
		job.setJarByClass(NLineSplite.class);
		job.setMapperClass(NLineSpliteMap.class);
		job.setPartitionerClass(PartitionerByIndex.class);
		job.setReducerClass(NLineSpliteReduce.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path(input));
		FileOutputFormat.setOutputPath(job, new Path(output));
		
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

class NLineSpliteMap extends
		Mapper<LongWritable, Text, Text, LongWritable> {
	int index = (int)(Math.random()*10);	
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		String line = value.toString();
		context.write(new Text(line + " " + index), key);
	}
}

class PartitionerByIndex extends Partitioner<Text, LongWritable> {

	@Override
	public int getPartition(Text key, LongWritable value, int numPartitions) {
		// TODO Auto-generated method stub
		int index = Integer.parseInt(key.toString().split(" ")[1]);
		return Math.abs(index % numPartitions);
	}	
}

class NLineSpliteReduce extends
		Reducer<Text, LongWritable, Text, LongWritable> {
	@Override
	protected void reduce(Text key,	Iterable<LongWritable> value,Context context)
			throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		context.write(new Text(key.toString()), new LongWritable(1));
	}
}
在實現每個Mapper的時候,隨機生成一個index,作爲該分片的標記;再通過自定義分區發送到不同的Reducer上。運行結果如下所示:



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章