Hadoop之NLineInputFormat解析

原創

2020-06-29 00:44

Hadoop默認實現的InputFormat是FileInputFormat<K,V>，在FileInputFormat下有如下五個子類：CombineFileInputFormat<K,V>、TextInputFormat<K,V>、KeyValueTextInputFormat<K,V>、NLineInputFormat<K,V>和 SequenceFileInputForma<K,V>t。其中TextInputFormat默認的實現是TextInputFormat<K,V>。該輸入格式的分片方式爲：輸入文本的每一行作爲一個分片，其中該行的偏移量作爲Key，該行的內容爲Value。而這篇文章介紹到的是是FileInputFormat的另外一種子類——NLineInputFormat。

NLineInputFormat是Hadoop的一種非默認初始化的一種輸入格式。不同與InputFormat默認初始化的LineInputFormat，這種分片方式是可以一次性按照指定的分片數進行InputSplit。所以在特定情況下，可以有效提升代碼的運行效率。例如：數據文件爲每行一個浮點數，總共一百行；指定Reducer個數爲5個，分片行數爲20。則表明在InputSplit時，文本的每20行作爲一個split；分別輸入每個Mapper處理的分片內容。因爲Hadoop中Mapper的個數與split有關，又因爲NLineInputSplit的Mapper計算方法爲：Mapper = Splits = 文件行數/分片行數。所以在上面的例子中，總共劃分爲5個Mapper。下面貼出實現上面案例的源代碼。

package Hadoop_InputFormat;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NLineSplite {
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		String input = "E:/Document/Study/Data/ComparatorNum.txt";
	 	String output = "E:/Document/Study/Data/output";
		
		System.setProperty("hadoop.home.dir", "D:/Tools/Office/hadoop-2.6.0");
		Configuration conf = new Configuration();
		Job job = new Job(conf,"NLineSplite");

		job.setInputFormatClass(NLineInputFormat.class);
		NLineInputFormat.setNumLinesPerSplit(job, 20);
		job.setNumReduceTasks(5);
		
		job.setJarByClass(NLineSplite.class);
		job.setMapperClass(NLineSpliteMap.class);
		job.setPartitionerClass(PartitionerByIndex.class);
		job.setReducerClass(NLineSpliteReduce.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path(input));
		FileOutputFormat.setOutputPath(job, new Path(output));
		
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

class NLineSpliteMap extends
		Mapper<LongWritable, Text, Text, LongWritable> {
	int index = (int)(Math.random()*10);	
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		String line = value.toString();
		context.write(new Text(line + " " + index), key);
	}
}

class PartitionerByIndex extends Partitioner<Text, LongWritable> {

	@Override
	public int getPartition(Text key, LongWritable value, int numPartitions) {
		// TODO Auto-generated method stub
		int index = Integer.parseInt(key.toString().split(" ")[1]);
		return Math.abs(index % numPartitions);
	}	
}

class NLineSpliteReduce extends
		Reducer<Text, LongWritable, Text, LongWritable> {
	@Override
	protected void reduce(Text key,	Iterable<LongWritable> value,Context context)
			throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		context.write(new Text(key.toString()), new LongWritable(1));
	}
}

在實現每個Mapper的時候，隨機生成一個index，作爲該分片的標記；再通過自定義分區發送到不同的Reducer上。運行結果如下所示：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hadoop之NLineInputFormat解析

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

Hadoop之NLineInputFormat解析

Python：Invalid environment marker：python_version

Hadoop之Partition深度解析

如何在hadoop中控制map的個數

歸併算法的遞歸和非遞歸實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結