MR中Partition的使用源碼示例

一、環境

1、hadoop 0.20.2

2、操作系統Linux

二、背景

1、爲何使用Partitioner，主要是想reduce的結果能夠根據key再次分類輸出到不同的文件夾中。

2、結果能夠直觀，同時做到對數據結果的簡單的統計分析。

三、實現

1、輸入的數據文件內容如下(1條數據內容少，1條數據內容超長，3條數據內容正常)：
kaka 1 28
hua 0 26
chao 1
tao 1 22
mao 0 29 22

2、目的是爲了分別輸出結果，正常的結果輸出到一個文本，太短的數據輸出到一個文本，太長的輸出到一個文本，共三個文本輸出。

3、代碼如下：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MyPartitioner {
	/** 
	 * 輸入文本，以tab間隔 
	 * kaka    1       28 
	 * hua     0       26 
	 * chao    1 
	 * tao     1       22 
	 * mao     0       29      22 
	 * */  

	
	public static class MyPartitionerMap extends Mapper<LongWritable, Text, Text, Text> {
		
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
		throws java.io.IOException, InterruptedException {
			
			String arr_value[] = value.toString().split("\t");
			if (arr_value.length > 3) {
				context.write(new Text("long"), value);
			} else if (arr_value.length < 3) {
				context.write(new Text("short"), value);
			} else {
				context.write(new Text("right"), value);
			}
		}
	}

	/**
	* partitioner的輸入就是map的輸出
	* 
	* @author Administrator
	*/
	public static class MyPartitionerPar extends Partitioner<Text, Text> {
	
		@Override
		public int getPartition(Text key, Text value, int numPartitions) {
			int result = 0;
			/*********************************************************************/
			/***key.toString().equals("long")  must use toString()!!!!	***/
			/***開始的時候我沒有用 ，導致都在一個區裏，結果也在一個reduce輸出文件中。	***/
			/********************************************************************/
			if (key.toString().equals("long")) {
				result = 0 % numPartitions;
			} else if (key.toString().equals("short")) {
				result = 1 % numPartitions;
			} else if (key.toString().equals("right")) {
				result = 2 % numPartitions;
			}
			return result;
		}
	}

	public static class MyPartitionerReduce extends Reducer<Text, Text, Text, Text> {
		protected void reduce(Text key, java.lang.Iterable<Text> value, Context context) throws java.io.IOException,
		InterruptedException {
			for (Text val : value) {
				context.write(key, val);
				//context.write(key, val);
			}
		}
	}

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if (otherArgs.length != 2) {
			System.err.println("Usage: MyPartitioner <in> <out>");
			System.exit(2);
		}
		Job job = new Job(conf, "MyPartitioner");
		job.setNumReduceTasks(5);
		
		job.setJarByClass(MyPartitioner.class);
		
		job.setMapperClass(MyPartitionerMap.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		
		job.setPartitionerClass(MyPartitionerPar.class);
		job.setReducerClass(MyPartitionerReduce.class);
		
		job.setOutputKeyClass(NullWritable.class);
		job.setOutputValueClass(Text.class);
		
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

4、通過key值的不同，對輸出的內容切分（切分依據是根據key來做）。雖然設置了5個reduce，但是最終輸出的reduce只有3個有內容。截圖如下

可以看到有3個文本是有值的，其他文本沒有值。

四、總結

1、partitioner主要就是爲了對結果輸出按照key進行分類，在上面的例子中將三種不同的數據分類輸出到了三個結果文本中。

2、partitioner輸入<k,v>就是map輸出的<k,v>

3、需要說明的是，partitioner是將reduce輸出做了分區，並不是僅僅是針對輸出的文本分區。可以將partitioner中的代碼替換爲:

return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;

4、如果按照代碼中的方式來輸出，如果判斷條件過多，不僅顯得代碼複雜冗餘，而且效率也不高。所以如果是判斷條件過多，又不是嚴格要求

必須每個條件必須輸出到一個文件，可以採用上面的方法，輸出到一個reduce分區，雖然結果可能是在一個文件中，但是輸出是經過排序的

5、特別需要注意的是，註釋中強調的那點：/***key.toString().equals("long") must use toString()!!!! ***//***開始的時候我沒有用，導致都在一個區裏，結果也在一個reduce輸出文件中。 ***/

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

MR中Partition的使用源碼示例

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

linux安裝cuda和cudnn

Mellanox網卡開啓SR-IOV

模擬手機設備：使用 Playwright 實現移動端自動化測試

HTML 00 Tutorial

全面系統的AI學習路徑，幫助普通人也能玩轉AI

從零開始：使用 Playwright 腳本錄製實現自動化測試

uni-app實現上拉加載

sqoop之從oracle導入hbase的問題與sqoop hbase 需要注意的一個問題

JAVA-編譯-包-將源文件和類文件分開

Hive學習筆記1--------Hive入門

hive2:HIVE的結構

hive3:hive和關係型數據庫RDBMS的異同

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

MR中Partition的使用 源碼示例

MR中Partition的使用源碼示例