[Hadoop]使用DistributedCache進行復制聯結

原創

2020-06-27 13:33

使用DistributedCache有一個前提，就是進行聯結的數據有一個足夠小，可以裝入內存中。注意我們可以從代碼中看出它是如何被裝入內存中的，因此，我們也可以在裝入的過程中進行過濾。但是值得指出的是，如果文件很大，那麼裝入內存中也是很費時的。

DistributedCache的原理是將小的那個文件複製到所有節點上。

我們使用DistributedCache.addCacheFile（）來設定要傳播的文件，然後在mapper的初始化方法setup中取用DistributedCache.getLocalCacheFiles()方法獲取該文件並裝入內存中。

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class DistributedCacheJoin extends Configured implements Tool{

	
	public static class MyMapper extends Mapper<Text,Text,Text,Text>{
		private HashMap<String,String> joinData = new HashMap<String,String>();
		public void map(Text key, Text value, Context context)
			throws IOException,InterruptedException{
			String joinValue = joinData.get(key.toString());//注意要toString()，hashcode()你懂的
			if(null != joinValue){
				context.write(key, new Text(joinValue + ","+ value.toString()));
			}
		}
		
		public void setup(Context context){
			try {
				Path [] cacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
				if(null != cacheFiles  && cacheFiles.length > 0){
					String line;
					String []tokens;
					BufferedReader br = new BufferedReader(new FileReader(cacheFiles[0].toString()));
					try{
						while((line = br.readLine()) != null){
							tokens = line.split(",", 2);
							joinData.put(tokens[0], tokens[1]);
							
						}
					}finally{
						br.close();
					}
				}
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}
	public int run(String[] args) throws Exception {
		Configuration conf = getConf();
		Job job = new Job(conf,"DistributedCacheJoin");
		job.setJarByClass(DistributedCacheJoin.class);
		
		DistributedCache.addCacheFile(new Path(args[0]).toUri(), job.getConfiguration());
		Path in = new Path(args[1]);
		Path out = new Path(args[2]);
		
		
		FileInputFormat.setInputPaths(job, in);
		FileOutputFormat.setOutputPath(job, out);
		
		job.setMapperClass(MyMapper.class);
		job.setNumReduceTasks(0);
		job.setInputFormatClass(KeyValueTextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		job.getConfiguration()
			.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
		//在新API 中不再是key.value.separator.in.input.line，你可以在源碼KeyValueLineRecordReader.java中看見。
		System.exit(job.waitForCompletion(true)?0:1);
		
		return 0;
	}
	
	public static void main(String args[]) throws Exception{
		int res = ToolRunner.run(new Configuration(), new DistributedCacheJoin(), args);
		System.exit(res);
	}
	
}

參考：

[1] <<hadoop in action>>

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[Hadoop]使用DistributedCache進行復制聯結

redis的key亂碼問題和值自增問題

CORS error 但是 status code 是200 OK

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

壓縮上傳的GPU數據的方案

OpenTelemetry 實踐指南：歷史、架構與基本概念

需求管理祕籍：從混亂到有序，讓你的項目高效運轉

使用skopeo同步鏡像

用光線投射法渲染規則模型

[Hadoop]使用DistributedCache進行復制聯結

Hadoop全分佈安裝配置及常見問題

使用hadoop的datajoin包進行關係型join操作

用eclipse編寫mapreduce程序

[MapReduce編程]用MapReduce大刀砍掉海量數據離線處理問題。

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結