基於mapreduce的Hadoop join實現分析(二)

2009-11-22 17:00

上次我們討論了基於mapreduce的join的實現,在上次討論的最後,我們對這個實現進行了總結,最主要的問題就是實現的可擴展性,由於在reduce端我們通過一個List數據結構保存了所有的某個外鍵的對應的所有人員信息,而List的最大值爲Integer.MAX_VALUE,所以在數據量巨大的時候,會造成List越界的錯誤.所以對這個實現的優化顯得很有必要.

我們再來看一下這個例子,現在有兩組數據:一組爲單位人員信息,如下:

人員ID 人員名稱地址ID

1 張三 1

2 李四 2

3 王五 1

4 趙六 3

5 馬七 3

另外一組爲地址信息:

地址ID 地址名稱

1 北京

2 上海

3 廣州

結合第一種實現方式,我們看到第一種方式最需要改進的地方就是如果對於某個地址ID的迭代器values,如果values的第一個元素是地址信息的話,那麼,我們就不需要緩存所有的人員信息了.如果第一個元素是地址信息,我們讀出地址信息後,後來就全部是人員信息,那麼就可以將人員的地址置爲相應的地址.

現在我們回頭看看mapreduce的partition和shuffle的過程,partitioner的主要功能是根據reduce的數量將map輸出的結果進行分塊,將數據送入到相應的reducer,所有的partitioner都必須實現Partitioner接口並實現getPartition方法,該方法的返回值爲int類型,並且取值範圍在0-numOfReducer-1,從而能夠將map的輸出輸入到相應的reducer中,對於某個mapreduce過程,Hadoop框架定義了默認的partitioner爲HashPartition,該Partitioner使用key的hashCode來決定將該key輸送到哪個reducer;shuffle將每個partitioner輸出的結果根據key進行group以及排序,將具有相同key的value構成一個valeus的迭代器,並根據key進行排序分別調用開發者定義的reduce方法進行歸併.從shuffle的過程我們可以看出key之間需要進行比較,通過比較才能知道某兩個key是否相等或者進行排序,因此mapduce的所有的key必須實現comparable接口的compareto()方法從而實現兩個key對象之間的比較.

回到我們的問題,我們想要的是將地址信息在排序的過程中排到最前面,前面我們只通過locId進行比較的方法就不夠用了,因爲其無法標識出是地址表中的數據還是人員表中的數據.因此,我們需要實現自己定義的Key數據結構,完成在想共同locId的情況下地址表更小的需求.由於map的中間結果需要寫到磁盤上,因此必須實現writable接口.具體實現如下:

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class RecordKey implements WritableComparable<RecordKey>{

int keyId;

boolean isPrimary;

public void readFields(DataInput in) throws IOException {

// TODO Auto-generated method stub

this.keyId = in.readInt();

this.isPrimary = in.readBoolean();

}

public void write(DataOutput out) throws IOException {

// TODO Auto-generated method stub

out.writeInt(keyId);

out.writeBoolean(isPrimary);

}

public int compareTo(RecordKey k) {

// TODO Auto-generated method stub

if(this.keyId == k.keyId){

if(k.isPrimary == this.isPrimary)

return 0;

return this.isPrimary? -1:1;

}else

return this.keyId > k.keyId?1:-1;

}

@Override

public int hashCode() {

return this.keyId;

}

這個key的數據結構中需要解釋的方法就是compareTo方法,該方法完成了在keyId相同的情況下,確保地址數據比人員數據小.

有了這個數據結構,我們又發現了一個新的問題------就是shuffle的group過程,shuffle的group過程默認使用的是key的compareTo()方法.剛纔我們添加的自定義Key沒有辦法將具有相同的locId的地址和人員放到同一個group中(因爲從compareTo方法中可以看出他們是不相等的).不過hadoop框架提供了OutputValueGoupingComparator可以讓使用者自定義key的group信息.我們需要的就是自己定義個groupingComparator就可以啦!看看這個比較器吧!

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.io.WritableComparator;

public class PkFkComparator extends WritableComparator {

public PkFkComparator(){

super(RecordKey.class);

}

@Override

public int compare(WritableComparable a, WritableComparable b) {

RecordKey key1 = (RecordKey)a;

RecordKey key2 = (RecordKey)b;

System.out.println("call compare");

if(key1.keyId == key2.keyId){

return 0;

}else

return key1.keyId > key2.keyId?1:-1;

}

這裏我們重寫了compare方法,將兩個具有相同的keyId的數據設爲相等.

好了,有了這兩個輔助工具,剩下的就比較簡單了.寫mapper,reducer,以及主程序.

import java.io.IOException;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.io.*;

public class JoinMapper extends MapReduceBase

implements Mapper<LongWritable, Text, RecordKey, Record> {

public void map(LongWritable key, Text value,

OutputCollector<RecordKey, Record> output, Reporter reporter)

throws IOException {

String line = value.toString();

String[] values = line.split(",");

if(values.length == 2){ //這裏使用記錄的長度來區別地址信息與人員信息,當然可以通過其他方式(如文件名等)來實現

Record reco = new Record();

reco.locId = values[0];

reco.type = 2;

reco.locationName = values[1];

RecordKey recoKey = new RecordKey();

recoKey.keyId = Integer.parseInt(values[0]);

recoKey.isPrimary = true;

output.collect(recoKey, reco);

}else{

Record reco = new Record();

reco.locId = values[2];

reco.empId = values[0];

reco.empName = values[1];

reco.type = 1;

RecordKey recoKey = new RecordKey();

recoKey.keyId = Integer.parseInt(values[2]);

recoKey.isPrimary = false;

output.collect(recoKey, reco);

}

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

public class JoinReducer extends MapReduceBase implements

Reducer<RecordKey, Record, LongWritable, Text> {

public void reduce(RecordKey key, Iterator<Record> values,

OutputCollector<LongWritable, Text> output,

Reporter reporter) throws IOException {

Record thisLocation= new Record();

while (values.hasNext()){

Record reco = values.next();

if(reco.type == 2){ //2 is the location

thisLocation = new Record(reco);

System.out.println("location is "+ thisLocation.locationName);

}else{ //1 is employee

reco.locationName =thisLocation.locationName;

System.out.println("emp is "+reco.toString());

output.collect(new LongWritable(0), new Text(reco.toString()));

}

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.SequenceFile;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class Join {

/**

* @param args

public static void main(String[] args) throws Exception {

// TODO Auto-generated method stub

JobConf conf = new JobConf(Join.class);

conf.setJobName("Join");

FileSystem fstm = FileSystem.get(conf);

Path outDir = new Path("/Users/outputtest");

fstm.delete(outDir, true);

conf.setOutputFormat(SequenceFileOutputFormat.class);

conf.setMapOutputKeyClass(RecordKey.class);

conf.setMapOutputValueClass(Record.class);

conf.setOutputKeyClass(LongWritable.class);

conf.setOutputValueClass(Text.class);

conf.setMapperClass(JoinMapper.class);

conf.setReducerClass(JoinReducer.class);

conf.setOutputValueGroupingComparator(PkFkComparator.class);

FileInputFormat.setInputPaths(conf, new Path(

"/user/input/join"));

FileOutputFormat.setOutputPath(conf, outDir);

JobClient.runJob(conf);

Path outPutFile = new Path(outDir, "part-00000");

SequenceFile.Reader reader = new SequenceFile.Reader(fstm, outPutFile,

conf);

org.apache.hadoop.io.Text numInside = new Text();

LongWritable numOutside = new LongWritable();

while (reader.next(numOutside, numInside)) {

System.out.println(numInside.toString() + " "

+ numOutside.toString());

}

reader.close();

}

好了,基本的程序就在這裏了.這就是一個比較完整的join的實現,這裏對數據中的噪聲沒有進行處理,如果數據中有噪聲數據,可能會導致程序的運行錯誤,還需要進一步提高程序的健壯性.

轉自：http://labs.chinamobile.com/mblog/4110_32505

基於mapreduce的 Hadoop join 實現分析(二)

基於mapreduce的Hadoop join實現分析(二)

hive sql 用法

Hive UDF 開發

深入剖析Hadoop程序日誌

hadoop 兩表join處理方法

HBase 集羣配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結