多數據源的MapReduce作業(一)--Reduce側的聯結

場景:實現多表的join操作。

select   customers.*,orders.* from customers

join orders

on customers.id =orders.id


使用DataJoin軟件包進行實現聯結操作。

擴展三個類:

1DataJoinMapperBase

2DataJoinReducerBase

3TaggedMapOutput


原理:

1、mapper端輸入後,將數據封裝成TaggedMapOutput類型,此類型封裝數據源(tag)和值(value);

2、map階段輸出的結果不在是簡單的一條數據,而是一條記錄。記錄=數據源(tag)+數據值(value).

3、combine接收的是一個組合:不同數據源卻有相同組鍵的值;

4、不同數據源的每一條記錄只能在一個combine中出現;

如圖:





接下來是實例代碼。

數據源:

Customers.txt

1,wuminggang,13575468248
2,liujiannan,18965235874
3,wangbo,15986854789
4,tom,15698745862

Orders.txt

3,A,99,2013-03-05
1,B,89,2013-02-05
2,C,69,2013-03-09
3,D,56,2013-06-07

自定義類  TaggedWritable

package com.hadoop.data.join;


import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;


/*TaggedMapOutput是一個抽象數據類型,封裝了標籤與記錄內容
 此處作爲DataJoinMapperBase的輸出值類型,需要實現Writable接口,所以要實現兩個序列化方法
 自定義輸入類型*/
public class TaggedWritable extends TaggedMapOutput {
private Writable data;


public TaggedWritable() {
this.tag = new Text();
}



public TaggedWritable(Writable data) // 構造函數
{
this.tag = new Text(); // tag可以通過setTag()方法進行設置
this.data = data;
}


@Override
public void readFields(DataInput in) throws IOException {
tag.readFields(in);
String dataClz = in.readUTF();
if (this.data == null
|| !this.data.getClass().getName().equals(dataClz)) {
try {
this.data = (Writable) ReflectionUtils.newInstance(
Class.forName(dataClz), null);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}

data.readFields(in);
}


@Override
public void write(DataOutput out) throws IOException {
tag.write(out);
out.writeUTF(this.data.getClass().getName());
data.write(out);
}


@Override
public Writable getData() {
return data;
}
}


Mapper類   JoinMapper  

package com.hadoop.data.join;


import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
import org.apache.hadoop.io.Text;


public class JoinMapper extends DataJoinMapperBase {


// 這個在任務開始時調用,用於產生標籤
// 此處就直接以文件名作爲標籤
@Override
protected Text generateInputTag(String inputFile) {
System.out.println("inputFile = " + inputFile);
return new Text(inputFile);
}


// 這裏我們已經確定分割符爲',',更普遍的,用戶應能自己指定分割符和組鍵。
// 設置組鍵
@Override
protected Text generateGroupKey(TaggedMapOutput record) {
String tag = ((Text) record.getTag()).toString();
System.out.println("tag = " + tag);
String line = ((Text) record.getData()).toString();
String[] tokens = line.split(",");
return new Text(tokens[0]);
}


// 返回一個任何帶任何我們想要的Text標籤的TaggedWritable
@Override
protected TaggedMapOutput generateTaggedMapOutput(Object value) {
TaggedWritable retv = new TaggedWritable((Text) value);
retv.setTag(this.inputTag); // 不要忘記設定當前鍵值的標籤
return retv;
}
}

Reduce類  JoinReducer

package com.hadoop.data.join;


import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
import org.apache.hadoop.io.Text;


public class JoinReducer extends DataJoinReducerBase {


// 兩個參數數組大小一定相同,並且最多等於數據源個數
@Override
protected TaggedMapOutput combine(Object[] tags, Object[] values) {
if (tags.length < 2)
return null; // 這一步,實現內聯結
String joinedStr = "";
for (int i = 0; i < values.length; i++) {
if (i > 0)
joinedStr += ","; // 以逗號作爲原兩個數據源記錄鏈接的分割符
TaggedWritable tw = (TaggedWritable) values[i];
String line = ((Text) tw.getData()).toString();


String[] tokens = line.split(",", 2); // 將一條記錄劃分兩組,去掉第一組的組鍵名。
joinedStr += tokens[1];
}
TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
retv.setTag((Text) tags[0]); // 這隻retv的組鍵,作爲最終輸出鍵。
return retv;
}


}

驅動類  DataJoinDriver

package com.hadoop.data.join;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class DataJoinDriver extends Configured implements Tool {


public int run(String[] args) throws Exception {
Configuration conf = getConf();
if (args.length != 2) {
System.err.println("Usage:DataJoin <input path> <output path>");
System.exit(-1);
}
Path in = new Path(args[0]);
Path out = new Path(args[1]);
JobConf job = new JobConf(conf, DataJoinDriver.class);
job.setJobName("DataJoin");
FileSystem hdfs = FileSystem.get(conf);
FileInputFormat.setInputPaths(job, in);
if (hdfs.exists(new Path(args[1]))) {
hdfs.delete(new Path(args[1]), true);
}
FileOutputFormat.setOutputPath(job, out);
job.setMapperClass(JoinMapper.class);
job.setReducerClass(JoinReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TaggedWritable.class);
JobClient.runJob(job);
return 0;
}


public static void main(String[] args) throws Exception {


int res = ToolRunner.run(new Configuration(), new DataJoinDriver(),
args);
System.exit(res);
}


}

ok,到此大功告成。

結果如下:

1       wuminggang,13575468248,B,89,2013-02-05
2       liujiannan,18965235874,C,69,2013-03-09
3       wangbo,15986854789,A,99,2013-03-05
3       wangbo,15986854789,D,56,2013-06-07

注意,代碼中紅色部分一定要加上,有的參考書上沒有,原因如下:

使用DataJoin進行Reduce側連接多數據源時,發生異常:

java.lang.RuntimeException: java.lang.NoSuchMethodException: com.hadoop.reducedatajoin.ReduceDataJoin$TaggedWritable.()

   at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)

   at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:62)

   at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)

   at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1271)

   at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1211)

   at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:249)

   at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:245)

   at org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.regroup(DataJoinReducerBase.java:106)

   at org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.reduce(DataJoinReducerBase.java:129)

   at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)

   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)

   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)

Caused by: java.lang.NoSuchMethodException: com.hadoop.reducedatajoin.ReduceDataJoin$TaggedWritable.()

   at java.lang.Class.getConstructor0(Unknown Source)

   at java.lang.Class.getDeclaredConstructor(Unknown Source)

   at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)

   ... 11 more

 

 

解決法案:

http://stackoverflow.com/questions/10201500/hadoop-reduce-side-join-using-datajoin

  You need a default constructor for TaggedWritable (Hadoop uses reflection to create this object, and requires a default constructor (no args).

  You also have a problem in that your readFields method, you call data.readFields(in) on the writable interface - but has no knowledge of the actual runtime class of data.

  I suggest you either write out the data class name before outputting the data object itself, or look into the GenericWritable class (you'll need to extend it to define the set of allowable writable classes that can be used).

  So you could amend as follows:

大概意思是說:你需要爲TaggedWritable提供一個默認的無參數構造方法。

您需要一個默認的的構造函數TaggedWritableHadoop的使用反射來創建這個對象,需要一個默認的構造函數(無參數)。
你也有一個問題,就是你的ReadFields方法,你可寫的接口上調用data.readFields(中) - 但沒有知識的實際運行時類的數據。
我建議你要麼寫出來的數據類的名稱,然後輸出的數據對象本身,或尋找到GenericWritable類(你需要擴展它定義一組允許可寫的類,可以使用)。

優點:需要處理的數據相對較小,使用比較常見。

缺點:效率不高。數據在shuffle階段重排,而大多數據重排後在reduce端又被丟掉,如果在map階段就去除不必要的數據,會提升效率。詳見多數據源的MapReduce作業(二)


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章