hadoop join之二

在介紹這個實例之前,請各位參考:http://bjyjtdj.iteye.com/blog/1453410

reduce side join是一種最簡單的join方式,其主要思想如下:
 在map階段,map函數同時讀取兩個文件File1和File2,爲了區分兩種來源的key/value數據對,對每條數據打一個標籤(tag),比如:tag=0表示來自文件File1,tag=2表示來自文件File2。即:map階段的主要任務是對不同文件中的數據打標籤。在reduce階段,reduce函數獲取key相同的來自File1和File2文件的value list, 然後對於同一個key,對File1和File2中的數據進行join(笛卡爾乘積)。即:reduce階段進行實際的連接操作。在這個例子中我們假設有兩個數據文件如下:

user.csv文件:

"ID","NAME","SEX"
"1","user1","0"
"2","user2","0"
"3","user3","0"
"4","user4","1"
"5","user5","0"
"6","user6","0"
"7","user7","1"
"8","user8","0"
"9","user9","0"

order.csv文件:

"USER_ID","NAME"
"1","order1"
"2","order2"
"3","order3"
"4","order4"
"7","order7"
"8","order8"
"9","order9"


目前網上的例子大多是基於0.20以前版本的API寫的,所以咱們採用新的API來寫,具體代碼如下:

public class MyJoin
{
    public static class MapClass extends 
        Mapper<LongWritable, Text, Text, Text>
    {

        //最好在map方法外定義變量,以減少map計算時創建對象的個數
        private Text key = new Text();
        private Text value = new Text();
        private String[] keyValue = null;
        
        @Override
        protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException
        {
            //採用的數據輸入格式是TextInputFormat,
            //文件被分爲一系列以換行或者製表符結束的行,
            //key是每一行的位置(偏移量,LongWritable類型),
            //value是每一行的內容,Text類型,所有我們要把key從value中解析出來
            keyValue = value.toString().split(",", 2);
            this.key.set(keyValue[0]);
            this.value.set(keyValue[1]);
            context.write(this.key, this.value);
        }
        
    }
    
    public static class Reduce extends Reducer<Text, Text, Text, Text>
    {

        //最好在reduce方法外定義變量,以減少reduce計算時創建對象的個數
        private Text value = new Text();
        
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException
        {
            StringBuilder valueStr = new StringBuilder();
            
            //values中的每一個值是不同數據文件中的具有相同key的值
            //即是map中輸出的多個文件相同key的value值集合
            for(Text val : values)
            {
                valueStr.append(val);
                valueStr.append(",");
            }
            
            this.value.set(valueStr.deleteCharAt(valueStr.length()-1).toString());
            context.write(key, this.value);
        }
        
    }
    
    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "MyJoin");
        
        job.setJarByClass(MyJoin.class);
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);
        //job.setCombinerClass(Reduce.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        
        //分別採用TextInputFormat和TextOutputFormat作爲數據的輸入和輸出格式
        //如果不設置,這也是Hadoop默認的操作方式
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章