用Hive一句話搞定的，但是有時必須要用mapreduce

方法介紹

1. 概述

在傳統數據庫（如：MYSQL）中，JOIN操作是非常常見且非常耗時的。而在HADOOP中進行JOIN操作，同樣常見且耗時，由於Hadoop的獨特設計思想，當進行JOIN操作時，有一些特殊的技巧。
本文首先介紹了Hadoop上通常的JOIN實現方法，然後給出了幾種針對不同輸入數據集的優化方法。

2. 常見的join方法介紹

假設要進行join的數據分別來自File1和File2.

2.1 reduce side join

reduce side join是一種最簡單的join方式，其主要思想如下：
在map階段，map函數同時讀取兩個文件File1和File2，爲了區分兩種來源的key/value數據對，對每條數據打一個標籤（tag）,比如：tag=0表示來自文件File1，tag=2表示來自文件File2。即：map階段的主要任務是對不同文件中的數據打標籤。
在reduce階段，reduce函數獲取key相同的來自File1和File2文件的value list，然後對於同一個key，對File1和File2中的數據進行join（笛卡爾乘積）。即：reduce階段進行實際的連接操作。

2.2 map side join

之所以存在reduce side join，是因爲在map階段不能獲取所有需要的join字段，即：同一個key對應的字段可能位於不同map中。Reduce side join是非常低效的，因爲shuffle階段要進行大量的數據傳輸。
Map side join是針對以下場景進行的優化：兩個待連接表中，有一個表非常大，而另一個表非常小，以至於小表可以直接存放到內存中。這樣，我們可以將小表複製多份，讓每個map task內存中存在一份（比如存放到hash table中），然後只掃描大表：對於大表中的每一條記錄key/value，在hash table中查找是否有相同的key的記錄，如果有，則連接後輸出即可。
爲了支持文件的複製，Hadoop提供了一個類DistributedCache，使用該類的方法如下：
（1）用戶使用靜態方法DistributedCache.addCacheFile()指定要複製的文件，它的參數是文件的URI（如果是HDFS上的文件，可以這樣：hdfs://namenode:9000/home/XXX/file，其中9000是自己配置的NameNode端口號）。JobTracker在作業啓動之前會獲取這個URI列表，並將相應的文件拷貝到各個TaskTracker的本地磁盤上。（2）用戶使用DistributedCache.getLocalCacheFiles()方法獲取文件目錄，並使用標準的文件讀寫API讀取相應的文件。

2.3 SemiJoin

SemiJoin，也叫半連接，是從分佈式數據庫中借鑑過來的方法。它的產生動機是：對於reduce side join，跨機器的數據傳輸量非常大，這成了join操作的一個瓶頸，如果能夠在map端過濾掉不會參加join操作的數據，則可以大大節省網絡IO。
實現方法很簡單：選取一個小表，假設是File1，將其參與join的key抽取出來，保存到文件File3中，File3文件一般很小，可以放到內存中。在map階段，使用DistributedCache將File3複製到各個TaskTracker上，然後將File2中不在File3中的key對應的記錄過濾掉，剩下的reduce階段的工作與reduce side join相同。
更多關於半連接的介紹，可參考：半連接介紹：http://wenku.baidu.com/view/ae7442db7f1922791688e877.html

2.4 reduce side join + BloomFilter

在某些情況下，SemiJoin抽取出來的小表的key集合在內存中仍然存放不下，這時候可以使用BloomFiler以節省空間。
BloomFilter最常見的作用是：判斷某個元素是否在一個集合裏面。它最重要的兩個方法是：add() 和contains()。最大的特點是不會存在false negative，即：如果contains()返回false，則該元素一定不在集合中，但會存在一定的true negative，即：如果contains()返回true，則該元素可能在集合中。
因而可將小表中的key保存到BloomFilter中，在map階段過濾大表，可能有一些不在小表中的記錄沒有過濾掉（但是在小表中的記錄一定不會過濾掉），這沒關係，只不過增加了少量的網絡IO而已。
更多關於BloomFilter的介紹，可參考：http://blog.csdn.net/jiaomeng/article/details/1495500

3. 二次排序

在Hadoop中，默認情況下是按照key進行排序，如果要按照value進行排序怎麼辦？即：對於同一個key，reduce函數接收到的value list是按照value排序的。這種應用需求在join操作中很常見，比如，希望相同的key中，小表對應的value排在前面。
有兩種方法進行二次排序，分別爲：buffer and in memory sort和 value-to-key conversion。
對於buffer and in memory sort，主要思想是：在reduce()函數中，將某個key對應的所有value保存下來，然後進行排序。這種方法最大的缺點是：可能會造成out of memory。
對於value-to-key conversion，主要思想是：將key和部分value拼接成一個組合key（實現WritableComparable接口或者調用setSortComparatorClass函數），這樣reduce獲取的結果便是先按key排序，後按value排序的結果，需要注意的是，用戶需要自己實現Paritioner，以便只按照key進行數據劃分。Hadoop顯式的支持二次排序，在Configuration類中有個setGroupingComparatorClass()方法，可用於設置排序group的key值，

reduce-side-join Python代碼

Hadoop有個工具叫做steaming，能夠支持python、shell、C++、PHP等其他任何支持標準輸入stdin及標準輸出stdout的語言，其運行原理可以通過和標準Java的map-reduce程序對比來說明：

使用原生java語言實現Map-reduce程序

hadoop準備好數據後，將數據傳送給java的map程序
java的map程序將數據處理後，輸出O1
hadoop將O1打散、排序，然後傳給不同的reduce機器
每個reduce機器將傳來的數據傳給reduce程序
reduce程序將數據處理，輸出最終數據O2

藉助hadoop streaming使用python語言實現Map-reduce程序

hadoop準備好數據後，將數據傳送給java的map程序
java的map程序將數據處理成“鍵/值”對，並傳送給python的map程序
python的map程序將數據處理後，將結果傳回給java的map程序
java的map程序將數據輸出爲O1
hadoop將O1打散、排序，然後傳給不同的reduce機器
每個reduce機器將傳來的數據處理成“鍵/值”對，並傳送給python的reduce程序
python的reduce程序將數據處理後，將結果返回給java的reduce程序
java的reduce程序將數據處理，輸出最終數據O2

上面紅色表示map的對比，藍色表示reduce的對比，可以看出streaming程序多了一步中間處理，這樣說來steaming程序的效率和性能應該低於java版的程序，然而python的開發效率、運行性能有時候會大於java，這就是streaming的優勢所在。

hadoop之實現集合join的需求

hadoop是用來做數據分析的，大都是對集合進行操作，因此該過程中將集合join起來使得一個集合能得到另一個集合對應的信息的需求非常常見。

比如以下這個需求，有兩份數據：學生信息（學號，姓名）和學生成績（學號、課程、成績），特點是有個共同的主鍵“學號”，現在需要將兩者結合起來得到數據（學號，姓名，課程，成績），計算公式：

（學號，姓名） join （學號，課程，成績）= （學號，姓名，課程，成績）

數據事例1-學生信息：

學號sno	姓名name
01	name1
02	name2
03	name3
04	name4

數據事例2:-學生成績：

學號sno	課程號courseno	成績grade
01	01	80
01	02	90
02	01	82
02	02	95

期待的最終輸出：

學號sno	姓名name	課程courseno	成績grade
01	name1	01	80
01	name1	02	90
02	name2	01	82
02	name2	02	95

實現join的注意點和易踩坑總結

如果你想寫一個完善健壯的map reduce程序，我建議你首先弄清楚輸入數據的格式、輸出數據的格式，然後自己手動構建輸入數據並手動計算出輸出數據，這個過程中你會發現一些寫程序中需要特別處理的地方：

實現join的key是哪個，是1個字段還是2個字段，本例中key是sno，1個字段
每個集合中key是否可以重複，本例中數據1不可重複，數據2的key可以重複
每個集合中key的對應值是否可以不存在，本例中有學生會沒成績，所以數據2的key可以爲空

第1條會影響到hadoop啓動腳本中key.fields和partition的配置，第2條會影響到map-reduce程序中具體的代碼實現方式，第3條同樣影響代碼編寫方式。

hadoop實現join操作的思路

具體思路是給每個數據源加上一個數字標記label，這樣hadoop對其排序後同一個字段的數據排在一起並且按照label排好序了，於是直接將相鄰相同key的數據合併在一起輸出就得到了結果。

1、 map階段：給表1和表2加標記，其實就是多輸出一個字段，比如表一加標記爲0，表2加標記爲2；

2、 partion階段：根據學號key爲第一主鍵，標記label爲第二主鍵進行排序和分區

3、 reduce階段：由於已經按照第一主鍵、第二主鍵排好了序，將相鄰相同key數據合併輸出

hadoop使用python實現join的map和reduce代碼

mapper.py的代碼：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# -*- coding: utf-8 -*-
#Mapper.py
import os
import sys
 
#mapper腳本
def mapper():
 #獲取當前正在處理的文件的名字，這裏我們有兩個輸入文件
 #所以要加以區分
 filepath = os.environ["map_input_file"]
 filename = os.path.split(filepath)[-1]
 for line in sys.stdin:
 if line.strip()=="":
 continue
 fields = line[:-1].split("\t")
 sno = fields[0]
 #以下判斷filename的目的是不同的文件有不同的字段，並且需加上不同的標記
 if filename == 'data_info':
 name = fields[1]
 #下面的數字'0'就是爲數據源1加上的統一標記
 print '\t'.join((sno,'0',name))
 elif filename == 'data_grade':
 courseno = fields[1]
 grade = fields[2]
 #下面的數字'1'就是爲數據源1加上的統一標記
 print '\t'.join((sno,'1',courseno,grade))
 
if __name__=='__main__':
 mapper()

reducer的代碼：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

# -*- coding: utf-8 -*-

#reducer.py

importsys

defreducer():

#爲了記錄和上一個記錄的區別，用lastsno記錄上個sno

lastsno=""

forlineinsys.stdin:

ifline.strip()=="":

continue

fields=line[:-1].split("\t")

sno=fields[0]

'''

處理思路：

遇見當前key與上一條key不同並且label=0，就記錄下來name值，

當前key與上一條key相同並且label==1，則將本條數據的courseno、

grade聯通上一條記錄的name一起輸出成最終結果

'''

ifsno!=lastsno:

name=""

#這裏沒有判斷label==1的情況，

#因爲sno!=lastno,並且label=1表示該條key沒有數據源1的數據

iffields[1]=="0":

name=fields[2]

elifsno==lastno:

#這裏沒有判斷label==0的情況，

#因爲sno==lastno並且label==0表示該條key沒有數據源2的數據

iffields[2]=="1":

courseno=fields[2]

grade=fields[3]

ifname:

print'\t'.join((lastsno,name,courseno,grade))

lastsno=sno

if__name__=='__main__':

reducer()

使用shell腳本啓動hadoop程序的方法：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

#先刪除輸出目錄

~/hadoop-client/hadoop/bin/hadoopfs-rmr/hdfs/jointest/output

#注意，下面配置中的環境值每個人機器不一樣

~/hadoop-client/hadoop/bin/hadoopstreaming\

-Dmapred.map.tasks=10\

-Dmapred.reduce.tasks=5\

-Dmapred.job.map.capacity=10\

-Dmapred.job.reduce.capacity=5\

-Dmapred.job.name="join--sno_name-sno_courseno_grade"\

-Dnum.key.fields.for.partition=1\

-Dstream.num.map.output.key.fields=2\

-partitionerorg.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner\

-input"/hdfs/jointest/input/*"\

-output"/hdfs/jointest/output"\

-mapper"python26/bin/python26.sh
mapper.py"\

-reducer"python26/bin/python26.sh
reducer.py"\

-file"mapper.py"\

-file"reducer.py"\

-cacheArchive"/share/python26.tar.gz#python26"

#看看運行成功沒，若輸出0則表示成功了

echo$?

可以自己手工構造輸入輸出數據進行測試，本程序是驗證過的。

數據準備

首先是準備好數據。這個倒已經是一個熟練的過程，所要做的是把示例數據準備好，記住路徑和字段分隔符。

準備好下面兩張表：

（1）m_ys_lab_jointest_a（以下簡稱表A）

建表語句爲：

[sql]view
plaincopy

create table if not exists m_ys_lab_jointest_a (  

     id bigint,  

     name string  

)  

row format delimited  

fields terminated by '9'  

lines terminated by '10'  

stored as textfile;

數據：

id     name
1     北京
2     天津
3     河北
4     山西
5     內蒙古
6     遼寧
7     吉林
8     黑龍江

（2）m_ys_lab_jointest_b（以下簡稱表B）

建表語句爲：

[sql]view
plaincopy

create table if not exists m_ys_lab_jointest_b (  

     id bigint,  

     statyear bigint,  

     num bigint  

)  

row format delimited  

fields terminated by '9'  

lines terminated by '10'  

stored as textfile;

數據：

id statyear     num
1     2010     1962
1     2011     2019
2     2010     1299
2     2011     1355
4     2010     3574
4     2011     3593
9     2010     2303
9     2011     2347

我們的目的是，以id爲key做join操作，得到以下表：

m_ys_lab_jointest_ab

id     name statyear     num
1       北京    2011    2019
1       北京    2010    1962
2       天津    2011    1355
2       天津    2010    1299
4       山西    2011    3593
4       山西    2010    3574

計算模型

整個計算過程是：

（1）在map階段，把所有記錄標記成<key, value>的形式，其中key是id，value則根據來源不同取不同的形式：來源於表A的記錄，value的值爲"a#"+name；來源於表B的記錄，value的值爲"b#"+score。

（2）在reduce階段，先把每個key下的value列表拆分爲分別來自表A和表B的兩部分，分別放入兩個向量中。然後遍歷兩個向量做笛卡爾積，形成一條條最終結果。

如下圖所示：

代碼

代碼如下：

[java]view
plaincopy

import java.io.IOException;  

import java.util.HashMap;  

import java.util.Iterator;  

import java.util.Vector;  

import org.apache.hadoop.io.LongWritable;  

import org.apache.hadoop.io.Text;  

import org.apache.hadoop.io.Writable;  

import org.apache.hadoop.mapred.FileSplit;  

import org.apache.hadoop.mapred.JobConf;  

import org.apache.hadoop.mapred.MapReduceBase;  

import org.apache.hadoop.mapred.Mapper;  

import org.apache.hadoop.mapred.OutputCollector;  

import org.apache.hadoop.mapred.RecordWriter;  

import org.apache.hadoop.mapred.Reducer;  

import org.apache.hadoop.mapred.Reporter;  

/** 

 * MapReduce實現Join操作 

 */  

public class MapRedJoin {  

    public static final String DELIMITER = "\u0009"; // 字段分隔符  

    // map過程  

    public static class MapClass extends MapReduceBase implements  

            Mapper<LongWritable, Text, Text, Text> {  

        public void configure(JobConf job) {  

            super.configure(job);  

        }  

        public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,  

                Reporter reporter) throws IOException, ClassCastException {  

            // 獲取輸入文件的全路徑和名稱  

            String filePath = ((FileSplit)reporter.getInputSplit()).getPath().toString();  

            // 獲取記錄字符串  

            String line = value.toString();  

            // 拋棄空記錄  

            if (line == null || line.equals("")) return;   

            // 處理來自表A的記錄  

            if (filePath.contains("m_ys_lab_jointest_a")) {  

                String[] values = line.split(DELIMITER); // 按分隔符分割出字段  

                if (values.length < 2) return;  

                String id = values[0]; // id  

                String name = values[1]; // name  

                output.collect(new Text(id), new Text("a#"+name));  

            }  

            // 處理來自表B的記錄  

            else if (filePath.contains("m_ys_lab_jointest_b")) {  

                String[] values = line.split(DELIMITER); // 按分隔符分割出字段  

                if (values.length < 3) return;  

                String id = values[0]; // id  

                String statyear = values[1]; // statyear  

                String num = values[2]; //num  

                output.collect(new Text(id), new Text("b#"+statyear+DELIMITER+num));  

            }  

        }  

    }  

    // reduce過程  

    public static class Reduce extends MapReduceBase  

            implements Reducer<Text, Text, Text, Text> {  

        public void reduce(Text key, Iterator<Text> values,  

                OutputCollector<Text, Text> output, Reporter reporter)  

                throws IOException {  

            Vector<String> vecA = new Vector<String>(); // 存放來自表A的值  

            Vector<String> vecB = new Vector<String>(); // 存放來自表B的值  

            while (values.hasNext()) {  

                String value = values.next().toString();  

                if (value.startsWith("a#")) {  

                    vecA.add(value.substring(2));  

                } else if (value.startsWith("b#")) {  

                    vecB.add(value.substring(2));  

                }  

            }  

            int sizeA = vecA.size();  

            int sizeB = vecB.size();  

            // 遍歷兩個向量  

            int i, j;  

            for (i = 0; i < sizeA; i ++) {  

                for (j = 0; j < sizeB; j ++) {  

                    output.collect(key, new Text(vecA.get(i) + DELIMITER +vecB.get(j)));  

                }  

            }     

        }  

    }  

    protected void configJob(JobConf conf) {  

        conf.setMapOutputKeyClass(Text.class);  

        conf.setMapOutputValueClass(Text.class);  

        conf.setOutputKeyClass(Text.class);  

        conf.setOutputValueClass(Text.class);  

        conf.setOutputFormat(ReportOutFormat.class);  

    }  

}

技術細節

下面說一下其中的若干技術細節：

（1）由於輸入數據涉及兩張表，我們需要判斷當前處理的記錄是來自表A還是來自表B。Reporter類getInputSplit()方法可以獲取輸入數據的路徑，具體代碼如下：

String filePath = ((FileSplit)reporter.getInputSplit()).getPath().toString();

（2）map的輸出的結果，同id的所有記錄（不管來自表A還是表B）都在同一個key下保存在同一個列表中，在reduce階段需要將其拆開，保存爲相當於笛卡爾積的m x n條記錄。由於事先不知道m、n是多少，這裏使用了兩個向量（可增長數組）來分別保存來自表A和表B的記錄，再用一個兩層嵌套循環組織出我們需要的最終結果。

（3）在MapReduce中可以使用System.out.println()方法輸出，以方便調試。不過System.out.println()的內容不會在終端顯示，而是輸出到了stdout和stderr這兩個文件中，這兩個文件位於logs/userlogs/attempt_xxx目錄下。可以通過web端的歷史job查看中的“Analyse This Job”來查看stdout和stderr的內容。

所有方法的java代碼（巨長）

從別人那轉來

1、在Reudce端進行連接。

在Reudce端進行連接是MapReduce框架進行表之間join操作最爲常見的模式，其具體的實現原理如下：

Map端的主要工作：爲來自不同表（文件）的key/value對打標籤以區別不同來源的記錄。然後用連接字段作爲key，其餘部分和新加的標誌作爲value，最後進行輸出。

reduce端的主要工作：在reduce端以連接字段作爲key的分組已經完成，我們只需要在每一個分組當中將那些來源於不同文件的記錄（在map階段已經打標誌）分開，最後進行笛卡爾只就ok了。原理非常簡單，下面來看一個實例：

(1)自定義一個value返回類型:

package com.mr.reduceSizeJoin;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
public class CombineValues implements WritableComparable{
//private static final Logger logger = LoggerFactory.getLogger(CombineValues.class);
private Text joinKey;//鏈接關鍵字
private Text flag;//文件來源標誌
private Text secondPart;//除了鏈接鍵外的其他部分
public void setJoinKey(Text joinKey) {
this.joinKey = joinKey;
}
public void setFlag(Text flag) {
this.flag = flag;
}
public void setSecondPart(Text secondPart) {
this.secondPart = secondPart;
}
public Text getFlag() {
return flag;
}
public Text getSecondPart() {
return secondPart;
}
public Text getJoinKey() {
return joinKey;
}
public CombineValues() {
this.joinKey = new Text();
this.flag = new Text();
this.secondPart = new Text();
}
@Override
public void write(DataOutput out) throws IOException {
this.joinKey.write(out);
this.flag.write(out);
this.secondPart.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
this.joinKey.readFields(in);
this.flag.readFields(in);
this.secondPart.readFields(in);
}
@Override
public int compareTo(CombineValues o) {
return this.joinKey.compareTo(o.getJoinKey());
}
@Override
public String toString() {
// TODO Auto-generated method stub
return "[flag="+this.flag.toString()+",joinKey="+this.joinKey.toString()+",secondPart="+this.secondPart.toString()+"]";
}
}

(2)map、reduce主體代碼

package com.mr.reduceSizeJoin;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* @author zengzhaozheng
* 用途說明：
* reudce side join中的left outer join
* 左連接，兩個文件分別代表2個表,連接字段table1的id字段和table2的cityID字段
* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)
* tb_dim_city.dat文件內容,分隔符爲"|"：
* id name orderid city_code is_show
* 0 其他 9999 9999 0
* 1 長春 1 901 1
* 2 吉林 2 902 1
* 3 四平 3 903 1
* 4 松原 4 904 1
* 5 通化 5 905 1
* 6 遼源 6 906 1
* 7 白城 7 907 1
* 8 白山 8 908 1
* 9 延吉 9 909 1
* -------------------------風騷的分割線-------------------------------
* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)
* tb_user_profiles.dat文件內容,分隔符爲"|"：
* userID network flow cityID
* 1 2G 123 1
* 2 3G 333 2
* 3 3G 555 1
* 4 2G 777 3
* 5 3G 666 4
*
* -------------------------風騷的分割線-------------------------------
* 結果：
* 1 長春 1 901 1 1 2G 123
* 1 長春 1 901 1 3 3G 555
* 2 吉林 2 902 1 2 3G 333
* 3 四平 3 903 1 4 2G 777
* 4 松原 4 904 1 5 3G 666
*/
public class ReduceSideJoin_LeftOuterJoin extends Configured implements Tool{
private static final Logger logger = LoggerFactory.getLogger(ReduceSideJoin_LeftOuterJoin.class);
public static class LeftOutJoinMapper extends Mapper {
private CombineValues combineValues = new CombineValues();
private Text flag = new Text();
private Text joinKey = new Text();
private Text secondPart = new Text();
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//獲得文件輸入路徑
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
//數據來自tb_dim_city.dat文件,標誌即爲"0"
if(pathName.endsWith("tb_dim_city.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 5){
return;
}
flag.set("0");
joinKey.set(valueItems[0]);
secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}//數據來自於tb_user_profiles.dat，標誌即爲"1"
else if(pathName.endsWith("tb_user_profiles.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 4){
return;
}
flag.set("1");
joinKey.set(valueItems[3]);
secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}
}
}
public static class LeftOutJoinReducer extends Reducer {
//存儲一個分組中的左表信息
private ArrayList leftTable = new ArrayList();
//存儲一個分組中的右表信息
private ArrayList rightTable = new ArrayList();
private Text secondPar = null;
private Text output = new Text();
/**
* 一個分組調用一次reduce函數
*/
@Override
protected void reduce(Text key, Iterable value, Context context)
throws IOException, InterruptedException {
leftTable.clear();
rightTable.clear();
/**
* 將分組中的元素按照文件分別進行存放
* 這種方法要注意的問題：
* 如果一個分組內的元素太多的話，可能會導致在reduce階段出現OOM，
* 在處理分佈式問題之前最好先了解數據的分佈情況，根據不同的分佈採取最
* 適當的處理方法，這樣可以有效的防止導致OOM和數據過度傾斜問題。
*/
for(CombineValues cv : value){
secondPar = new Text(cv.getSecondPart().toString());
//左表tb_dim_city
if("0".equals(cv.getFlag().toString().trim())){
leftTable.add(secondPar);
}
//右表tb_user_profiles
else if("1".equals(cv.getFlag().toString().trim())){
rightTable.add(secondPar);
}
}
logger.info("tb_dim_city:"+leftTable.toString());
logger.info("tb_user_profiles:"+rightTable.toString());
for(Text leftPart : leftTable){
for(Text rightPart : rightTable){
output.set(leftPart+ "\t" + rightPart);
context.write(key, output);
}
}
}
}
@Override
public int run(String[] args) throws Exception {
Configuration conf=getConf(); //獲得配置文件對象
Job job=new Job(conf,"LeftOutJoinMR");
job.setJarByClass(ReduceSideJoin_LeftOuterJoin.class);
FileInputFormat.addInputPath(job, new Path(args[0])); //設置map輸入文件路徑
FileOutputFormat.setOutputPath(job, new Path(args[1])); //設置reduce輸出文件路徑
job.setMapperClass(LeftOutJoinMapper.class);
job.setReducerClass(LeftOutJoinReducer.class);
job.setInputFormatClass(TextInputFormat.class); //設置文件輸入格式
job.setOutputFormatClass(TextOutputFormat.class);//使用默認的output格格式
//設置map的輸出key和value類型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CombineValues.class);
//設置reduce的輸出key和value類型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
return job.isSuccessful()?0:1;
}
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
try {
int returnCode = ToolRunner.run(new ReduceSideJoin_LeftOuterJoin(),args);
System.exit(returnCode);
} catch (Exception e) {
// TODO Auto-generated catch block
logger.error(e.getMessage());
}
}
}

其中具體的分析以及數據的輸出輸入請看代碼中的註釋已經寫得比較清楚了，這裏主要分析一下reduce join的一些不足。之所以會存在reduce join這種方式，我們可以很明顯的看出原：因爲整體數據被分割了，每個map task只處理一部分數據而不能夠獲取到所有需要的join字段，因此我們需要在講join key作爲reduce端的分組將所有join key相同的記錄集中起來進行處理，所以reduce join這種方式就出現了。這種方式的缺點很明顯就是會造成map和reduce端也就是shuffle階段出現大量的數據傳輸，效率很低。

2、在Map端進行連接。

使用場景：一張表十分小、一張表很大。

用法:在提交作業的時候先將小表文件放到該作業的DistributedCache中，然後從DistributeCache中取出該小表進行join key / value解釋分割放到內存中（可以放大Hash Map等等容器中）。然後掃描大表，看大表中的每條記錄的join key /value值是否能夠在內存中找到相同join key的記錄，如果有則直接輸出結果。

直接上代碼，比較簡單：

package com.mr.mapSideJoin;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* @author zengzhaozheng
*
* 用途說明：
* Map side join中的left outer join
* 左連接，兩個文件分別代表2個表,連接字段table1的id字段和table2的cityID字段
* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)，
* 假設tb_dim_city文件記錄數很少，tb_dim_city.dat文件內容,分隔符爲"|"：
* id name orderid city_code is_show
* 0 其他 9999 9999 0
* 1 長春 1 901 1
* 2 吉林 2 902 1
* 3 四平 3 903 1
* 4 松原 4 904 1
* 5 通化 5 905 1
* 6 遼源 6 906 1
* 7 白城 7 907 1
* 8 白山 8 908 1
* 9 延吉 9 909 1
* -------------------------風騷的分割線-------------------------------
* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)
* tb_user_profiles.dat文件內容,分隔符爲"|"：
* userID network flow cityID
* 1 2G 123 1
* 2 3G 333 2
* 3 3G 555 1
* 4 2G 777 3
* 5 3G 666 4
* -------------------------風騷的分割線-------------------------------
* 結果：
* 1 長春 1 901 1 1 2G 123
* 1 長春 1 901 1 3 3G 555
* 2 吉林 2 902 1 2 3G 333
* 3 四平 3 903 1 4 2G 777
* 4 松原 4 904 1 5 3G 666
*/
public class MapSideJoinMain extends Configured implements Tool{
private static final Logger logger = LoggerFactory.getLogger(MapSideJoinMain.class);
public static class LeftOutJoinMapper extends Mapper {
private HashMap city_info = new HashMap();
private Text outPutKey = new Text();
private Text outPutValue = new Text();
private String mapInputStr = null;
private String mapInputSpit[] = null;
private String city_secondPart = null;
/**
* 此方法在每個task開始之前執行，這裏主要用作從DistributedCache
* 中取到tb_dim_city文件，並將裏邊記錄取出放到內存中。
*/
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
BufferedReader br = null;
//獲得當前作業的DistributedCache相關文件
Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String cityInfo = null;
for(Path p : distributePaths){
if(p.toString().endsWith("tb_dim_city.dat")){
//讀緩存文件，並放到mem中
br = new BufferedReader(new FileReader(p.toString()));
while(null!=(cityInfo=br.readLine())){
String[] cityPart = cityInfo.split("\\|",5);
if(cityPart.length ==5){
city_info.put(cityPart[0], cityPart[1]+"\t"+cityPart[2]+"\t"+cityPart[3]+"\t"+cityPart[4]);
}
}
}
}
}
/**
* Map端的實現相當簡單，直接判斷tb_user_profiles.dat中的
* cityID是否存在我的map中就ok了，這樣就可以實現Map Join了
*/
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//排掉空行
if(value == null || value.toString().equals("")){
return;
}
mapInputStr = value.toString();
mapInputSpit = mapInputStr.split("\\|",4);
//過濾非法記錄
if(mapInputSpit.length != 4){
return;
}
//判斷鏈接字段是否在map中存在
city_secondPart = city_info.get(mapInputSpit[3]);
if(city_secondPart != null){
this.outPutKey.set(mapInputSpit[3]);
this.outPutValue.set(city_secondPart+"\t"+mapInputSpit[0]+"\t"+mapInputSpit[1]+"\t"+mapInputSpit[2]);
context.write(outPutKey, outPutValue);
}
}
}
@Override
public int run(String[] args) throws Exception {
Configuration conf=getConf(); //獲得配置文件對象
DistributedCache.addCacheFile(new Path(args[1]).toUri(), conf);//爲該job添加緩存文件
Job job=new Job(conf,"MapJoinMR");
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0])); //設置map輸入文件路徑
FileOutputFormat.setOutputPath(job, new Path(args[2])); //設置reduce輸出文件路徑
job.setJarByClass(MapSideJoinMain.class);
job.setMapperClass(LeftOutJoinMapper.class);
job.setInputFormatClass(TextInputFormat.class); //設置文件輸入格式
job.setOutputFormatClass(TextOutputFormat.class);//使用默認的output格式
//設置map的輸出key和value類型
job.setMapOutputKeyClass(Text.class);
//設置reduce的輸出key和value類型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
return job.isSuccessful()?0:1;
}
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
try {
int returnCode = ToolRunner.run(new MapSideJoinMain(),args);
System.exit(returnCode);
} catch (Exception e) {
// TODO Auto-generated catch block
logger.error(e.getMessage());
}
}
}

這裏說說DistributedCache。DistributedCache是分佈式緩存的一種實現，它在整個MapReduce框架中起着相當重要的作用，他可以支撐我們寫一些相當複雜高效的分佈式程序。說回到這裏，JobTracker在作業啓動之前會獲取到DistributedCache的資源uri列表，並將對應的文件分發到各個涉及到該作業的任務的TaskTracker上。另外，關於DistributedCache和作業的關係，比如權限、存儲路徑區分、public和private等屬性，接下來有用再整理研究一下寫一篇blog，這裏就不詳細說了。

另外還有一種比較變態的Map Join方式，就是結合HBase來做Map Join操作。這種方式完全可以突破內存的控制，使你毫無忌憚的使用Map Join，而且效率也非常不錯。

3、SemiJoin。

SemiJoin就是所謂的半連接，其實仔細一看就是reduce join的一個變種，就是在map端過濾掉一些數據，在網絡中只傳輸參與連接的數據不參與連接的數據不必在網絡中進行傳輸，從而減少了shuffle的網絡傳輸量，使整體效率得到提高，其他思想和reduce join是一模一樣的。說得更加接地氣一點就是將小表中參與join的key單獨抽出來通過DistributedCach分發到相關節點，然後將其取出放到內存中（可以放到HashSet中），在map階段掃描連接表，將join key不在內存HashSet中的記錄過濾掉，讓那些參與join的記錄通過shuffle傳輸到reduce端進行join操作，其他的和reduce join都是一樣的。看代碼：

package com.mr.SemiJoin;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* @author zengzhaozheng
*
* 用途說明：
* reudce side join中的left outer join
* 左連接，兩個文件分別代表2個表,連接字段table1的id字段和table2的cityID字段
* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)
* tb_dim_city.dat文件內容,分隔符爲"|"：
* id name orderid city_code is_show
* 0 其他 9999 9999 0
* 1 長春 1 901 1
* 2 吉林 2 902 1
* 3 四平 3 903 1
* 4 松原 4 904 1
* 5 通化 5 905 1
* 6 遼源 6 906 1
* 7 白城 7 907 1
* 8 白山 8 908 1
* 9 延吉 9 909 1
* -------------------------風騷的分割線-------------------------------
* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)
* tb_user_profiles.dat文件內容,分隔符爲"|"：
* userID network flow cityID
* 1 2G 123 1
* 2 3G 333 2
* 3 3G 555 1
* 4 2G 777 3
* 5 3G 666 4
* -------------------------風騷的分割線-------------------------------
* joinKey.dat內容：
* city_code
* 1
* 2
* 3
* 4
* -------------------------風騷的分割線-------------------------------
* 結果：
* 1 長春 1 901 1 1 2G 123
* 1 長春 1 901 1 3 3G 555
* 2 吉林 2 902 1 2 3G 333
* 3 四平 3 903 1 4 2G 777
* 4 松原 4 904 1 5 3G 666
*/
public class SemiJoin extends Configured implements Tool{
private static final Logger logger = LoggerFactory.getLogger(SemiJoin.class);
public static class SemiJoinMapper extends Mapper {
private CombineValues combineValues = new CombineValues();
private HashSet joinKeySet = new HashSet();
private Text flag = new Text();
private Text joinKey = new Text();
private Text secondPart = new Text();
/**
* 將參加join的key從DistributedCache取出放到內存中，以便在map端將要參加join的key過濾出來。b
*/
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
BufferedReader br = null;
//獲得當前作業的DistributedCache相關文件
Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String joinKeyStr = null;
for(Path p : distributePaths){
if(p.toString().endsWith("joinKey.dat")){
//讀緩存文件，並放到mem中
br = new BufferedReader(new FileReader(p.toString()));
while(null!=(joinKeyStr=br.readLine())){
joinKeySet.add(joinKeyStr);
}
}
}
}
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//獲得文件輸入路徑
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
//數據來自tb_dim_city.dat文件,標誌即爲"0"
if(pathName.endsWith("tb_dim_city.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 5){
return;
}
//過濾掉不需要參加join的記錄
if(joinKeySet.contains(valueItems[0])){
flag.set("0");
joinKey.set(valueItems[0]);
secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}else{
return ;
}
}//數據來自於tb_user_profiles.dat，標誌即爲"1"
else if(pathName.endsWith("tb_user_profiles.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 4){
return;
}
//過濾掉不需要參加join的記錄
if(joinKeySet.contains(valueItems[3])){
flag.set("1");
joinKey.set(valueItems[3]);
secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}else{
return ;
}
}
}
}
public static class SemiJoinReducer extends Reducer {
//存儲一個分組中的左表信息
private ArrayList leftTable = new ArrayList();
//存儲一個分組中的右表信息
private ArrayList rightTable = new ArrayList();
private Text secondPar = null;
private Text output = new Text();
/**
* 一個分組調用一次reduce函數
*/
@Override
protected void reduce(Text key, Iterable value, Context context)
throws IOException, InterruptedException {
leftTable.clear();
rightTable.clear();
/**
* 將分組中的元素按照文件分別進行存放
* 這種方法要注意的問題：
* 如果一個分組內的元素太多的話，可能會導致在reduce階段出現OOM，
* 在處理分佈式問題之前最好先了解數據的分佈情況，根據不同的分佈採取最
* 適當的處理方法，這樣可以有效的防止導致OOM和數據過度傾斜問題。
*/
for(CombineValues cv : value){
secondPar = new Text(cv.getSecondPart().toString());
//左表tb_dim_city
if("0".equals(cv.getFlag().toString().trim())){
leftTable.add(secondPar);
}
//右表tb_user_profiles
else if("1".equals(cv.getFlag().toString().trim())){
rightTable.add(secondPar);
}
}
logger.info("tb_dim_city:"+leftTable.toString());
logger.info("tb_user_profiles:"+rightTable.toString());
for(Text leftPart : leftTable){
for(Text rightPart : rightTable){
output.set(leftPart+ "\t" + rightPart);
context.write(key, output);
}
}
}
}
@Override
public int run(String[] args) throws Exception {
Configuration conf=getConf(); //獲得配置文件對象
DistributedCache.addCacheFile(new Path(args[2]).toUri(), conf);
Job job=new Job(conf,"LeftOutJoinMR");
job.setJarByClass(SemiJoin.class);
FileInputFormat.addInputPath(job, new Path(args[0])); //設置map輸入文件路徑
FileOutputFormat.setOutputPath(job, new Path(args[1])); //設置reduce輸出文件路徑
job.setMapperClass(SemiJoinMapper.class);
job.setReducerClass(SemiJoinReducer.class);
job.setInputFormatClass(TextInputFormat.class); //設置文件輸入格式
job.setOutputFormatClass(TextOutputFormat.class);//使用默認的output格式
//設置map的輸出key和value類型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CombineValues.class);
//設置reduce的輸出key和value類型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
return job.isSuccessful()?0:1;
}
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
try {
int returnCode = ToolRunner.run(new SemiJoin(),args);
System.exit(returnCode);
} catch (Exception e) {
logger.error(e.getMessage());
}
}
}

這裏還說說SemiJoin也是有一定的適用範圍的，其抽取出來進行join的key是要放到內存中的，所以不能夠太大，容易在Map端造成OOM。

總結

blog介紹了三種join方式。這三種join方式適用於不同的場景，其處理效率上的相差還是蠻大的，其中主要導致因素是網絡傳輸。Map join效率最高，其次是SemiJoin，最低的是reduce join。另外，寫分佈式大數據處理程序的時最好要對整體要處理的數據分佈情況作一個瞭解，這可以提高我們代碼的效率，使數據的傾斜度降到最低，使我們的代碼傾向性更好。

影響類型	影響的範圍
key字段數目	1、啓動腳本中num.key.fields.for.partition的配置2、啓動腳本中stream.num.map.output.key.fields的配置

MapReduce實現兩表的Join--原理及python和java代碼實現

1. 概述

2. 常見的join方法介紹

2.1 reduce side join

2.2 map side join

2.3 SemiJoin

2.4 reduce side join + BloomFilter

3. 二次排序

使用原生java語言實現Map-reduce程序

藉助hadoop streaming使用python語言實現Map-reduce程序

hadoop之實現集合join的需求

實現join的注意點和易踩坑總結

hadoop實現join操作的思路

hadoop使用python實現join的map和reduce代碼

更多需要注意的地方

數據準備

計算模型

代碼

技術細節

2021看雪SDC議題回顧 | SaTC：一種全新的物聯網設備漏洞自動化挖掘方法

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

得物 ZooKeeper SLA 也可以 99.99%

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

hive join 數據傾斜真實案例

查看python源碼之jieba安裝

hdfs設置回收站

周星馳成名前的故事

我的編程競賽之路 ——中國大學生計算機編程第一人樓天城訪談

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結