本片博文博主爲大家講解MapReduce之Join的多種應用。

一. Reduce Join

1.1 Reduce Join 工作原理

Map端的主要工作：爲來自不同表或文件的key/value對，打標籤以區別不同來源的記錄。然後用連接字段作爲key，其餘部分和新加的標誌作爲val，最後進行輸出。

Reduce端的主要工作：在Reduce端以連接字段作爲key的分組已經完成，我們只需要在每一個分組當中將那些來源於不同文件的記錄(在Mao階段已經打標誌)分開，最後進行合併就ok了。

1.2 Reduce Join 案例

1. 需求

將商品信息表中數據根據商品pid合併到訂單數據表中。
最終形式如下表：

id	pname	amount
1001	小米	1
1004	小米	4
1002	華爲	2
1005	華爲	5
1003	格力	3
1006	格力	6

2. 需求分析

通過將關聯條件作爲Map輸出的key，將兩表滿足Join條件的數據並攜帶數據所來源的文件信息，發往同一個ReduceTask，在Reduce中進行數據的串聯，如下圖所示。

3. 完成代碼

1. 創建商品和訂合併後的OrderBean類

package com.buwenbuhuo.reducejoin;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
 * @author 卜溫不火
 * @create 2020-04-25 17:24
 * com.buwenbuhuo.reducejoin - the name of the target package where the new class or interface will be created.
 * mapreduce0422 - the name of the current project.
 */
public class OrderBean implements WritableComparable<OrderBean> {
    private String id;
    private String pid;
    private int amount;
    private String pname;

    @Override
    public String toString() {
        return id + "\t" + pname + "\t" + amount;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    @Override
    public int compareTo(OrderBean o) {
        int compare = this.pid.compareTo(o.pid);

        if (compare == 0) {
            return o.pname.compareTo(this.pname);
        } else {
            return compare;
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(id);
        out.writeUTF(pid);
        out.writeInt(amount);
        out.writeUTF(pname);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.id = in.readUTF();
        this.pid = in.readUTF();
        this.amount = in.readInt();
        this.pname = in.readUTF();
    }
}

2. 編寫RJMapper類

package com.buwenbuhuo.reducejoin;

import com.buwenbuhuo.reducejoin.OrderBean;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;
/**
 * @author 卜溫不火
 * @create 2020-04-25 17:24
 * com.buwenbuhuo.reducejoin - the name of the target package where the new class or interface will be created.
 * mapreduce0422 - the name of the current project.
 */
public class RJMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {

    private OrderBean orderBean = new OrderBean();

    private String filename;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit fs = (FileSplit) context.getInputSplit();
        filename = fs.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\t");
        if (filename.equals("order.txt")) {
            orderBean.setId(fields[0]);
            orderBean.setPid(fields[1]);
            orderBean.setAmount(Integer.parseInt(fields[2]));
            orderBean.setPname("");
        } else {
            orderBean.setPid(fields[0]);
            orderBean.setPname(fields[1]);
            orderBean.setId("");
            orderBean.setAmount(0);
        }
        context.write(orderBean, NullWritable.get());
    }
}

3. 編寫RJReducer類

package com.buwenbuhuo.reducejoin;

import com.buwenbuhuo.reducejoin.OrderBean;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;
/**
 * @author 卜溫不火
 * @create 2020-04-25 17:24
 * com.buwenbuhuo.reducejoin - the name of the target package where the new class or interface will be created.
 * mapreduce0422 - the name of the current project.
 */
public class RJReducer extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {

    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        //拿到迭代器
        Iterator<NullWritable> iterator = values.iterator();
        //數據指針下移，獲取第一個OrderBean
        iterator.next();
        //從第一個OrderBean中取出品牌名稱
        String pname = key.getPname();

        //遍歷剩下的OrderBean，設置品牌名稱並寫出
        while (iterator.hasNext()) {
            iterator.next();
            key.setPname(pname);
            context.write(key, NullWritable.get());
        }
    }
}

4. 編寫 RJComparator類(構造器)

package com.buwenbuhuo.reducejoin;

import com.buwenbuhuo.reducejoin.OrderBean;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
/**
 * @author 卜溫不火
 * @create 2020-04-25 17:24
 * com.buwenbuhuo.reducejoin - the name of the target package where the new class or interface will be created.
 * mapreduce0422 - the name of the current project.
 */
public class RJComparator extends WritableComparator {

    protected RJComparator() {
        super(OrderBean.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean oa = (OrderBean) a;
        OrderBean ob = (OrderBean) b;
        return oa.getPid().compareTo(ob.getPid());
    }
}

5. 編寫RJDriver類

package com.buwenbuhuo.reducejoin;

import com.buwenbuhuo.reducejoin.OrderBean;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
/**
 * @author 卜溫不火
 * @create 2020-04-25 17:24
 * com.buwenbuhuo.reducejoin - the name of the target package where the new class or interface will be created.
 * mapreduce0422 - the name of the current project.
 */
public class RJDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(RJDriver.class);

        job.setMapperClass(RJMapper.class);
        job.setReducerClass(RJReducer.class);

        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

        job.setGroupingComparatorClass(RJComparator.class);

        FileInputFormat.setInputPaths(job, new Path("d:\\input"));
        FileOutputFormat.setOutputPath(job, new Path("d:\\output"));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

4. 查看運行結果

1. 運行

2. 結果

結果正確,說明我們的reducer端的join算法算是成功實現了！！！

二. Map Join

2.1 使用場景

Map Join適用於一張表十分小、一張表很大的場景。

2.2 優點

思考：在Reduce端處理過多的表，非常容易產生數據傾斜。怎麼辦？
在Map端緩存多張表，提前處理業務邏輯，這樣增加Map端業務，減少Reduce端數據的壓力，儘可能的減少數據傾斜。

2.3 具體辦法：採用DistributedCache

（1）在Mapper的setup階段，將文件讀取到緩存集合中。
（2）在驅動函數中加載緩存。

// 緩存普通文件到Task運行節點。
job.addCacheFile(new URI("file://d:/cache/pd.txt"));

2.4 Map Join案例

1. 需求

將商品信息表中數據根據商品pid合併到訂單數據表中。

id	pname	amount
1001	小米	1
1004	小米	4
1002	華爲	2
1005	華爲	5
1003	格力	3
1006	格力	6

2. 需求分析

MapJoin適用於關聯表中有小表的情形。

3. 代碼實現

1. 創建MJMapper類

package com.buwenbuhuo.mapjoin;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;
/**
 * @author 卜溫不火
 * @create 2020-04-25 17:54
 * com.buwenbuhuo.mapjoin - the name of the target package where the new class or interface will be created.
 * mapreduce0422 - the name of the current project.
 */
public class MJMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    private Map<String, String> pMap = new HashMap<>();

    private Text k = new Text();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        URI[] cacheFiles = context.getCacheFiles();
        String path = cacheFiles[0].getPath().toString();
        FileSystem fileSystem = FileSystem.get(context.getConfiguration());
        FSDataInputStream bufferedReader = fileSystem.open(new Path(path));
        String line;
        while (StringUtils.isNotEmpty(line = bufferedReader.readLine())) {
            String[] fields = line.split("\t");
            pMap.put(fields[0], fields[1]);
        }
        IOUtils.closeStream(bufferedReader);
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\t");
        String pname = pMap.get(fields[1]);
        if (pname == null) {
            pname = "NULL";
        }
        k.set(fields[0] + "\t" + pname + "\t" + fields[2]);
        context.write(k, NullWritable.get());
    }
}

2. 創建MJDriver類

package com.buwenbuhuo.mapjoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
/**
 * @author 卜溫不火
 * @create 2020-04-25 17:54
 * com.buwenbuhuo.mapjoin - the name of the target package where the new class or interface will be created.
 * mapreduce0422 - the name of the current project.
 */
public class MJDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(MJDriver.class);

        job.setMapperClass(MJMapper.class);
        job.setNumReduceTasks(0);

        job.addCacheFile(URI.create("file:///d:/input/pd.txt"));

        FileInputFormat.setInputPaths(job, new Path("d:\\input\\order.txt"));
        FileOutputFormat.setOutputPath(job, new Path("d:\\output"));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

4. 運行及查看結果

1. 運行

2. 查看結果

結果正確,說明我們的map端的join算法算是成功實現了！！！

那麼本次的分享就到這裏了，後續小菌還會爲大家帶來更多Hadoop的內容,喜歡的朋友們不要忘了關注博主哦。

MapReduce快速入門系列(13) | MapReduce之reduce端join與map端join算法實現

目錄

一. Reduce Join

1.1 Reduce Join 工作原理

1.2 Reduce Join 案例

1. 需求

2. 需求分析

3. 完成代碼

4. 查看運行結果

二. Map Join

2.1 使用場景

2.2 優點

2.3 具體辦法：採用DistributedCache

2.4 Map Join案例

1. 需求

2. 需求分析

3. 代碼實現

4. 運行及查看結果

Python 潮流週刊#52：Python 處理 Excel 的資源

Spark快速入門系列(4) | Spark環境搭建—standalone(1) 集羣的搭建

Spark快速入門系列(3) | 簡單一文了解Spark核心概念

Spark快速入門系列(2) | Spark 運行模式之Local本地模式

Spark快速入門系列(1) | 深入淺出，一文讓你瞭解什麼是Spark

scala快速入門系列(1) | scala的簡單介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結