hadoop join之semi join

SemiJoin，也叫半連接，是從分佈式數據庫中借鑑過來的方法。它的產生動機是：對於reduce side join，跨機器的數據傳輸量非常大，這成了join操作的一個瓶頸，如果能夠在map端過濾掉不會參加join操作的數據，則可以大大節省網絡IO。實現方法很簡單：選取一個小表，假設是File1，將其參與join的key抽取出來，保存到文件File3中，File3文件一般很小，可以放到內存中。在map階段，使用DistributedCache將File3複製到各個TaskTracker上，然後將File2中不在File3中的key對應的記錄過濾掉，剩下的reduce階段的工作與reduce side join相同。此實例中，還是採用第一個實例中的數據，假如我們只過濾sex爲1的user，並將key存於user_id文件中（注意：每行的數據一定要帶上雙引號啊），如下：

"ID"
"1"
"2"
"3"
"5"
"6"
"8"
"9"

完整代碼如下，此實例中我們採用新的API來寫：

public class SemiJoin extends Configured implements Tool
{
    public static class MapClass extends Mapper<LongWritable, Text, Text, Text>
    {

        // 用於緩存user_id文件中的數據
        private Set<String> userIds = new HashSet<String>();
        
        private Text key = new Text();
        private Text value = new Text();
        private String[] keyValue;

        // 此方法會在map方法執行之前執行
        @Override
        protected void setup(Context context) throws IOException, InterruptedException
        {
            BufferedReader in = null;

            try
            {
                // 從當前作業中獲取要緩存的文件
                Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
                String userId = null;

                for (Path path : paths)
                {
                    if (path.toString().contains("user_id"))
                    {
                        in = new BufferedReader(new FileReader(path.toString()));
                        while (null != (userId = in.readLine()))
                        {
                            userIds.add(userId);
                        }
                    }
                }
            }
            catch (IOException e)
            {
                e.printStackTrace();
            }
            finally
            {
                try
                {
                    if(in != null)
                    {
                        in.close(); 
                    }
                }
                catch (IOException e)
                {
                    e.printStackTrace();
                }
            }
        }

        public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException
        {
            // 在map階段過濾掉不需要的數據
            this.keyValue = value.toString().split(",");
            
            if(userIds.contains(keyValue[0]))
            {
                this.key.set(keyValue[0]);
                this.value.set(keyValue[1]);
                context.write(this.key, this.value);
            }
        }

    }

    public static class Reduce extends Reducer<Text, Text, Text, Text>
    {

        private Text value = new Text();
        private StringBuilder sb;
        
        public void reduce(Text key, Iterable<Text> values, Context context) 
            throws IOException, InterruptedException
        {
            
            sb = new StringBuilder();
            for(Text val : values)
            {
                sb.append(val.toString());
                sb.append(",");
            }
            
            this.value.set(sb.deleteCharAt(sb.length()-1).toString());
            context.write(key, this.value);
        }
        
    }
    
    public int run(String[] args) throws Exception
    {
        Job job = new Job(getConf(), "SemiJoin");

        job.setJobName("SemiJoin");
        job.setJarByClass(SemiJoin.class);
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        String[] otherArgs = new GenericOptionsParser(job.getConfiguration(), args).getRemainingArgs();
        
        // 我們把第一個參數的地址作爲要緩存的文件路徑
        DistributedCache.addCacheFile(new Path(otherArgs[0]).toUri(), job.getConfiguration());
        FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception
    {
        int res = ToolRunner.run(new Configuration(), new SemiJoin(), args);
        System.exit(res);
    }

}

畫蛇添足

發佈了26 篇原創文章 · 獲贊 8 · 訪問量 10萬+

私信關注

hadoop join之semi join

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

Dokcer部署Kafka集羣

hadoop join之二

windows系統使用eclipse搭建本地spark的java開發環境

集羣環境下配置hadoop,zookeeper,hbase第一部分

集羣環境下配置hadoop,zookeeper,hbase第二部分

簡述計算樸素貝葉斯的步驟

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結