Hadoop學習筆記———《MultipleOutputs———將結果輸出到指定的多個文件或文件夾》

原創

李承锦MJ

2018-09-03 06:21

在MapReduce中使用MultipleOutputs將結果輸出到指定的多個文件或文件夾

使用步驟主要有三步：

1、在reduce或map類中創建MultipleOutputs對象，將結果輸出；

class TestReducer extends Reducer<Text, Text, Text, Text>{  
  
    //將結果輸出到多個文件或多個文件夾  
    private MultipleOutputs mos;  

    protected void setup(Context context) throws IOException,InterruptedException {  
        mos = new MultipleOutputs<>(context);  // 初始化mos
     }  
          
      
    protected void cleanup(Context context) throws IOException,InterruptedException {  
        mos.close();  //關閉對象  
    }  
}

2、在map或reduce方法中使用MultipleOutputs對象輸出數據，代替context.write();

protected void reduce(Text key, Iterable<Text> values, Context context)  
            throws IOException, InterruptedException {  
        .... // 計算key和value
        //使用MultipleOutputs對象輸出數據  
        if(key.toString().equals("file1")){  
            mos.write("file1", key, value);  
        }else if(key.toString().equals("file2")){  
            mos.write("file2", key, value);    
        }
}

3、在創建job時，定義附加的輸出文件()，這裏的文件名稱與第二步設置的文件名相同;

要注意的是hadoop是不承認未經註冊namedOutput的，必須先在主函數中註冊，然後才能寫入，否則運行時會報not defined錯誤；所以要在主函數中用MultipleOutputs.addNamedOutput將對應的namedOutput文件註冊一下。

//定義附加的輸出文件  
 MultipleOutputs.addNamedOutput(job,"file1",TextOutputFormat.class,Text.class,Text.class);  
 MultipleOutputs.addNamedOutput(job,"file2",TextOutputFormat.class,Text.class,Text.class);

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hadoop學習筆記———《MultipleOutputs———將結果輸出到指定的多個文件或文件夾》

學習筆記———《SIFT算法》

學習筆記———《GMM模型以及基於EM算法的參數估計》

Python學習筆記———《計算程序運行時間》

Python學習筆記———《文件操作》

學習筆記———《EM算法》

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結