【轉】Hadoop Streaming 使用總結

Hadoop Streaming 是 Hadoop 提供的一個工具，用戶可以使用它來創建和運行一類特殊的 MapReduce 任務，這些 MR 任務可以使用任何可執行文件或腳本作爲 mapper 和 reducer。

比如，簡單的 word count 任務可使用 Hadoop Streaming 簡單寫爲：

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -input /tmp/wordcount/input \
    -output /tmp/wordcount/output \
    -mapper /bin/cat \
    -reducer /usr/bin/wc

工作原理

Hadoop Streaming 會創建一個 MR 任務，然後將任務提交到集羣上執行，同時監控這個任務的整個執行過程。如果 mapper 和 reducer 都是可執行文件，streaming 程序會使用 PipeMapper 和 PipeReducer 來做一個類似代理的 Mapper 和 Reducer，它們負責啓動實際的 mapper 和 reducer 可執行文件，然後從 HDFS 讀取輸入數據，再一行一行寫入到可執行文件進程的標準輸入，同時讀取可執行文件進程處理完數據後輸出到標準輸出的數據，將其寫出到 Mapper 和 Reducer 真正的輸出中。

以一個沒有 reduce 階段的 Streaming 程序爲例，其 Mapper 簡要運行流程可見下圖：

Hadoop Streaming Mapper 運行流程

PipeMapper 在啓動 mapper.sh 後，不斷重複 2-7 （一次 map ）過程，直到所有數據處理完成。

與 PipeMapper 類似，PipeReducer 會將從 map 端 shuffle 過來數據，一行行的寫到 reducer.sh 進程的標準輸入，然後收集 reducer.sh 進程的標準輸出，最終寫出到 hdfs output。

以上就是是 MapReduce 框架和 streaming mapper/reducer 之間的基本處理流程。所以，用戶在編寫 Streaming 程序的 mapper 和 reducer 時，只需要從不斷 stdin 中一行行讀取數據，處理然後輸出到標準輸出中即可。

同時，用戶也可以使用 java 類作爲 mapper 或者 reducer 。上面的例子與這裏的代碼等價：

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -input /tmp/wordcount/input \
    -output /tmp/wordcount/output \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /usr/bin/wc

用戶也可以設定 stream.non.zero.exit.is.failure true 或 false 來表明 streaming task 的返回值非零時是 Failure 還是 Success。默認情況，streaming task 返回非零時表示失敗。

參數配置

Hadoop Streaming 支持 Streaming 命令參數配置以及 Hadoop 通用的參數配置。常規命令行語法如下所示。

1	hadoop command [genericOptions] [streamingOptions]

注意：確保通用參數放置在 Streaming 的參數配置之前，否則命令將會失敗。具體可查看後續的示例。

Streaming 參數配置

Streaming 命令參數配置具體如下表所示：

參數	必選	意義
-input directoryname or filename	是	設置輸入數據路徑，可以是文件或者目錄。可通過重複配置，添加多個輸入路徑
-output directoryname	是	設置指定輸出數據路徑，必須是目錄。輸出路徑只能有一個
-mapper executable or JavaClassName	是	設置可執行的 mapper
-reducer executable or JavaClassName	是	設置可執行的 reducer
-file filename	否	設置需要同步到計算節點的文件，可以使可執行的 mapper，reducer 或 combiner 文件在計算節點本地可用。可以通過重複配置，同步多個文件
-inputformat JavaClassName	否	設置 InputFormat，用來將輸入文件讀取成 key/value 對，如果未設置默認使用 TextInputFormat
-outputformat JavaClassName	否	設置 OutputFormat，用來將輸出的 key/value 對寫出到輸出文件，如果未設置默認使用 TextOutputFormat
-partitioner JavaClassName	否	設置 Partitioner，用來根據 key 確定數據應該指派到的 reduce
-combiner streamingCommand or JavaClassName	否	設置 Combiner，用來在 mapper 端歸併 mapper 的輸出
-cmdenv name=value	否	設置環境變量，可以在 mapper 或者 reducer 運行時獲取
-inputreader	否	設置 InputReader 類，用於讀取輸入數據，取代 InputFormat Class
-verbose	否	設置是否輸出日誌
-lazyOutput	否	設置是否延遲輸出
-numReduceTasks	否	設置 reducer 任務數量
-mapdebug	否	設置 mapper debug 腳本，在 mapper 任務運行失敗時執行
-reducedebug	否	設置 reducer debug 腳本， reducer 任務運行失敗時執行

通用參數配置

Streaming 任務同時支持 Hadoop 通用的參數配置，主要的參數配置有以下幾個：

參數	必選	意義
-conf configuration_file	否	指定配置文件
-D property=value	否	設置參數的值
-fs host:port or local	否	指定一個 namenode
-files	否	指定需要拷貝到集羣的文件，多個文件以逗號分隔
-libjars	否	指定需要添加到任務 classpath 的 jar 文件，多個文件以逗號分隔
-archives	否	指定需要解壓到計算節點的壓縮文件，多個文件以逗號分隔

參數配置示例

下面我們通過一些示例來展示具體的 Streaming 任務參數配置。

設置 Mapper Reducer

使用可執行文件作爲 mapper 和 reducer

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -input myInputDir \
  -output myOutputDir \
  -mapper /usr/bin/cat \
  -reducer /usr/bin/wc

使用 Java Class 作爲 mapper，指定 InputFormat

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -input myInputDir \
  -output myOutputDir \
  -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
  -reducer /usr/bin/wc

提交任務時打包文件

如上文所述，我們可以指定任意的可執行文件作爲 mapper 或者 reducer。在提交 Hadoop Streaming 任務時，可執行的 mapper 或者 reducer 執行文件並不必已經存在 Hadoop 集羣的任意一臺機器上。如果不存在，我們只需要在提交任務時的時候使用 -file 參數指定需要的文件，告訴集羣在提交任務時將這些文件打包，這樣 Hadoop 會自動將這些文件打包上傳到 Hdfs，並同步到每個計算節點。

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -input myInputDir \
  -output myOutputDir \
  -mapper wordcount.py \
  -reducer /usr/bin/wc \
  -file wordcount.py

同時，你也可以指定一些依賴文件，打包上傳到集羣上，提供給 mapper 或者 reducer 任務使用

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -input myInputDir \
  -output myOutputDir \
  -mapper wordcount.py \
  -reducer /usr/bin/wc \
  -file wordcount.py \
  -file dictionary.txt

指定任務的其他插件

跟普通的 MR 任務一樣，我們可以指定任務運行時的一些插件

-inputformat JavaClassName
-outputformat JavaClassName
-partitioner JavaClassName
-combiner streamingCommand or JavaClassName

設置環境變量

我們也可以通過參數設置任務運行時的環境變量，並且可以在 mapper 或者 reducer 運行時獲取環境變量的值。

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -input myInputDir \
  -output myOutputDir \
  -mapper wordcount.py \
  -reducer /usr/bin/wc \
  -file wordcount.py \
  -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ \
  -cmdenv LOG_LEVEL=debug

設置 reducer 數量

通常，我們需要爲任務設置合適的 reducer 數量，默認的 reducer 數量是 1，具體的 reducer 數量應當根據業務需求以及可使用資源等因素來確定。
reducer 數量可以使用 Streaming 配置 -numReduceTasks 10 或者通用參數 -D mapreduce.job.reduces=10 來設置。如下，我們將任務的 reducer 數量設爲 10。

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -D mapreduce.job.reduces=10
  -input myInputDir \
  -output myOutputDir \
  -mapper /usr/bin/cat \
  -reducer /usr/bin/wc

對於 map-only 的任務，在 map 結束後直接輸出結果即可，不需要進行 reduce，這時我們需要將 reducer 數量設爲 0。設置 -D mapreduce.job.reduces=0 或者 -reducer=NONE 即可。

依賴大文件或者歸檔文件

很多 Streaming 任務在運行時需要依賴某些特定的文件或者環境，比如某些分詞任務依賴字典文件，或某個 python 實現的機器學習任務依賴指定的 module 。此時，我們就需要將依賴的文件或者歸檔上傳到 HDFS 上，並使用 -files 和 -archives 選項指定已上傳的依賴文件或者歸檔的 HDFS 路徑，任務運行時將會將指定的依賴文件和歸檔分發到各個計算節點上，任務運行時即可依賴這些文件進行相應的操作。

注意：-files 和 -archives 選項都是通用選項，需要放在 Streaming 命令配置前，否則會導致任務啓動失敗

-files 選項指定的依賴文件會在任務啓動之前分發到當前計算節點上。依賴文件分完成後，會在任務的工作目錄裏建個軟鏈指向它。
如下示例，Hadoop 會在任務的當前工作目錄中自動創建名爲 dict.txt 的符號鏈接，指向 dict.txt 的實際複製到的本地路徑。任務直接通過軟鏈引用文件即可。

1	-files hdfs://host:port/tmp/cache/dict.txt

用戶也可以自己指定符號鏈接的名稱，如下示例，將建立名爲 dict 的軟鏈。

1	-files hdfs://host:port/tmp/cache/dict.txt#dict

對於多個依賴文件，可以用逗號分隔

1	-files hdfs://host:port/tmp/cache/dict.txt,hdfs://host:port/tmp/cache/test.txt

-archives 選項指定的歸檔文件會在任務啓動之前分發到當前計算節點上，對於使用某些壓縮的歸檔文件（tar 或 jar）分發到計算節點上後， Hadoop 會自動將其解壓到一個目錄裏，並在任務的工作目錄裏建個軟鏈指向這個目錄。
如下示例，Hadoop 會在任務的當前工作目錄中自動創建名爲 dict.tar.gz 的符號鏈接，指向 dict.tar.gz 解壓到的目錄。

1	-archives hdfs://host:port/tmp/cache/dict.tar.gz

用戶也可以自己指定符號鏈接的名稱，如下示例，將建立名爲 mydict 的軟鏈。

1	-archives hdfs://host:port/tmp/cache/dict.tar.gz#mydict

指定多個歸檔依賴文件，可以用逗號分隔。

1	-archives hdfs://host:port/tmp/cache/dict.tar.gz,hdfs://host:port/tmp/cache/test.tar.gz

對於任務依賴包含許多文件的目錄時，如 python 依賴的某些集羣上不存在的第三方 module，我們可以先將整個目錄歸檔壓縮後上傳到 HDFS 上，並使用 -archives 指定上傳後的路徑，即可在 python 實現的 mapper 或 reducer 運行時引用 module 進行相關的計算。

tar -czvf test_moudle.tar.gz -C test_moudle/ .
hadoop fs -put test_moudle.tar.gz /tmp/cache/test_moudle.tar.gz
# 打包上傳完成後，在依賴未更新時，後續任務啓動時不要重新打包上傳歸檔文件
# 在 mapper.py 或 reducer.py 直接用 ./tmodule 路徑讀取或者引用即可

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -archives hdfs://host:port/tmp/cache/test_moudle.tar.gz#tmodule \
  -D mapreduce.job.reduces=1 \
  -D mapreduce.job.name="TestArchives" \
  -input myInputDir \
  -output myOutputDir \
  -mapper 'python mapper.py' \
  -reducer 'python reducer.py'

Streaming 任務本地測試

因爲在集羣上對失敗任務進行 debug 比較麻煩一些，所以在提交任務之前，建議先在本地對任務進行簡單的測試，測試通過後再提交到集羣。

簡單測試命令如下：

1	cat inputfile \| sh mapper.sh \| sort \|sh reducer.sh > output

【轉】Hadoop Streaming 使用總結

工作原理

參數配置

Streaming 參數配置

通用參數配置

參數配置示例

設置 Mapper Reducer

提交任務時打包文件

指定任務的其他插件

設置環境變量

設置 reducer 數量

依賴大文件或者歸檔文件

Streaming 任務本地測試

【轉】ElasticSearch快速使用篇（基本命令篇）

【轉載】Yarn ContainerExecutor 配置與使用

【轉】Hadoop Streaming 使用總結

[轉載]JournalNode的作用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結