Hadoop 流

Hadoop流是Hadoop發行版附帶的一個實用程序。此實用程序允許您使用任何可執行文件或腳本作爲映射程序和/或reducer創建和運行Map / Reduce作業。

使用Python的示例

對於Hadoop流，我們正在考慮字數問題。Hadoop中的任何作業必須有兩個階段：mapper和reducer。我們已經爲python腳本中的mapper和reducer編寫了代碼，以便在Hadoop下運行它。也可以在Perl和Ruby中寫同樣的內容。

映射器階段代碼

!/usr/bin/python
import sys
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Break the line into words words = myline.split() 
# Iterate the words list for myword in words: 
# Write the results to standard output print '%s  %s' % (myword, 1)

確保此文件具有執行權限（chmod + x /home/expert /hadoop-1.2.1 / mapper.py）。

減速器階段代碼

#!/usr/bin/python
from operator import itemgetter 
import sys 
current_word = ""
current_count = 0 
word = "" 
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Split the input we got from mapper.py word, count = myline.split('  ', 1) 
# Convert count variable to integer 
   try: 
      count = int(count) 
except ValueError: 
   # Count was not a number, so silently ignore this line continue
if current_word == word: 
   current_count += count 
else: 
   if current_word: 
      # Write result to standard output print '%s  %s' % (current_word, current_count) 
   current_count = count
   current_word = word
# Do not forget to output the last word if needed! 
if current_word == word: 
   print '%s  %s' % (current_word, current_count)

將mapper和reducer代碼保存在Hadoop主目錄中的mapper.py和reducer.py中。確保這些文件具有執行權限（chmod + x mapper.py和chmod + x reducer.py）。因爲python是縮進敏感所以相同的代碼可以從下面的鏈接下載。

執行WordCount程序

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar 
   -input input_dirs  
   -output output_dir  
   -mapper <path/mapper.py  
   -reducer <path/reducer.py

其中“\”用於行連續以便清楚可讀性。

例如：

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

流如何工作

在上面的示例中，mapper和reducer都是從標準輸入讀取輸入並將輸出發送到標準輸出的python腳本。該實用程序將創建一個Map / Reduce作業，將作業提交到適當的羣集，並監視作業的進度，直到作業完成。

當爲映射器指定腳本時，每個映射器任務將在映射器初始化時作爲單獨的進程啓動腳本。當映射器任務運行時，它將其輸入轉換爲行，並將這些行饋送到進程的標準輸入（STDIN）。同時，映射器從進程的標準輸出（STDOUT）收集面向行的輸出，並將每行轉換爲鍵/值對，作爲映射器的輸出收集。默認情況下，直到第一個製表符字符的行的前綴是鍵，行的其餘部分（不包括製表符字符）將是值。如果行中沒有製表符，則整個行被視爲鍵，值爲null。但是，這可以根據一個需要定製。

當爲reducer指定腳本時，每個reducer任務將作爲單獨的進程啓動腳本，然後初始化reducer。當reducer任務運行時，它將其輸入鍵/值對轉換爲行，並將行饋送到進程的標準輸入（STDIN）。同時，reducer從進程的標準輸出（STDOUT）收集面向行的輸出，將每行轉換爲鍵/值對，將其作爲reducer的輸出進行收集。默認情況下，直到第一個製表符字符的行的前綴是鍵，行的其餘部分（不包括製表符字符）是值。但是，這可以根據特定要求進行定製。

重要命令

參數	描述
-input directory/file-name	輸入mapper的位置。（需要）
-output directory-name	減速器的輸出位置。（需要）
-mapper executable or script or JavaClassName	Mapper可執行文件。（需要）
-reducer executable or script or JavaClassName	Reducer可執行文件。（需要）
-file file-name	使mapper，reducer或combiner可執行文件在計算節點本地可用。
-inputformat JavaClassName	你提供的類應該返回Text類的鍵/值對。如果未指定，則使用TextInputFormat作爲默認值。
-outputformat JavaClassName	您提供的類應該採用Text類的鍵/值對。如果未指定，則使用TextOutputformat作爲默認值。
-partitioner JavaClassName	確定將鍵發送到哪個reduce的類。
-combiner streamingCommand or JavaClassName	組合器可執行映射輸出。
-cmdenv name=value	將環境變量傳遞到流式命令。
-inputreader	對於向後兼容性：指定記錄讀取器類（而不是輸入格式類）。
-verbose	詳細輸出。
-lazyOutput	創建輸出延遲。例如，如果輸出格式基於FileOutputFormat，則輸出文件僅在首次調用output.collect（或Context.write）時創建。
-numReduceTasks	指定Reducer的數量。
-mapdebug	映射任務失敗時調用的腳本。
-reducedebug	當reduce任務失敗時調用的腳本。

使用Python的示例

映射器階段代碼

減速器階段代碼

執行WordCount程序

例如：

流如何工作

重要命令

MySQL 分庫分表方案，總結太全了。。

Qt/C++音視頻開發71-指定mjpeg/h264格式採集本地攝像頭/存儲文件到mp4/設備推流/採集推流

WPF開源輕便、快速的桌面啓動器

公司來了個新同事，把 DDD 運用得爐火純青！

機器學習的典型任務

Java Web應用設計

陌然計算機編程

Java Web&九九口法表

Java Web&計算程序

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結