MapReduce Tutorial 思考總結

Prerequisites(前置條件)

Hadoop集羣必須安裝好，配置好，可正常運行。

Overview(概覽)

MR(MapReduce簡稱，下同)任務會將input data-set切片(split)成獨立的chunks然後交由map task並行處理，map task的輸出經過sort(框架完成)後作爲reduce task的輸入。MR任務的輸入和輸出都存放在文件系統如HDFS。
一般來說計算節點和存儲節點是同一個節點，這樣框架就能有效調度任務到數據節點上，可以有效提高集羣帶寬。即NodeManager節點和DataNode節點在同一臺機器上。
MapReduce 框架包含一個主ResourceManager，每個集羣節點都有一個從NodeManager和每個application都有一個MRAppMaster。
Application至少要確定input/output location及提供map和reduce方法(通過實現相應的map/reduce接口等)，這些參數和其他job參數就共同組成了job configuration。
Client提交job(如可運行的jar文件)和configuration給ResourceManager，然後ResourceManager就分配任務給slaves，並調度任務、監控任務和提供status和診斷信息給client。
儘管Hadoop框架是由java寫的，但是MR application不一定要用java寫。

Hadoop Streaming就是一個很好的工具，可以讓用戶以任何可執行文件(如shell、python)作爲Mapper/Reducer來創建和運行job。
Hadoop Pipes 是兼容SWIG用來實現 MapReduce 應用的C++ API（不是基於JNI）

Inputs and Outputs(輸入和輸出)

MR框架僅僅操作<k,v>對，即MR的輸入和輸出都是<k,v>對。
key/value classes必須要序列化即實現Writable接口(網絡上傳輸)，另外key classes必須要實現WritableComparable接口(框架需要利用key去做sort)。
MapRedeuce job 的輸入輸出類型流向示意如下：

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

MapReduce - User Interfaces(MR的用戶接口)

首先就是Mapper和Reducer接口。Application通常只實現它們提供的map和reduce方法。
然後其他接口包括Job、Partitioner、InputFormat、OutputFormat和其他。

Mapper

Mappers將輸入的<k,v>對轉換成中間的<k,v>對即map的輸出結果(存放在本地磁盤)。
MapReduce 會根據 InputFormat 切分(split)的每個 InputSplit 都創建一個map task。
通過 job.setMapperClass(Class)來給Job設置Mapper實現類，然後MR框架爲map task的InputSplit中的每個<k,v>對調用map(WritableComparable, Writable, Context)方法進行處理。Application可複寫cleanup(Context)方法(該方法會在map task結束前調用一次)來執行任何需要回收清除的操作，可複寫setup(Context)方法(該方法會在map task開始前調用一次)來進行初始操作。
map task輸出的<k,v>對會通過context.write(WritableComparable, Writable)寫入緩存。
Application可以通過Counter來報告統計信息，Counter組件後面詳細介紹。
Mapper的輸出會被排序(sort)並且分區(partition)到每一個Reducer。分區數和Reduce Task的數目是一致的。用戶可以通過實現一個自定義的Partitioner來控制哪個key分配到哪個Reducer，默認是HashPartitioner。
中間的<k,v>對，即經過排序後的map輸出以一個簡單的格式存儲，格式爲(key-len, key, value-len, value)。map的輸出結果也可通過配置來決定以怎樣的方式來壓縮數據。通過JobConf.setMapOutputCompressorClass(Class<? extends CompressionCodec> codecClass)來設置，也就是設置如下兩個property

mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec=<Class>

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

map task的數量通常由輸入的總大小(即輸入文件的block數)決定。
map task的正確並行度在每個節點10-100個之間較好，對於那些輕CPU的任務可以設置到300個。map task設置需要一段時間，所以最好是任務至少需要一分鐘來執行。
可以通過Configuration.set(MRJobConfig.NUM_MAPS, int)設置更高的map值，但是不太建議這麼做。

Reducer

Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The framework then calls reduce(WritableComparable, Iterable, Context) method for each <key, (list of values)> pair in the grouped inputs

Reducer處理共享一個key的中間<k,v>對。
通過Job.setNumReduceTasks(int)來手動設置Reduce Task的數目。
通過Job.setReducerClass(Class)來設置Reducer接口的實現類。
框架會對於分組輸入(grouped inputs)的每個<key, (list of values)>對調用一次reduce(WritableComparable, Iterable<Writable>, Context)方法。如何分組下面有講到。
Application可複寫cleanup(Context)方法(該方法會在reduce task結束前調用一次)來執行任何需要回收清除的操作，可複寫setup(Context)方法(該方法會在reduc task開始前調用一次)來進行初始操作。
Reducer有3個主要的階段：shuffle, sort and reduce

shuffle

Reducer的輸入數據都是Mapper階段經過排序的map輸出數據。在shuffle階段框架將通過HTTP從恰當的Mapper的分區中取得數據。即每個Reduce Task從所有的map task的對應分區中取得數據存到本地，每個Reduce Task都有自己的分區號而且只能取得屬於自己分區的map輸出數據。每個map task的輸出數據中都會被Partitioner組件分到不同的分區。Partitioner組件後面說明。

sort

sort階段框架將對輸入到的 Reducer 的數據通過key（不同的 Mapper 可能輸出相同的key）進行分組。
shuffle和sort是同時進行的；map的輸出數據被獲取時會進行合併。

Secondary Sort

If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via Job.setSortComparatorClass(Class). Since Job.setGroupingComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.

如果想要對中間記錄實現與 map 階段不同的排序方式，那麼可以通過Job.setSortComparatorClass(Class)指定一個Comparator(比較器)。由於Job.setGroupingComparatorClass(Class)可用於控制中間鍵的分組方式，因此可以將這些鍵與值結合使用來模擬secondary sort。
關於SecondarySort機制可以參考我寫的另一篇博客關於MapReduce的Secondary Sort機制

Reduce

在這個階段reduce(WritableComparable, Iterable, Context)方法將會被調用來處理每個已經分好的組<key, (list of values)>對。
reduce task通過 Context.write(WritableComparable, Writable) 將數據寫入到文件系統。
Application可以通過Counter來報告統計信息，Counter組件後面詳細介紹。
Recuder輸出的數據是不經過排序的。

How Many Reduces?

正確的reduce task數似乎是0.95或1.75乘以節點數乘以每個節點的最大容器數。
使用0.95，一旦map task完成所有的reduce都可以立即啓動並開始傳輸map輸出數據。
使用1.75，更快的節點將完成第一輪reduce，並啓動第二輪reduce，從而更好地實現負載平衡。
增加reduce的數量會增加框架開銷，但是會增加負載平衡並降低失敗的成本。
上面的比例因子略小於整數，以便在框架中爲推測任務和失敗任務保留幾個reduce slot。
在實際MR任務中，reduce task數目的設置和業務是相關的。例如數據來源是電話記錄，然後想得到的最終結果是想要根據移動/電信/聯通查看所有的電話記錄，那麼就需要設置reduce task數目爲3，並自定義Partitioner組件使其分爲移動/電信/聯通 3個分區。至於可能出現的數據傾斜的問題，容後再看，如輸入數據中移動電話記錄就佔比很大，會導致負責移動的reduce task負擔很重。

Reducer NONE

當沒有 reduction 需求的時候可以將 reduce task 的數目設置爲0，這是允許的。
在這種情況當中，map task將直接輸出數據到job指定的輸出路徑(文件系統上的)。此種情況下框架不會對輸出到文件系統的數據進行排序。這種Reducer NONE的特點很適合做數據清洗的工作，不會將原始數據清洗後不經過排序直接輸出到文件系統。如下圖所示。是在官方的secondarySort例子上設置了reduce task數目爲0的輸出結果，和原始數據的順序保持一致。mapreduce.job.reduces=0
hadoop jar secondarysort.jar -Dmapreduce.job.reduces=0 /user/root/input1 /user/root/output12

Partitioner

Partitioner根據key進行分區。
Partitioner 對 map task的輸出數據的 key進行分區。Partitioner 採用的默認方法是對 key 取 hashcode。分區數等於 job 的 reduce 任務數。因此對應分區的reduce task就會取屬於該分區的map 輸出數據。
HashPartitioner 是默認的的分區器。

Counter

Counter是一個工具用於報告 Mapreduce Application的統計數據。
Mapper 和 Reducer 實現類可使用計數器來報告統計值。
可以參考我寫的這篇博客MapReduce程序計數器Counter

Job Configuration

Job用來表示即描述MapReduce作業的配置。Job是用戶將MapReduce作業描述給Hadoop框架去執行的主要接口。框架試圖忠實地執行作業所描述的作業，然而有些參數是Final Parameters是不能被修改的，有些參數就是能直接修改的，如上面所描述的直接通過job.setNumReduceTasks(int tasks)來設置reduce task的數目。
Job通常用來設置Mapper、Partitioner、Reducer、InputFormat、OutputFormat等接口的實現。
FileInputFormat.setInputPaths(Job, Path…)。
FileOutputFormat.setOutputPath(Job, Path)。
Job也可用來設置Comparator(這個在上面的SecondarySort部分有提到)，將文件加入到分佈式緩存，設置map/reduce的輸出數據是否被壓縮以及如何被壓縮，可設置map/reduce任務是否以speculative方式即推測執行的方式運行，設置map/reduce任務的最大嘗試次數，等等等等。
也可以通過Configuration.set(String, String)/ Configuration.get(String)來設置/獲取參數信息。用DistributedCache去緩存很大的只讀文件如字典。

Task Execution & Environment

MRAppMaster 在一個單獨的jvm中運行Mapper/Reducer task做爲一個子進程。子任務繼承父MRAppMaster的運行環境。
用戶可以通mapreduce.{map|reduce}.java.opts來爲子進程設置額外的參數。如果mapreduce.{map|reduce}.java.opts參數包含@taskid@ 符號那麼Mapreduce任務將會被修改爲taskid的值。官方給出的例子如下所示

<property>
  <name>mapreduce.map.java.opts</name>
  <value> -Xmx512M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@[email protected] -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false </value>
</property>

<property>
  <name>mapreduce.reduce.java.opts</name>
  <value> -Xmx1024M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@[email protected] -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false </value>
</property

其中-Xmx指定最大可用的內存，-verbose:gc -Xloggc:/tmp/@[email protected]指定gc日誌路徑。
com.sun.management.jmxremote.authenticate=false是禁用jmx遠程連接密碼。
com.sun.management.jmxremote.ssl=false是禁用SSL傳輸。
關於jmx更加詳細的內容可以參考oracle jdk文檔
如果想本地查看遠程服務器上MapTask的內存信息，如使用jmc或者jvisual工具查看，
還需添加如下內容，指定MapTask的jmx端口，其中com.sun.management.jmxremote默認就是true，可以不必顯示設置。

-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=6666

有一點值得注意的是，如果mapreduce.map.java.opts指定了jmxremote.port，且有多個MapTask進程在同一個節點上運行，就會出現容器啓動失敗的情況，因爲發生了端口占用的情況，這也很好理解。所以最好是跑一個MapTask然後遠程查看就可以了。
如下圖是通過jmc和jvisual工具查看MapTask進程的資源信息圖。

Memory Management

用戶可以使用mapreduce.{map|reduce}.memory.mb指定子任務或者任何子進程運行的最大虛擬內存。需要注意的這裏的值是針對每個進程的限制。{map|reduce}.memory.mb的值是以MB爲單位的。並且這個值應該大於等於傳給JavaVM的-Xmx的值，要不VM可能會無法啓動。
說明：mapreduce.{map|reduce}.java.opts只用來設置MRAppMaster發出的子任務。

Map Parameters

A record emitted from a map will be serialized into a buffer and metadata will be stored into accounting buffers. As described in the following options, when either the serialization buffer or the metadata exceed a threshold, the contents of the buffers will be sorted and written to disk in the background while the map continues to output records. If either buffer fills completely while the spill is in progress, the map thread will block. When the map is finished, any remaining records are written to disk and all on-disk segments are merged into a single file. Minimizing the number of spills to disk can decrease map time, but a larger buffer also decreases the memory available to the mapper.

map通過context.write發出的記錄將被序列化到緩衝區(OutputCollector)中，元數據將存儲到會計緩衝區中。如下面的選項所述，當序列化緩衝區或元數據超出閾值時，將對緩衝區的內容進行排序並將其寫入後臺的磁盤，同時map將繼續輸出記錄。如果任何一個緩衝區在溢出過程中被完全填滿，則map task線程將阻塞。當map task完成時，所有剩餘的記錄都寫到磁盤上，所有磁盤上的段合併到一個文件(即每個MapTask的最終輸出結果就是一個合併後的文件)中。最小化磁盤溢出的數量可以減少映射時間，但是更大的緩衝區也會減少map程序可用的內存。

Name	Type	Description
mapreduce.task.io.sort.mb	int	存儲Map輸出的序列化數據和metadata數據的緩衝區的大小，以mb爲單位
mapreduce.map.sort.spill.percent	float	序列化緩衝區中的軟限制。一旦到達，spiller線程將開始溢出內容到後臺的磁盤

例如，如果mapreduce.map.sort.spill.percent設置爲0.33，則在spiller組件溢出文件時map繼續通過context.write輸出數據填充緩衝區的其餘部分，下一次溢出將包括所有收集到的記錄，即緩衝區的0.66，並且不會生成額外的溢出。換句話說，閾值是定義觸發器，而不是阻塞。
大於序列化緩衝區的記錄將首先觸發溢出，然後溢出到單獨的文件。

Shuffle/Reduce Parameters

As described previously, each reduce fetches the output assigned to it by the Partitioner via HTTP into memory and periodically merges these outputs to disk. If intermediate compression of map outputs is turned on, each output is decompressed into memory. The following options affect the frequency of these merges to disk prior to the reduce and the memory allocated to map output during the reduce.

如前所述，每個reduce通過HTTP將Partitioner組件分配給它的輸出提取到內存中，並定期將這些輸出合併到磁盤中。如果map輸出的數據設置了壓縮，則將解壓縮map輸出數據到內存中。以下選項影響在reduce之前將數據合併到磁盤的頻率，以及在reduce期間爲map輸出而分配的內存。

Name	Type	Description
mapreduce.task.io.sort.factor	int	指定同時合併磁盤上的段數。它限制了合併過程中打開文件和壓縮編解碼器的數量。如果文件的數量超過這個限制，合併將在幾輪中進行。默認是10
mapreduce.reduce.merge.inmem.threshold	int	閾值，表示內存中合併進程的文件數量。當我們積累閾值數量的文件時，我們啓動內存中的合併並溢出到磁盤。如果值爲0或小於0，則表示不希望設置任何閾值，而是僅依賴於ramfs的內存消耗來觸發合併。默認是1000
mapreduce.reduce.shuffle.merge.percent	float	內存合併初始化的使用閾值，表示爲分配給存儲內存map輸出的總內存的百分比，其中存儲map輸出的內存由mapreduce.reduce.shuffle.input.buffer.percent定義。默認0.66
mapreduce.reduce.shuffle.input.buffer.percent	float	在shuffle期間從最大堆大小分配到存儲map輸出的內存百分比。儘管應該爲框架留出一些內存，但通常將其設置爲足夠高以存儲大量的map輸出是有利的。默認0.70
mapreduce.reduce.input.buffer.percent	float	在reduce期間保留map輸出的內存百分比(相對於最大堆大小)。當洗牌結束時，內存中任何剩餘的map輸出必須消耗小於此閾值的內存，然後才能開始reduce。默認是0.0

例如，如果mapreduce.map.sort.spill.percent設置爲0.33，則在spiller組件溢出文件時map繼續通過

Configured Parameters

The following properties are localized in the job configuration for each task’s execution

Name	Type	Description
mapreduce.job.id	String	The job id
mapreduce.job.jar	String	job.jar location in job directory
mapreduce.job.local.dir	String	The job specific shared scratch space
mapreduce.task.id	String	The task id
mapreduce.task.attempt.id	String	The task attempt id
mapreduce.task.is.map	boolean	Is this a map task
mapreduce.task.partition	int	The id of the task within the job
mapreduce.map.input.file	String	The filename that the map is reading from
mapreduce.map.input.start	long	The offset of the start of the map input split
mapreduce.map.input.length	long	The number of bytes in the map input split
mapreduce.task.output.dir	String	The task’s temporary output directory

如果用的是Hadoop Streaming(另一篇博客會提到這個工具)提交MR的話，如果想獲取上述參數的值，需要將.換成_，例如mapreduce.job.id就改成mapreduce_job_id

Task Logs

Nodemanager讀取stdout、stderr、syslog日誌信息並存儲到${HADOOP_LOG_DIR}/userlogs

Job Submission and Monitoring

Job 是用戶Job與ResourceManager交互的主要接口。
job提交進程包括以下幾個內容：

檢查Job的輸入輸出
爲Job計算InputSplit(即計算有多少分片)
如果有必要的話設置DistributedCache如詞典的緩存
將Job的jar和configuration複製到Mapreduce系統的文件系統路徑下
將Job提交到ResourceManger並且可以選擇是否監控Job的狀態

Job的歷史文件記錄在mapreduce.jobhistory.intermediate-done-dir和mapreduce.jobhistory.done-dir的默認位置。

yarn.app.mapreduce.am.staging-dir=/tmp/hadoop-yarn/staging
mapreduce.jobhistory.intermediate-done-dir=${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate	
mapreduce.jobhistory.done-dir=${yarn.app.mapreduce.am.staging-dir}/history/done

可通過mapred job -history output.jhist來查看job的具體信息，其中output.jhist是job的歷史日誌文件且是hdfs上的路徑，如下所示
mapred job -history /tmp/hadoop-yarn/staging/history/done/2020/02/15/000000/job_1581753465874_0002-1581753958635-root-secondary+sort-1581754265093-1-0-SUCCEEDED-default-1581753960803.jhist

Job Control

用戶可能需要將多個任務串行實現複雜任務而沒辦法通過一個MapReduce任務實現。這是相當容易，job的output通常是輸出到分佈式緩存，而輸出，可以作爲下一個任務的輸入。
然而，這也意味確保任務的完成（成功/失敗）的義務是完全建立在客戶端上。在這種情況下，各種作業的控制選項有：
Job.submit() :提交作業給集羣並立刻返回
Job.waitForCompletion(boolean) :提交作業給集羣並且等待它完成。

Job Input

InputFormat描述了MapReduce作業的輸入規範。
MapReduce框架依賴於作業的InputFormat去做以下幾件事：

驗證job的輸入規範。
將輸入文件拆分爲邏輯InputSplit實例，每個InputSplit實例後面會分配給單個map task程序。
提供RecordReader接口實現，用於從邏輯InputSplit中收集input records，以便map task進行處理。

基於文件的InputFormat實現(通常是FileInputFormat的子類)的默認行爲是根據輸入文件的總大小(以字節爲單位)將輸入分割爲邏輯InputSplit實例。但是，輸入文件的文件系統block大小被視爲Input Split的上限。可以通過mapreduce.input.fileinputformat.split.minsize設置拆分大小的下界，下界默認是0。
TextInputFormat 是默認的InputFormat。
如果TextInputFormat是給定作業的InputFormat，則框架將檢測擴展名爲.gz的輸入文件，並使用適當的CompressionCodec對其進行解壓。但是，必須注意，具有上述擴展名的壓縮文件不能被分割，每個壓縮文件都由一個map task完整地處理。

InputSplit

一個InputSplit對應一個map task，InputSplit就是map task要處理的數據。
通常InputSplit提供一個面向字節的輸入視圖，RecordReader負責處理和呈現一個面向記錄的視圖。
FileSplit是默認的InputSplit。

RecordReader

RecordReader從InputSplit中讀取<key、value>對。
通常，RecordReader轉換InputSplit提供的輸入的面向字節的視圖，並提供面向Mapper實現的記錄以進行處理。

Job Output

OutputFormat描述了MapReduce作業的輸出規範。
MapReduce框架依賴於工作的輸出格式去做以下兩件事:

驗證作業的輸出規範;例如，檢查輸出目錄是否不存在。
提供用於寫入作業的輸出文件的RecordWriter實現。輸出文件存儲在文件系統(如HDFS)中。

TextOutputFormat是默認的OutputFormat。

OutputCommitter

OutputCommitter describes the commit of task output for a MapReduce job.
The MapReduce framework relies on the OutputCommitter of the job to:

Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job. Job setup is done by a separate task when the job is in PREP state and after initializing tasks. Once the setup task completes, the job will be moved to RUNNING state.

Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion. Job cleanup is done by a separate task at the end of the job. Job is declared SUCCEDED/FAILED/KILLED after the cleanup task completes.

Setup the task temporary output. Task setup is done as part of the same task, during task initialization.

Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.

Commit of the task output. Once task is done, the task will commit it’s output if required.

Discard the task commit. If the task has been failed/killed, the output will be cleaned-up. If task could not cleanup (in exception block), a separate task will be launched with same attempt-id to do the cleanup.

FileOutputCommitter is the default OutputCommitter. Job setup/cleanup tasks occupy map or reduce containers, whichever is available on the NodeManager. And JobCleanup task, TaskCleanup tasks and JobSetup task have the highest priority, and in that order.

Task Side-Effect Files

某些情況下會有兩個完全相同的map/reduce task同時工作(如Speculative機制)導致生成副作用文件。
因此application-writer需要選擇唯一的task-attempt去出處理而不是task，是task-attempt。

Hence the application-writer will have to pick unique names per task-attempt (using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per task.

RecordWriter

RecordWriter寫<key, value>對到輸出文件.
RecordWriter實現類將job的輸出寫到FileSytem。

Other Useful Features

Submitting Jobs to Queues

用戶將job提交到Queue(隊列)。Queue作爲作業的集合，允許系統提供特定的功能。例如，Queue使用acl來控制哪些用戶可以向其提交作業。Queue主要由Hadoop Scheduler使用.
Hadoop配置了一個單獨(默認自帶)的強制隊列，名字爲default。可以通過mapreduce.job.queuename來設置job該分配到哪個隊列，或者通過Configuration.set(MRJobConfig.QUEUE_NAME, String)來設置job的隊列，如果沒有顯示設置則默認分配到default隊列。一些作業調度器(如Capacity Scheduler)支持多個隊列。
關於Scheduler可以參考我的另一篇博客Yarn Scheduler（待寫）。

DistributedCache

DistributedCache有效地分發特定於應用程序的大型只讀文件如詞典。
DistributedCache是MapReduce框架提供的一種工具，用於緩存應用程序所需的文件(file、archives、jar等等)。
DistributedCache假設通過hdfs:// url指定的文件已經存在於文件系統（如HDFS）中。
在node執行job的任一task之前，MR框架將把必要的文件複製到該節點上。它的效率源於每個job只複製一次文件，並且能夠緩將拷貝過來的歸檔文件自動解壓到該節點的task工作目錄。
DistributedCache可用於分發簡單的只讀數據/文本文件和更復雜的類型，如archives 和jar。archives (zip、tar、tgz和tar.gz文件)在從節點上解壓。文件自動有執行權限集合。
可以通過mapreduce.job.cache.{files|archives}來設置待分發的文件，如果有多個文件，用逗號分割。也可以通過Job.addCacheFile(URI)、Job.addCacheArchive(URI)、Job.setCacheFiles(URI[])、Job.setCacheArchives(URI[])來設置。其中URI的形式是hdfs://host:port/absolute-path#link-name，其中link-name是別名，可以在程序裏用這個別名。如果是用的Hadoop Streaming提交的任務，文件可以通過添加命令行參數-cacheFile/-cacheArchive來分發。

Private and Public DistributedCache Files

DistributedCache files can be private or public, that determines how they can be shared on the slave nodes.

“Private” DistributedCache files are cached in a localdirectory private to the user whose jobs need these files. These files are shared by all tasks and jobs of the specific user only and cannot be accessed by jobs of other users on the slaves. A DistributedCache file becomes private by virtue of its permissions on the file system where the files are uploaded, typically HDFS. If the file has no world readable access, or if the directory path leading to the file has no world executable access for lookup, then the file becomes private.

“Public” DistributedCache files are cached in a global directory and the file access is setup such that they are publicly visible to all users. These files can be shared by tasks and jobs of all users on the slaves. A DistributedCache file becomes public by virtue of its permissions on the file system where the files are uploaded, typically HDFS. If the file has world readable access, AND if the directory path leading to the file has world executable access for lookup, then the file becomes public. In other words, if the user intends to make a file publicly available to all users, the file permissions must be set to be world readable, and the directory permissions on the path leading to the file must be world executable.

Profiling

Profiling is a utility to get a representative (2 or 3) sample of built-in java profiler for a sample of maps and reduces

Profiling 是一個工具可以用來獲取2到3個Java內置分析器(built-in java profiler)關於map和reduce的分析樣本。

Name	value	Description
mapreduce.task.profile	true	系統是否應該收集job的一些task的Profiler信息，信息會存儲在user log目錄
mapreduce.task.profile.params	-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=y,thread=y,verbose=n,file=%s	用於收集map/reduce的Profiler信息的JVM Profiler參數信息，其中%s會被替換成`profiler.out`文件路徑，更具體的針對map/reduce的參數可以參考下面的屬性
mapreduce.task.profile.map.params	${mapreduce.task.profile.params}	針對map的JVM Profiler參數信息
mapreduce.task.profile.maps	0-2	需要做Profiler信息收集的map task的編號
mapreduce.task.profile.reduce.params	${mapreduce.task.profile.params}	針對reduce的JVM Profiler參數信息
mapreduce.task.profile.reduces	0-2	需要做Profiler信息收集的reduce task的編號

其中關於mapreduce.task.profile.maps屬性的值如何設置，可以參考源碼中org.apache.hadoop.conf.Configuration.IntegerRanges的註釋說明部分。
2-3,5,7-表示2,3,5,and 7,8,9,...

A class that represents a set of positive integer ranges.
It parses strings of the form: “2-3,5,7-” where ranges are separated by comma and the lower/upper bounds are separated by dash.
Either the lower or upper bound may be omitted meaning all values up to or over.
So the string above means 2, 3, 5, and 7, 8, 9, …

如下圖是我設置開啓Profile後得到的profile.out文件。

由於我使用的Hadoop版本是2.6.5，在添加Profile參數後得到的profile.out文件是被截斷的也就是不全的。這個問題可以參考如下兩個網址有比較詳細的說明，這個bug已經在hadoop 2.8.0版本修復，所以如果想做Profile信息收集請升級Hadoop版本。
stackoverflow中關於hadoop hprof no cpu samles的問題
 具體的BUG信息及解決

HPROF是每個JDK發行版附帶的用於heap和CPU分析的工具。關於hprof更加詳細的信息可以參考Oracle JDK關於hprof的說明

Debugging

MapReduce框架提供一個工具用來執行用戶提供的腳本用於調試。當一個MapReduce任務失敗，用戶可以運行調試腳本，去處理任務log。腳本可以讀取任務的stdout、stderr、syslog和jobconf。調試腳本的stdout和sterr輸出將會作爲Job UI的一部分顯示出來。

How to distribute the script file

用戶需要使用DistributedCache去分發腳本和給腳本做symlink（即取別名）

How to submit the script

可以通過設置mapreduce.map.debug.script和mapreduce.reduce.debug.script來分別對map task和reduce task進行debug。也可以通過API的方式設置Configuration.set(MRJobConfig.MAP_DEBUG_SCRIPT, String)、Configuration.set(MRJobConfig.REDUCE_DEBUG_SCRIPT, String).
stdout, stderr, syslog and jobconf files是debug 腳本的輸入參數，debug命令會在map/reduce task失敗的節點上運行，命令是$script $stdout $stderr $syslog $jobconf。

Pipes programs have the c++ program name as a fifth argument for the command. Thus for the pipes programs the command is
$script $stdout $stderr $syslog $jobconf $program

Data Compression

Hadoop MapReduce可以對map輸出的中間數據和job-outputs也就是reduce的輸出進行壓縮。支持gzip、bzip2、snappy和lz4壓縮。

Intermediate Outputs

通過Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS, boolean)來設置是否對map輸出數據進行壓縮。
通過Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC, Class)來設置用什麼樣的壓縮方式進行壓縮。

Job Outputs

通過FileOutputFormat.setCompressOutput(Job, boolean)來設置是否對job輸出進行壓縮。
通過FileOutputFormat.setOutputCompressorClass(Job, Class)來設置用什麼樣的壓縮方式進行壓縮。
如果job的輸出是SequenceFileOutputFormat，那麼壓縮類型只能在SequenceFile.CompressionType裏選擇。
通過SequenceFileOutputFormat.setOutputCompressionType(Job, SequenceFile.CompressionType)。

Skipping Bad Records

請直接參照官網Skipping Bad Records

Example: WordCount v2.0

以下代碼是在官方給的代碼基礎上修改的，主要是添加了關於link-name的內容，代碼裏寫的很清除，我就不再贅述。

package com.utstar.patrick.hadoop.mr.wcdemo;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;

public class WordCountV2 {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{

        static enum CountersEnum { INPUT_WORDS }

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        private boolean caseSensitive;
        private Set<String> patternsToSkip = new HashSet<String>();

        private Configuration conf;
        private BufferedReader fis;

        @Override
        public void setup(Context context) throws IOException,
                InterruptedException {
            //由於沒有明確的接口能獲取當前運行taskattemp的containerid，所以就打印taskattempid
            //然後結合job ui和日誌就能找到taskAttempt和containerid的對應關係
            System.out.println(context.getTaskAttemptID().toString());
            conf = context.getConfiguration();
            caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
            if (conf.getBoolean("wordcount.skip.patterns", true)) {
                //獲取cacheFiles
                URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
                //獲取cacheArchives
                Arrays.stream(Job.getInstance(conf).getCacheArchives()).forEach(x->{
                    System.out.println("URI="+x.toString());
                    System.out.println("fragment="+x.getFragment());
                });
                for (URI patternsURI : patternsURIs) {
                    Path patternsPath = new Path(patternsURI.getPath());
                    System.out.println("URI="+patternsURI.toString());
                    System.out.println(patternsPath.toString());
                    String patternsFileName = patternsPath.getName().toString();
//                    parseSkipFile(patternsFileName);
                    System.out.println("pathName="+patternsFileName+", fragment="+patternsURI.getFragment());
                    System.out.println("------------------------");

                    //最好是傳如fragment，因爲如果uri裏添加了link-name做symlink的話，讀取文件要通過{link-name}
                    //而且如果不指定link-name的話，fragment默認就是文件名
                    //hdfs://host:port/absolute-path#link-name
                    //可以參考URI.java的說明 URIs: [<scheme>:]<scheme-specific-part>[#<fragment>]
                    parseSkipFile(patternsURI.getFragment());
                }
            }
        }

        private void parseSkipFile(String fileName) {
            try {
                fis = new BufferedReader(new FileReader(fileName));
                String pattern = null;
                while ((pattern = fis.readLine()) != null) {
                    patternsToSkip.add(pattern);
                }
            } catch (IOException ioe) {
                System.err.println("Caught exception while parsing the cached file '"
                        + StringUtils.stringifyException(ioe));
            }
        }

        @Override
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            String line = (caseSensitive) ?
                    value.toString() : value.toString().toLowerCase();
            for (String pattern : patternsToSkip) {
                line = line.replaceAll(pattern, "");
            }
            StringTokenizer itr = new StringTokenizer(line);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
                Counter counter = context.getCounter(CountersEnum.class.getName(),
                        CountersEnum.INPUT_WORDS.toString());
                counter.increment(1);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
        String[] remainingArgs = optionParser.getRemainingArgs();
        if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {
            System.err.println("Usage: wordcount <in> <out> [-skip skipPatternFile]");
            System.exit(2);
        }

        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCountV2.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        List<String> otherArgs = new ArrayList<String>();
        for (int i=0; i < remainingArgs.length; ++i) {
            if ("-skip".equals(remainingArgs[i])) {
                job.addCacheFile(new Path(remainingArgs[++i]).toUri());
                job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
            } else {
                otherArgs.add(remainingArgs[i]);
            }
        }
        FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));
        FileSystem fs = FileSystem.get(conf);
        fs.delete(new Path(otherArgs.get(1)),true);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

我執行的命令如下，並沒有按照官方給定代碼的方式通過API去分發文件，而是通過命令行 -archives、 -files 的方式分發，這是一樣的。而且設置了link-name即取了別名
hadoop jar wordcountV2.jar -archives ../test/a.zip#abc -files patterns.txt#pp -Dwordcount.skip.patterns=true -Dwordcount.case.sensitive=false /user/root/input /user/root/wd_output

特別注意：如果想用GENERIC_OPTIONS必須在代碼裏通過GenericOptionsParser去解析，否則是解析不到-files 等GENERIC_OPTIONS屬性設置的。

代碼運行結果和官方代碼運行結果一樣。
下圖是在nodemanager機器上的${hadoop.tmp.dir}/nm-local-dir/usercache目錄下找到的分發文件。所有的nodemanager機器上對應的目錄都有分發的文件，這也驗證了上面DistributedCache的說法。

下圖是在日誌目錄具體的container裏找到的stdout輸出，對應着代碼的std輸出

參考網址

Hadoop官網MapReduce Tutorial