MapReduce Tutorial 思考总结

Prerequisites(前置条件)

Hadoop集群必须安装好，配置好，可正常运行。

Overview(概览)

MR(MapReduce简称，下同)任务会将input data-set切片(split)成独立的chunks然后交由map task并行处理，map task的输出经过sort(框架完成)后作为reduce task的输入。MR任务的输入和输出都存放在文件系统如HDFS。
一般来说计算节点和存储节点是同一个节点，这样框架就能有效调度任务到数据节点上，可以有效提高集群带宽。即NodeManager节点和DataNode节点在同一台机器上。
MapReduce 框架包含一个主ResourceManager，每个集群节点都有一个从NodeManager和每个application都有一个MRAppMaster。
Application至少要确定input/output location及提供map和reduce方法(通过实现相应的map/reduce接口等)，这些参数和其他job参数就共同组成了job configuration。
Client提交job(如可运行的jar文件)和configuration给ResourceManager，然后ResourceManager就分配任务给slaves，并调度任务、监控任务和提供status和诊断信息给client。
尽管Hadoop框架是由java写的，但是MR application不一定要用java写。

Hadoop Streaming就是一个很好的工具，可以让用户以任何可执行文件(如shell、python)作为Mapper/Reducer来创建和运行job。
Hadoop Pipes 是兼容SWIG用来实现 MapReduce 应用的C++ API（不是基于JNI）

Inputs and Outputs(输入和输出)

MR框架仅仅操作<k,v>对，即MR的输入和输出都是<k,v>对。
key/value classes必须要序列化即实现Writable接口(网络上传输)，另外key classes必须要实现WritableComparable接口(框架需要利用key去做sort)。
MapRedeuce job 的输入输出类型流向示意如下：

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

MapReduce - User Interfaces(MR的用户接口)

首先就是Mapper和Reducer接口。Application通常只实现它们提供的map和reduce方法。
然后其他接口包括Job、Partitioner、InputFormat、OutputFormat和其他。

Mapper

Mappers将输入的<k,v>对转换成中间的<k,v>对即map的输出结果(存放在本地磁盘)。
MapReduce 会根据 InputFormat 切分(split)的每个 InputSplit 都创建一个map task。
通过 job.setMapperClass(Class)来给Job设置Mapper实现类，然后MR框架为map task的InputSplit中的每个<k,v>对调用map(WritableComparable, Writable, Context)方法进行处理。Application可复写cleanup(Context)方法(该方法会在map task结束前调用一次)来执行任何需要回收清除的操作，可复写setup(Context)方法(该方法会在map task开始前调用一次)来进行初始操作。
map task输出的<k,v>对会通过context.write(WritableComparable, Writable)写入缓存。
Application可以通过Counter来报告统计信息，Counter组件后面详细介绍。
Mapper的输出会被排序(sort)并且分区(partition)到每一个Reducer。分区数和Reduce Task的数目是一致的。用户可以通过实现一个自定义的Partitioner来控制哪个key分配到哪个Reducer，默认是HashPartitioner。
中间的<k,v>对，即经过排序后的map输出以一个简单的格式存储，格式为(key-len, key, value-len, value)。map的输出结果也可通过配置来决定以怎样的方式来压缩数据。通过JobConf.setMapOutputCompressorClass(Class<? extends CompressionCodec> codecClass)来设置，也就是设置如下两个property

mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec=<Class>

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

map task的数量通常由输入的总大小(即输入文件的block数)决定。
map task的正确并行度在每个节点10-100个之间较好，对于那些轻CPU的任务可以设置到300个。map task设置需要一段时间，所以最好是任务至少需要一分钟来执行。
可以通过Configuration.set(MRJobConfig.NUM_MAPS, int)设置更高的map值，但是不太建议这么做。

Reducer

Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The framework then calls reduce(WritableComparable, Iterable, Context) method for each <key, (list of values)> pair in the grouped inputs

Reducer处理共享一个key的中间<k,v>对。
通过Job.setNumReduceTasks(int)来手动设置Reduce Task的数目。
通过Job.setReducerClass(Class)来设置Reducer接口的实现类。
框架会对于分组输入(grouped inputs)的每个<key, (list of values)>对调用一次reduce(WritableComparable, Iterable<Writable>, Context)方法。如何分组下面有讲到。
Application可复写cleanup(Context)方法(该方法会在reduce task结束前调用一次)来执行任何需要回收清除的操作，可复写setup(Context)方法(该方法会在reduc task开始前调用一次)来进行初始操作。
Reducer有3个主要的阶段：shuffle, sort and reduce

shuffle

Reducer的输入数据都是Mapper阶段经过排序的map输出数据。在shuffle阶段框架将通过HTTP从恰当的Mapper的分区中取得数据。即每个Reduce Task从所有的map task的对应分区中取得数据存到本地，每个Reduce Task都有自己的分区号而且只能取得属于自己分区的map输出数据。每个map task的输出数据中都会被Partitioner组件分到不同的分区。Partitioner组件后面说明。

sort

sort阶段框架将对输入到的 Reducer 的数据通过key（不同的 Mapper 可能输出相同的key）进行分组。
shuffle和sort是同时进行的；map的输出数据被获取时会进行合并。

Secondary Sort

If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via Job.setSortComparatorClass(Class). Since Job.setGroupingComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.

如果想要对中间记录实现与 map 阶段不同的排序方式，那么可以通过Job.setSortComparatorClass(Class)指定一个Comparator(比较器)。由于Job.setGroupingComparatorClass(Class)可用于控制中间键的分组方式，因此可以将这些键与值结合使用来模拟secondary sort。
关于SecondarySort机制可以参考我写的另一篇博客关于MapReduce的Secondary Sort机制

Reduce

在这个阶段reduce(WritableComparable, Iterable, Context)方法将会被调用来处理每个已经分好的组<key, (list of values)>对。
reduce task通过 Context.write(WritableComparable, Writable) 将数据写入到文件系统。
Application可以通过Counter来报告统计信息，Counter组件后面详细介绍。
Recuder输出的数据是不经过排序的。

How Many Reduces?

正确的reduce task数似乎是0.95或1.75乘以节点数乘以每个节点的最大容器数。
使用0.95，一旦map task完成所有的reduce都可以立即启动并开始传输map输出数据。
使用1.75，更快的节点将完成第一轮reduce，并启动第二轮reduce，从而更好地实现负载平衡。
增加reduce的数量会增加框架开销，但是会增加负载平衡并降低失败的成本。
上面的比例因子略小于整数，以便在框架中为推测任务和失败任务保留几个reduce slot。
在实际MR任务中，reduce task数目的设置和业务是相关的。例如数据来源是电话记录，然后想得到的最终结果是想要根据移动/电信/联通查看所有的电话记录，那么就需要设置reduce task数目为3，并自定义Partitioner组件使其分为移动/电信/联通 3个分区。至于可能出现的数据倾斜的问题，容后再看，如输入数据中移动电话记录就占比很大，会导致负责移动的reduce task负担很重。

Reducer NONE

当没有 reduction 需求的时候可以将 reduce task 的数目设置为0，这是允许的。
在这种情况当中，map task将直接输出数据到job指定的输出路径(文件系统上的)。此种情况下框架不会对输出到文件系统的数据进行排序。这种Reducer NONE的特点很适合做数据清洗的工作，不会将原始数据清洗后不经过排序直接输出到文件系统。如下图所示。是在官方的secondarySort例子上设置了reduce task数目为0的输出结果，和原始数据的顺序保持一致。mapreduce.job.reduces=0
hadoop jar secondarysort.jar -Dmapreduce.job.reduces=0 /user/root/input1 /user/root/output12

Partitioner

Partitioner根据key进行分区。
Partitioner 对 map task的输出数据的 key进行分区。Partitioner 采用的默认方法是对 key 取 hashcode。分区数等于 job 的 reduce 任务数。因此对应分区的reduce task就会取属于该分区的map 输出数据。
HashPartitioner 是默认的的分区器。

Counter

Counter是一个工具用于报告 Mapreduce Application的统计数据。
Mapper 和 Reducer 实现类可使用计数器来报告统计值。
可以参考我写的这篇博客MapReduce程序计数器Counter

Job Configuration

Job用来表示即描述MapReduce作业的配置。Job是用户将MapReduce作业描述给Hadoop框架去执行的主要接口。框架试图忠实地执行作业所描述的作业，然而有些参数是Final Parameters是不能被修改的，有些参数就是能直接修改的，如上面所描述的直接通过job.setNumReduceTasks(int tasks)来设置reduce task的数目。
Job通常用来设置Mapper、Partitioner、Reducer、InputFormat、OutputFormat等接口的实现。
FileInputFormat.setInputPaths(Job, Path…)。
FileOutputFormat.setOutputPath(Job, Path)。
Job也可用来设置Comparator(这个在上面的SecondarySort部分有提到)，将文件加入到分布式缓存，设置map/reduce的输出数据是否被压缩以及如何被压缩，可设置map/reduce任务是否以speculative方式即推测执行的方式运行，设置map/reduce任务的最大尝试次数，等等等等。
也可以通过Configuration.set(String, String)/ Configuration.get(String)来设置/获取参数信息。用DistributedCache去缓存很大的只读文件如字典。

Task Execution & Environment

MRAppMaster 在一个单独的jvm中运行Mapper/Reducer task做为一个子进程。子任务继承父MRAppMaster的运行环境。
用户可以通mapreduce.{map|reduce}.java.opts来为子进程设置额外的参数。如果mapreduce.{map|reduce}.java.opts参数包含@taskid@ 符号那么Mapreduce任务将会被修改为taskid的值。官方给出的例子如下所示

<property>
  <name>mapreduce.map.java.opts</name>
  <value> -Xmx512M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@[email protected] -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false </value>
</property>

<property>
  <name>mapreduce.reduce.java.opts</name>
  <value> -Xmx1024M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@[email protected] -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false </value>
</property

其中-Xmx指定最大可用的内存，-verbose:gc -Xloggc:/tmp/@[email protected]指定gc日志路径。
com.sun.management.jmxremote.authenticate=false是禁用jmx远程连接密码。
com.sun.management.jmxremote.ssl=false是禁用SSL传输。
关于jmx更加详细的内容可以参考oracle jdk文档
如果想本地查看远程服务器上MapTask的内存信息，如使用jmc或者jvisual工具查看，
还需添加如下内容，指定MapTask的jmx端口，其中com.sun.management.jmxremote默认就是true，可以不必显示设置。

-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=6666

有一点值得注意的是，如果mapreduce.map.java.opts指定了jmxremote.port，且有多个MapTask进程在同一个节点上运行，就会出现容器启动失败的情况，因为发生了端口占用的情况，这也很好理解。所以最好是跑一个MapTask然后远程查看就可以了。
如下图是通过jmc和jvisual工具查看MapTask进程的资源信息图。

Memory Management

用户可以使用mapreduce.{map|reduce}.memory.mb指定子任务或者任何子进程运行的最大虚拟内存。需要注意的这里的值是针对每个进程的限制。{map|reduce}.memory.mb的值是以MB为单位的。并且这个值应该大于等于传给JavaVM的-Xmx的值，要不VM可能会无法启动。
说明：mapreduce.{map|reduce}.java.opts只用来设置MRAppMaster发出的子任务。

Map Parameters

A record emitted from a map will be serialized into a buffer and metadata will be stored into accounting buffers. As described in the following options, when either the serialization buffer or the metadata exceed a threshold, the contents of the buffers will be sorted and written to disk in the background while the map continues to output records. If either buffer fills completely while the spill is in progress, the map thread will block. When the map is finished, any remaining records are written to disk and all on-disk segments are merged into a single file. Minimizing the number of spills to disk can decrease map time, but a larger buffer also decreases the memory available to the mapper.

map通过context.write发出的记录将被序列化到缓冲区(OutputCollector)中，元数据将存储到会计缓冲区中。如下面的选项所述，当序列化缓冲区或元数据超出阈值时，将对缓冲区的内容进行排序并将其写入后台的磁盘，同时map将继续输出记录。如果任何一个缓冲区在溢出过程中被完全填满，则map task线程将阻塞。当map task完成时，所有剩余的记录都写到磁盘上，所有磁盘上的段合并到一个文件(即每个MapTask的最终输出结果就是一个合并后的文件)中。最小化磁盘溢出的数量可以减少映射时间，但是更大的缓冲区也会减少map程序可用的内存。

Name	Type	Description
mapreduce.task.io.sort.mb	int	存储Map输出的序列化数据和metadata数据的缓冲区的大小，以mb为单位
mapreduce.map.sort.spill.percent	float	序列化缓冲区中的软限制。一旦到达，spiller线程将开始溢出内容到后台的磁盘

例如，如果mapreduce.map.sort.spill.percent设置为0.33，则在spiller组件溢出文件时map继续通过context.write输出数据填充缓冲区的其余部分，下一次溢出将包括所有收集到的记录，即缓冲区的0.66，并且不会生成额外的溢出。换句话说，阈值是定义触发器，而不是阻塞。
大于序列化缓冲区的记录将首先触发溢出，然后溢出到单独的文件。

Shuffle/Reduce Parameters

As described previously, each reduce fetches the output assigned to it by the Partitioner via HTTP into memory and periodically merges these outputs to disk. If intermediate compression of map outputs is turned on, each output is decompressed into memory. The following options affect the frequency of these merges to disk prior to the reduce and the memory allocated to map output during the reduce.

如前所述，每个reduce通过HTTP将Partitioner组件分配给它的输出提取到内存中，并定期将这些输出合并到磁盘中。如果map输出的数据设置了压缩，则将解压缩map输出数据到内存中。以下选项影响在reduce之前将数据合并到磁盘的频率，以及在reduce期间为map输出而分配的内存。

Name	Type	Description
mapreduce.task.io.sort.factor	int	指定同时合并磁盘上的段数。它限制了合并过程中打开文件和压缩编解码器的数量。如果文件的数量超过这个限制，合并将在几轮中进行。默认是10
mapreduce.reduce.merge.inmem.threshold	int	阈值，表示内存中合并进程的文件数量。当我们积累阈值数量的文件时，我们启动内存中的合并并溢出到磁盘。如果值为0或小于0，则表示不希望设置任何阈值，而是仅依赖于ramfs的内存消耗来触发合并。默认是1000
mapreduce.reduce.shuffle.merge.percent	float	内存合并初始化的使用阈值，表示为分配给存储内存map输出的总内存的百分比，其中存储map输出的内存由mapreduce.reduce.shuffle.input.buffer.percent定义。默认0.66
mapreduce.reduce.shuffle.input.buffer.percent	float	在shuffle期间从最大堆大小分配到存储map输出的内存百分比。尽管应该为框架留出一些内存，但通常将其设置为足够高以存储大量的map输出是有利的。默认0.70
mapreduce.reduce.input.buffer.percent	float	在reduce期间保留map输出的内存百分比(相对于最大堆大小)。当洗牌结束时，内存中任何剩余的map输出必须消耗小于此阈值的内存，然后才能开始reduce。默认是0.0

例如，如果mapreduce.map.sort.spill.percent设置为0.33，则在spiller组件溢出文件时map继续通过

Configured Parameters

The following properties are localized in the job configuration for each task’s execution

Name	Type	Description
mapreduce.job.id	String	The job id
mapreduce.job.jar	String	job.jar location in job directory
mapreduce.job.local.dir	String	The job specific shared scratch space
mapreduce.task.id	String	The task id
mapreduce.task.attempt.id	String	The task attempt id
mapreduce.task.is.map	boolean	Is this a map task
mapreduce.task.partition	int	The id of the task within the job
mapreduce.map.input.file	String	The filename that the map is reading from
mapreduce.map.input.start	long	The offset of the start of the map input split
mapreduce.map.input.length	long	The number of bytes in the map input split
mapreduce.task.output.dir	String	The task’s temporary output directory

如果用的是Hadoop Streaming(另一篇博客会提到这个工具)提交MR的话，如果想获取上述参数的值，需要将.换成_，例如mapreduce.job.id就改成mapreduce_job_id

Task Logs

Nodemanager读取stdout、stderr、syslog日志信息并存储到${HADOOP_LOG_DIR}/userlogs

Job Submission and Monitoring

Job 是用户Job与ResourceManager交互的主要接口。
job提交进程包括以下几个内容：

检查Job的输入输出
为Job计算InputSplit(即计算有多少分片)
如果有必要的话设置DistributedCache如词典的缓存
将Job的jar和configuration复制到Mapreduce系统的文件系统路径下
将Job提交到ResourceManger并且可以选择是否监控Job的状态

Job的历史文件记录在mapreduce.jobhistory.intermediate-done-dir和mapreduce.jobhistory.done-dir的默认位置。

yarn.app.mapreduce.am.staging-dir=/tmp/hadoop-yarn/staging
mapreduce.jobhistory.intermediate-done-dir=${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate	
mapreduce.jobhistory.done-dir=${yarn.app.mapreduce.am.staging-dir}/history/done

可通过mapred job -history output.jhist来查看job的具体信息，其中output.jhist是job的历史日志文件且是hdfs上的路径，如下所示
mapred job -history /tmp/hadoop-yarn/staging/history/done/2020/02/15/000000/job_1581753465874_0002-1581753958635-root-secondary+sort-1581754265093-1-0-SUCCEEDED-default-1581753960803.jhist

Job Control

用户可能需要将多个任务串行实现复杂任务而没办法通过一个MapReduce任务实现。这是相当容易，job的output通常是输出到分布式缓存，而输出，可以作为下一个任务的输入。
然而，这也意味确保任务的完成（成功/失败）的义务是完全建立在客户端上。在这种情况下，各种作业的控制选项有：
Job.submit() :提交作业给集群并立刻返回
Job.waitForCompletion(boolean) :提交作业给集群并且等待它完成。

Job Input

InputFormat描述了MapReduce作业的输入规范。
MapReduce框架依赖于作业的InputFormat去做以下几件事：

验证job的输入规范。
将输入文件拆分为逻辑InputSplit实例，每个InputSplit实例后面会分配给单个map task程序。
提供RecordReader接口实现，用于从逻辑InputSplit中收集input records，以便map task进行处理。

基于文件的InputFormat实现(通常是FileInputFormat的子类)的默认行为是根据输入文件的总大小(以字节为单位)将输入分割为逻辑InputSplit实例。但是，输入文件的文件系统block大小被视为Input Split的上限。可以通过mapreduce.input.fileinputformat.split.minsize设置拆分大小的下界，下界默认是0。
TextInputFormat 是默认的InputFormat。
如果TextInputFormat是给定作业的InputFormat，则框架将检测扩展名为.gz的输入文件，并使用适当的CompressionCodec对其进行解压。但是，必须注意，具有上述扩展名的压缩文件不能被分割，每个压缩文件都由一个map task完整地处理。

InputSplit

一个InputSplit对应一个map task，InputSplit就是map task要处理的数据。
通常InputSplit提供一个面向字节的输入视图，RecordReader负责处理和呈现一个面向记录的视图。
FileSplit是默认的InputSplit。

RecordReader

RecordReader从InputSplit中读取<key、value>对。
通常，RecordReader转换InputSplit提供的输入的面向字节的视图，并提供面向Mapper实现的记录以进行处理。

Job Output

OutputFormat描述了MapReduce作业的输出规范。
MapReduce框架依赖于工作的输出格式去做以下两件事:

验证作业的输出规范;例如，检查输出目录是否不存在。
提供用于写入作业的输出文件的RecordWriter实现。输出文件存储在文件系统(如HDFS)中。

TextOutputFormat是默认的OutputFormat。

OutputCommitter

OutputCommitter describes the commit of task output for a MapReduce job.
The MapReduce framework relies on the OutputCommitter of the job to:

Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job. Job setup is done by a separate task when the job is in PREP state and after initializing tasks. Once the setup task completes, the job will be moved to RUNNING state.

Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion. Job cleanup is done by a separate task at the end of the job. Job is declared SUCCEDED/FAILED/KILLED after the cleanup task completes.

Setup the task temporary output. Task setup is done as part of the same task, during task initialization.

Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.

Commit of the task output. Once task is done, the task will commit it’s output if required.

Discard the task commit. If the task has been failed/killed, the output will be cleaned-up. If task could not cleanup (in exception block), a separate task will be launched with same attempt-id to do the cleanup.

FileOutputCommitter is the default OutputCommitter. Job setup/cleanup tasks occupy map or reduce containers, whichever is available on the NodeManager. And JobCleanup task, TaskCleanup tasks and JobSetup task have the highest priority, and in that order.

Task Side-Effect Files

某些情况下会有两个完全相同的map/reduce task同时工作(如Speculative机制)导致生成副作用文件。
因此application-writer需要选择唯一的task-attempt去出处理而不是task，是task-attempt。

Hence the application-writer will have to pick unique names per task-attempt (using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per task.

RecordWriter

RecordWriter写<key, value>对到输出文件.
RecordWriter实现类将job的输出写到FileSytem。

Other Useful Features

Submitting Jobs to Queues

用户将job提交到Queue(队列)。Queue作为作业的集合，允许系统提供特定的功能。例如，Queue使用acl来控制哪些用户可以向其提交作业。Queue主要由Hadoop Scheduler使用.
Hadoop配置了一个单独(默认自带)的强制队列，名字为default。可以通过mapreduce.job.queuename来设置job该分配到哪个队列，或者通过Configuration.set(MRJobConfig.QUEUE_NAME, String)来设置job的队列，如果没有显示设置则默认分配到default队列。一些作业调度器(如Capacity Scheduler)支持多个队列。
关于Scheduler可以参考我的另一篇博客Yarn Scheduler（待写）。

DistributedCache

DistributedCache有效地分发特定于应用程序的大型只读文件如词典。
DistributedCache是MapReduce框架提供的一种工具，用于缓存应用程序所需的文件(file、archives、jar等等)。
DistributedCache假设通过hdfs:// url指定的文件已经存在于文件系统（如HDFS）中。
在node执行job的任一task之前，MR框架将把必要的文件复制到该节点上。它的效率源于每个job只复制一次文件，并且能够缓将拷贝过来的归档文件自动解压到该节点的task工作目录。
DistributedCache可用于分发简单的只读数据/文本文件和更复杂的类型，如archives 和jar。archives (zip、tar、tgz和tar.gz文件)在从节点上解压。文件自动有执行权限集合。
可以通过mapreduce.job.cache.{files|archives}来设置待分发的文件，如果有多个文件，用逗号分割。也可以通过Job.addCacheFile(URI)、Job.addCacheArchive(URI)、Job.setCacheFiles(URI[])、Job.setCacheArchives(URI[])来设置。其中URI的形式是hdfs://host:port/absolute-path#link-name，其中link-name是别名，可以在程序里用这个别名。如果是用的Hadoop Streaming提交的任务，文件可以通过添加命令行参数-cacheFile/-cacheArchive来分发。

Private and Public DistributedCache Files

DistributedCache files can be private or public, that determines how they can be shared on the slave nodes.

“Private” DistributedCache files are cached in a localdirectory private to the user whose jobs need these files. These files are shared by all tasks and jobs of the specific user only and cannot be accessed by jobs of other users on the slaves. A DistributedCache file becomes private by virtue of its permissions on the file system where the files are uploaded, typically HDFS. If the file has no world readable access, or if the directory path leading to the file has no world executable access for lookup, then the file becomes private.

“Public” DistributedCache files are cached in a global directory and the file access is setup such that they are publicly visible to all users. These files can be shared by tasks and jobs of all users on the slaves. A DistributedCache file becomes public by virtue of its permissions on the file system where the files are uploaded, typically HDFS. If the file has world readable access, AND if the directory path leading to the file has world executable access for lookup, then the file becomes public. In other words, if the user intends to make a file publicly available to all users, the file permissions must be set to be world readable, and the directory permissions on the path leading to the file must be world executable.

Profiling

Profiling is a utility to get a representative (2 or 3) sample of built-in java profiler for a sample of maps and reduces

Profiling 是一个工具可以用来获取2到3个Java内置分析器(built-in java profiler)关于map和reduce的分析样本。

Name	value	Description
mapreduce.task.profile	true	系统是否应该收集job的一些task的Profiler信息，信息会存储在user log目录
mapreduce.task.profile.params	-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=y,thread=y,verbose=n,file=%s	用于收集map/reduce的Profiler信息的JVM Profiler参数信息，其中%s会被替换成`profiler.out`文件路径，更具体的针对map/reduce的参数可以参考下面的属性
mapreduce.task.profile.map.params	${mapreduce.task.profile.params}	针对map的JVM Profiler参数信息
mapreduce.task.profile.maps	0-2	需要做Profiler信息收集的map task的编号
mapreduce.task.profile.reduce.params	${mapreduce.task.profile.params}	针对reduce的JVM Profiler参数信息
mapreduce.task.profile.reduces	0-2	需要做Profiler信息收集的reduce task的编号

其中关于mapreduce.task.profile.maps属性的值如何设置，可以参考源码中org.apache.hadoop.conf.Configuration.IntegerRanges的注释说明部分。
2-3,5,7-表示2,3,5,and 7,8,9,...

A class that represents a set of positive integer ranges.
It parses strings of the form: “2-3,5,7-” where ranges are separated by comma and the lower/upper bounds are separated by dash.
Either the lower or upper bound may be omitted meaning all values up to or over.
So the string above means 2, 3, 5, and 7, 8, 9, …

如下图是我设置开启Profile后得到的profile.out文件。

由于我使用的Hadoop版本是2.6.5，在添加Profile参数后得到的profile.out文件是被截断的也就是不全的。这个问题可以参考如下两个网址有比较详细的说明，这个bug已经在hadoop 2.8.0版本修复，所以如果想做Profile信息收集请升级Hadoop版本。
stackoverflow中关于hadoop hprof no cpu samles的问题
 具体的BUG信息及解决

HPROF是每个JDK发行版附带的用于heap和CPU分析的工具。关于hprof更加详细的信息可以参考Oracle JDK关于hprof的说明

Debugging

MapReduce框架提供一个工具用来执行用户提供的脚本用于调试。当一个MapReduce任务失败，用户可以运行调试脚本，去处理任务log。脚本可以读取任务的stdout、stderr、syslog和jobconf。调试脚本的stdout和sterr输出将会作为Job UI的一部分显示出来。

How to distribute the script file

用户需要使用DistributedCache去分发脚本和给脚本做symlink（即取别名）

How to submit the script

可以通过设置mapreduce.map.debug.script和mapreduce.reduce.debug.script来分别对map task和reduce task进行debug。也可以通过API的方式设置Configuration.set(MRJobConfig.MAP_DEBUG_SCRIPT, String)、Configuration.set(MRJobConfig.REDUCE_DEBUG_SCRIPT, String).
stdout, stderr, syslog and jobconf files是debug 脚本的输入参数，debug命令会在map/reduce task失败的节点上运行，命令是$script $stdout $stderr $syslog $jobconf。

Pipes programs have the c++ program name as a fifth argument for the command. Thus for the pipes programs the command is
$script $stdout $stderr $syslog $jobconf $program

Data Compression

Hadoop MapReduce可以对map输出的中间数据和job-outputs也就是reduce的输出进行压缩。支持gzip、bzip2、snappy和lz4压缩。

Intermediate Outputs

通过Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS, boolean)来设置是否对map输出数据进行压缩。
通过Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC, Class)来设置用什么样的压缩方式进行压缩。

Job Outputs

通过FileOutputFormat.setCompressOutput(Job, boolean)来设置是否对job输出进行压缩。
通过FileOutputFormat.setOutputCompressorClass(Job, Class)来设置用什么样的压缩方式进行压缩。
如果job的输出是SequenceFileOutputFormat，那么压缩类型只能在SequenceFile.CompressionType里选择。
通过SequenceFileOutputFormat.setOutputCompressionType(Job, SequenceFile.CompressionType)。

Skipping Bad Records

请直接参照官网Skipping Bad Records

Example: WordCount v2.0

以下代码是在官方给的代码基础上修改的，主要是添加了关于link-name的内容，代码里写的很清除，我就不再赘述。

package com.utstar.patrick.hadoop.mr.wcdemo;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;

public class WordCountV2 {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{

        static enum CountersEnum { INPUT_WORDS }

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        private boolean caseSensitive;
        private Set<String> patternsToSkip = new HashSet<String>();

        private Configuration conf;
        private BufferedReader fis;

        @Override
        public void setup(Context context) throws IOException,
                InterruptedException {
            //由于没有明确的接口能获取当前运行taskattemp的containerid，所以就打印taskattempid
            //然后结合job ui和日志就能找到taskAttempt和containerid的对应关系
            System.out.println(context.getTaskAttemptID().toString());
            conf = context.getConfiguration();
            caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
            if (conf.getBoolean("wordcount.skip.patterns", true)) {
                //获取cacheFiles
                URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
                //获取cacheArchives
                Arrays.stream(Job.getInstance(conf).getCacheArchives()).forEach(x->{
                    System.out.println("URI="+x.toString());
                    System.out.println("fragment="+x.getFragment());
                });
                for (URI patternsURI : patternsURIs) {
                    Path patternsPath = new Path(patternsURI.getPath());
                    System.out.println("URI="+patternsURI.toString());
                    System.out.println(patternsPath.toString());
                    String patternsFileName = patternsPath.getName().toString();
//                    parseSkipFile(patternsFileName);
                    System.out.println("pathName="+patternsFileName+", fragment="+patternsURI.getFragment());
                    System.out.println("------------------------");

                    //最好是传如fragment，因为如果uri里添加了link-name做symlink的话，读取文件要通过{link-name}
                    //而且如果不指定link-name的话，fragment默认就是文件名
                    //hdfs://host:port/absolute-path#link-name
                    //可以参考URI.java的说明 URIs: [<scheme>:]<scheme-specific-part>[#<fragment>]
                    parseSkipFile(patternsURI.getFragment());
                }
            }
        }

        private void parseSkipFile(String fileName) {
            try {
                fis = new BufferedReader(new FileReader(fileName));
                String pattern = null;
                while ((pattern = fis.readLine()) != null) {
                    patternsToSkip.add(pattern);
                }
            } catch (IOException ioe) {
                System.err.println("Caught exception while parsing the cached file '"
                        + StringUtils.stringifyException(ioe));
            }
        }

        @Override
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            String line = (caseSensitive) ?
                    value.toString() : value.toString().toLowerCase();
            for (String pattern : patternsToSkip) {
                line = line.replaceAll(pattern, "");
            }
            StringTokenizer itr = new StringTokenizer(line);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
                Counter counter = context.getCounter(CountersEnum.class.getName(),
                        CountersEnum.INPUT_WORDS.toString());
                counter.increment(1);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
        String[] remainingArgs = optionParser.getRemainingArgs();
        if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {
            System.err.println("Usage: wordcount <in> <out> [-skip skipPatternFile]");
            System.exit(2);
        }

        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCountV2.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        List<String> otherArgs = new ArrayList<String>();
        for (int i=0; i < remainingArgs.length; ++i) {
            if ("-skip".equals(remainingArgs[i])) {
                job.addCacheFile(new Path(remainingArgs[++i]).toUri());
                job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
            } else {
                otherArgs.add(remainingArgs[i]);
            }
        }
        FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));
        FileSystem fs = FileSystem.get(conf);
        fs.delete(new Path(otherArgs.get(1)),true);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

我执行的命令如下，并没有按照官方给定代码的方式通过API去分发文件，而是通过命令行 -archives、 -files 的方式分发，这是一样的。而且设置了link-name即取了别名
hadoop jar wordcountV2.jar -archives ../test/a.zip#abc -files patterns.txt#pp -Dwordcount.skip.patterns=true -Dwordcount.case.sensitive=false /user/root/input /user/root/wd_output

特别注意：如果想用GENERIC_OPTIONS必须在代码里通过GenericOptionsParser去解析，否则是解析不到-files 等GENERIC_OPTIONS属性设置的。

代码运行结果和官方代码运行结果一样。
下图是在nodemanager机器上的${hadoop.tmp.dir}/nm-local-dir/usercache目录下找到的分发文件。所有的nodemanager机器上对应的目录都有分发的文件，这也验证了上面DistributedCache的说法。

下图是在日志目录具体的container里找到的stdout输出，对应着代码的std输出

参考网址

Hadoop官网MapReduce Tutorial