Hadoop: Distributed Cache Deploy
1. 簡介
2. 如何配置?
3. 原理分析
4. 問題紀錄
1. 簡介
Hadoop支持通過分佈式緩存的方式來部署不同版本的MapReduce框架。通過此方案,用戶可以很方便的在Yarn上運行不同版本的MR任務。如果基於現有MR框架進行定製開發(添加新feature、修復bug等),新功能上線會比較麻煩。而分佈式緩存則爲該問題提供了一個很好的解決方案。
2. 如何配置?
實現該方案主要有三個步驟:
- 將新版的jar文件打包,上傳到HDFS集羣中。
- 配置mapreduce.application.framework.path,指向步驟1中的文件所在的路徑,同時支持爲路徑指定一個別名。如:hdfs:///data/hadoop/mr/hadoop-mapreduce275.0.1.tar.gz#mr-opt。
- 配置mapreduce.application.classpath,根據2中的信息合理地設置classpath。
# mapred-site.xml中新增配置項
<property>
<name>mapreduce.application.framework.path</name>
<value>hdfs:///data/hadoop/mr/hadoop-mapreduce275.0.1.tar.gz#mr-opt</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/*</value>
</property>
3. 原理分析
Container啓動時,會根據mapreduce.application.classpath的內容來設定CLASSPATH,從而替換默認的CLASSPATH。Yarn中Container啓動流程大致如下:NodeManager接收到Container啓動請求後,會觸發Container狀態從NEW轉變到LOCALIZING。這個過程就是資源本地化(比如上面提到的mapreduce.application.framework.path指定的tar包資源),主要工作由ResourceLocalizationService負責。NodeManager會根據資源的"可見性"分別將其下載到不同的目錄中,供用戶提交job使用。資源會被按照"可見性"分爲三類,分別是:
可見性 | 說明 | 本地目錄 |
---|---|---|
PUBLIC | 該NodeManager上所有用戶提交的APP都能訪問。 | ${yarn.nodemanager.local-dirs}/filecache |
PRIVATE | 同一個用戶提交的APP都可以訪問。 | {user}/filecache |
APPLICATION | 同一個APP下的Container都可以訪問。 | ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/${appId}/filecache |
NodeManager中是由LocalResourcesTracker類負責維護Resource的生命週期,(download、remove、recover等)。下面是ResourceLocalizationService中相關code:
# 負責PUBLIC級別的Resource
private LocalResourcesTracker publicRsrc;
# 負責PRIVATE級別的Resource
private final ConcurrentMap<String/*username*/,LocalResourcesTracker> privateRsrc =
new ConcurrentHashMap<String,LocalResourcesTracker>();
# 負責APPLICATION級別的Resource
private final ConcurrentMap<String/*appid*/,LocalResourcesTracker> appRsrc =
new ConcurrentHashMap<String,LocalResourcesTracker>();
對於PRIVATE或APPLICATION級別的Resource,會爲每個username或appid維護一個LocalResourcesTracker。這主要考慮到不同可見性的Resource對於併發性要求不同。由於mapreduce.application.framework.path指定的資源可見性是PUBLIC,即:所有用戶提交到該節點的任務都可以訪問。所以最終會被下載到${yarn.nodemanager.local-dirs}/filecache文件夾中。
對於PUBLIC的資源,可見性範圍越大,意味着潛在的訪問者越多。其副本數應儘量設置大一些,避免任務在LOCALIZING時下載資源帶來額外競爭,造成JOB啓動性能損耗。可以參考:mapreduce.client.submit.file.replication的設置。
4. 問題紀錄
利用hadoop自帶的wordcount樣例進行測試。遇到以下幾個問題。
問題1
提交任務
# 通過-D指定mapreduce.application.classpath
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount -Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce275.0.1.tar.gz/*,
$PWD/hadoop-mapreduce275.0.1.tar.gz/lib/* /tmp/input /tmp/output
異常紀錄
18/11/06 20:04:43 INFO mapreduce.JobSubmitter: number of splits:4
18/11/06 20:04:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540972651914_0011
18/11/06 20:04:44 INFO mapreduce.JobSubmitter: Cleaning up the staging area /staging/hadoop/.staging/job_1540972651914_0011
java.lang.IllegalArgumentException: Could not locate MapReduce framework name 'mr-opt' in mapreduce.application.classpath
at org.apache.hadoop.mapreduce.v2.util.MRApps.setMRFrameworkClasspath(MRApps.java:231)
at org.apache.hadoop.mapreduce.v2.util.MRApps.setClasspath(MRApps.java:258)
at org.apache.hadoop.mapred.YARNRunner.createApplicationSubmissionContext(YARNRunner.java:468)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:296)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:244)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
- 異常分析:如果mapreduce.application.framework.path中指定了path的別名,那麼mapreduce.application.classpath中必須通過別名來引用jar包。
- 解決方案:將$PWD/hadoop-mapreduce275.0.1.tar.gz中的hadoop-mapreduce275.0.1.tar.gz改爲其別名即可。即:$PWD/mr-opt/
問題2
提交任務
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount
-Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/* /tmp/input /tmp/output
異常紀錄
18/11/07 15:28:50 INFO mapreduce.Job: Job job_1540972651914_0042 failed with state FAILED due to: Application application_1540972651914_0042 failed 4 times due to AM Container for appattempt_1540972651914_0042_000004 exited with exitCode: -1000
For more detailed output, check application tracking page:http://yq01-sw-backup-hds01.yq01.baidu.com:8088/cluster/app/application_1540972651914_0042Then, click on links to logs of each attempt.
Diagnostics: Permission denied: user=work, access=READ, inode="/data/hadoop/mr/hadoop-mapreduce275.0.1.tar.gz":hadoop:hadoop:-rwx------
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:308)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:220)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1808)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1792)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1765)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1844)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1814)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1729)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
- 異常分析:這很明顯是權限問題,上傳的tar包必須能夠讓所有用戶都能訪問。
- 解決方案:將hadoop-mapreduce275.0.1.tar.gz的acl修改爲755即可。
問題3
提交任務
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount
-Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/* /tmp/input /tmp/output
異常紀錄
18/11/06 20:06:09 INFO mapreduce.Job: Running job: job_1540972651914_0012
18/11/06 20:06:14 INFO mapreduce.Job: Job job_1540972651914_0012 running in uber mode : false
18/11/06 20:06:14 INFO mapreduce.Job: map 0% reduce 0%
18/11/06 20:06:14 INFO mapreduce.Job: Job job_1540972651914_0012 failed with state FAILED due to: Application application_1540972651914_0012 failed 4 times due to AM Container for appattempt_1540972651914_0012_000004 exited with exitCode: -1000
For more detailed output, check application tracking page:http://yq01-sw-backup-hds01.yq01.baidu.com:8088/cluster/app/application_1540972651914_0012Then, click on links to logs of each attempt.
Diagnostics: ExitCodeException exitCode=2:
gzip: /home/disk0/yarn/local/filecache/10_tmp/tmp_hadoop-mapreduce275.0.1.tar.gz: not in gzip format
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
Failing this attempt. Failing the application.
18/11/06 20:06:14 INFO mapreduce.Job: Counters: 0
- 異常分析:打包過程語法由於,使用了tar -cvf打成了tar包。
- 解決方案:通過tar -zcvf打包。
問題4
提交任務
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount
-Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/* /tmp/input /tmp/output
異常紀錄
Log Type: stderr
Log Upload Time: Tue Nov 06 20:09:02 +0800 2018
Log Length: 88
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
- 異常分析:AM重試2次都失敗,日誌顯示沒有加載到MRAppMaster導致。
debug過程
- 開啓NodeManager的debug:yarn.nodemanager.delete.debug-delay-sec設置長點時間,如:3600。
- 查看AM的啓動腳本,位於AM所在NodeManager服務器的${yarn.nodemanager.local-dirs}/nmPrivate/${applicationId}/${containerID}/launch_container.sh,其中CLASSPATH信息如下:
# launch_container.sh部分內容
export PWD="/home/disk0/yarn/local/usercache/hadoop/appcache/application_1540972651914_0043/container_e27_1540972651914_0043_03_000001"
。。。
export CLASSPATH="$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/home/hadoop/mr-opt/*:
/home/hadoop/mr-opt/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:$PWD/*"
。。。
ln -sf "/home/disk0/yarn/local/filecache/11/hadoop-mapreduce275.0.1.tar.gz" "mr-opt"
可以看到MR相關的classpath爲:/home/hadoop/mr-opt/:/home/hadoop/mr-opt/lib/。這顯然是不對的。理論上應該是:$PWD/mr-opt/:$PWD/mr-opt/lib/。最終查明原因是提交任務的是$PWD被轉義成當前目錄(提交任務時的目錄),而我提交任務的目錄就是:/home/hadoop。
- 解決方案:不通過-D來指定mapreduce.application.classpath,直接將該項配置在mapred-site.xml中即可。
# 下面是正確的CLASSPATH
export PWD="/home/disk0/yarn/local/usercache/hadoop/appcache/application_1540972651914_0043/container_e27_1540972651914_0043_03_000001"
export CLASSPATH="$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$PWD/mr-opt/*:
$PWD/mr-opt/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:$PWD/*"
ln -sf "/home/disk0/yarn/local/filecache/11/hadoop-mapreduce275.0.1.tar.gz" "mr-opt"