Hadoop Distributed Cache Deploy

Hadoop: Distributed Cache Deploy

1. 簡介
2. 如何配置?
3. 原理分析
4. 問題紀錄

1. 簡介

Hadoop支持通過分佈式緩存的方式來部署不同版本的MapReduce框架。通過此方案,用戶可以很方便的在Yarn上運行不同版本的MR任務。如果基於現有MR框架進行定製開發(添加新feature、修復bug等),新功能上線會比較麻煩。而分佈式緩存則爲該問題提供了一個很好的解決方案。

2. 如何配置?

實現該方案主要有三個步驟:

  1. 將新版的jar文件打包,上傳到HDFS集羣中。
  2. 配置mapreduce.application.framework.path,指向步驟1中的文件所在的路徑,同時支持爲路徑指定一個別名。如:hdfs:///data/hadoop/mr/hadoop-mapreduce275.0.1.tar.gz#mr-opt。
  3. 配置mapreduce.application.classpath,根據2中的信息合理地設置classpath。
# mapred-site.xml中新增配置項
    <property>
        <name>mapreduce.application.framework.path</name>
        <value>hdfs:///data/hadoop/mr/hadoop-mapreduce275.0.1.tar.gz#mr-opt</value>
    </property>
    
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/*</value>
    </property>

3. 原理分析

Container啓動時,會根據mapreduce.application.classpath的內容來設定CLASSPATH,從而替換默認的CLASSPATH。Yarn中Container啓動流程大致如下:NodeManager接收到Container啓動請求後,會觸發Container狀態從NEW轉變到LOCALIZING。這個過程就是資源本地化(比如上面提到的mapreduce.application.framework.path指定的tar包資源),主要工作由ResourceLocalizationService負責。NodeManager會根據資源的"可見性"分別將其下載到不同的目錄中,供用戶提交job使用。資源會被按照"可見性"分爲三類,分別是:

可見性 說明 本地目錄
PUBLIC 該NodeManager上所有用戶提交的APP都能訪問。 ${yarn.nodemanager.local-dirs}/filecache
PRIVATE 同一個用戶提交的APP都可以訪問。 {yarn.nodemanager.local-dirs}/usercache/\{user}/filecache
APPLICATION 同一個APP下的Container都可以訪問。 ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/${appId}/filecache

NodeManager中是由LocalResourcesTracker類負責維護Resource的生命週期,(download、remove、recover等)。下面是ResourceLocalizationService中相關code:

# 負責PUBLIC級別的Resource
private LocalResourcesTracker publicRsrc;
# 負責PRIVATE級別的Resource
private final ConcurrentMap<String/*username*/,LocalResourcesTracker> privateRsrc =
    new ConcurrentHashMap<String,LocalResourcesTracker>();
# 負責APPLICATION級別的Resource
private final ConcurrentMap<String/*appid*/,LocalResourcesTracker> appRsrc =
    new ConcurrentHashMap<String,LocalResourcesTracker>();

對於PRIVATE或APPLICATION級別的Resource,會爲每個username或appid維護一個LocalResourcesTracker。這主要考慮到不同可見性的Resource對於併發性要求不同。由於mapreduce.application.framework.path指定的資源可見性是PUBLIC,即:所有用戶提交到該節點的任務都可以訪問。所以最終會被下載到${yarn.nodemanager.local-dirs}/filecache文件夾中。

對於PUBLIC的資源,可見性範圍越大,意味着潛在的訪問者越多。其副本數應儘量設置大一些,避免任務在LOCALIZING時下載資源帶來額外競爭,造成JOB啓動性能損耗。可以參考:mapreduce.client.submit.file.replication的設置。

4. 問題紀錄

利用hadoop自帶的wordcount樣例進行測試。遇到以下幾個問題。

問題1

提交任務
# 通過-D指定mapreduce.application.classpath
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount  -Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce275.0.1.tar.gz/*,
$PWD/hadoop-mapreduce275.0.1.tar.gz/lib/* /tmp/input /tmp/output
異常紀錄
18/11/06 20:04:43 INFO mapreduce.JobSubmitter: number of splits:4
18/11/06 20:04:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540972651914_0011
18/11/06 20:04:44 INFO mapreduce.JobSubmitter: Cleaning up the staging area /staging/hadoop/.staging/job_1540972651914_0011
java.lang.IllegalArgumentException: Could not locate MapReduce framework name 'mr-opt' in mapreduce.application.classpath
  at org.apache.hadoop.mapreduce.v2.util.MRApps.setMRFrameworkClasspath(MRApps.java:231)
  at org.apache.hadoop.mapreduce.v2.util.MRApps.setClasspath(MRApps.java:258)
  at org.apache.hadoop.mapred.YARNRunner.createApplicationSubmissionContext(YARNRunner.java:468)
  at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:296)
  at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:244)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
  • 異常分析:如果mapreduce.application.framework.path中指定了path的別名,那麼mapreduce.application.classpath中必須通過別名來引用jar包。
  • 解決方案:將$PWD/hadoop-mapreduce275.0.1.tar.gz中的hadoop-mapreduce275.0.1.tar.gz改爲其別名即可。即:$PWD/mr-opt/

問題2

提交任務
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount
-Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/* /tmp/input /tmp/output
異常紀錄
18/11/07 15:28:50 INFO mapreduce.Job: Job job_1540972651914_0042 failed with state FAILED due to: Application application_1540972651914_0042 failed 4 times due to AM Container for appattempt_1540972651914_0042_000004 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://yq01-sw-backup-hds01.yq01.baidu.com:8088/cluster/app/application_1540972651914_0042Then, click on links to logs of each attempt.
Diagnostics: Permission denied: user=work, access=READ, inode="/data/hadoop/mr/hadoop-mapreduce275.0.1.tar.gz":hadoop:hadoop:-rwx------
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:308)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:220)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1808)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1792)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1765)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1844)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1814)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1729)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
  • 異常分析:這很明顯是權限問題,上傳的tar包必須能夠讓所有用戶都能訪問。
  • 解決方案:將hadoop-mapreduce275.0.1.tar.gz的acl修改爲755即可。

問題3

提交任務
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount
-Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/* /tmp/input /tmp/output
異常紀錄
18/11/06 20:06:09 INFO mapreduce.Job: Running job: job_1540972651914_0012
18/11/06 20:06:14 INFO mapreduce.Job: Job job_1540972651914_0012 running in uber mode : false
18/11/06 20:06:14 INFO mapreduce.Job:  map 0% reduce 0%
18/11/06 20:06:14 INFO mapreduce.Job: Job job_1540972651914_0012 failed with state FAILED due to: Application application_1540972651914_0012 failed 4 times due to AM Container for appattempt_1540972651914_0012_000004 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://yq01-sw-backup-hds01.yq01.baidu.com:8088/cluster/app/application_1540972651914_0012Then, click on links to logs of each attempt.
Diagnostics: ExitCodeException exitCode=2:
gzip: /home/disk0/yarn/local/filecache/10_tmp/tmp_hadoop-mapreduce275.0.1.tar.gz: not in gzip format
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
Failing this attempt. Failing the application.
18/11/06 20:06:14 INFO mapreduce.Job: Counters: 0
  • 異常分析:打包過程語法由於,使用了tar -cvf打成了tar包。
  • 解決方案:通過tar -zcvf打包。

問題4

提交任務
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount
-Dmapreduce.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$PWD/mr-opt/*,$PWD/mr-opt/lib/* /tmp/input /tmp/output
異常紀錄
Log Type: stderr
Log Upload Time: Tue Nov 06 20:09:02 +0800 2018
Log Length: 88
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
  • 異常分析:AM重試2次都失敗,日誌顯示沒有加載到MRAppMaster導致。

debug過程

  1. 開啓NodeManager的debug:yarn.nodemanager.delete.debug-delay-sec設置長點時間,如:3600。
  2. 查看AM的啓動腳本,位於AM所在NodeManager服務器的${yarn.nodemanager.local-dirs}/nmPrivate/${applicationId}/${containerID}/launch_container.sh,其中CLASSPATH信息如下:
# launch_container.sh部分內容
export PWD="/home/disk0/yarn/local/usercache/hadoop/appcache/application_1540972651914_0043/container_e27_1540972651914_0043_03_000001"
。。。
export CLASSPATH="$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/home/hadoop/mr-opt/*:
/home/hadoop/mr-opt/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:$PWD/*"
。。。
ln -sf "/home/disk0/yarn/local/filecache/11/hadoop-mapreduce275.0.1.tar.gz" "mr-opt"

可以看到MR相關的classpath爲:/home/hadoop/mr-opt/:/home/hadoop/mr-opt/lib/。這顯然是不對的。理論上應該是:$PWD/mr-opt/:$PWD/mr-opt/lib/。最終查明原因是提交任務的是$PWD被轉義成當前目錄(提交任務時的目錄),而我提交任務的目錄就是:/home/hadoop。

  • 解決方案:不通過-D來指定mapreduce.application.classpath,直接將該項配置在mapred-site.xml中即可。
# 下面是正確的CLASSPATH
export PWD="/home/disk0/yarn/local/usercache/hadoop/appcache/application_1540972651914_0043/container_e27_1540972651914_0043_03_000001"

export CLASSPATH="$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$PWD/mr-opt/*:
$PWD/mr-opt/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:$PWD/*"

ln -sf "/home/disk0/yarn/local/filecache/11/hadoop-mapreduce275.0.1.tar.gz" "mr-opt"

參考

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章