hadoop提交job失敗:java.net.ConnectException:Connection refused

記一次hadoopp提交任務失敗

問題

info日誌

2019-07-18 11:40:50 386 [QuartzScheduler_Worker-1:203538] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-18 11:40:51 388 [QuartzScheduler_Worker-1:204540] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-18 11:40:52 389 [QuartzScheduler_Worker-1:205541] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

error日誌

2019-07-18 10:51:12 160 [QuartzScheduler_Worker-1:669550199] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=923, name='TRAJECTORY@HOUR@1562281200000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/07","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/07/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562281200000,"mrTypeName":"TRAJECTORY"}', status=2}]  run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-07-18 11:11:22 520 [QuartzScheduler_Worker-1:670760559] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=925, name='TRAJECTORY@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"TRAJECTORY"}', status=2}]  run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-07-18 11:31:32 854 [QuartzScheduler_Worker-1:671970893] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=924, name='LAUNCH@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/LAUNCH","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"LAUNCH"}', status=2}]  run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

類似問題可能是8020或者其他端口ConnectionRefused,基本確定是對應端口服務有問題。這裏8032是yarn服務,所以去檢查,若不知道8032是什麼服務,可以去hadoop配置文件下找下配的是哪個服務

排查與解決

我這先全量掃下配置有沒有配置8032,然後去修改過配置的配置文件定位在哪個配置那文件

[root@sparka hadoop]# cat * | grep 8032
                <value>sparka:8032</value>
[root@sparka hadoop]# cat hdfs-site.xml | grep 8032 
[root@sparka hadoop]# cat core-site.xml | grep 8032    
[root@sparka hadoop]# cat yarn-site.xml | grep 8032                
                <value>sparka:8032</value>
<property>
                <name>yarn.resourcemanager.address</name>
                <value>sparka:8032</value>
        </property>

jps檢查yarn的resourceManager服務

[root@sparka hadoop]# jps
24384 jar
17312 jar
15937 Main
12746 NameNode
7370 SDK_WEB_AUTOREPORT.jar
31436 ZEUS_MANAGERSERVER.jar
26637 jar
24078 Jps
21903 jar
29427 Bootstrap
31027 jar
12947 JournalNode
18580 jar
23541 RunJar
24314 jar
13117 DFSZKFailoverController
22333 jar
15358 Main
4287 machineagent.jar
[root@sparka hadoop]#

確實發現沒有,可以去檢查下何時因爲啥原因停掉了或者啓動失敗,去hadoop的logs路徑下檢查下這個服務的日誌

查看resourceManager日誌

[root@sparka logs]# tail -F yarn-root-resourcemanager-sparka.log
2019-06-22 00:08:54,794 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1555906617082_6791,name=sdk_data_action_day-determine_partitions_groupby-Optional.of([2019-06-21T00:00:00.000Z/2019-06-22T00:00:00.000Z]),user=root,queue=root.root,state=FAILED,trackingUrl=http://sparka:8088/cluster/app/application_1555906617082_6791,appMasterHost=N/A,startTime=1561132960264,finishTime=1561133311607,finalStatus=FAILED
2019-06-22 00:08:58,664 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : sparkb:32859 for container : container_1555906617082_6794_01_000001
2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1555906617082_6794_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Clear node set for appattempt_1555906617082_6794_000001
2019-06-22 00:09:00,859 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1555906617082_6794 AttemptId: appattempt_1555906617082_6794_000001 MasterContainer: Container: [ContainerId: container_1555906617082_6794_01_000001, NodeId: sparkb:32859, NodeHttpAddress: sparkb:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.240.47.153:32859 }, ]
2019-06-22 00:09:03,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from SCHEDULED to ALLOCATED_SAVING
2019-06-22 00:09:06,446 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from ALLOCATED_SAVING to ALLOCATED
2019-06-22 00:09:07,251 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1555906617082_6794_000001
2019-06-22 00:09:13,245 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
2019-06-22 00:09:13,960 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException

最後兩行看到這個異常,進去源碼看下

  /**
   * Forcibly terminates the currently running Java virtual machine.
   *
   * @param status
   *          exit code
   * @param msg
   *          message used to create the {@code HaltException}
   * @throws HaltException
   *           if Runtime.getRuntime().halt() is disabled for test purposes
   */
  public static void halt(int status, String msg) throws HaltException {
    LOG.info("Halt with status " + status + " Message: " + msg);
    if (systemHaltDisabled) {
      HaltException ee = new HaltException(status, msg);
      LOG.fatal("Halt called", ee);
      if (null == firstHaltException) {
        firstHaltException = ee;
      }
      throw ee;
    }
    Runtime.getRuntime().halt(status);
  }

根據方法註釋,被強制退出了jvm進程。一般出現強制停止服務的大都是服務器資源不能支持程序正常運行,這裏我猜測是內存不足導致,但由於之前停掉的,也沒看到啥日誌證據。。。

解決

重啓yarn

有排查更好的思路或方式請指出,非常感謝

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章