spark日常報錯問題-持續性更新

1：spark運行過程中出現與driver鏈接異常，並存在磁盤讀寫一場:

java.io.IOException: Failed to delete: /mnt/sd04/yarn/nm/usercache/hdfs/appcache/application_1570683010624_24827/blockmgr-24356fee-b578-49a1-8e97-9588d2d1180e

解決的方向：1：driver掛了，這個的看driver的機器問題了。2：executor所在機器磁盤不足。3：本質還是內存問題，加大executor內存，並增加spark.executor.heartbeatInterval數值，默認是10s，可以增加到60s嘗試，executor-cores的數量不用動，可以嘗試減小，過大的broadcast也會導致該問題，可以通過repartition重新進行分區，然後用join方式替代broadcast。在partitionBy(new HashPartition(partiton-num))時，如果partiton-num太大或者太小的時候會報這種錯誤，說白了也是內存的原因，但是最好通過調整分區數目的方式來解決

2：ERROR TransportChannelHandler:144 - Connection to hadoop1/192.168.32.1:18053 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.

解決方向：1：加大excutor-memory的值或者減少executor-cores的數量。2：合理設置spark.network.timeout，默認是120s，可以增加到600s

3：java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

at sun.nio.ch.Net.bind0(Native Method)

at sun.nio.ch.Net.bind(Net.java:444)

at sun.nio.ch.Net.bind(Net.java:436)

at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)

解決方向：增加重試次數：spark.port.maxRetries 100

4：spark on yarn 報錯jersey

Caused by: java.lang.ClassNotFoundException: com.sun.jersey.core.util.FeaturesAndProperties

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

... 37 more

解決方向：可能因爲yarn的jersey jar版本和spark的不一致，發現yarn的lib包下面使用的是1.9的jar，而spark下使用的是2.22.2的jar包，將 jersey-core-1.9.jar 和 jersey-client-1.9.jar 這兩個包拷貝到$SPARK_HOME/jars目錄下，並將該目錄下原本的 jersey-client-2.22.2.jar改名，yarn模式的spark就可以正常啓動。

5：org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

解決方向：這個錯一般發生在shuffle階段,shuffle的數據量太大的時候,解決的方法是儘量減少shuffle的數據,增大shuffle的時候job的並行度,可以適當增大executor的內存

6：WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
解決方向：yarn隊列沒資源了，等別人任務資源跑完釋放了資源就好了

7：File does not exist. Holder DFSClient_NONMAPREDUCE_311350472_53 does not have any open files.

解決方向：一般是因爲同時操作同一個文件夾造成的，如果是mapreduce程序，可以嘗試Reducer代碼中加上cleanup方法

　protected void cleanup(Context context) throws IOException,InterruptedException {
            multipleOutputs.close();
   }

如果不行，可以嘗試將輸出目錄改成多個嘗試。

8:INFO DAGScheduler:54 - ShuffleMapStage 26 (insertInto at App.scala:288) failed in 425.039 s due to Stage cancelled because SparkContext was shut down ERROR FileFormatWriter:91 - Aborting job 405d0906-e853-4ccd-84e9-b5d660d08e13. org.apache.spark.SparkException: Job 8 cancelled because SparkContext was shut down

解決方向：很明顯，任務被人殺了

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark日常報錯問題-持續性更新

hive使用tez環境配置

在spark，MapReduce 或 Flink 程序裏面制定環境變量

spark日常報錯問題-持續性更新

flink設置historyserver

kafka參數整理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結