spark日常報錯問題-持續性更新

1:spark運行過程中出現與driver鏈接異常,並存在磁盤讀寫一場:

java.io.IOException: Failed to delete: /mnt/sd04/yarn/nm/usercache/hdfs/appcache/application_1570683010624_24827/blockmgr-24356fee-b578-49a1-8e97-9588d2d1180e

解決的方向:1:driver掛了,這個的看driver的機器問題了。2:executor所在機器磁盤不足。3:本質還是內存問題,加大executor內存,並增加spark.executor.heartbeatInterval數值,默認是10s,可以增加到60s嘗試,executor-cores的數量不用動,可以嘗試減小,過大的broadcast也會導致該問題,可以通過repartition重新進行分區,然後用join方式替代broadcast。在partitionBy(new HashPartition(partiton-num))時,如果partiton-num太大或者太小的時候會報這種錯誤,說白了也是內存的原因,但是最好通過調整分區數目的方式來解決

2:ERROR TransportChannelHandler:144 - Connection to hadoop1/192.168.32.1:18053 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.

解決方向:1:加大excutor-memory的值或者減少executor-cores的數量。2:合理設置spark.network.timeout,默認是120s,可以增加到600s

3:java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

at sun.nio.ch.Net.bind0(Native Method)

at sun.nio.ch.Net.bind(Net.java:444)

at sun.nio.ch.Net.bind(Net.java:436)

at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)

解決方向:增加重試次數:spark.port.maxRetries 100

4:spark on yarn 報錯jersey

Caused by: java.lang.ClassNotFoundException: com.sun.jersey.core.util.FeaturesAndProperties

        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

        at java.security.AccessController.doPrivileged(Native Method)

        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

        ... 37 more

 

 

解決方向:可能因爲yarn的jersey jar版本和spark的不一致,發現yarn的lib包下面使用的是1.9的jar,而spark下使用的是2.22.2的jar包,將 jersey-core-1.9.jar 和 jersey-client-1.9.jar 這兩個包拷貝到$SPARK_HOME/jars目錄下,並將該目錄下原本的 jersey-client-2.22.2.jar改名,yarn模式的spark就可以正常啓動。

5:org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 

解決方向:這個錯一般發生在shuffle階段,shuffle的數據量太大的時候,解決的方法是儘量減少shuffle的數據,增大shuffle的時候job的並行度,可以適當增大executor的內存

6:WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
解決方向:yarn隊列沒資源了,等別人任務資源跑完釋放了資源就好了

7:File does not exist. Holder DFSClient_NONMAPREDUCE_311350472_53 does not have any open files.

解決方向:一般是因爲同時操作同一個文件夾造成的,如果是mapreduce程序,可以嘗試Reducer代碼中加上cleanup方法

 protected void cleanup(Context context) throws IOException,InterruptedException {
            multipleOutputs.close();
   }

如果不行,可以嘗試將輸出目錄改成多個嘗試。

8:INFO DAGScheduler:54 - ShuffleMapStage 26 (insertInto at App.scala:288) failed in 425.039 s due to Stage cancelled because SparkContext was shut down ERROR FileFormatWriter:91 - Aborting job 405d0906-e853-4ccd-84e9-b5d660d08e13. org.apache.spark.SparkException: Job 8 cancelled because SparkContext was shut down

解決方向:很明顯,任務被人殺了

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章