1:spark運行過程中出現與driver鏈接異常,並存在磁盤讀寫一場:
java.io.IOException: Failed to delete: /mnt/sd04/yarn/nm/usercache/hdfs/appcache/application_1570683010624_24827/blockmgr-24356fee-b578-49a1-8e97-9588d2d1180e
解決的方向:1:driver掛了,這個的看driver的機器問題了。2:executor所在機器磁盤不足。3:本質還是內存問題,加大executor內存,並增加spark.executor.heartbeatInterval數值,默認是10s,可以增加到60s嘗試,executor-cores的數量不用動,可以嘗試減小,過大的broadcast也會導致該問題,可以通過repartition重新進行分區,然後用join方式替代broadcast。在partitionBy(new HashPartition(partiton-num))時,如果partiton-num太大或者太小的時候會報這種錯誤,說白了也是內存的原因,但是最好通過調整分區數目的方式來解決
2:ERROR TransportChannelHandler:144 - Connection to hadoop1/192.168.32.1:18053 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
解決方向:1:加大excutor-memory的值或者減少executor-cores的數量。2:合理設置spark.network.timeout,默認是120s,可以增加到600s
3:java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
解決方向:增加重試次數:spark.port.maxRetries 100
4:spark on yarn 報錯jersey
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.core.util.FeaturesAndProperties
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 37 more
解決方向:可能因爲yarn的jersey jar版本和spark的不一致,發現yarn的lib包下面使用的是1.9的jar,而spark下使用的是2.22.2的jar包,將 jersey-core-1.9.jar 和 jersey-client-1.9.jar 這兩個包拷貝到$SPARK_HOME/jars目錄下,並將該目錄下原本的 jersey-client-2.22.2.jar改名,yarn模式的spark就可以正常啓動。
5:org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
解決方向:這個錯一般發生在shuffle階段,shuffle的數據量太大的時候,解決的方法是儘量減少shuffle的數據,增大shuffle的時候job的並行度,可以適當增大executor的內存
6:WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
解決方向:yarn隊列沒資源了,等別人任務資源跑完釋放了資源就好了
7:File does not exist. Holder DFSClient_NONMAPREDUCE_311350472_53 does not have any open files.
解決方向:一般是因爲同時操作同一個文件夾造成的,如果是mapreduce程序,可以嘗試Reducer代碼中加上cleanup方法
protected void cleanup(Context context) throws IOException,InterruptedException {
multipleOutputs.close();
}
如果不行,可以嘗試將輸出目錄改成多個嘗試。
8:INFO DAGScheduler:54 - ShuffleMapStage 26 (insertInto at App.scala:288) failed in 425.039 s due to Stage cancelled because SparkContext was shut down ERROR FileFormatWriter:91 - Aborting job 405d0906-e853-4ccd-84e9-b5d660d08e13. org.apache.spark.SparkException: Job 8 cancelled because SparkContext was shut down
解決方向:很明顯,任務被人殺了