spark日常报错问题-持续性更新

1:spark运行过程中出现与driver链接异常,并存在磁盘读写一场:

java.io.IOException: Failed to delete: /mnt/sd04/yarn/nm/usercache/hdfs/appcache/application_1570683010624_24827/blockmgr-24356fee-b578-49a1-8e97-9588d2d1180e

解决的方向:1:driver挂了,这个的看driver的机器问题了。2:executor所在机器磁盘不足。3:本质还是内存问题,加大executor内存,并增加spark.executor.heartbeatInterval数值,默认是10s,可以增加到60s尝试,executor-cores的数量不用动,可以尝试减小,过大的broadcast也会导致该问题,可以通过repartition重新进行分区,然后用join方式替代broadcast。在partitionBy(new HashPartition(partiton-num))时,如果partiton-num太大或者太小的时候会报这种错误,说白了也是内存的原因,但是最好通过调整分区数目的方式来解决

2:ERROR TransportChannelHandler:144 - Connection to hadoop1/192.168.32.1:18053 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.

解决方向:1:加大excutor-memory的值或者减少executor-cores的数量。2:合理设置spark.network.timeout,默认是120s,可以增加到600s

3:java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

at sun.nio.ch.Net.bind0(Native Method)

at sun.nio.ch.Net.bind(Net.java:444)

at sun.nio.ch.Net.bind(Net.java:436)

at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)

解决方向:增加重试次数:spark.port.maxRetries 100

4:spark on yarn 报错jersey

Caused by: java.lang.ClassNotFoundException: com.sun.jersey.core.util.FeaturesAndProperties

        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

        at java.security.AccessController.doPrivileged(Native Method)

        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

        ... 37 more

 

 

解决方向:可能因为yarn的jersey jar版本和spark的不一致,发现yarn的lib包下面使用的是1.9的jar,而spark下使用的是2.22.2的jar包,将 jersey-core-1.9.jar 和 jersey-client-1.9.jar 这两个包拷贝到$SPARK_HOME/jars目录下,并将该目录下原本的 jersey-client-2.22.2.jar改名,yarn模式的spark就可以正常启动。

5:org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 

解决方向:这个错一般发生在shuffle阶段,shuffle的数据量太大的时候,解决的方法是尽量减少shuffle的数据,增大shuffle的时候job的并行度,可以适当增大executor的内存

6:WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
解决方向:yarn队列没资源了,等别人任务资源跑完释放了资源就好了

7:File does not exist. Holder DFSClient_NONMAPREDUCE_311350472_53 does not have any open files.

解决方向:一般是因为同时操作同一个文件夹造成的,如果是mapreduce程序,可以尝试Reducer代码中加上cleanup方法

 protected void cleanup(Context context) throws IOException,InterruptedException {
            multipleOutputs.close();
   }

如果不行,可以尝试将输出目录改成多个尝试。

8:INFO DAGScheduler:54 - ShuffleMapStage 26 (insertInto at App.scala:288) failed in 425.039 s due to Stage cancelled because SparkContext was shut down ERROR FileFormatWriter:91 - Aborting job 405d0906-e853-4ccd-84e9-b5d660d08e13. org.apache.spark.SparkException: Job 8 cancelled because SparkContext was shut down

解决方向:很明显,任务被人杀了

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章