十:WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set,解決案例

一:問題現象:

在spark on yarn 提交任務是,提示如下:

WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

INFO yarn.Client: Uploading resource file:/tmp/spark-27a2d9ca-106c-4f4a-baff-c96ef5081c51/__spark_libs__808575299793112451.zip -> hdfs://weizhonggui/user/hadoop/.sparkStaging/application_1543886353459_0001/__spark_libs__808575299793112451.zip
18/12/24 23:32:36 INFO yarn.Client: Uploading resource file:/tmp/spark-27a2d9ca-106c-4f4a-baff-c96ef5081c51/__spark_conf__4031622468796062240.zip -> hdfs://weizhonggui/user/hadoop/.sparkStaging/application_1543886353459_0001/spark_conf.zip

在這裏插入圖片描述

二:原因分析:

在[https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties]裏有解釋:

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

繼續查看具體的 Spark Properties:
spark.yarn.jars:none :List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn’t need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

spark.yarn.archive:An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application’s containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

就是在默認情況:Spark on YARN要用Spark jars(默認就在Spark安裝目錄),但這個jars也可以再HDFS任何可以讀到的地方,這樣就方便每次應用程序跑的時候在節點上可以Cache,這樣就不用上傳這些jars,

三:處理過程:

3.1.創建 archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
3.2.上傳jar包到 HDFS: hdfs dfs -put spark-libs.jar /system/SparkJars/jar
hdfs下創建目錄:hdfs dfs -mkdir -p /system/SparkJars/jar
3.3. 在spark-default.conf中設置 spark.yarn.archive=hdfs:///system/SparkJars/jar/spark-libs.jar

四:問題總結:

這是SPARK on YARN,調優的一個手段,節約每個NODE上傳JAR到HDFS的時間,可通過具體情況查看:
在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章