SparkConf
如果這樣寫
new SparkConf().setMaster("yarn-client")
在idea內調試會報錯:
Exception in thread "main" java.lang.IllegalStateException: Library directory '....../data-platform-task/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
查看Spark官方文檔,需要設置spark.yarn.jars或者spark.yarn.archive。
- spark.yarn.jars:支持本地jar,也支持hdfs路徑。
- spark.yarn.archive:壓縮包。
修改程序:
new SparkConf().setMaster("yarn-client")
.set("spark.yarn.archive", getProperty(HDFS_SPARK_ARCHIVE))
在idea內調試,報錯:
Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.x$334 of type org.apache.spark.api.java.function.PairFunction in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287)
這是因爲找不到任務依賴的類。
繼續查閱文檔,有
- spark.yarn.dist.jars:逗號分隔的jar包。
繼續修改程序:
new SparkConf().setMaster("yarn-client")
.set("spark.yarn.archive", getProperty(HDFS_SPARK_ARCHIVE))
.set("spark.yarn.dist.jars", getProperty(TASK_JARS))
調試,可以執行了。
19/06/27 12:53:12 INFO yarn.YarnAllocator: Will request 2 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
19/06/27 12:53:12 INFO yarn.YarnAllocator: Submitted 2 unlocalized container requests.
19/06/27 12:53:12 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
19/06/27 12:53:12 INFO impl.AMRMClientImpl: Received new token for : leishu-OptiPlex-7060:39105
19/06/27 12:53:12 INFO yarn.YarnAllocator: Launching container container_1561543784696_0031_01_000002 on host leishu-OptiPlex-7060 for executor with ID 1
19/06/27 12:53:13 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : leishu-OptiPlex-7060:39105
19/06/27 12:53:13 INFO yarn.YarnAllocator: Launching container container_1561543784696_0031_01_000003 on host leishu-OptiPlex-7060 for executor with ID 2
19/06/27 12:53:13 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : leishu-OptiPlex-7060:39105
19/06/27 12:53:16 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 0 of them.
19/06/27 12:53:18 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
19/06/27 12:53:18 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.16.209.105:33251
19/06/27 12:53:18 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.16.209.105:33251
19/06/27 12:53:18 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
19/06/27 12:53:18 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
19/06/27 12:53:18 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/06/27 12:53:18 INFO yarn.ApplicationMaster: Deleting staging directory file:/home/.../.sparkStaging/application_1561543784696_0031
19/06/27 12:53:18 INFO util.ShutdownHookManager: Shutdown hook called
上傳文件到hdfs
其中,對於spark.yarn.archive參數,我把需要的jar包(/spark-2.4.3-bin-hadoop2.7/jars目錄下的全部文件)壓縮成一個zip文件,上傳到了hdfs。
使用使用代碼,實現上傳到hdfs的功能:
public class SparkJar2Hdfs {
public static void main(String[] args) throws Exception {
//要上傳的源文件所在路徑
Path src = new Path(getProperty(SPARK_JARS_ZIP));
Path dst = new Path(getProperty(HDFS_SPARK_JARS_PATH));
removeDir(dst);
if (createDir(dst) && uploadPath(src, dst)) {
listStatus(dst);
}
}
private static FileSystem getCorSys() {
FileSystem coreSys = null;
Configuration conf = new Configuration();
try {
return FileSystem.get(URI.create(getProperty(HDFS_SPARK_ROOT)), conf);
} catch (Exception e) {
e.printStackTrace();
}
return coreSys;
}
//創建目錄
private static boolean createDir(Path path) {
try (FileSystem coreSys = getCorSys()) {
if (coreSys.exists(path)) {
return true;
} else {
return coreSys.mkdirs(path);
}
} catch (IOException e) {
e.printStackTrace();
return false;
}
}
//刪除目錄
private static boolean removeDir(Path path) {
try (FileSystem coreSys = getCorSys()) {
if (coreSys.exists(path)) {
return true;
} else {
return coreSys.delete(path, true);
}
} catch (IOException e) {
e.printStackTrace();
return false;
}
}
//文件上傳
private static boolean uploadPath(Path srcPath, Path desPath) {
try (FileSystem coreSys = getCorSys()) {
if (coreSys.isDirectory(desPath)) {
coreSys.copyFromLocalFile(srcPath, desPath);
return true;
} else {
throw new IOException("desPath is not exist");
}
} catch (IOException e) {
e.printStackTrace();
return false;
}
}
//文件列表
private static void listStatus(Path desPath) {
try (FileSystem coreSys = getCorSys()) {
FileStatus files[] = coreSys.listStatus(desPath);
for (int i = 0; i < files.length; i++) {
System.out.println(files[i].getPath());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
執行,會輸出文件URL。
Connected to the target VM, address: '127.0.0.1:39539', transport: 'socket'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
hdfs://localhost:9000/user/.../spark-libs/spark-2.4.3-hadoop2.7.7.zip
Disconnected from the target VM, address: '127.0.0.1:39539', transport: 'socket'
Process finished with exit code 0
maven-shade
對於spark.yarn.dist.jars參數,可以使用maven-shade-plugin:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<outputFile>${project.build.directory}/shaded/data-platform-task-${project.version}-shaded.jar
</outputFile>
<artifactSet>
<includes>
<include>com.alibaba:druid</include>
<include>com.aliyun:emr-core</include>
<include>com.google.inject:guice</include>
<include>log4j:log4j</include>
<include>org.postgresql:postgresql</include>
<include>org.slf4j:slf4j-api</include>
<include>org.slf4j:slf4j-log4j12</include>
<include>org.projectlombok.lombok</include>
<include>org.springframework:spring-jdbc</include>
</includes>
</artifactSet>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
這樣,任務所在工程的類,以及依賴的全部第三方jar包可以打成一個jar包。