spark版本: 2.0.0
1.引入
为了简单测试项目功能,我们可能只需要使用spark-shell就可以完成,但是企业级项目我们都是先打包好之后使用spark-submit脚本完成线上的spark项目部署。
./bin/spark-submit \
--class com.example.spark.Test \
--master yarn \
--deploy-mode client \
/home/hadoop/data/test.jar
虽然spark-submit命令用了很多,但是其中的原理,相信很多人都不了解,现在我们来介绍一下这里面的小秘密。
2.spark-submit脚本
spark-submit脚本很容易发现,最终它会执行org.apache.spark.deploy.SparkSubmit
的main方法,因此我们就从这里开始。
def main(args: Array[String]): Unit = {
// 解析自定义参数
val appArgs = new SparkSubmitArguments(args)
// 是否详细打印
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
// submit支持三种类型的提交,主要是SparkSubmitAction.SUBMIT,其他两种几乎不使用
appArgs.action match {
// 提交任务(核心)
case SparkSubmitAction.SUBMIT => submit(appArgs)
// 只支持 Standalone and Mesos cluster
case SparkSubmitAction.KILL => kill(appArgs)
// 只支持 Standalone and Mesos cluster
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
submit方法是提交任务的入口,里面包含参数校验,根据情况更加附加参数,处理用户参数,系统配置,各语言之间的兼容问题和执行权限等问题。
private def submit(args: SparkSubmitArguments): Unit = {
// 处理用户参数,系统配置,各语言之间的兼容问题
val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
// 处理执行权限问题,最终都是调用方法runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
def doRunMain(): Unit = {
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
})
} catch {
case e: Exception =>
if (e.getStackTrace().length == 0) {
printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
exitFn(1)
} else {
throw e
}
}
} else {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
}
// 增加日志提醒
if (args.isStandaloneCluster && args.useRest) {
try {
printStream.println("Running Spark using the REST application submission protocol.")
doRunMain()
} catch {
case e: SubmitRestConnectionException =>
printWarning(s"Master endpoint ${args.master} was not a REST server. " +
"Falling back to legacy submission gateway instead.")
args.useRest = false
submit(args)
}
} else {
doRunMain()
}
}
上面真正有用的代码其实并不多,除了最终调用方法runMain,这里还需要简单介绍一下prepareSubmitEnvironment
这个方法。如果看过源码的同学,应该会发现这段代码很长,不过大部分都是解决一些参数冲突,多语言等问题,只有少部分需要注意,所以接下来是跳读。
// 1. 属性介绍
private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
: (Seq[String], Seq[String], Map[String, String], String) = {
// 运行类childMainClass的参数
val childArgs = new ArrayBuffer[String]()
// 需要附加的classpath
val childClasspath = new ArrayBuffer[String]()
// 解析参数/默认参数得出的系统配置集合(不是配置到系统配置中的,是通过计算得到的)
val sysProps = new HashMap[String, String]()
// 运行主类,这个不一定和指定的--class参数一致(如果deployMode=cluster(集群模式),运行主类会改变,指定的主类会被装饰之后运行)
var childMainClass = ""
// 2. 主类选择
if (deployMode == CLIENT) {
// 客户端模式,主类就是指定类
childMainClass = args.mainClass
....
// yarn cluster 模式,org.apache.spark.deploy.yarn.Client会装饰用户指定主类
if (isYarnCluster) {
childMainClass = "org.apache.spark.deploy.yarn.Client"
if (args.isPython) {
childArgs += ("--primary-py-file", args.primaryResource)
childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
} else if (args.isR) {
val mainFile = new Path(args.primaryResource).getName
childArgs += ("--primary-r-file", mainFile)
childArgs += ("--class", "org.apache.spark.deploy.RRunner")
} else {
if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
childArgs += ("--jar", args.primaryResource)
}
childArgs += ("--class", args.mainClass)
}
if (args.childArgs != null) {
args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
}
}
......
// 其他cluster模式也会改变运行主类,详情看代码
已经分析了参数的解析和补充,现在直接回到runMain方法。
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
.....
// 指定类加载器
val loader =
if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
new ChildFirstURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
} else {
new MutableURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
}
Thread.currentThread.setContextClassLoader(loader)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
// 更新系统配置
for ((key, value) <- sysProps) {
System.setProperty(key, value)
}
var mainClass: Class[_] = null
try {
// 查找主类
mainClass = Utils.classForName(childMainClass)
} catch {
case e: ClassNotFoundException =>
e.printStackTrace(printStream)
if (childMainClass.contains("thriftserver")) {
printStream.println(s"Failed to load main class $childMainClass.")
printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
}
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
case e: NoClassDefFoundError =>
e.printStackTrace(printStream)
if (e.getMessage.contains("org/apache/hadoop/hive")) {
printStream.println(s"Failed to load hive class.")
printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
}
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
}
// SPARK-4170
if (classOf[scala.App].isAssignableFrom(mainClass)) {
printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}
// 执行主类的main方法
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
if (!Modifier.isStatic(mainMethod.getModifiers)) {
throw new IllegalStateException("The main method in the given main class must be static")
}
// 运行主类的错误处理
@tailrec
def findCause(t: Throwable): Throwable = t match {
case e: UndeclaredThrowableException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: InvocationTargetException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: Throwable =>
e
}
try {
// 传入参数执行主类的main方法
mainMethod.invoke(null, childArgs.toArray)
} catch {
case t: Throwable =>
findCause(t) match {
case SparkUserAppException(exitCode) =>
System.exit(exitCode)
case t: Throwable =>
throw t
}
}
}
submit运行时主要包含以下几个步骤:
- 选择类加载器
- 添加classpath
- 修改系统配置
- 运行更新之后的主类