spark源码解析-sparkSubmit分析

spark版本: 2.0.0

1.引入

为了简单测试项目功能,我们可能只需要使用spark-shell就可以完成,但是企业级项目我们都是先打包好之后使用spark-submit脚本完成线上的spark项目部署。

./bin/spark-submit \
--class com.example.spark.Test \
--master yarn \
--deploy-mode client \
/home/hadoop/data/test.jar

虽然spark-submit命令用了很多,但是其中的原理,相信很多人都不了解,现在我们来介绍一下这里面的小秘密。

2.spark-submit脚本

spark-submit脚本很容易发现,最终它会执行org.apache.spark.deploy.SparkSubmit的main方法,因此我们就从这里开始。

 def main(args: Array[String]): Unit = {
   // 解析自定义参数
    val appArgs = new SparkSubmitArguments(args)
    // 是否详细打印
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    // submit支持三种类型的提交,主要是SparkSubmitAction.SUBMIT,其他两种几乎不使用
    appArgs.action match {
        // 提交任务(核心)
      case SparkSubmitAction.SUBMIT => submit(appArgs)
        // 只支持 Standalone and Mesos cluster
      case SparkSubmitAction.KILL => kill(appArgs)
        // 只支持 Standalone and Mesos cluster
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

submit方法是提交任务的入口,里面包含参数校验,根据情况更加附加参数,处理用户参数,系统配置,各语言之间的兼容问题和执行权限等问题。

 private def submit(args: SparkSubmitArguments): Unit = {
    // 处理用户参数,系统配置,各语言之间的兼容问题
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

    // 处理执行权限问题,最终都是调用方法runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
    def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          })
        } catch {
          case e: Exception =>
            if (e.getStackTrace().length == 0) {
              printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
              exitFn(1)
            } else {
              throw e
            }
        }
      } else {
        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      }
    }
    
    // 增加日志提醒
    if (args.isStandaloneCluster && args.useRest) {
      try {
        printStream.println("Running Spark using the REST application submission protocol.")
        doRunMain()
      } catch {
        case e: SubmitRestConnectionException =>
          printWarning(s"Master endpoint ${args.master} was not a REST server. " +
            "Falling back to legacy submission gateway instead.")
          args.useRest = false
          submit(args)
      }
    } else {
      doRunMain()
    }
  }

上面真正有用的代码其实并不多,除了最终调用方法runMain,这里还需要简单介绍一下prepareSubmitEnvironment这个方法。如果看过源码的同学,应该会发现这段代码很长,不过大部分都是解决一些参数冲突,多语言等问题,只有少部分需要注意,所以接下来是跳读。

// 1. 属性介绍
  private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
      : (Seq[String], Seq[String], Map[String, String], String) = {
   // 运行类childMainClass的参数
    val childArgs = new ArrayBuffer[String]()
    // 需要附加的classpath
    val childClasspath = new ArrayBuffer[String]()
    // 解析参数/默认参数得出的系统配置集合(不是配置到系统配置中的,是通过计算得到的)
    val sysProps = new HashMap[String, String]()
    // 运行主类,这个不一定和指定的--class参数一致(如果deployMode=cluster(集群模式),运行主类会改变,指定的主类会被装饰之后运行)
    var childMainClass = ""
    
// 2. 主类选择
    if (deployMode == CLIENT) {
    // 客户端模式,主类就是指定类
      childMainClass = args.mainClass
    ....
    
    // yarn cluster 模式,org.apache.spark.deploy.yarn.Client会装饰用户指定主类
    if (isYarnCluster) {
      childMainClass = "org.apache.spark.deploy.yarn.Client"
      if (args.isPython) {
        childArgs += ("--primary-py-file", args.primaryResource)
        childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
      } else if (args.isR) {
        val mainFile = new Path(args.primaryResource).getName
        childArgs += ("--primary-r-file", mainFile)
        childArgs += ("--class", "org.apache.spark.deploy.RRunner")
      } else {
        if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
          childArgs += ("--jar", args.primaryResource)
        }
        childArgs += ("--class", args.mainClass)
      }
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }
    ......
    // 其他cluster模式也会改变运行主类,详情看代码


已经分析了参数的解析和补充,现在直接回到runMain方法。

private def runMain(
      childArgs: Seq[String],
      childClasspath: Seq[String],
      sysProps: Map[String, String],
      childMainClass: String,
      verbose: Boolean): Unit = {
    .....
    // 指定类加载器
    val loader =
      if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
        new ChildFirstURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      } else {
        new MutableURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      }
    Thread.currentThread.setContextClassLoader(loader)

    for (jar <- childClasspath) {
      addJarToClasspath(jar, loader)
    }

    // 更新系统配置
    for ((key, value) <- sysProps) {
      System.setProperty(key, value)
    }

    var mainClass: Class[_] = null

    try {
    // 查找主类
      mainClass = Utils.classForName(childMainClass)
    } catch {
      case e: ClassNotFoundException =>
        e.printStackTrace(printStream)
        if (childMainClass.contains("thriftserver")) {
          printStream.println(s"Failed to load main class $childMainClass.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
      case e: NoClassDefFoundError =>
        e.printStackTrace(printStream)
        if (e.getMessage.contains("org/apache/hadoop/hive")) {
          printStream.println(s"Failed to load hive class.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
    }

    // SPARK-4170
    if (classOf[scala.App].isAssignableFrom(mainClass)) {
      printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
    }
    // 执行主类的main方法
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }

    // 运行主类的错误处理
    @tailrec
    def findCause(t: Throwable): Throwable = t match {
      case e: UndeclaredThrowableException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: InvocationTargetException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: Throwable =>
        e
    }

    try {
      // 传入参数执行主类的main方法
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
      case t: Throwable =>
        findCause(t) match {
          case SparkUserAppException(exitCode) =>
            System.exit(exitCode)

          case t: Throwable =>
            throw t
        }
    }
  }

submit运行时主要包含以下几个步骤:

  1. 选择类加载器
  2. 添加classpath
  3. 修改系统配置
  4. 运行更新之后的主类
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章