大數據:Spark Standalone 集羣調度(一)從遠程調試開始說application創建

遠程debug,特別是在集羣方式時候,會很方便了解代碼的運行方式,這也是碼農比較喜歡的方式

雖然scala的語法和java不一樣,但是scala是運行在JVM虛擬機上的,也就是scala最後編譯成字節碼運行在JVM上,那麼遠程調試方式就是JVM調試方式

在服務器端:

-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=7001,suspend=y 

客戶端通過socket就能遠程調試代碼

1. 調試submit, master, worker代碼

1.1 Submit 調試 

客戶端client 運行Submit,這裏就不描述,通常spark的用例都是用

spark-submit

提交一個spark任務

其本質就是類似下面命令

/usr/java/jdk1.8.0_111/bin/java -cp /work/spark-2.1.0-bin-hadoop2.7/conf/:/work/spark-2.1.0-bin-hadoop2.7/jars/* -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=7000,suspend=y -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://raintungmaster:7077 --class rfcexample --jars /work/spark-2.1.0-bin-hadoop2.7/examples/jars/scopt_2.11-3.3.0.jar,/work/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar /tmp/machinelearning.jar

調用SparkSubmit的類去提交任務,debug的參數直接往上加就是了

1.2 master, worker 的設置調試

export SPARK_WORKER_OPTS="-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=8000,suspend=n"
export SPARK_MASTER_OPTS="-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=8001,suspend=n"
在啓動的時候設置環境變量就可以了

2. 調試executor 代碼

發現設置woker環境參數,但確一直都無法調試在spark executor 運行的代碼,既然executor是在worker上運行的,當然是可以遠程debug,但爲啥executor不能調試呢?

3. Spark standalone 的集羣調度

既然executor不能調試,我們需要把submit, master, worker的調度關係搞清楚

3.1 Submit 提交任務

剛纔已經描述過submit實際上初始化了SparkSubmit的類,在SparkSubmit的main方法中調用了runMain方法

try {
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
      case t: Throwable =>
        findCause(t) match {
          case SparkUserAppException(exitCode) =>
            System.exit(exitCode)

          case t: Throwable =>
            throw t
        }
    }

而核心就是調用了在我們提交的類的main方法,在上面的例子裏就是參數
--class rfcexample
調用了rfcexample的main方法

通常我們在寫spark的運行的類的方法,會初始化spark的上下文
val sc = new SparkContext(conf)
SparkContext初始化的時候會啓動一個task的任務
// Create and start the scheduler
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    _taskScheduler.start()

在standalone 的模式下最後會調用StandaloneSchedulerBackend.scala 的start方法

    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    client.start()
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
    waitForRegistration()
    launcherBackend.setState(SparkAppHandle.State.RUNNING)

構建Application的描述符號,啓動一個StandaloneAppClient 去connect 的master

3.2 Master 分配任務

Submit 裏創建了一個客戶端構建了一個application的描述,註冊application 到master中,在master的dispatcher分發消息會收到registerapplication的消息
case RegisterApplication(description, driver) =>
      // TODO Prevent repeated registrations from some driver
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response
      } else {
        logInfo("Registering app " + description.name)
        val app = createApplication(description, driver)
        registerApplication(app)
        logInfo("Registered app " + description.name + " with ID " + app.id)
        persistenceEngine.addApplication(app)
        driver.send(RegisteredApplication(app.id, self))
        schedule()
      }

創建一個新的application id, 註冊這個application,一個application只能綁定一個客戶端端口,同一個客戶端的ip:port只能註冊一個application,在schedule裏通過計算application的內存,core的要求,進行對有效的worker分配executor
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)
    worker.endpoint.send(LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
    exec.application.driver.send(
      ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
  }

在worker的endpoint發送了LaunchExecutor的序列化消息

3.3 Worker 分配任務

在worker.scala中dispatcher收到了LaunchExecutor 消息
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
      if (masterUrl != activeMasterUrl) {
        logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
      } else {
        try {
          logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))

          // Create the executor's working directory
          val executorDir = new File(workDir, appId + "/" + execId)
          if (!executorDir.mkdirs()) {
            throw new IOException("Failed to create directory " + executorDir)
          }

          // Create local dirs for the executor. These are passed to the executor via the
          // SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
          // application finishes.
          val appLocalDirs = appDirectories.getOrElse(appId,
            Utils.getOrCreateLocalRootDirs(conf).map { dir =>
              val appDir = Utils.createDirectory(dir, namePrefix = "executor")
              Utils.chmod700(appDir)
              appDir.getAbsolutePath()
            }.toSeq)
          appDirectories(appId) = appLocalDirs
          val manager = new ExecutorRunner(
            appId,
            execId,
            appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
            cores_,
            memory_,
            self,
            workerId,
            host,
            webUi.boundPort,
            publicAddress,
            sparkHome,
            executorDir,
            workerUri,
            conf,
            appLocalDirs, ExecutorState.RUNNING)
          executors(appId + "/" + execId) = manager
          manager.start()
          coresUsed += cores_
          memoryUsed += memory_
          sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
        } catch {
          case e: Exception =>
            logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)
            if (executors.contains(appId + "/" + execId)) {
              executors(appId + "/" + execId).kill()
              executors -= appId + "/" + execId
            }
            sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,
              Some(e.toString), None))
        }
      }
創建了一個工作目錄,啓動了ExecutorRunner
private[worker] def start() {
    workerThread = new Thread("ExecutorRunner for " + fullId) {
      override def run() { fetchAndRunExecutor() }
    }
    workerThread.start()
    // Shutdown hook that kills actors on shutdown.
    shutdownHook = ShutdownHookManager.addShutdownHook { () =>
      // It's possible that we arrive here before calling `fetchAndRunExecutor`, then `state` will
      // be `ExecutorState.RUNNING`. In this case, we should set `state` to `FAILED`.
      if (state == ExecutorState.RUNNING) {
        state = ExecutorState.FAILED
      }
      killProcess(Some("Worker shutting down")) }
  }

在ExecutorRunner.scala的start的方法裏,啓動了線程ExecutorRunner for xxx, 運行executor,難道application裏的方法就是在這個線程裏運行的?
private def fetchAndRunExecutor() {
    try {
      // Launch the process
      val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),
        memory, sparkHome.getAbsolutePath, substituteVariables)
      val command = builder.command()
      val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")
      .....
      process = builder.start()
      ......
      val exitCode = process.waitFor()
      state = ExecutorState.EXITED
      val message = "Command exited with code " + exitCode
      worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))
    } catch {
      ......
    }
  }

在看fetchAndRunExecutor的方法裏,我們看到了builder.start,這是一個ProcessBuilder,也就是當前線程啓動了一個子進程運行命令

這就是爲什麼我們無法通過debug worker的方式去debug executor, 因爲這是另一個進程

4. 調試executor進程

我們剛纔跟了代碼一路,發現在master接受到RegisterApplication消息到發送調度worker的LaunchExecutor消息,並沒有對消息進行處理,最後子進程的運行命令是從ApplicationDescription中的command獲取到,而我們也知道ApplicationDescription 就是3.1種的Submit創建的,那就回到StandaloneSchedulerBackend.scala 的start方法

val driverUrl = RpcEndpointAddress(
      sc.conf.get("spark.driver.host"),
      sc.conf.get("spark.driver.port").toInt,
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
    val args = Seq(
      "--driver-url", driverUrl,
      "--executor-id", "{{EXECUTOR_ID}}",
      "--hostname", "{{HOSTNAME}}",
      "--cores", "{{CORES}}",
      "--app-id", "{{APP_ID}}",
      "--worker-url", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)

    // When testing, expose the parent class path to the child. This is processed by
    // compute-classpath.{cmd,sh} and makes all needed jars available to child processes
    // when the assembly is built with the "*-provided" profiles enabled.
    val testingClassPath =
      if (sys.props.contains("spark.testing")) {
        sys.props("java.class.path").split(java.io.File.pathSeparator).toSeq
      } else {
        Nil
      }

    // Start executors with a few necessary configs for registering with the scheduler
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
我們看到了executor 的java的參數是在javaOpts裏控制的,也就是
val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
原來是參數spark.executor.extraJavaOptions控制的,反過來去翻spark文檔,雖然有點晚
spark.executor.extraJavaOptions (none) A string of extra JVM options to pass to executors. For instance, GC settings or other logging.  Note that it is illegal to set Spark properties or maximum heap size (-Xmx)settings  with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Maximum heap size settings can be set with spark.executor.memory.

在這個文檔裏,我們可以通過設置conf 對spark_submit 進行executor 進行JVM參數設置
--conf "spark.executor.extraJavaOptions=-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=7001,suspend=y"
整個運行的submit的進程就是
/usr/java/jdk1.8.0_111/bin/java -cp /work/spark-2.1.0-bin-hadoop2.7/conf/:/work/spark-2.1.0-bin-hadoop2.7/jars/* -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=7000,suspend=y -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://raintungmaster:7077 --class rfcexample --jars /work/spark-2.1.0-bin-hadoop2.7/examples/jars/scopt_2.11-3.3.0.jar,/work/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar --conf "spark.executor.extraJavaOptions=-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=7001,suspend=y" /tmp/machinelearning.jar

注意:
如果你的worker 不能起多個executor,畢竟監聽端口在一起機器上只能起一個。




發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章