Spark 源码阅读(4)——Master和Worker的启动流程

启动脚本中调用的是Master的main方法,所以我们找到Master的main方法:

private[spark] object Master extends Logging {
  val systemName = "sparkMaster"
  private val actorName = "Master"

  def main(argStrings: Array[String]) {
    SignalLogger.register(log)
    val conf = new SparkConf
    val args = new MasterArguments(argStrings, conf)
    val (actorSystem, _, _, _) = startSystemAndActor(args.host, args.port, args.webUiPort, conf)
    actorSystem.awaitTermination()
  }
}

此处是Master的伴生对象(伴生对象和类的名称一样,和类在一个源文件中,它是静态对象,里面的属性方法都是静态的,他可以和伴生类相互访问私有成员)

SparkConf对象暂时不看。

下面创建了一个MasterArguments对象,根据名称来看,它是Master上参数封装的对象。传递的参数是脚本中传递的参数和SparkConf对象。

进入MasterArguments中:


private[spark] class MasterArguments(args: Array[String], conf: SparkConf) {
  // 设置了一些默认值
  var host = Utils.localHostName()
  var port = 7077
  var webUiPort = 8080
  var propertiesFile: String = null

  // Check for settings in environment variables
  // 判断环境中是否已经设置了参数,环境变量中有的话就使用环境变量中的值
  if (System.getenv("SPARK_MASTER_HOST") != null) {
    host = System.getenv("SPARK_MASTER_HOST")
  }
  if (System.getenv("SPARK_MASTER_PORT") != null) {
    port = System.getenv("SPARK_MASTER_PORT").toInt
  }
  if (System.getenv("SPARK_MASTER_WEBUI_PORT") != null) {
    webUiPort = System.getenv("SPARK_MASTER_WEBUI_PORT").toInt
  }

  //调用下面定义好的方法
  parse(args.toList)

  // This mutates the SparkConf, so all accesses to it must be made after this line
  propertiesFile = Utils.loadDefaultSparkProperties(conf, propertiesFile)

  if (conf.contains("spark.master.ui.port")) {
    webUiPort = conf.get("spark.master.ui.port").toInt
  }

  // 定义一个用于模式匹配的方法,根据args进行模式匹配,tail表示匹配到的剩余串,匹配成功后赋值
  // 注意args参数的形式如:List("--ip","192.168.0.1","--host","mini1")
  def parse(args: List[String]): Unit = args match {
    case ("--ip" | "-i") :: value :: tail =>
      Utils.checkHost(value, "ip no longer supported, please use hostname " + value)
      host = value
      parse(tail)

    case ("--host" | "-h") :: value :: tail =>
      Utils.checkHost(value, "Please use hostname " + value)
      host = value
      parse(tail)

    case ("--port" | "-p") :: IntParam(value) :: tail =>
      port = value
      parse(tail)

    case "--webui-port" :: IntParam(value) :: tail =>
      webUiPort = value
      parse(tail)

    case ("--properties-file") :: value :: tail =>
      propertiesFile = value
      parse(tail)

    case ("--help") :: tail =>
      printUsageAndExit(0)

    case Nil => {}

    case _ =>
      printUsageAndExit(1)
  }

  /**
   * Print usage and exit JVM with the given exit code.
   */
  def printUsageAndExit(exitCode: Int) {
    System.err.println(
      "Usage: Master [options]\n" +
      "\n" +
      "Options:\n" +
      "  -i HOST, --ip HOST     Hostname to listen on (deprecated, please use --host or -h) \n" +
      "  -h HOST, --host HOST   Hostname to listen on\n" +
      "  -p PORT, --port PORT   Port to listen on (default: 7077)\n" +
      "  --webui-port PORT      Port for web UI (default: 8080)\n" +
      "  --properties-file FILE Path to a custom Spark properties file.\n" +
      "                         Default is conf/spark-defaults.conf.")
    System.exit(exitCode)
  }
}

上面的模式匹配中引用了一个IntParam方法,没有使用new, 所以调用的可能是apply方法

private[spark] object IntParam {
  def unapply(str: String): Option[Int] = {
    try {
      Some(str.toInt)
    } catch {
      case e: NumberFormatException => None
    }
  }
}

可以看到调用的是静态类IntParam的unapply方法,和apply的区别是unapply一般用于模式匹配。

通过以上代码可以清楚的看到Master的配置参数包括:host、port、webUiPort等的赋值过程。


  def main(argStrings: Array[String]) {
    SignalLogger.register(log)
    val conf = new SparkConf
    val args = new MasterArguments(argStrings, conf)
    val (actorSystem, _, _, _) = startSystemAndActor(args.host, args.port, args.webUiPort, conf)
    actorSystem.awaitTermination()
  }

MasterArguments对象封装好参数后,调用startSystemAndActor方法

该方法的输入参数包括上一步处理好的host、port、webuiport、SparkConf,返回一个包含4个元素的元组

  def startSystemAndActor(
      host: String,
      port: Int,
      webUiPort: Int,
      conf: SparkConf): (ActorSystem, Int, Int, Option[Int]) = {
    val securityMgr = new SecurityManager(conf)
    // 根据方法的名称可以看出是创建了一个ActorSystem
    val (actorSystem, boundPort) = AkkaUtils.createActorSystem(systemName, host, port, conf = conf,
      securityManager = securityMgr)
    val actor = actorSystem.actorOf(
      Props(classOf[Master], host, boundPort, webUiPort, securityMgr, conf), actorName)
    val timeout = AkkaUtils.askTimeout(conf)
    val portsRequest = actor.ask(BoundPortsRequest)(timeout)
    val portsResponse = Await.result(portsRequest, timeout).asInstanceOf[BoundPortsResponse]
    (actorSystem, boundPort, portsResponse.webUIPort, portsResponse.restPort)
  }

上面代码中调用 AkkaUtils.createActorSystem的方法创建了一个ActorSystem(相当于连接池、线程池,用于创建actor)

我们看一下ActorSystem是如何被创建的:


  def createActorSystem(
      name: String,
      host: String,
      port: Int,
      conf: SparkConf,
      securityManager: SecurityManager): (ActorSystem, Int) = {
    val startService: Int => (ActorSystem, Int) = { actualPort =>
      doCreateActorSystem(name, host, actualPort, conf, securityManager)
    }
    // 将startService作为参数
    Utils.startServiceOnPort(port, startService, conf, name)
  }

这个方法里面定义了一个startService的函数,输入int类型的port,输出ActorSystem和Int


  def startServiceOnPort[T](
      startPort: Int,
      startService: Int => (T, Int),
      conf: SparkConf,
      serviceName: String = ""): (T, Int) = {
   
    // 判断startPort是否合法
    require(startPort == 0 || (1024 <= startPort && startPort < 65536),
      "startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port.")

    val serviceString = if (serviceName.isEmpty) "" else s" '$serviceName'"
    val maxRetries = portMaxRetries(conf)
    for (offset <- 0 to maxRetries) {
      // Do not increment port if startPort is 0, which is treated as a special port
      val tryPort = if (startPort == 0) {
        startPort
      } else {
        // If the new port wraps around, do not try a privilege port
        ((startPort + offset - 1024) % (65536 - 1024)) + 1024
      }
      try {
        // 调用上面定义好的方法
        val (service, port) = startService(tryPort)
        logInfo(s"Successfully started service$serviceString on port $port.")
        return (service, port)
      } catch {
        case e: Exception if isBindCollision(e) =>
          if (offset >= maxRetries) {
            val exceptionMessage =
              s"${e.getMessage}: Service$serviceString failed after $maxRetries retries!"
            val exception = new BindException(exceptionMessage)
            // restore original stack trace
            exception.setStackTrace(e.getStackTrace)
            throw exception
          }
          logWarning(s"Service$serviceString could not bind on port $tryPort. " +
            s"Attempting port ${tryPort + 1}.")
      }
    }
    // Should never happen
    throw new SparkException(s"Failed to start service$serviceString on port $startPort")
  }

上面的代码中最主要的代码是:

        val (service, port) = startService(tryPort)

也就是调用作为参数传递过来的方法。回到createActorSystem方法的代码中我们可以看到这个函数的定义:


    val startService: Int => (ActorSystem, Int) = { actualPort =>
      doCreateActorSystem(name, host, actualPort, conf, securityManager)
    }

doCreateActorSystem的实现细节:

private def doCreateActorSystem(
      name: String,
      host: String,
      port: Int,
      conf: SparkConf,
      securityManager: SecurityManager): (ActorSystem, Int) = {

    val akkaThreads   = conf.getInt("spark.akka.threads", 4)
    val akkaBatchSize = conf.getInt("spark.akka.batchSize", 15)
    val akkaTimeout = conf.getInt("spark.akka.timeout", conf.getInt("spark.network.timeout", 120))
    val akkaFrameSize = maxFrameSizeBytes(conf)
    val akkaLogLifecycleEvents = conf.getBoolean("spark.akka.logLifecycleEvents", false)
    val lifecycleEvents = if (akkaLogLifecycleEvents) "on" else "off"
    if (!akkaLogLifecycleEvents) {
      // As a workaround for Akka issue #3787, we coerce the "EndpointWriter" log to be silent.
      // See: https://www.assembla.com/spaces/akka/tickets/3787#/
      Option(Logger.getLogger("akka.remote.EndpointWriter")).map(l => l.setLevel(Level.FATAL))
    }

    val logAkkaConfig = if (conf.getBoolean("spark.akka.logAkkaConfig", false)) "on" else "off"

    val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 6000)
    val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 1000)

    val secretKey = securityManager.getSecretKey()
    val isAuthOn = securityManager.isAuthenticationEnabled()
    if (isAuthOn && secretKey == null) {
      throw new Exception("Secret key is null with authentication on")
    }
    val requireCookie = if (isAuthOn) "on" else "off"
    val secureCookie = if (isAuthOn) secretKey else ""
    logDebug(s"In createActorSystem, requireCookie is: $requireCookie")

    val akkaSslConfig = securityManager.akkaSSLOptions.createAkkaConfig
        .getOrElse(ConfigFactory.empty())

    val akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
      .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
      s"""
      |akka.daemonic = on
      |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
      |akka.stdout-loglevel = "ERROR"
      |akka.jvm-exit-on-fatal-error = off
      |akka.remote.require-cookie = "$requireCookie"
      |akka.remote.secure-cookie = "$secureCookie"
      |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
      |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
      |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
      |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
      |akka.remote.netty.tcp.hostname = "$host"
      |akka.remote.netty.tcp.port = $port
      |akka.remote.netty.tcp.tcp-nodelay = on
      |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
      |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
      |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
      |akka.actor.default-dispatcher.throughput = $akkaBatchSize
      |akka.log-config-on-start = $logAkkaConfig
      |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
      |akka.log-dead-letters = $lifecycleEvents
      |akka.log-dead-letters-during-shutdown = $lifecycleEvents
      """.stripMargin))

    val actorSystem = ActorSystem(name, akkaConf)
    val provider = actorSystem.asInstanceOf[ExtendedActorSystem].provider
    val boundPort = provider.getDefaultAddress.port.get
    (actorSystem, boundPort)
  }

这个方法同样前一部分封装参数,没有的设置默认值,最后 val actorSystem = ActorSystem(name, akkaConf)

创建了一个actorSystem并绑定了一个端口返回。

通过以上代码我们可以看到ActorSystem是如何被创建的。

ActorSystem 创建好后,下一步就是创建一个Actor 并执行其生命周期的preStart方法:

val actor = actorSystem.actorOf(
  Props(classOf[Master], host, boundPort, webUiPort, securityMgr, conf), actorName)

我们看一下Master中的preStart方法

override def preStart() {
  logInfo("Starting Spark master at " + masterUrl)
  logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
  // Listen for remote client disconnection events, since they don't go through Akka's watch()
  context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent])
  webUi.bind()
  masterWebUiUrl = "http://" + masterPublicAddress + ":" + webUi.boundPort
  context.system.scheduler.schedule(0 millis, WORKER_TIMEOUT millis, self, CheckForWorkerTimeOut)

  masterMetricsSystem.registerSource(masterSource)
  masterMetricsSystem.start()
  applicationMetricsSystem.start()
  // Attach the master and app metrics servlet handler to the web ui after the metrics systems are
  // started.
  masterMetricsSystem.getServletHandlers.foreach(webUi.attachHandler)
  applicationMetricsSystem.getServletHandlers.foreach(webUi.attachHandler)

  ....
  persistenceEngine = persistenceEngine_
  leaderElectionAgent = leaderElectionAgent_
}
我们可以看到上面的=代码中调用了一个调度器,从集群启动开始每隔一定的时间向master发送一个任务CheckForWorkerTimeOut任务。
下面我们看一下这个CheckForWorkerTimeOut
def timeOutDeadWorkers() {
  // Copy the workers into an array so we don't modify the hashset while iterating through it
  val currentTime = System.currentTimeMillis()
  val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT).toArray
  for (worker <- toRemove) {
    if (worker.state != WorkerState.DEAD) {
      logWarning("Removing %s because we got no heartbeat in %d seconds".format(
        worker.id, WORKER_TIMEOUT/1000))
      removeWorker(worker)
    } else {
      if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT)) {
        workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough; cull it
      }
    }
  }
}
上面的代码中,首先从workers中过滤出超时的worker 然后将这个worker删除。



下面看一下Worker的启动流程。我们主要看一下Actor的preStart方法
override def preStart() {
  assert(!registered)
  logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(
    host, port, cores, Utils.megabytesToString(memory)))
  logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
  logInfo("Spark home: " + sparkHome)
  createWorkDir()
  context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent])
  shuffleService.startIfEnabled()
  webUi = new WorkerWebUI(this, workDir, webUiPort)
  webUi.bind()
  registerWithMaster()

  metricsSystem.registerSource(workerSource)
  metricsSystem.start()
  // Attach the worker metrics servlet handler to the web ui after the metrics system is started.
  metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
}
我们看一下registerWithMaster
def registerWithMaster() {
  // DisassociatedEvent may be triggered multiple times, so don't attempt registration
  // if there are outstanding registration attempts scheduled.
  registrationRetryTimer match {
    case None =>
      registered = false
      tryRegisterAllMasters()
      connectionAttemptCount = 0
      registrationRetryTimer = Some {
        context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL,
          INITIAL_REGISTRATION_RETRY_INTERVAL, self, ReregisterWithMaster)
      }
    case Some(_) =>
      logInfo("Not spawning another attempt to register with the master, since there is an" +
        " attempt scheduled already.")
  }
}
private def tryRegisterAllMasters() {
  for (masterAkkaUrl <- masterAkkaUrls) {
    logInfo("Connecting to master " + masterAkkaUrl + "...")
    val actor = context.actorSelection(masterAkkaUrl)
    actor ! RegisterWorker(workerId, host, port, cores, memory, webUi.boundPort, publicAddress)
  }
}

该方法根据master的url获取一个Actor连接,然后向Master发送了一个样例类,我们看一下Master中的实现

case RegisterWorker(id, workerHost, workerPort, cores, memory, workerUiPort, publicAddress) =>
{
  logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
    workerHost, workerPort, cores, Utils.megabytesToString(memory)))
  if (state == RecoveryState.STANDBY) {
    // ignore, don't send response
  } else if (idToWorker.contains(id)) {
    sender ! RegisterWorkerFailed("Duplicate worker ID")
  } else {
    val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
      sender, workerUiPort, publicAddress)
    if (registerWorker(worker)) {
      persistenceEngine.addWorker(worker)
      sender ! RegisteredWorker(masterUrl, masterWebUiUrl)
      schedule()
    } else {
      val workerAddress = worker.actor.path.address
      logWarning("Worker registration failed. Attempted to re-register worker at same " +
        "address: " + workerAddress)
      sender ! RegisterWorkerFailed("Attempted to re-register worker at same address: "
        + workerAddress)
    }
  }
}

上面代码主要是判断worker的id是否已经注册,没有注册的话进行注册。



本人有大数据学习相关的全套资料。需要的联系qq 2230683232

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章