Spark源码分析(三)调度管理1

Spark调度相关概念

  • Task(任务):单个分区数据集上的最小处理流程单元
  • TaskSet(任务集):由一组关联的,但相互之间没有Shuffle依赖关系的任务所组成的任务集
  • Stage(调度阶段):由一个任务集对应
  • Job(作业):由一个RDD Action生成的一个或多个调度阶段所组成的一次计算作业
  • Application(应用程序):Spark应用程序,由一个或多个作业组成,用户编写的

作业运行

上一章讲了RDD,它和Driver是什么关系呢?我们用上一章的例子继续讲解Driver之后的故事
val sc=new SparkContext()
val hdfsFile = sc.textFile(args(1))  
val flatMapRdd = hdfsFile.flatMap(s => s.split(" "))  
val filterRdd = flatMapRdd.filter(_.length == 2)  
val mapRdd = filterRdd.map(word => (word, 1))  
val reduce = mapRdd.reduceByKey(_ + _)  
reduce.cache()  
reduce.count()  
这段代码是在Spark应用程序的main函数中的,所以它会在Driver中运行
首先,会初始化类SparkContext

SparkContext

SparkContext是连接Spark程序的桥梁,它会初始化类LiveListenerBus,该会监听Spark程序事件并做相应的处理;还会调用SparkEnv.create方法,还会一个调用该类的地方就是初始化类Executor的时候

SparkEnv 

scala使用伴生对象,可以使用静态方法create
object SparkEnv extends Logging {
  private val env = new ThreadLocal[SparkEnv]//ThreadLocal为使用该变量的线程提供独立的变量副本
  @volatile private var lastSetSparkEnv : SparkEnv = _//缓存最新的SparkEnv并且volatile,便于其他线程获得

  ......
  private[spark] def create(
      conf: SparkConf,
      executorId: String,
      hostname: String,
      port: Int,
      isDriver: Boolean,
      isLocal: Boolean,
      listenerBus: LiveListenerBus = null): SparkEnv = {
    ......
    val (actorSystem, boundPort) = AkkaUtils.createActorSystem("spark", hostname, port, conf = conf,
      securityManager = securityManager)
    ......
    def registerOrLookup(name: String, newActor: => Actor): ActorRef = {//如果在Driver上创建Actor对象,否则创建ref
      if (isDriver) {
        logInfo("Registering " + name)
        actorSystem.actorOf(Props(newActor), name = name)
      } else {
        val driverHost: String = conf.get("spark.driver.host", "localhost")
        val driverPort: Int = conf.getInt("spark.driver.port", 7077)
        Utils.checkHost(driverHost, "Expected hostname")
        val url = s"akka.tcp://spark@$driverHost:$driverPort/user/$name"
        val timeout = AkkaUtils.lookupTimeout(conf)
        logInfo(s"Connecting to $name: $url")
        Await.result(actorSystem.actorSelection(url).resolveOne(timeout), timeout)
      }
    }

    val mapOutputTracker =  if (isDriver) {
      new MapOutputTrackerMaster(conf)
    } else {
      new MapOutputTrackerWorker(conf)
    }

    // Have to assign trackerActor after initialization as MapOutputTrackerActor
    // requires the MapOutputTracker itself
    mapOutputTracker.trackerActor = registerOrLookup(
      "MapOutputTracker",
      new MapOutputTrackerMasterActor(mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

    val blockManagerMaster = new BlockManagerMaster(registerOrLookup(
      "BlockManagerMaster",
      new BlockManagerMasterActor(isLocal, conf, listenerBus)), conf)

    val blockManager = new BlockManager(executorId, actorSystem, blockManagerMaster,
      serializer, conf, securityManager, mapOutputTracker)

    val connectionManager = blockManager.connectionManager

    val broadcastManager = new BroadcastManager(isDriver, conf, securityManager)

    val cacheManager = new CacheManager(blockManager)

    val shuffleFetcher = instantiateClass[ShuffleFetcher](
      "spark.shuffle.fetcher", "org.apache.spark.BlockStoreShuffleFetcher")

    val httpFileServer = new HttpFileServer(securityManager)
    httpFileServer.initialize()
    conf.set("spark.fileserver.uri",  httpFileServer.serverUri)

    val metricsSystem = if (isDriver) {
      MetricsSystem.createMetricsSystem("driver", conf, securityManager)
    } else {
      MetricsSystem.createMetricsSystem("executor", conf, securityManager)
    }
    metricsSystem.start()
    ......
    new SparkEnv(
      executorId,
      actorSystem,
      serializer,
      closureSerializer,
      cacheManager,
      mapOutputTracker,
      shuffleFetcher,
      broadcastManager,
      blockManager,
      connectionManager,
      securityManager,
      httpFileServer,
      sparkFilesDir,
      metricsSystem,
      conf)
  }
从上面的代码可以看到,SparkEnv会创建很多对象,比如blockManager、cacheManager、mapOutputTracker等
在SparkContext中,最主要的初始化工作就是初始化TaskScheduler和DAGScheduler,这两个就是Spark的核心所在
TaskScheduler:专门负责Task执行,它只负责资源管理和Task分配,执行情况的报告
DAGScheduler:负责解析Spark程序,生成Stage,形成DAG,最终划分成Tasks,提交给TaskScheduler,它只完成静态分析
  private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master)
  @volatile private[spark] var dagScheduler: DAGScheduler = _
  try {
    dagScheduler = new DAGScheduler(this)
  } catch {
    case e: Exception => throw
      new SparkException("DAGScheduler cannot be initialized due to %s".format(e.getMessage))
  }

  // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
  // constructor
  taskScheduler.start()

TaskSchedulerImpl

  def initialize(backend: SchedulerBackend) {
    this.backend = backend
    // temporarily set rootPool name to empty
    rootPool = new Pool("", schedulingMode, 0, 0)
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
      }
    }
    schedulableBuilder.buildPools()
  }
创建TaskSchedulerImpl对象的时候会调用initialize方法,同时创建SchedulerBackend对象,在standalone模式下就是类SparkDepolySchedulerBackend,该类主要跟worker上的CoarseGrainedExecutorBackend通信;还会根据用户设定的SchedulingMode调度模式创建一个rootPool根调度池,之后根据具体的调度模式再进一步创建SchedulableBuilder对象,具体的SchedulableBuilder对象的buildPool是方法将在rootPool的基础上完成整个调度池的构建工作,每个SparkContext可能又同时存在多个可运行的任务集,这些任务集之间的调度室由rootPool来决定的
最后会调用TaskSchedulerImpl.start方法,该方法主要是调用SparkDepolySchedulerBackend.start方法
  override def start() {
    super.start()//CoarseGrainedSchedulerBackend.start

    // The endpoint for executors to talk to us
    val driverUrl = "akka.tcp://spark@%s:%s/user/%s".format(
      conf.get("spark.driver.host"), conf.get("spark.driver.port"),
      CoarseGrainedSchedulerBackend.ACTOR_NAME)
    val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", "{{CORES}}", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath").toSeq.flatMap { cp =>
      cp.split(java.io.File.pathSeparator)
    }
    val libraryPathEntries =
      sc.conf.getOption("spark.executor.extraLibraryPath").toSeq.flatMap { cp =>
        cp.split(java.io.File.pathSeparator)
      }

    val command = Command(//该命令会发送给Master,然后Master再发送给worker并启动CoarseGrainedExecutorBackend进程
      "org.apache.spark.executor.CoarseGrainedExecutorBackend", args, sc.executorEnvs,
      classPathEntries, libraryPathEntries, extraJavaOpts)
    val sparkHome = sc.getSparkHome()
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      sparkHome, sc.ui.appUIAddress, sc.eventLogger.map(_.logDir))

    client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)//向Master注册程序
    client.start()
  }
1、CoarseGrainedSchedulerBackend会创建DriverActor
DriverActor对象会定时向自己发送ReviveOffers消息,该类主要跟worker上的进程CoarseGrainedExecutorBackend通信,向TaskScheduler提供资源,并且调度任务到相应的CoarseGrainedExecutorBackend上执行,获得任务的返回结果
2、AppClient会创建ClientActor
该类主要向Master通信,向Master注册Spark应用程序,接受并处理Master发送的事件,Master调度资源启动Driver和Executor可以参考博客,我觉得写的非常棒

DAGScheduler

class DAGScheduler(
    private[scheduler] val sc: SparkContext,
    private[scheduler] val taskScheduler: TaskScheduler,
    listenerBus: LiveListenerBus,
    mapOutputTracker: MapOutputTrackerMaster,
    blockManagerMaster: BlockManagerMaster,
    env: SparkEnv)
DAGScheduler会把上述引用传进来,最主要的功能是因为划分完Stage后需要交给TaskScheduler;任务运行完成还要将shuffle结果的相关信息向MapOutputTrackerMaster注册
DAGScheduler内部维护了各种“任务/调度阶段/作业”的状态和互相之间的映射关系,用在任务状态更新、集群状态更新等各种情况下,正确地维护作业的运行逻辑
各种映射关系:
  private[scheduler] val jobIdToStageIds = new HashMap[Int, HashSet[Int]]
  private[scheduler] val stageIdToJobIds = new HashMap[Int, HashSet[Int]]
  private[scheduler] val stageIdToStage = new HashMap[Int, Stage]
  private[scheduler] val shuffleToMapStage = new HashMap[Int, Stage]
  private[scheduler] val jobIdToActiveJob = new HashMap[Int, ActiveJob]
  private[scheduler] val resultStageToJob = new HashMap[Stage, ActiveJob]
  private[scheduler] val stageToInfos = new HashMap[Stage, StageInfo]

  等待运行的调度阶段列表
  private[scheduler] val waitingStages = new HashSet[Stage]

  正在运行的调度阶段列表
  private[scheduler] val runningStages = new HashSet[Stage]

  失败等待重新提交的调度阶段
  private[scheduler] val failedStages = new HashSet[Stage]

  每个调度阶段里等待执行的任务列表
  private[scheduler] val pendingTasks = new HashMap[Stage, HashSet[Task[_]]]
DAGScheduler还会创建DAGSchedulerEventProcessActor,这样一来就可以将同步的函数调用转换对事件的异步处理了
该Actor主要处理作业的提交、Executor管理、Task的完成等事件
我觉得DAGScheduler划分Stage的目的是可以避免很多不必要的Task,包括中间结果的写入读取,Task的启动和回收都是开销,而且每个Stage内部是pipeline,大大提高了程序的执行效率

接着例子,SparkContext对象创建好后,经过多种RDD转换操作,最后行为操作reduce.count->SparkContext.runJob
写着写着内容就多了,接下去的过程放到下章了



发布了20 篇原创文章 · 获赞 9 · 访问量 6万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章