Yarn源碼剖析（四）-- AM的註冊與資源調度申請Container及啓動

AM註冊到RM

1. 從Yarn源碼剖析（三）-- ApplicationMaster的啓動可知提交應用程序至yarn時最後啓動了ApplicationMaster類，所以我們直接來看這個類（是spark自己封裝的AM）的main方法，可以看到spark是通過調用AMRMClient客戶端來調用相關API來實現AM註冊的，以及資源的調度。

amClient = AMRMClient.createAMRMClient()  //創建一個客戶端實例
amClient.init(conf)  //初始化
amClient.start()  //啓動客戶端
this.uiHistoryAddress = uiHistoryAddress

val trackingUrl = uiAddress.getOrElse {
  if (sparkConf.get(ALLOW_HISTORY_SERVER_TRACKING_URL)) uiHistoryAddress else ""
}

logInfo("Registering the ApplicationMaster")
synchronized {
  amClient.registerApplicationMaster(Utils.localHostName(), 0, trackingUrl)  //註冊AM
  registered = true
}

2. 那我們就對應這幾個API到hadoop源碼中去剖析AM註冊啓動的過程，先從AMRMClient.createAMRMClient()入手，我們找到hadoop代碼中org.apache.hadoop.yarn.client.api.AMRMClient這個類（hadoop是2.7.4版本的代碼），可以看到就是new了一個實現類AMRMClientImpl

public static <T extends ContainerRequest> AMRMClient<T> createAMRMClient() {
  AMRMClient<T> client = new AMRMClientImpl<T>();
  return client;
}

3. 接下來看一下amClient.init(conf)的初始化方法，那我們這裏顯然是由 AMRMClientImpl()實現，因此我們去看它的serviceInit(),內部就是把配置文件做了一個初始化，就不多做分析了。接着看amClient.start()方法，它同init()方法一樣，相關的初始化和啓動方法在Yarn源碼剖析（一） --- RM與NM服務啓動以及心跳通信有做介紹，顯然也是由AMRMClientImpl()實現的，內部就實現了爲指定的協議創建資源管理器的代理。至此AM服務的初始化、啓動就完成了，接下來就是註冊這個AM到RM上了，從spark端的代碼中可以看到，是調用了AMRMClient.registerApplicationMaster()方法。

synchronized {
  amClient.registerApplicationMaster(Utils.localHostName(), 0, trackingUrl)
  registered = true
}

4. 那我們就來看一下這個代碼AMRMClient.registerApplicationMaster()，由它的實現類AMRMClientImpl()實現。

  private RegisterApplicationMasterResponse registerApplicationMaster()
      throws YarnException, IOException {
      //設置資源請求
    RegisterApplicationMasterRequest request =
        RegisterApplicationMasterRequest.newInstance(this.appHostName,
            this.appHostPort, this.appTrackingUrl);
            //獲取響應
    RegisterApplicationMasterResponse response =
        rmClient.registerApplicationMaster(request);  //關鍵方法
    synchronized (this) {
      lastResponseId = 0;
      if (!response.getNMTokensFromPreviousAttempts().isEmpty()) {
        populateNMTokens(response.getNMTokensFromPreviousAttempts());
      }
    }
    return response;
  }

5. 很明顯這裏有一個非常關鍵的方法rmClient.registerApplicationMaster(request)，點進去發現這個方法由AM的服務端代碼來實現，這內部就是真正做註冊的事情了。下面這段代碼就是內部真正做註冊的代碼，這塊代碼涉及到yarn狀態機的轉換，通過提交不同的事件來做相應的處理，這塊內容單獨拿出來講完全可以形成一篇長文，所以有興趣的朋友可以自己去閱讀源碼或者查閱資料。

LOG.info("AM registration " + applicationAttemptId);
this.rmContext
  .getDispatcher()
  .getEventHandler()
  .handle(
    new RMAppAttemptRegistrationEvent(applicationAttemptId, request
      .getHost(), request.getRpcPort(), request.getTrackingUrl()));

6. 根據狀事件狀態的提交，AM註冊會交由到RMAppAttemptImpl中的AMRegisteredTransition.transition類來實現，該方法實現了狀態機的轉換由ACCEPTED狀態變換RUNNING，至此AM註冊成功。

  // Let the app know
  appAttempt.eventHandler.handle(new RMAppEvent(appAttempt
      .getAppAttemptId().getApplicationId(),
     RMAppEventType.ATTEMPT_REGISTERED));

AM申請Container

1. 從下文代碼可以看出，調用了AMRMClient的allocate方法來實現資源調度，所以我們去看hadoop代碼中AM是如何做資源調度的。

  //內部是對封裝請求信息
  updateResourceRequests()
  
  val progressIndicator = 0.1f
  // requests.
  //此方法就是去yarn查看所有節點可用的資源信息
  val allocateResponse = amClient.allocate(progressIndicator)
  
  val allocatedContainers = allocateResponse.getAllocatedContainers()

2. 先看一下updateResourceRequests()內部，調用了以下這個方法，這個方法就是將請求傳給RM，這個方法我就不做具體分析了。

  newLocalityRequests.foreach { request =>
    amClient.addContainerRequest(request)
  }

3. 接下來就看關鍵方法amClient.allocate(progressIndicator)，當然該方法也是有AMRMClientImpl實現類實現的，並且在該實現類初始化時，實例化了ApplicationMasterProtocol，並調用該對象allocate，由ApplicationMasterService實現，這個代碼比較的長，所以我們分段來看關鍵的代碼，先來看這段代碼，這裏先去執行了STATUS_UPDATE事件，更新保存一下各個節點的資源信息（因爲在之前一系列的操作中，集羣的資源可能已經發生變化了）

  // Send the status update to the appAttempt.
  this.rmContext.getDispatcher().getEventHandler().handle(
      new RMAppAttemptStatusupdateEvent(appAttemptId, request
          .getProgress()));

4. 接下來則是去檢查一下調度器隊列中的cpu和內存是否足夠

  //去檢查隊列中的cpu和內存是否足夠
  RMServerUtils.normalizeAndValidateRequests(ask,
      rScheduler.getMaximumResourceCapability(), app.getQueue(),
      rScheduler, rmContext);

5. 下面就是實現了調度器的資源調度，我們默認分析Capatity Scheduler

  // Send new requests to appAttempt.
  Allocation allocation =
      this.rScheduler.allocate(appAttemptId, ask, release, 
          blacklistAdditions, blacklistRemovals);

6. 對於請求，我們會做一個資源規整化，在各類調度器中有一個規整化因子，capatity以及FIFO調度器都是不可配的，由yarn的最小可調度資源來決定，Fair調度器則可以配置，什麼叫規整化因子，其實很好理解，假設規整化因子是1G,如果此時申請的資源是800M，那麼yarn也會調度1個G內存的Container供任務使用

  // Sanity check
  //規整化資源請求
  SchedulerUtils.normalizeRequests(
      ask, getResourceCalculator(), getClusterResource(),
      getMinimumResourceCapability(), getMaximumResourceCapability());

7. 我們繼續看最後返回了getAllocation方法，內部有一個重要的方法pullNewlyAllocatedContainersAndNMTokens()，所有的調度器最終都會來執行這個方法，這個方法內部則是做對容器鑑權以及得到申請下的Container並返回，該方法內部有一個newlyAllocatedContainers集合，代碼看到這就會有一定的疑問，這個Container的集合是怎麼獲得的，爲什麼在這裏就拿到了被分配的Container集合呢？

8. 這個就要從NM與RM的心跳說起了，當AM啓動註冊到RM時，AM就發送了請求給RM，RM會與NM通信去申請資源，NM則通過心跳的方式去給出Container的集合。Yarn源碼剖析（一） --- RM與NM服務啓動以及心跳通信中我們就介紹NM與RM的心跳通信，所以此時我們就直接去看心跳觸發的STATUS_UPDATE事件的轉換函數 StatusUpdateWhenHealthyTransition()，內部有這麼一個方法，這個方法會觸發SchedulerEventType.NODE_UPDATE事件

  if(rmNode.nextHeartBeat) {
    rmNode.nextHeartBeat = false;
    rmNode.context.getDispatcher().getEventHandler().handle(
        new NodeUpdateSchedulerEvent(rmNode));
  }

9. 該事件交由默認調度器處理，跳轉到allocateContainersToNode()，我們分析沒有Container預定的情況

    root.assignContainers(
        clusterResource,
        node,
        // TODO, now we only consider limits for parent for non-labeled
        // resources, should consider labeled resources as well.
        new ResourceLimits(labelManager.getResourceByLabel(
            RMNodeLabelsManager.NO_LABEL, clusterResource)));
  }

10. 方法先進入根隊列類ParnetQueue去處理，來看一下代碼分析，只選了關鍵部分的代碼

  while (canAssign(clusterResource, node)) {  //判斷節點資源是否足夠
    if (LOG.isDebugEnabled()) {
      LOG.debug("Trying to assign containers to child-queue of "
        + getQueueName());
    }
    
    //判斷隊列是否接受該節點的註冊，節點可用資源判斷之類的
    if (!super.canAssignToThisQueue(clusterResource, nodeLabels, resourceLimits,
        minimumAllocation, Resources.createResource(getMetrics()
            .getReservedMB(), getMetrics().getReservedVirtualCores()))) {
      break;
    }
    
    // 進入子隊列實現調度
    CSAssignment assignedToChild = 
        assignContainersToChildQueues(clusterResource, node, resourceLimits);
    assignment.setType(assignedToChild.getType());
    }

11. 子隊列中會做完隊列資源判斷後，就會進入子隊列的assignContainers方法，這個方法很長，其實大多還是一些校驗性的方法，不多做分析了，內部會有序的爲應用程序申請Container，最後進入下面這個方法

  // Try to schedule
  CSAssignment assignment =  
    assignContainersOnNode(clusterResource, node, application, priority, 
        null, currentResourceLimits);

12. 可以看一下這個方法內部做了什麼，內部分爲三種調度情況NodeType.NODE_LOCAL、NodeType.RACK_LOCAL、NodeType.OFF_SWITCH，我們分析NodeType.NODE_LOCAL，所以進入assignNodeLocalContainers方法，該方法也較長，不考慮預定Container的情況，直接找關鍵方法，

   // 前面都做過資源判斷了，所以此處根據資源請求直接new出一個Container對象
  Container container = getContainer(rmContainer, application, node, capability, priority);
  // 添加分配Container
  RMContainer allocatedContainer = 
      application.allocate(type, node, priority, request, container);
  
  // Does the application need this resource?
  if (allocatedContainer == null) {
    return Resources.none();
  }
  
  // Inform the node
  node.allocateContainer(allocatedContainer); //告知節點需要添加要啓動的Container集合

13. 那我們來看看關鍵的添加分配的方法application.allocate()，將new出來的Container傳入了新的RMContainer，而且添加到newlyAllocatedContainers集合中，看到此處，大家就知道前文中的集合是怎麼來的了，而後告訴狀態機Container的狀態轉換到了STARTED

  // Create RMContainer
  RMContainer rmContainer = new RMContainerImpl(container, this
      .getApplicationAttemptId(), node.getNodeID(),
      appSchedulingInfo.getUser(), this.rmContext);
  
  // Add it to allContainers list.
  newlyAllocatedContainers.add(rmContainer);

14. 步驟11中還有一個node.allocateContainer(allocatedContainer)，我們來看一下，可以看出，將這些返回的Container添加到了一個launchedContainers集合中，這個集合用於Container的啓動。

  public synchronized void allocateContainer(RMContainer rmContainer) {
    Container container = rmContainer.getContainer();
    deductAvailableResource(container.getResource());
    ++numContainers;
  
    launchedContainers.put(container.getId(), rmContainer);
  }

Container的啓動

1. 從上文可知spark調用的Yarn的接口去獲取到了匹配的Container集合，那接下來當然是要去啓動這些Container了，所以，來看看handleAllocatedContainers(allocatedContainers.asScala)

  val allocateResponse = amClient.allocate(progressIndicator)
  
  val allocatedContainers = allocateResponse.getAllocatedContainers()
  handleAllocatedContainers(allocatedContainers.asScala)

2. 這個方法內部有一個runAllocatedContainers(containersToUse)，下面我貼了這個方法關鍵的一個部分，將各個需要的變量傳入這個線程

  launcherPool.execute(new Runnable {
    override def run(): Unit = {
      try {
        new ExecutorRunnable(
          Some(container),
          conf,
          sparkConf,
          driverUrl,
          executorId,
          executorHostname,
          executorMemory,
          executorCores,
          appAttemptId.getApplicationId.toString,
          securityMgr,
          localResources
        ).run()
        updateInternalState()

3. 那我們來看這個線程的run方法，初始化了啓動一個NMClient，然後調用了startContainer()

  def run(): Unit = {
    logDebug("Starting Executor Container")
    nmClient = NMClient.createNMClient()
    nmClient.init(conf)
    nmClient.start()
    startContainer()
  }

4. 那來看一下 startContainer()，選取了最關鍵的方法

  // Send the start request to the ContainerManager
  try {
    nmClient.startContainer(container.get, ctx)
  } catch {
    case ex: Exception =>
      throw new SparkException(s"Exception while starting container ${container.get.getId}" +
        s" on host $hostname", ex)
  }

5. 調用了NM啓動Container的接口，接口後面的內容與上一篇AM Container啓動的過程基本一致，這裏就不做分析了

總結

至此，圍繞着Yarn源碼剖析（零） --- spark任務提交到yarn的流程流程圖有關spark任務的提交至運行就結束了。關於Yarn的資源調度介紹到，也就結束了，在剖析源碼的過程中，蛋撻還是有很多很多的困惑，例如狀態機的轉換，Hadoop RPC通信等等，這些困惑的知識點，也會在後續的學習中去輸出對應的博文，也希望大家在閱讀蛋撻的博文時可以指出文中錯誤的地方~在這蛋撻不勝感激。

在整個Yarn資源調度的過程中，蛋撻參考了許多博文資料，所以在剖析的思路上難免會受到一些引導，若有侵犯博主的權益，請知會與我刪除相關的信息。

作者：蛋撻

日期：2018.09.05

Yarn源碼剖析（四）-- AM的註冊與資源調度申請Container及啓動

AM註冊到RM

AM申請Container

Container的啓動

總結

【資源調度總綱】Yarn源碼剖析（零） --- spark任務提交到yarn的流程

Thrift基本原理以及使用介紹

Yarn源碼剖析（三）--- ApplicationMaster的啓動

一致性哈希

hbck

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結