Hadoop Yarn 3.1.0 源碼分析（03 容器分配和投運）

接下來我們看容器的分配和在NM節點上的投運過程，接着上一篇：
FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer():

public Resource assignContainer(FSSchedulerNode node) {
    Resource assigned = Resources.none();

    // If this queue is over its limit, reject
    if (!assignContainerPreCheck(node)) {
      return assigned;
    }

    // 根據FairScheduler的調度規則對子隊列進行排序
    writeLock.lock();
    try {
      Collections.sort(childQueues, policy.getComparator());
    } finally {
      writeLock.unlock();
    }

    readLock.lock();
    try {
      for (FSQueue child : childQueues) {
        assigned = child.assignContainer(node);
        if (!Resources.equals(assigned, Resources.none())) {
          break;
        }
      }
    } finally {
      readLock.unlock();
    }
    return assigned;
  }

FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer() -> FSLeafQueue.assignContainer()

 public Resource assignContainer(FSSchedulerNode node) {
    Resource assigned = none();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Node " + node.getNodeName() + " offered to queue: " +
          getName() + " fairShare: " + getFairShare());
    }

    if (!assignContainerPreCheck(node)) {
      return assigned;
    }

    for (FSAppAttempt sched : fetchAppsWithDemand(true)) {
      if (SchedulerAppUtils.isPlaceBlacklisted(sched, node, LOG)) {
        continue;
      }
      assigned = sched.assignContainer(node);
      if (!assigned.equals(none())) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Assigned container in queue:" + getName() + " " +
              "container:" + assigned);
        }
        break;
      }
    }
    return assigned;
  }

從根隊列然後到子隊列，最後到子隊列中按照調度規則，應該被調度用於分配容器的應用程序的實例進行容器分配。
FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer() -> FSLeafQueue.assignContainer() -> FSAppAttempt.assignContainer():

 public Resource assignContainer(FSSchedulerNode node) {
    if (isOverAMShareLimit()) {
      PendingAsk amAsk = appSchedulingInfo.getNextPendingAsk();
      updateAMDiagnosticMsg(amAsk.getPerAllocationResource(),
          " exceeds maximum AM resource allowed).");
      if (LOG.isDebugEnabled()) {
        LOG.debug("AM resource request: " + amAsk.getPerAllocationResource()
            + " exceeds maximum AM resource allowed, "
            + getQueue().dumpState());
      }
      return Resources.none();
    }
    return assignContainer(node, false);
  }

 private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
    if (LOG.isTraceEnabled()) {
      LOG.trace("Node offered to app: " + getName() + " reserved: " + reserved);
    }

    Collection<SchedulerRequestKey> keysToTry = (reserved) ?
        Collections.singletonList(
            node.getReservedContainer().getReservedSchedulerKey()) :
        getSchedulerKeys();
    //對於應用中的每一種優先級請求，看是否滿足NODE_LOCAL,RACK_LOCAL,OFF_SWITCH這三種本地性需求
    //請求可能被延遲調度，如果設置了延期調度參數來提升本地性
    try {
      writeLock.lock();
      for (SchedulerRequestKey schedulerKey : keysToTry) {

        if (!reserved && !hasContainerForNode(schedulerKey, node)) {
          continue;
        }

        addSchedulingOpportunity(schedulerKey);

        PendingAsk rackLocalPendingAsk = getPendingAsk(schedulerKey,
            node.getRackName());
        PendingAsk nodeLocalPendingAsk = getPendingAsk(schedulerKey,
            node.getNodeName());

        if (nodeLocalPendingAsk.getCount() > 0
            && !appSchedulingInfo.canDelayTo(schedulerKey,
            node.getNodeName())) {
          LOG.warn("Relax locality off is not supported on local request: "
              + nodeLocalPendingAsk);
        }

        NodeType allowedLocality;
        if (scheduler.isContinuousSchedulingEnabled()) {
          allowedLocality = getAllowedLocalityLevelByTime(schedulerKey,
              scheduler.getNodeLocalityDelayMs(),
              scheduler.getRackLocalityDelayMs(),
              scheduler.getClock().getTime());
        } else {
          allowedLocality = getAllowedLocalityLevel(schedulerKey,
              scheduler.getNumClusterNodes(),
              scheduler.getNodeLocalityThreshold(),
              scheduler.getRackLocalityThreshold());
        }

        if (rackLocalPendingAsk.getCount() > 0
            && nodeLocalPendingAsk.getCount() > 0) {
          if (LOG.isTraceEnabled()) {
            LOG.trace("Assign container on " + node.getNodeName()
                + " node, assignType: NODE_LOCAL" + ", allowedLocality: "
                + allowedLocality + ", priority: " + schedulerKey.getPriority()
                + ", app attempt id: " + this.attemptId);
          }
          return assignContainer(node, nodeLocalPendingAsk, NodeType.NODE_LOCAL,
              reserved, schedulerKey);
        }

        if (!appSchedulingInfo.canDelayTo(schedulerKey, node.getRackName())) {
          continue;
        }

        if (rackLocalPendingAsk.getCount() > 0
            && (allowedLocality.equals(NodeType.RACK_LOCAL) || allowedLocality
            .equals(NodeType.OFF_SWITCH))) {
          if (LOG.isTraceEnabled()) {
            LOG.trace("Assign container on " + node.getNodeName()
                + " node, assignType: RACK_LOCAL" + ", allowedLocality: "
                + allowedLocality + ", priority: " + schedulerKey.getPriority()
                + ", app attempt id: " + this.attemptId);
          }
          return assignContainer(node, rackLocalPendingAsk, NodeType.RACK_LOCAL,
              reserved, schedulerKey);
        }

        PendingAsk offswitchAsk = getPendingAsk(schedulerKey,
            ResourceRequest.ANY);
        if (!appSchedulingInfo.canDelayTo(schedulerKey, ResourceRequest.ANY)) {
          continue;
        }

        if (offswitchAsk.getCount() > 0) {
          if (getAppPlacementAllocator(schedulerKey).getUniqueLocationAsks()
              <= 1 || allowedLocality.equals(NodeType.OFF_SWITCH)) {
            if (LOG.isTraceEnabled()) {
              LOG.trace("Assign container on " + node.getNodeName()
                  + " node, assignType: OFF_SWITCH" + ", allowedLocality: "
                  + allowedLocality + ", priority: "
                  + schedulerKey.getPriority()
                  + ", app attempt id: " + this.attemptId);
            }
            return assignContainer(node, offswitchAsk, NodeType.OFF_SWITCH,
                reserved, schedulerKey);
          }
        }

        if (LOG.isTraceEnabled()) {
          LOG.trace("Can't assign container on " + node.getNodeName()
              + " node, allowedLocality: " + allowedLocality + ", priority: "
              + schedulerKey.getPriority() + ", app attempt id: "
              + this.attemptId);
        }
      }
    } finally {
      writeLock.unlock();
    }

    return Resources.none();
  }

有關Delay Schedulering 提高本地性的分析，詳見Hadoop Yarn延遲調度分析（Delay Schedulering）。
根據不同的本地性需求調用：

private Resource assignContainer(
      FSSchedulerNode node, PendingAsk pendingAsk, NodeType type,
      boolean reserved, SchedulerRequestKey schedulerKey) {

    // 該請求需要的資源量
    Resource capability = pendingAsk.getPerAllocationResource();

    // 調度節點上可以被分配的資源量
    Resource available = node.getUnallocatedResource();
    //是否是被預留的容器
    Container reservedContainer = null;
    if (reserved) {
      reservedContainer = node.getReservedContainer().getContainer();
    }

    // 如果需求的資源量小於節點可以被分配的資源量
    if (Resources.fitsIn(capability, available)) {
      // 把新分配的容器通知給應用程序
      RMContainer allocatedContainer =
          allocate(type, node, schedulerKey, pendingAsk,
              reservedContainer);
      if (allocatedContainer == null) {
        if (reserved) {
          unreserve(schedulerKey, node);
        }
        return Resources.none();
      }
      //預留的情況下取消預留
      if (reserved) {
        unreserve(schedulerKey, node);
      }

      // 把新分配的容器通知給節點
      node.allocateContainer(allocatedContainer);

      if (!isAmRunning() && !getUnmanagedAM()) {
        setAMResource(capability);
        getQueue().addAMResourceUsage(capability);
        setAmRunning(true);
      }

      return capability;
    }

    if (LOG.isDebugEnabled()) {
      LOG.debug("Resource request: " + capability + " exceeds the available"
          + " resources of the node.");
    }

把新分配容器通知給application是我們關注的重點：
FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer() -> FSLeafQueue.assignContainer() -> FSAppAttempt.assignContainer() -> FSAppAttempt.allocate():

public RMContainer allocate(NodeType type, FSSchedulerNode node,
      SchedulerRequestKey schedulerKey, PendingAsk pendingAsk,
      Container reservedContainer) {
    RMContainer rmContainer;
    Container container;

    try {
      writeLock.lock();
      // 根據實際調度的本地性需求，對allowedLocalityLevel進行更新
      NodeType allowed = allowedLocalityLevel.get(schedulerKey);
      if (allowed != null) {
        if (allowed.equals(NodeType.OFF_SWITCH) && (type.equals(
            NodeType.NODE_LOCAL) || type.equals(NodeType.RACK_LOCAL))) {
          this.resetAllowedLocalityLevel(schedulerKey, type);
        } else if (allowed.equals(NodeType.RACK_LOCAL) && type.equals(
            NodeType.NODE_LOCAL)) {
          this.resetAllowedLocalityLevel(schedulerKey, type);
        }
      }

      if (getOutstandingAsksCount(schedulerKey) <= 0) {
        return null;
      }

      container = reservedContainer;
      if (container == null) {
        container = createContainer(node, pendingAsk.getPerAllocationResource(),
            schedulerKey);
      }

      // 創建一個RMContainer對象，是容器在RM中的表現形式
      rmContainer = new RMContainerImpl(container, schedulerKey,
          getApplicationAttemptId(), node.getNodeID(),
          appSchedulingInfo.getUser(), rmContext);
      ((RMContainerImpl) rmContainer).setQueueName(this.getQueueName());

      // 把容器加入到已分配容器的列表中
      addToNewlyAllocatedContainers(node, rmContainer);
      liveContainers.put(container.getId(), rmContainer);

      // 把容器的信息更新到appSchedulingInfo中
      ContainerRequest containerRequest = appSchedulingInfo.allocate(
          type, node, schedulerKey, container);
      this.attemptResourceUsage.incUsed(container.getResource());
      getQueue().incUsedResource(container.getResource());

    RMContainer
      ((RMContainerImpl) rmContainer).setContainerRequest(containerRequest);

      // 做完相關工作後，觸發RMContainer的狀態機
      rmContainer.handle(
          new RMContainerEvent(container.getId(), RMContainerEventType.START));

      if (LOG.isDebugEnabled()) {
        LOG.debug("allocate: applicationAttemptId=" + container.getId()
            .getApplicationAttemptId() + " container=" + container.getId()
            + " host=" + container.getNodeId().getHost() + " type=" + type);
      }
      RMAuditLogger.logSuccess(getUser(), AuditConstants.ALLOC_CONTAINER,
          "SchedulerApp", getApplicationId(), container.getId(),
          container.getResource());
    } finally {
      writeLock.unlock();
    }

    return rmContainer;
  }

我們看到這個函數主要是創建了Container在RM中的表示形式RMContainer，然後更新了一些相關信息後，觸發了RMContainer的狀態機，向其發送了RMContainerEventType.START狀態。對應的狀態機轉移操作爲：

addTransition(RMContainerState.NEW, RMContainerState.ALLOCATED,
        RMContainerEventType.START, new ContainerStartedTransition())

private static final class ContainerStartedTransition extends
      BaseTransition {

    public void transition(RMContainerImpl container, RMContainerEvent event) {
      container.rmContext.getAllocationTagsManager().addContainer(
          container.getNodeId(), container.getContainerId(),
          container.getAllocationTags());

      container.eventHandler.handle(new RMAppAttemptEvent(
          container.appAttemptId, RMAppAttemptEventType.CONTAINER_ALLOCATED));
    }
  }

這個狀態機跳變的伴隨操作，終於推動了RMAppAttemptEventType操作，之前沒有分配到容器的RMAppAttempt一直處於RMAppAttemptState.SCHEDULED狀態，現在終於可能可以脫離前進了。

addTransition(RMAppAttemptState.SCHEDULED,
          EnumSet.of(RMAppAttemptState.ALLOCATED_SAVING,
            RMAppAttemptState.SCHEDULED),
          RMAppAttemptEventType.CONTAINER_ALLOCATED,
          new AMContainerAllocatedTransition())

可以看到這是一個多弧跳變，相應的狀態轉移對應跳函數爲：

private static final class AMContainerAllocatedTransition
      implements
      MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {

    public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
      // 想從獲取AM所需的那個容器
      Allocation amContainerAllocation =
          appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
            EMPTY_CONTAINER_REQUEST_LIST, null, EMPTY_CONTAINER_RELEASE_LIST, null,
            null, new ContainerUpdates());
      //既然創建完一個RMContainer之後，CONTAINER_ALLOCATED觸發了，就說明至少存在一個容器可以獲取，並且存放在了       //SchedulerApplication#newlyAllocatedContainers 中。
      //但是對應調度器FairScheduler的allocate函數不能保證一定能夠拉到容器，因爲容器可能不能被拉過來因爲某些原因，比如DNS不可達，就會回到上一個狀態也就是SCHEDULED狀態，然後再重新去獲取AM所需的一個容器。
      if (amContainerAllocation.getContainers().size() == 0) {
        appAttempt.retryFetchingAMContainer(appAttempt);
        return RMAppAttemptState.SCHEDULED;
      }

      // 有容器獲取就分配給AM使用
      appAttempt.setMasterContainer(amContainerAllocation.getContainers()
          .get(0));
      RMContainerImpl rmMasterContainer = (RMContainerImpl)appAttempt.scheduler
          .getRMContainer(appAttempt.getMasterContainer().getId());
      rmMasterContainer.setAMContainer(true);
      appAttempt.rmContext.getNMTokenSecretManager()
        .clearNodeSetForAttempt(appAttempt.applicationAttemptId);
      appAttempt.getSubmissionContext().setResource(
        appAttempt.getMasterContainer().getResource());
      appAttempt.storeAttempt();
      return RMAppAttemptState.ALLOCATED_SAVING;
    }
  }

在Hadoop Yarn 3.1.0 源碼分析（02 作業調度）我們看到，當一個RMAppAttempt處於SUBMITTED狀態時，收到ATTEMPT_ADDED事件觸發的時候，會執行ScheduleTransition.transition()然後調用appAttempt.scheduler.allocate()，企圖從newlyAllocatedContainers集合中收攬已經分配的容器，若沒有則會停留在SCHEDULED狀態，若收攬到容器，則RMContainer會收到ACQUIRED事件，此時在心跳過後allocate中創建的RMContainer受到RMContainerEventType.START事件處於RMContainerState.ALLOCATED狀態。那麼對應的RMContainer的狀態機就能繼續推進。

addTransition(RMContainerState.ALLOCATED, RMContainerState.ACQUIRED,
        RMContainerEventType.ACQUIRED, new AcquiredTransition())

我們回到上面AMContainerAllocatedTransition

if (amContainerAllocation.getContainers().size() == 0) {
        appAttempt.retryFetchingAMContainer(appAttempt);
        //沒有收攬到容器繼續停留在SCHEDULED狀態
        return RMAppAttemptState.SCHEDULED;
}

 appAttempt.storeAttempt();

 return RMAppAttemptState.ALLOCATED_SAVING;

如果現在還是收攬不到容器，那麼會開一個線程週期性的去獲取。
appAttempt.retryFetchingAMContainer():

private void retryFetchingAMContainer(final RMAppAttemptImpl appAttempt) {
    new Thread() {
      public void run() {
        try {
          //500ms試一下
          Thread.sleep(500);
        } catch (InterruptedException e) {
          LOG.warn("Interrupted while waiting to resend the"
              + " ContainerAllocated Event.");
        }
        //再次觸發這個事件進行嘗試
        appAttempt.eventHandler.handle(
            new RMAppAttemptEvent(appAttempt.applicationAttemptId,
                RMAppAttemptEventType.CONTAINER_ALLOCATED));
      }
    }.start();
  }

若RMAppAttempt收攬到了容器之後，那麼當前RMAppAttempt處於RMAppAttemptState.ALLOCATED_SAVING狀態，而appAttempt.storeAttempt()是異步的過程，結束了會給RMAppAttempt發送事件：
appAttempt.storeAttempt() -> RMStateStore.storeNewApplicationAttempt() :

public void storeNewApplicationAttempt(RMAppAttempt appAttempt) {
    //....
    getRMStateStoreEventHandler().handle(
      new RMStateStoreAppAttemptEvent(attemptState));
  }
   public RMStateStoreAppAttemptEvent(ApplicationAttemptStateData attemptState) {
    super(RMStateStoreEventType.STORE_APP_ATTEMPT);
    this.attemptState = attemptState;
  }

對應的狀態機轉移爲：

addTransition(RMStateStoreState.ACTIVE,
          EnumSet.of(RMStateStoreState.ACTIVE, RMStateStoreState.FENCED),
          RMStateStoreEventType.STORE_APP_ATTEMPT,
          new StoreAppAttemptTransition())

private static class StoreAppAttemptTransition implements
      MultipleArcTransition<RMStateStore, RMStateStoreEvent,
          RMStateStoreState> {
    public RMStateStoreState transition(RMStateStore store,
        RMStateStoreEvent event) {
       //.....
        store.notifyApplicationAttempt(new RMAppAttemptEvent
               (attemptState.getAttemptId(),
               RMAppAttemptEventType.ATTEMPT_NEW_SAVED));

      return finalState(isFenced);
    };
  }

可以看到保存完AppAttemp信息後像RMAppAttempt發送了RMAppAttemptEventType.ATTEMPT_NEW_SAVED事件：

addTransition(RMAppAttemptState.ALLOCATED_SAVING, 
          RMAppAttemptState.ALLOCATED,
          RMAppAttemptEventType.ATTEMPT_NEW_SAVED, new AttemptStoredTransition())

RMAppAttempt進入了RMAppAttemptEventType.ALLOCATED狀態，狀態機對應的伴隨操作是：

 private static final class AttemptStoredTransition extends BaseTransition {
    @Override
    public void transition(RMAppAttemptImpl appAttempt,
                                                    RMAppAttemptEvent event) {
      //安全認證相關
      appAttempt.registerClientToken();
      appAttempt.launchAttempt();
    }
  }

private void launchAttempt(){
    launchAMStartTime = System.currentTimeMillis();
    // Send event to launch the AM Container
    eventHandler.handle(new AMLauncherEvent(AMLauncherEventType.LAUNCH, this));
  }

Hadoop Yarn 3.1.0 源碼分析（03 容器分配和投運）

【SQL進階】CASE語句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

HDFS 讀寫分離 (源碼分析1 ：支持從Standby Namenode讀)

趣頭條百PB規模 Hadoop實踐(HDFS篇)

HDFS 讀寫分離（總體架構介紹）

深入理解YARN全局調度和源碼分析

應用程序在YARN中的整體運行流程和狀態機（NodeManager端）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Hadoop Yarn 3.1.0 源碼分析 （03 容器分配和投運）

Hadoop Yarn 3.1.0 源碼分析（03 容器分配和投運）