Hadoop Yarn 3.1.0 源碼分析 (02 作業調度)

在上一節中,我們詳細分析了作業是如何上岸的,現在作業 已經到達了RM端,並且交給了 RMAppManager進行繼續運轉,我們繼續跟蹤作業是如何在YARN中如何運轉。
Server端:
ApplicationClientProtocolPBSeriveImpl.submitApplication() -> ClientRMService.submitApplicaiton() -> RMAppManger.submitApplication():

 protected void submitApplication(
      ApplicationSubmissionContext submissionContext, long submitTime,
      String user) throws YarnException {
    ApplicationId applicationId = submissionContext.getApplicationId();
    //根據提交的,submissionContext創造一個RMAppImpl實例,並且創造一個applicationID到RMAppImpl的映射關係的緩存,方便之後查詢
    RMAppImpl application = createAndPopulateNewRMApp(
        submissionContext, submitTime, user, false, -1);
    try {
      //是否開啓安全機制,一些安全相關操作
      if (UserGroupInformation.isSecurityEnabled()) {
        this.rmContext.getDelegationTokenRenewer()
            .addApplicationAsync(applicationId,
                BuilderUtils.parseCredentials(submissionContext),
                submissionContext.getCancelTokensWhenComplete(),
                application.getUser(),
                BuilderUtils.parseTokensConf(submissionContext));
      } else {
        //未開啓安全機制,就獲取dispatcher然後交給對應的EventHandler,處理RMAppEventType.START事件,觸發了RMAppImpl對象的狀態機
        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));
      }
    } catch (Exception e) {
      LOG.warn("Unable to parse credentials for " + applicationId, e);
      //異常情況觸發,RMAppEventType.APP_REJECTED事件
      // Sending APP_REJECTED is fine, since we assume that the
      // RMApp is in NEW state and thus we haven't yet informed the
      // scheduler about the existence of the application
      this.rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppEvent(applicationId,
              RMAppEventType.APP_REJECTED, e.getMessage()));
      throw RPCUtil.getRemoteException(e);
    }
  }

我們來看一下createAndPopulateNewRMApp這個函數的主要操作
ApplicationClientProtocolPBSeriveImpl.submitApplication() -> ClientRMService.submitApplicaiton() -> RMAppManger.submitApplication() -> RMAppManger.createAndPopulateNewRMApp():

  private RMAppImpl createAndPopulateNewRMApp(
      ApplicationSubmissionContext submissionContext, long submitTime,
      String user, boolean isRecovery, long startTime) throws YarnException {
      //得到當前作業的applicationId
       ApplicationId applicationId = submissionContext.getApplicationId();
      //檢驗資源請求是否合理
    List<ResourceRequest> amReqs = validateAndCreateResourceRequest(
        submissionContext, isRecovery);
      // 創建RMApp的實例即 RMAppImpl
    RMAppImpl application =
        new RMAppImpl(applicationId, rmContext, this.conf,
            submissionContext.getApplicationName(), user,
            submissionContext.getQueue(),
            submissionContext, this.scheduler, this.masterService,
            submitTime, submissionContext.getApplicationType(),
            submissionContext.getApplicationTags(), amReqs, placementContext,
            startTime);
     //如果rmContext中的緩存中沒有該applicationId對應的applicationId,則將其緩存進去,需要的時候就可以直接查詢
if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
        null) {
      String message = "Application with id " + applicationId
          + " is already present! Cannot add a duplicate!";
      LOG.warn(message);
      throw new YarnException(message);
    }

創建RMAppImpl ,是最主要的操作,這是一個作業在RM端的呈現形式。構造此類的參數中包括兩部分的內容。

  • 客戶端提交過來的submissionContext相關的一些信息:
    作業的名字:getApplicationName(),作業的ID:applicationId,提交的隊列:getQueue(),作業類型:getApplicationType(), 作業的標誌信息:getApplicationTags(), 資源請求 amReqs , 等等。

  • RM端RMAppManager提供的的一些相關信息:
    RM的上下文信息:rmContext,一些相關配置信息:this.conf,採用的調度器是哪一個:this.scheduler, 管理AM的服務:this.masterService 等。

回到上面的 RMAppManger.submitApplication()過程:

        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));

作業的提交在RM端觸發了,作業RMAppEventType.START事件。通過RPC傳到RM端的每一個作業,RMAppManager都會爲其創建一個RMAppImpl對象,是RM中表示一個作業的對象,作業的運轉是通過狀態機來表示的,關於狀態機,會在其他章節進行詳細講解,這裏就是RMAppImpl的狀態機。
this.rmContext.getDispatcher()得到的是一個異步調度器,是一個AsyncDispatcher類對象rmDispatcher ,然後通過註冊到這個對象具體的rmDispatcher.getEventHandler()得到一個實現了不同界面的EventHandler事件類,進行handle()具體的事件,然後調用該事件對應的跳變函數,同時發生對應的狀態機跳變。這裏發送的事件是RMAppEventType.START事件。
RMAppManager中的rmContext是ResourceManager中的rmConext對象傳進去的,我們首先來看一下rmConext在ResourceManager類中是如何創建的:

protected void serviceInit(Configuration conf) throws Exception {
    this.conf = conf;
    //先創造這個RMContextImpl這個類的實例
    this.rmContext = new RMContextImpl();
    //通過set的方式來填充一些變量,我們這裏先只看Dispatcher相關的set操作 
    rmDispatcher = setupDispatcher();
    addIfService(rmDispatcher);
    rmContext.setDispatcher(rmDispatcher);
    super.serviceInit(this.conf);
  }

 private Dispatcher setupDispatcher() {
    Dispatcher dispatcher = createDispatcher();
    dispatcher.register(RMFatalEventType.class,
        new ResourceManager.RMFatalEventDispatcher());
    return dispatcher;
  }

  protected Dispatcher createDispatcher() {
    return new AsyncDispatcher("RM Event dispatcher");
  }

我們可以看到AsyncDispatcher類對象 rmDispatcher 是在ResourceManager serviceInit()過程中創建的。在ResourceManager的內部類RMActiveServices 中的serviceInit()過程中,還給這個rmDispatcher 註冊了很多EventHandler事件類,供以後進行對不同的事件進行分發到不同的事件類中:

//RM中包含所有活躍服務的類
public class RMActiveServices extends CompositeService {
    //創建安全相關的Token的類 
    private DelegationTokenRenewer delegationTokenRenewer;
    //調度器相關的事件類
    private EventHandler<SchedulerEvent> schedulerDispatcher;
    //在RM端發起AM的對象
    private ApplicationMasterLauncher applicationMasterLauncher;
    //容器分配過期對應的類
    private ContainerAllocationExpirer containerAllocationExpirer;
    private ResourceManager rm;
    private boolean fromActive = false;
    private StandByTransitionRunnable standByTransitionRunnable;
 //RMActiveServices 中的serviceInit()過程,我們只關心rmDispatcher的註冊過程
 protected void serviceInit(Configuration configuration) throws Exception {
    //爲NodesListManager在rmDispatcher中註冊事件處理器類
    rmDispatcher.register(NodesListManagerEventType.class, nodesListManager);
    //爲調度器在rmDispatcher中註冊事件處理器類
    rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher);
    //爲RM中的application呈現對象RMApp在rmDispatcher註冊事件處理器類
    rmDispatcher.register(RMAppEventType.class, new ApplicationEventDispatcher(rmContext));
    //爲RM中的RMApp對象的一次嘗試對應的對象RMAppAttempt在rmDispatcher註冊事件處理器類
    rmDispatcher.register(RMAppAttemptEventType.class, new ApplicationAttemptEventDispatcher(rmContext));
    //爲RM中的節點呈現對象RMNode在rmDispatcher註冊事件處理器類
    rmDispatcher.register(RMNodeEventType.class, new NodeEventDispatcher(rmContext));

註冊的過程我們舉一個例子看一下,就看爲調度器在rmDispatcher中註冊事件處理器類的過程吧:rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher):
其中SchedulerEventType類,是事件的類型類,是一個enum,我們可以看到其中所有調度器相關的事件類型有哪些

public enum SchedulerEventType {

  // Node操作對應的調度器事件類型
  NODE_ADDED,
  NODE_REMOVED,
  NODE_UPDATE,
  NODE_RESOURCE_UPDATE,
  NODE_LABELS_UPDATE,

  // RMApp操作對應的調度器事件類型
  APP_ADDED,
  APP_REMOVED,

  // RMAppAttempt操作對應的調度器事件類型
  APP_ATTEMPT_ADDED,
  APP_ATTEMPT_REMOVED,

  //ContainerAllocationExpirer對應的事件類型
  CONTAINER_EXPIRED,

  // Source: SchedulerAppAttempt::pullNewlyUpdatedContainer.
  RELEASE_CONTAINER,

  /* Source: SchedulingEditPolicy */
  KILL_RESERVED_CONTAINER,

  // Mark a container for preemption
  MARK_CONTAINER_FOR_PREEMPTION,

  // Mark a for-preemption container killable
  MARK_CONTAINER_FOR_KILLABLE,

  // Cancel a killable container
  MARK_CONTAINER_FOR_NONKILLABLE,

  //Queue Management Change
  MANAGE_QUEUE
}

schedulerDispatcher就是一個調度器有關的事件對應的事件處理器類,調度器相關的事件處理器類較其他事件處理器類複雜很多,我們深入看一下,schedulerDispatcher = createSchedulerEventDispatcher() -> ResourceManager.createSchedulerEventDispatcher()

protected ResourceScheduler scheduler;
protected EventHandler<SchedulerEvent> createSchedulerEventDispatcher() {
    return new EventDispatcher(this.scheduler, "SchedulerEventDispatcher");
  }

public interface ResourceScheduler extends YarnScheduler, Recoverable { }

public interface YarnScheduler extends EventHandler<SchedulerEvent> {}

//傳入的handler是一個ResourceScheduler ,繼承了 YarnScheduler, 而YarnScheduler又繼承了 EventHandler<SchedulerEvent>

大概瞭解了事件類型和,事件處理器後,再次回到上面的 RMAppManger.submitApplication()過程:

        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));

this.rmContext.getDispatcher().getEventHandler()這個過程我們上面已經瞭解了,rmDispatcher把事件交給之前註冊過的RMApp對應的事件處理器之後,handle()函數就是對於具體事件類型,事件處理器的操作了:
ApplicationEventDispatcher.handle():

public void handle(RMAppEvent event) {
      ApplicationId appID = event.getApplicationId();
      RMApp rmApp = this.rmContext.getRMApps().get(appID);
      if (rmApp != null) {
        try {
          rmApp.handle(event);
        } catch (Throwable t) {
          LOG.error("Error in handling event type " + event.getType()
              + " for application " + appID, t);
        }
      }
    }

事件處理器,把對應的事件類型轉交給RMApp,然後調用rmApp.handle(event),事件上調用的是RMAppImpl.handle(event),爲什麼可以轉交給RMAppImpl處理呢,因爲RMAppImpl也是實現了EventHandler,RMAppImpl的類圖如下,實現的接口RMApp是繼承自EventHandler的:
這裏寫圖片描述

ApplicationEventDispatcher.handle() -> RMAppImpl.handle() :

public void handle(RMAppEvent event) {

    this.writeLock.lock();

    try {
      ApplicationId appID = event.getApplicationId();
      LOG.debug("Processing event for " + appID + " of type "
          + event.getType());
      final RMAppState oldState = getState();
      try {
        /* keep the master in sync with the state machine */
        //根據事件類型,完成相應的狀態機跳變操作
        this.stateMachine.doTransition(event.getType(), event);
      } catch (InvalidStateTransitionException e) {
        LOG.error("App: " + appID
            + " can't handle this event at current state", e);
        onInvalidStateTransition(event.getType(), oldState);
      }

      // Log at INFO if we're not recovering or not in a terminal state.
      // Log at DEBUG otherwise.
      if ((oldState != getState()) &&
          (((recoveredFinalState == null)) ||
            (event.getType() != RMAppEventType.RECOVER))) {
        LOG.info(String.format(STATE_CHANGE_MESSAGE, appID, oldState,
            getState(), event.getType()));
      } else if ((oldState != getState()) && LOG.isDebugEnabled()) {
        LOG.debug(String.format(STATE_CHANGE_MESSAGE, appID, oldState,
            getState(), event.getType()));
      }
    } finally {
      this.writeLock.unlock();
    }
  }

現在傳入的事件類型 event.getType()即爲: RMAppEventType.START。RMAppImpl通過StateMachineFactory,初始化狀態機跳變的函數,我們截取部分來看一下,初始化的RMApp的初始化狀態是RMAppState.NEW狀態,接下來是添加的從NEW狀態轉移到其他狀態對應的跳變操作函數。當然StateMachineFactory不僅僅是從NEW狀態跳變相關的部分,還保存了其他狀態進行轉移的跳變表。

private static final StateMachineFactory<RMAppImpl,
                                           RMAppState,
                                           RMAppEventType,
                                           RMAppEvent> stateMachineFactory
                               = new StateMachineFactory<RMAppImpl,
                                           RMAppState,
                                           RMAppEventType,
                                           RMAppEvent>(RMAppState.NEW)


     // Transitions from NEW state
    .addTransition(RMAppState.NEW, RMAppState.NEW,
        RMAppEventType.NODE_UPDATE, new RMAppNodeUpdateTransition())
    .addTransition(RMAppState.NEW, RMAppState.NEW_SAVING,
        RMAppEventType.START, new RMAppNewlySavingTransition())
    .addTransition(RMAppState.NEW, EnumSet.of(RMAppState.SUBMITTED,
            RMAppState.ACCEPTED, RMAppState.FINISHED, RMAppState.FAILED,
            RMAppState.KILLED, RMAppState.FINAL_SAVING),
        RMAppEventType.RECOVER, new RMAppRecoveredTransition())
    .addTransition(RMAppState.NEW, RMAppState.KILLED, RMAppEventType.KILL,
        new AppKilledTransition())
    .addTransition(RMAppState.NEW, RMAppState.FINAL_SAVING,
        RMAppEventType.APP_REJECTED,
        new FinalSavingTransition(new AppRejectedTransition(),
          RMAppState.FAILED))

從上述狀態機跳變表中找到對應的狀態轉移過程,我們可以看到從初始化的RMAppState.NEW狀態開始,對應 RMAppEventType.START事件類型的跳變只有一條爲:

addTransition(RMAppState.NEW, RMAppState.NEW_SAVING,
        RMAppEventType.START, new RMAppNewlySavingTransition())

狀態機跳變對應的跳變類爲RMAppNewlySavingTransition,跳變操作爲函數:RMAppNewlySavingTransition.transition()

//是一個SingleArcTransition,即跳變只能從一個狀態跳變到另一個狀態,是一對一的
private static class RMAppTransition implements
      SingleArcTransition<RMAppImpl, RMAppEvent> {
    public void transition(RMAppImpl app, RMAppEvent event) {
    };
  }
//繼承了上述的單弧跳變類RMAppTransition.transition()   
private static final class RMAppNewlySavingTransition extends RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {
      //作業生命週期時間
      long applicationLifetime =
          app.getApplicationLifetime(ApplicationTimeoutType.LIFETIME);
      applicationLifetime = app.scheduler
          .checkAndGetApplicationLifetime(app.queue, applicationLifetime);
      if (applicationLifetime > 0) {
        // calculate next timeout value
        Long newTimeout =
            Long.valueOf(app.submitTime + (applicationLifetime * 1000));
        app.rmContext.getRMAppLifetimeMonitor().registerApp(app.applicationId,
            ApplicationTimeoutType.LIFETIME, newTimeout);

        // 更新作業過期時間
        app.applicationTimeouts.put(ApplicationTimeoutType.LIFETIME,
            newTimeout);

        LOG.info("Application " + app.applicationId
            + " is registered for timeout monitor, type="
            + ApplicationTimeoutType.LIFETIME + " value=" + applicationLifetime
            + " seconds");
      }

      // If recovery is enabled then store the application information in a
      // non-blocking call so make sure that RM has stored the information
      // needed to restart the AM after RM restart without further client
      // communication
      LOG.info("Storing application with id " + app.applicationId);
      app.rmContext.getStateStore().storeNewApplication(app);
    }
  }

在YARN中,狀態機之間大多數都是相輔相成,互相推動的, 的、RMApp的狀態機從NEW到NEW_SAVING改變過程推動RMStateStore的狀態機變化,而RMStateStore的狀態機變化,又反過來推動RMApp的狀態機推進。
RMAppNewlySavingTransition.transition()->RMStateStore.storeNewApplictaion():

ResourceManager服務使用這個類對應的storeNewApplication()來保存application的狀態。這個過程同樣使用AsyncDispatcher來實現,是非阻塞的,一旦完成,發送響應的事件類型給RMApp,觸發其狀態機變化。

  public void storeNewApplication(RMApp app) {
    ApplicationSubmissionContext context = app
                                            .getApplicationSubmissionContext();
    assert context instanceof ApplicationSubmissionContextPBImpl;
    ApplicationStateData appState =
        ApplicationStateData.newInstance(app.getSubmitTime(),
            app.getStartTime(), context, app.getUser(), app.getCallerContext());
    appState.setApplicationTimeouts(app.getApplicationTimeouts());
    getRMStateStoreEventHandler().handle(new RMStateStoreAppEvent(appState));
  }

事件類型是:

 public RMStateStoreAppEvent(ApplicationStateData appState) {
    super(RMStateStoreEventType.STORE_APP);
    this.appState = appState;
  }

對應的狀態機跳變和對應跳變操作的類爲:

addTransition(RMStateStoreState.ACTIVE,
          EnumSet.of(RMStateStoreState.ACTIVE, RMStateStoreState.FENCED),
          RMStateStoreEventType.STORE_APP, new StoreAppTransition())

下面來看這個具體的跳變類中的transaction操作:

private static class StoreAppTransition
      implements MultipleArcTransition<RMStateStore, RMStateStoreEvent,
          RMStateStoreState> {
    public RMStateStoreState transition(RMStateStore store,
        RMStateStoreEvent event) {
      if (!(event instanceof RMStateStoreAppEvent)) {
        // should never happen
        LOG.error("Illegal event type: " + event.getClass());
        return RMStateStoreState.ACTIVE;
      }
      boolean isFenced = false;
      ApplicationStateData appState =
          ((RMStateStoreAppEvent) event).getAppState();
      ApplicationId appId =
          appState.getApplicationSubmissionContext().getApplicationId();
      LOG.info("Storing info for app: " + appId);
      try {
        store.storeApplicationStateInternal(appId, appState);
        store.notifyApplication(
            new RMAppEvent(appId, RMAppEventType.APP_NEW_SAVED));
      } catch (Exception e) {
        LOG.error("Error storing app: " + appId, e);
        if (e instanceof StoreLimitException) {
          store.notifyApplication(
              new RMAppEvent(appId, RMAppEventType.APP_SAVE_FAILED,
                  e.getMessage()));
        } else {
          isFenced = store.notifyStoreOperationFailedInternal(e);
        }
      }
      return finalState(isFenced);
    };

  }

很清楚的看到,主要乾了一件事情,保存好了application的信息以後,通知RMApp:

store.notifyApplication(
            new RMAppEvent(appId, RMAppEventType.APP_NEW_SAVED));

 private void notifyApplication(RMAppEvent event) {
    rmDispatcher.getEventHandler().handle(event);
  }

就是觸發了RMApp的RMAppEventType.APP_NEW_SAVED事件,狀態機之間真是互相幫助才能互相推進,下面繼續到RMAppImpl中看看,這個事件對應的狀態機跳變,和跳變的操作:

addTransition(RMAppState.NEW_SAVING, RMAppState.SUBMITTED,
        RMAppEventType.APP_NEW_SAVED, new AddApplicationToSchedulerTransition())

AddApplicationToSchedulerTransition這個跳變操作的類,聽名字就像是把application加入到調度器中,我們具體來看一下:

 private static final class AddApplicationToSchedulerTransition extends
      RMAppTransition {
    public void transition(RMAppImpl app, RMAppEvent event) {
      app.handler.handle(
          new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
              app.applicationPriority, app.placementContext));
      // send the ATS create Event
      app.sendATSCreateEvent();
    }
  }

public AppAddedSchedulerEvent(ApplicationId applicationId, String queue,
      String user, boolean isAppRecovering, ReservationId reservationID,
      Priority appPriority, ApplicationPlacementContext placementContext) {
    super(SchedulerEventType.APP_ADDED);
    this.applicationId = applicationId;
    this.queue = queue;
    this.user = user;
    this.reservationID = reservationID;
    this.isAppRecovering = isAppRecovering;
    this.appPriority = appPriority;
    this.placementContext = placementContext;
  }

AddApplicationToSchedulerTransition.transaction,實際上是觸發了SchedulerEventType.APP_ADDED事件,然後ResourceManager註冊的schedulerDispatcher事件處理器類對這個事件進行處理:
ResourceManager.schedulerDispatcher.handle():

 public void handle(T event) {
    try {
      int qSize = eventQueue.size();
      if (qSize !=0 && qSize %1000 == 0) {
        LOG.info("Size of " + getName() + " event-queue is " + qSize);
      }
      int remCapacity = eventQueue.remainingCapacity();
      if (remCapacity < 1000) {
        LOG.info("Very low remaining capacity on " + getName() + "" +
            "event queue: " + remCapacity);
      }
      //將事件掛入調度器的事件隊列
      this.eventQueue.put(event);
    } catch (InterruptedException e) {
      LOG.info("Interrupted. Trying to exit gracefully.");
    }
  }

EventProcessor 從隊列中取出事件進行處理:

 private final class EventProcessor implements Runnable {
    @Override
    public void run() {

      T event;

      while (!stopped && !Thread.currentThread().isInterrupted()) {
        try {
          event = eventQueue.take();
        } catch (InterruptedException e) {
          LOG.error("Returning, interrupted : " + e);
          return; // TODO: Kill RM.
        }

        try {
          //這裏的handler是傳入的具體類型的調度器
          handler.handle(event);
        } catch (Throwable t) {
          // An error occurred, but we are shutting down anyway.
          // If it was an InterruptedException, the very act of
          // shutdown could have caused it and is probably harmless.
          if (stopped) {
            LOG.warn("Exception during shutdown: ", t);
            break;
          }
          LOG.fatal("Error in handling event type " + event.getType()
              + " to the Event Dispatcher", t);
          if (shouldExitOnError
              && !ShutdownHookManager.get().isShutdownInProgress()) {
            LOG.info("Exiting, bbye..");
            System.exit(-1);
          }
        }
      }
    }
  }

handler是在ResourceManager中初始化該schedulerDispatcher的時候,傳入的:

 schedulerDispatcher = createSchedulerEventDispatcher();

 protected EventHandler<SchedulerEvent> createSchedulerEventDispatcher() {
    return new EventDispatcher(this.scheduler, "SchedulerEventDispatcher");
  }

ResourceManager中的scheduler有三種選項,我這裏選擇FairScheduler進行分析:
ResourceManager.schedulerDispatcher.handle() -> FairScheduler.handle():

public void handle(SchedulerEvent event) {
    switch (event.getType()) {
    case NODE_ADDED:
    ...
    case NODE_REMOVED:
    ...
    case NODE_UPDATE:
    ... 
    case APP_ADDED:
      if (!(event instanceof AppAddedSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      AppAddedSchedulerEvent appAddedEvent = (AppAddedSchedulerEvent) event;
      String queueName =
          resolveReservationQueueName(appAddedEvent.getQueue(),
              appAddedEvent.getApplicationId(),
              appAddedEvent.getReservationID(),
              appAddedEvent.getIsAppRecovering());
      if (queueName != null) {
        addApplication(appAddedEvent.getApplicationId(),
            queueName, appAddedEvent.getUser(),
            appAddedEvent.getIsAppRecovering());
      }
      break;
    case APP_REMOVED:
    ... 
    case NODE_RESOURCE_UPDATE:
    ...
    case APP_ATTEMPT_ADDED:
    ...
    case APP_ATTEMPT_REMOVED:
    ...  
    case RELEASE_CONTAINER:
    ... 
    case CONTAINER_EXPIRED:
    ...  
    }
  }

由於當前傳入的事件是SchedulerEventType.APP_ADDED類型:即對應的APP_ADDED,那麼我們暫時屏蔽其他事件類型,這是用switch的形式,而沒有用狀態機轉換。
ResourceManager.schedulerDispatcher.handle() -> FairScheduler.handle()-> FairScheduler.addApplication():

protected void addApplication(ApplicationId applicationId,
      String queueName, String user, boolean isAppRecovering) {
    //隊列名合法性檢查相關的一些東西
    if (queueName == null || queueName.isEmpty()) {
      String message =
          "Reject application " + applicationId + " submitted by user " + user
              + " with an empty queue name.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler().handle(
          new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED,
              message));
      return;
    }

    if (queueName.startsWith(".") || queueName.endsWith(".")) {
      String message =
          "Reject application " + applicationId + " submitted by user " + user
              + " with an illegal queue name " + queueName + ". "
              + "The queue name cannot start/end with period.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler().handle(
          new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED,
              message));
      return;
    }

    try {
      writeLock.lock();
      //將application分配的某一個具體的隊列中
      RMApp rmApp = rmContext.getRMApps().get(applicationId);
      FSLeafQueue queue = assignToQueue(rmApp, queueName, user);
      if (queue == null) {
        return;
      }

      //ACLs檢查等相關操作
      UserGroupInformation userUgi = UserGroupInformation.createRemoteUser(
          user);

      if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue
          .hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
        String msg = "User " + userUgi.getUserName()
            + " cannot submit applications to queue " + queue.getName()
            + "(requested queuename is " + queueName + ")";
        LOG.info(msg);
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED, msg));
        return;
      }
      //創建一個application在調度器中的表示,SchedulerApplication,並且將applicationId和這個調度器表示關聯起來,以後通過applicationId就能得到這個調度器的表示。
      SchedulerApplication<FSAppAttempt> application =
          new SchedulerApplication<FSAppAttempt>(queue, user);
      applications.put(applicationId, application);
      //添加一些application的統計信息
      queue.getMetrics().submitApp(user);

      LOG.info("Accepted application " + applicationId + " from user: " + user
          + ", in queue: " + queue.getName()
          + ", currently num of applications: " + applications.size());
      //一些恢複相關的操作
      if (isAppRecovering) {
        if (LOG.isDebugEnabled()) {
          LOG.debug(applicationId
              + " is recovering. Skip notifying APP_ACCEPTED");
        }
      } else {
        if (rmApp != null && rmApp.getApplicationSubmissionContext() != null) {

        //把隊列名設置到ASC中  rmApp.getApplicationSubmissionContext().setQueue(queue.getName());
        }
        //推動RMApp的狀態機
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED));
      }
    } finally {
      writeLock.unlock();
    }
  }

主要操作就是,創建application在調度器中的表示SchedulerApplication,並記錄applicationId和其對應關係,方便下次查詢,然後就發送RMAppEventType.APP_ACCEPTED事件,反過來推動了RMApp的狀態機轉移,在RMAppImpl中看一下具體的狀態機轉移,以及對應的狀態轉移伴隨的操作函數:

addTransition(RMAppState.SUBMITTED, RMAppState.ACCEPTED,
        RMAppEventType.APP_ACCEPTED, new StartAppAttemptTransition())

private static final class StartAppAttemptTransition extends RMAppTransition {
    public void transition(RMAppImpl app, RMAppEvent event) {
      app.createAndStartNewAttempt(false);
    };
  }

 private void
      createAndStartNewAttempt(boolean transferStateFromPreviousAttempt) {
    createNewAttempt();
    handler.handle(new RMAppStartAttemptEvent(currentAttempt.getAppAttemptId(),
      transferStateFromPreviousAttempt));
  }
//生成當前ApplicationAttemptId,
private void createNewAttempt() {
    ApplicationAttemptId appAttemptId =
        ApplicationAttemptId.newInstance(applicationId, nextAttemptId++);
    createNewAttempt(appAttemptId);
  }
//然後根據這個ApplicationAttemptId創建一個新的AppAttempt
private void createNewAttempt(ApplicationAttemptId appAttemptId) {
    BlacklistManager currentAMBlacklistManager;
    if (currentAttempt != null) {
      currentAMBlacklistManager = currentAttempt.getAMBlacklistManager();
    } else {
      if (amBlacklistingEnabled && !submissionContext.getUnmanagedAM()) {
        currentAMBlacklistManager = new SimpleBlacklistManager(
            RMServerUtils.getApplicableNodeCountForAM(rmContext, conf,
                getAMResourceRequests()),
            blacklistDisableThreshold);
      } else {
        currentAMBlacklistManager = new DisabledBlacklistManager();
      }
    }
    //創建一個新的RMAppAttempt
    RMAppAttempt attempt =
        new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
          submissionContext, conf, amReqs, this, currentAMBlacklistManager);
    //將對應的appAttemptId和具體的attempt放入attempts中,方便下次查詢
    attempts.put(appAttemptId, attempt);
    currentAttempt = attempt;
  }
 public RMAppStartAttemptEvent(ApplicationAttemptId appAttemptId,
      boolean transferStateFromPreviousAttempt) {
    super(appAttemptId, RMAppAttemptEventType.START);
    this.transferStateFromPreviousAttempt = transferStateFromPreviousAttempt;
  }

既然application已經被調度器接受,接下來就可以觸發了RMAppAttemptEventType.START事件,來開始一次application的運行嘗試。我們來看一下 RMAppAttemptImpl中事件對應的狀態機轉移以及狀態轉移操作:

addTransition(RMAppAttemptState.NEW, RMAppAttemptState.SUBMITTED,
          RMAppAttemptEventType.START, new AttemptStartedTransition())

private static final class AttemptStartedTransition extends BaseTransition {
    public void transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
    //是否是從上一次嘗試的狀態開始
        boolean transferStateFromPreviousAttempt = false;
      if (event instanceof RMAppStartAttemptEvent) {
        transferStateFromPreviousAttempt =
            ((RMAppStartAttemptEvent) event)
              .getTransferStateFromPreviousAttempt();
      }
      appAttempt.startTime = System.currentTimeMillis();

      // 向ApplicationMasterService註冊appAttempt,備案一下,ApplicationMasterService是專門和AM交流的
      appAttempt.masterService
          .registerAppAttempt(appAttempt.applicationAttemptId);

      if (UserGroupInformation.isSecurityEnabled()) {
        appAttempt.clientTokenMasterKey =
            appAttempt.rmContext.getClientToAMTokenSecretManager()
              .createMasterKey(appAttempt.applicationAttemptId);
      }

      // 把一個applicationAttempt添加到調度器中,然後通知調度器
      appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent(
        appAttempt.applicationAttemptId, transferStateFromPreviousAttempt));
    }
  }

  public AppAttemptAddedSchedulerEvent(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt) {
    this(applicationAttemptId, transferStateFromPreviousAttempt, false);
  }

  public AppAttemptAddedSchedulerEvent(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt,
      boolean isAttemptRecovering) {
    super(SchedulerEventType.APP_ATTEMPT_ADDED);
    this.applicationAttemptId = applicationAttemptId;
    this.transferStateFromPreviousAttempt = transferStateFromPreviousAttempt;
    this.isAttemptRecovering = isAttemptRecovering;
  }

主要操作是把這個application的嘗試添加到調度器中,並通知調度器對此作出反應,對應的事件爲SchedulerEventType.APP_ATTEMPT_ADDED,對應的FairScheduler對該事件的操作爲:

public void handle(SchedulerEvent event) {
    switch (event.getType()) {
     case APP_ATTEMPT_ADDED:
      if (!(event instanceof AppAttemptAddedSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      AppAttemptAddedSchedulerEvent appAttemptAddedEvent =
          (AppAttemptAddedSchedulerEvent) event;
      addApplicationAttempt(appAttemptAddedEvent.getApplicationAttemptId(),
        appAttemptAddedEvent.getTransferStateFromPreviousAttempt(),
        appAttemptAddedEvent.getIsAttemptRecovering());
      break;
    }
 }

然後調用了FairScheduler.addApplicationAttempt函數,來添加一個應用嘗試到調度器中:

  protected void addApplicationAttempt(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt,
      boolean isAttemptRecovering) {
    try {
      writeLock.lock();
      //通過applicationId獲取調度器中對應的SchedulerApplication
      SchedulerApplication<FSAppAttempt> application = applications.get(
          applicationAttemptId.getApplicationId());
      String user = application.getUser();
      FSLeafQueue queue = (FSLeafQueue) application.getQueue();
      //創建AppAttempt在調度器中的表示FSAppAttempt
      FSAppAttempt attempt = new FSAppAttempt(this, applicationAttemptId, user,
          queue, new ActiveUsersManager(getRootQueueMetrics()), rmContext);
      if (transferStateFromPreviousAttempt) {
        attempt.transferStateFromPreviousAttempt(
            application.getCurrentAppAttempt());
      }
      //把這個新創建的FSAppAttempt attempt設爲當前的FSAppAttempt
      application.setCurrentAppAttempt(attempt);

      boolean runnable = maxRunningEnforcer.canAppBeRunnable(queue, attempt);
      queue.addApp(attempt, runnable);
      if (runnable) {
        maxRunningEnforcer.trackRunnableApp(attempt);
      } else{
        maxRunningEnforcer.trackNonRunnableApp(attempt);
      }
      //把attempt加入metrics信息
      queue.getMetrics().submitAppAttempt(user);

      LOG.info("Added Application Attempt " + applicationAttemptId
          + " to scheduler from user: " + user);

      if (isAttemptRecovering) {
      //若是恢復attempt操作則不需要notify
        if (LOG.isDebugEnabled()) {
          LOG.debug(applicationAttemptId
              + " is recovering. Skipping notifying ATTEMPT_ADDED");
        }
      } else{
        //發送RMAppAttemptEventType.ATTEMPT_ADDED事件給RMAppAttempt
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppAttemptEvent(applicationAttemptId,
                RMAppAttemptEventType.ATTEMPT_ADDED));
      }
    } finally {
      writeLock.unlock();
    }
  }

RMAppAttemptImpl類中對應的狀態機跳變爲:

addTransition(RMAppAttemptState.SUBMITTED, 
          EnumSet.of(RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING,
                     RMAppAttemptState.SCHEDULED),
          RMAppAttemptEventType.ATTEMPT_ADDED,
          new ScheduleTransition())

這是一個多弧跳變,要看這個AM是通過向RM申請的,還是直接在NodeManager上面啓的。
ScheduleTransition.transition:

 public static final class ScheduleTransition
      implements
      MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {

    public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
      ApplicationSubmissionContext subCtx = appAttempt.submissionContext;
      //如果是通過向RM申請的AM,並且受RM控制管理
      if (!subCtx.getUnmanagedAM()) {
        for (ResourceRequest amReq : appAttempt.amReqs) {
          //設置AM的所需的Container個數爲1, 並且設置爲AM對應的優先級
          amReq.setNumContainers(1);
          amReq.setPriority(AM_CONTAINER_PRIORITY);
        }

        int numNodes =
            RMServerUtils.getApplicableNodeCountForAM(appAttempt.rmContext,
                appAttempt.conf, appAttempt.amReqs);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Setting node count for blacklist to " + numNodes);
        }
        appAttempt.getAMBlacklistManager().refreshNodeHostCount(numNodes);

        ResourceBlacklistRequest amBlacklist =
            appAttempt.getAMBlacklistManager().getBlacklistUpdates();
        if (LOG.isDebugEnabled()) {
          LOG.debug("Using blacklist for AM: additions(" +
              amBlacklist.getBlacklistAdditions() + ") and removals(" +
              amBlacklist.getBlacklistRemovals() + ")");
        }
        // 調用具體調度器,這裏是FairScheduler的資源分配過程
        Allocation amContainerAllocation =
            appAttempt.scheduler.allocate(
                appAttempt.applicationAttemptId,
                appAttempt.amReqs, null, EMPTY_CONTAINER_RELEASE_LIST,
                amBlacklist.getBlacklistAdditions(),
                amBlacklist.getBlacklistRemovals(),
                new ContainerUpdates());
        if (amContainerAllocation != null
            && amContainerAllocation.getContainers() != null) {
          assert (amContainerAllocation.getContainers().size() == 0);
        }
        //分配到的Container爲0的情況下返回RMAppAttemptState.SCHEDULED
        return RMAppAttemptState.SCHEDULED;
      } else {
        // 如果是unmanaged的情況,說明AM不通過RM來控制管理,而是用戶直接通過RM來申請的,直接向RMAppAttempt發送RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING事件
        appAttempt.storeAttempt();
        return RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING;
      }
    }
  }

我們看到amContainerAllocation 中沒有分配到容器的時候,纔會返回RMAppAttemptState.SCHEDULED,使其停留在RMAppAttemptState.SCHEDULED狀態,那麼狀態機還怎麼前行呢,我們對應調度器分配過程看一下,分配到容器的情況下,會讓RMAppAttempt的狀態機發生怎麼樣的變化。
appAttempt.scheduler.allocate()實際上調用的是 FairScheduler.allocate()過程,直接看其中容器相關的部分:

 public Allocation allocate(ApplicationAttemptId appAttemptId,
      List<ResourceRequest> ask, List<SchedulingRequest> schedulingRequests,
      List<ContainerId> release, List<String> blacklistAdditions,
      List<String> blacklistRemovals, ContainerUpdates updateRequests) {

    List<Container> newlyAllocatedContainers =
        application.pullNewlyAllocatedContainers();

    return new Allocation(newlyAllocatedContainers, headroom,
        preemptionContainerIds, null, null,
        application.pullUpdatedNMTokens(), null, null,
        application.pullNewlyPromotedContainers(),
        application.pullNewlyDemotedContainers(),
        application.pullPreviousAttemptContainers());
  }

通過調用application.pullNewlyAllocatedContainers()來收攬容器:

 public List<Container> pullNewlyAllocatedContainers() {
    try {
      writeLock.lock();
      List<Container> returnContainerList = new ArrayList<Container>(
          newlyAllocatedContainers.size());

      Iterator<RMContainer> i = newlyAllocatedContainers.iterator();
      while (i.hasNext()) {
        RMContainer rmContainer = i.next();
        Container updatedContainer =
            updateContainerAndNMToken(rmContainer, null);
        //收攬到容器就把他加入到returnContainerList中返回
        if (updatedContainer != null) {
          returnContainerList.add(updatedContainer);
          i.remove();
        }
      }
      return returnContainerList;
    } finally {
      writeLock.unlock();
    }
  }

繼續看updateContainerAndNMToken(rmContainer, null):

 private Container updateContainerAndNMToken(RMContainer rmContainer,
      ContainerUpdateType updateType) {
    try {
    //只看關鍵部分
    ......
    if (updateType == null) {
      // 這是一個新分配的容器
      rmContainer.handle(new RMContainerEvent(
          rmContainer.getContainerId(), RMContainerEventType.ACQUIRED));
    } else {
    }
    return container;
  }

RMContainerImpl對於事件反應做出的狀態機轉變:

addTransition(RMContainerState.NEW, RMContainerState.ACQUIRED,
        RMContainerEventType.ACQUIRED, new AcquiredTransition())

狀態機轉變伴隨的跳變函數爲:

 private static final class AcquiredTransition extends BaseTransition {

    public void transition(RMContainerImpl container, RMContainerEvent event) {
      //把這個容器加入到容器分配過期的服務中
      container.containerAllocationExpirer.register(
          new AllocationExpirationInfo(container.getContainerId()));

      //通知APP,推進他的狀態機
      container.eventHandler.handle(new RMAppRunningOnNodeEvent(container
          .getApplicationAttemptId().getApplicationId(), container.nodeId));
    }
  }

 public RMAppRunningOnNodeEvent(ApplicationId appId, NodeId node) {
    super(appId, RMAppEventType.APP_RUNNING_ON_NODE);
    this.node = node;
  }

通知RMApp告訴他,APP已經獲得容器在某個NM節點的Container了。

private static final class AppRunningOnNodeTransition extends RMAppTransition {
    public void transition(RMAppImpl app, RMAppEvent event) {
      RMAppRunningOnNodeEvent nodeAddedEvent = (RMAppRunningOnNodeEvent) event;

      // 若最終信息已經被保存,運行完了就通知運行的節點RMNode清除該app信息
      if (isAppInFinalState(app)) {
        app.handler.handle(
            new RMNodeCleanAppEvent(nodeAddedEvent.getNodeId(), nodeAddedEvent
                .getApplicationId()));
        return;
      }

      // 否則把節點信息加入保存,等待進一步處理
      app.ranNodes.add(nodeAddedEvent.getNodeId());


  }

收攬到容器,就可以供application運行了,現在回到另一個問題,就是什麼時候纔有容器收攬,這是通過NM節點進行心跳得到的,在NodeManager所在的jvm中,有個NodeStatusUpdaterImpl線程,這個線程週期性地收集本節點的狀態信息,然後像RM進行彙報:
NodeStatusUpdaterImpl.statusUpdaterRunnable.run() ->

private class StatusUpdaterRunnable implements Runnable {

    public void run() {
      int lastHeartbeatID = 0;
      while (!isStopped) {
        // Send heartbeat
        try {
          NodeHeartbeatResponse response = null;
          Set<NodeLabel> nodeLabelsForHeartbeat =
              nodeLabelsHandler.getNodeLabelsForHeartbeat();
          //手機本節點的狀態信息
          NodeStatus nodeStatus = getNodeStatus(lastHeartbeatID);
          //構成一個心跳請求
          NodeHeartbeatRequest request =
              NodeHeartbeatRequest.newInstance(nodeStatus,
                  NodeStatusUpdaterImpl.this.context
                      .getContainerTokenSecretManager().getCurrentKey(),
                  NodeStatusUpdaterImpl.this.context
                      .getNMTokenSecretManager().getCurrentKey(),
                  nodeLabelsForHeartbeat,
                  NodeStatusUpdaterImpl.this.context
                      .getRegisteringCollectors());

         //像RM發送心跳請求,並收到迴應
          response = resourceTracker.nodeHeartbeat(request);
          //get next heartbeat interval from response
          nextHeartBeatInterval = response.getNextHeartBeatInterval();
          updateMasterKeys(response);

          if (!handleShutdownOrResyncCommand(response)) {
            nodeLabelsHandler.verifyRMHeartbeatResponseForNodeLabels(
                response);

           //根據命令清除不在需要運行的容器和應用
            removeOrTrackCompletedContainersFromContext(response
                .getContainersToBeRemovedFromNM());

            logAggregationReportForAppsTempList.clear();
            lastHeartbeatID = response.getResponseId();
            List<ContainerId> containersToCleanup = response
                .getContainersToCleanup();
            if (!containersToCleanup.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrCompletedContainersEvent(containersToCleanup,
                      CMgrCompletedContainersEvent.Reason
                          .BY_RESOURCEMANAGER));
            }
            List<ApplicationId> appsToCleanup =
                response.getApplicationsToCleanup();
            //Only start tracking for keepAlive on FINISH_APP
            trackAppsForKeepAlive(appsToCleanup);
            if (!appsToCleanup.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrCompletedAppsEvent(appsToCleanup,
                      CMgrCompletedAppsEvent.Reason.BY_RESOURCEMANAGER));
            }
            Map<ApplicationId, ByteBuffer> systemCredentials =
                response.getSystemCredentialsForApps();
            if (systemCredentials != null && !systemCredentials.isEmpty()) {
              ((NMContext) context).setSystemCrendentialsForApps(
                  parseCredentials(systemCredentials));
            }
            List<org.apache.hadoop.yarn.api.records.Container>
                containersToUpdate = response.getContainersToUpdate();
            if (!containersToUpdate.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrUpdateContainersEvent(containersToUpdate));
            }


            List<SignalContainerRequest> containersToSignal = response
                .getContainersToSignalList();
            if (!containersToSignal.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrSignalContainersEvent(containersToSignal));
            }

            // Update QueuingLimits if ContainerManager supports queuing
            ContainerQueuingLimit queuingLimit =
                response.getContainerQueuingLimit();
            if (queuingLimit != null) {
              context.getContainerManager().updateQueuingLimit(queuingLimit);
            }
          }
          // Handling node resource update case.
          Resource newResource = response.getResource();
          if (newResource != null) {
            updateNMResource(newResource);
            if (LOG.isDebugEnabled()) {
              LOG.debug("Node's resource is updated to " +
                  newResource.toString());
            }
          }
          if (YarnConfiguration.timelineServiceV2Enabled(context.getConf())) {
            updateTimelineCollectorData(response);
          }

        } catch (ConnectException e) {
          //catch and throw the exception if tried MAX wait time to connect RM
          dispatcher.getEventHandler().handle(
              new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
          // failed to connect to RM.
          failedToConnect = true;
          throw new YarnRuntimeException(e);
        } catch (Exception e) {

          // TODO Better error handling. Thread can die with the rest of the
          // NM still running.
          LOG.error("Caught exception in status-updater", e);
        } finally {
          synchronized (heartbeatMonitor) {
            nextHeartBeatInterval = nextHeartBeatInterval <= 0 ?
                YarnConfiguration.DEFAULT_RM_NM_HEARTBEAT_INTERVAL_MS :
                nextHeartBeatInterval;
            try {
            //睡眠心跳時間間隔後進入下一倫循環
              heartbeatMonitor.wait(nextHeartBeatInterval);
            } catch (InterruptedException e) {
              // Do Nothing
            }
          }
        }
      }
    }

下面來看RM端接受到NM心跳的操作:
ResourceTracker.nodeHeartbeat() -> ResourceTrackerService.nodeHeartbeat():

 public NodeHeartbeatResponse nodeHeartbeat(NodeHeartbeatRequest request)
      throws YarnException, IOException {

    NodeStatus remoteNodeStatus = request.getNodeStatus();

    NodeId nodeId = remoteNodeStatus.getNodeId();
    //省略一些合法性驗證

    // 發送節點存活的證明
    this.nmLivelinessMonitor.receivedPing(nodeId);
    this.decommissioningWatcher.update(rmNode, remoteNodeStatus);


    //省略一些日誌等相關的
    // 準備心跳回應
    NodeHeartbeatResponse nodeHeartBeatResponse =
        YarnServerBuilderUtils.newNodeHeartbeatResponse(
            getNextResponseId(lastNodeHeartbeatResponse.getResponseId()),
            NodeAction.NORMAL, null, null, null, null, nextHeartBeatInterval);
    rmNode.setAndUpdateNodeHeartbeatResponse(nodeHeartBeatResponse);

    populateKeys(request, nodeHeartBeatResponse);

    ConcurrentMap<ApplicationId, ByteBuffer> systemCredentials =
        rmContext.getSystemCredentialsForApps();
    if (!systemCredentials.isEmpty()) {
      nodeHeartBeatResponse.setSystemCredentialsForApps(systemCredentials);
    }


    //向RMNode發送狀態更新信息,保存最近的一個心跳回應 
    RMNodeStatusEvent nodeStatusEvent =
        new RMNodeStatusEvent(nodeId, remoteNodeStatus);
    if (request.getLogAggregationReportsForApps() != null
        && !request.getLogAggregationReportsForApps().isEmpty()) {
      nodeStatusEvent.setLogAggregationReportsForApps(request
        .getLogAggregationReportsForApps());
    }

//用該事件來驅動RMNodeImpl的狀態機  this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent);


   //...
    return nodeHeartBeatResponse;
  }

RM會爲所有已經註冊的NM節點保存一個RMNodeImpl對象,當收到對應NM的節點的心跳的時候,會向RMNodeImpl發送一個STATUS_UPDATE事件:

addTransition(NodeState.RUNNING,
          EnumSet.of(NodeState.RUNNING, NodeState.UNHEALTHY),
          RMNodeEventType.STATUS_UPDATE,
          new StatusUpdateWhenHealthyTransition())
  public static class StatusUpdateWhenHealthyTransition implements
      MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {

    public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {

      RMNodeStatusEvent statusEvent = (RMNodeStatusEvent) event;
      rmNode.setOpportunisticContainersStatus(
          statusEvent.getOpportunisticContainersStatus());
      NodeHealthStatus remoteNodeHealthStatus = updateRMNodeFromStatusEvents(
          rmNode, statusEvent);
      NodeState initialState = rmNode.getState();
      boolean isNodeDecommissioning =
          initialState.equals(NodeState.DECOMMISSIONING);
      if (isNodeDecommissioning) {
        List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
        if (rmNode.runningApplications.isEmpty() &&
            (keepAliveApps == null || keepAliveApps.isEmpty())) {
          RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
          return NodeState.DECOMMISSIONED;
        }
      }
      //如果節點已經不健康了
      if (!remoteNodeHealthStatus.getIsNodeHealthy()) {
        LOG.info("Node " + rmNode.nodeId +
            " reported UNHEALTHY with details: " +
            remoteNodeHealthStatus.getHealthReport());
        // if a node in decommissioning receives an unhealthy report,
        // it will stay in decommissioning.
        if (isNodeDecommissioning) {
          return NodeState.DECOMMISSIONING;
        } else {
          reportNodeUnusable(rmNode, NodeState.UNHEALTHY);
          return NodeState.UNHEALTHY;
        }
      }
      //節點健康的情況下
      rmNode.handleContainerStatus(statusEvent.getContainers());
      rmNode.handleReportedIncreasedContainers(
          statusEvent.getNMReportedIncreasedContainers());

      List<LogAggregationReport> logAggregationReportsForApps =
          statusEvent.getLogAggregationReportsForApps();
      if (logAggregationReportsForApps != null
          && !logAggregationReportsForApps.isEmpty()) {
        rmNode.handleLogAggregationStatus(logAggregationReportsForApps);
      }

      if(rmNode.nextHeartBeat) {
        rmNode.nextHeartBeat = false;
        rmNode.context.getDispatcher().getEventHandler().handle(
            new NodeUpdateSchedulerEvent(rmNode));
      }

      // Update DTRenewer in secure mode to keep these apps alive. Today this is
      // needed for log-aggregation to finish long after the apps are gone.
      if (UserGroupInformation.isSecurityEnabled()) {
        rmNode.context.getDelegationTokenRenewer().updateKeepAliveApplications(
          statusEvent.getKeepAliveAppIds());
      }

      return initialState;
    }
  }

節點健康的情況下是我們的主線,來看一下:
rmNode.handleContainerStatus():

 private void handleContainerStatus(List<ContainerStatus> containerStatuses) {
    // 獲取該節點剛剛新發起的Containers 以及剛剛新完成的Containers
    List<ContainerStatus> newlyLaunchedContainers =
        new ArrayList<ContainerStatus>();
    List<ContainerStatus> newlyCompletedContainers =
        new ArrayList<ContainerStatus>();
    int numRemoteRunningContainers = 0;
    for (ContainerStatus remoteContainer : containerStatuses) {
      ContainerId containerId = remoteContainer.getContainerId();

      // 我們不需要知道那些因爲application被殺,而正在被清理的Containers
      if (containersToClean.contains(containerId)) {
        LOG.info("Container " + containerId + " already scheduled for "
            + "cleanup, no further processing");
        continue;
      }

      ApplicationId containerAppId =
          containerId.getApplicationAttemptId().getApplicationId();

      if (finishedApplications.contains(containerAppId)) {
        LOG.info("Container " + containerId
            + " belongs to an application that is already killed,"
            + " no further processing");
        continue;
      } else if (!runningApplications.contains(containerAppId)) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Container " + containerId
              + " is the first container get launched for application "
              + containerAppId);
        }
        handleRunningAppOnNode(this, context, containerAppId, nodeId);
      }

      // 處理正在RUNNING狀態的Container
      if (remoteContainer.getState() == ContainerState.RUNNING) {
        ++numRemoteRunningContainers;
        if (!launchedContainers.contains(containerId)) {
          // 剛剛發起的Container讓RM知道一下
          launchedContainers.add(containerId);
          newlyLaunchedContainers.add(remoteContainer);
          // 註銷 containerAllocationExpirer.
          containerAllocationExpirer
              .unregister(new AllocationExpirationInfo(containerId));
        }
      } else {
        // 一個剛剛運行結束的container,從已發起的列表中刪除
        launchedContainers.remove(containerId);
        if (completedContainers.add(containerId)) {
          newlyCompletedContainers.add(remoteContainer);
        }
        //註銷containerAllocationExpirer.
        containerAllocationExpirer
            .unregister(new AllocationExpirationInfo(containerId));
      }
    }
    //丟失的Container,把他視爲剛剛運行完成的Container
    List<ContainerStatus> lostContainers =
        findLostContainers(numRemoteRunningContainers, containerStatuses);
    for (ContainerStatus remoteContainer : lostContainers) {
      ContainerId containerId = remoteContainer.getContainerId();
      if (completedContainers.add(containerId)) {
        newlyCompletedContainers.add(remoteContainer);
      }
    }
    //最後把新發起的容器和新完成的容器的信息更新到有變化的容器隊列nodeUpdateQueue中
    if (newlyLaunchedContainers.size() != 0
        || newlyCompletedContainers.size() != 0) {
      nodeUpdateQueue.add(new UpdatedContainerInfo(newlyLaunchedContainers,
          newlyCompletedContainers));
    }
  }

回到StatusUpdateWhenHealthyTransition繼續:

  public static class StatusUpdateWhenHealthyTransition implements
      MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
    public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {

     //....

      //如果有狀態變化的容器
      if(rmNode.nextHeartBeat) {
      //保證只發生一次
        rmNode.nextHeartBeat = false;
        rmNode.context.getDispatcher().getEventHandler().handle(
            new NodeUpdateSchedulerEvent(rmNode));
      }



      return initialState;
    }
  }

new NodeUpdateSchedulerEvent():

 public NodeUpdateSchedulerEvent(RMNode rmNode) {
    super(SchedulerEventType.NODE_UPDATE);
    this.rmNode = rmNode;
  }

觸發了調度器 SchedulerEventType.NODE_UPDATE事件,我們這裏的調度器類型是FairScheduler:
FairScheduler.handle():

 public void handle(SchedulerEvent event) {
    switch (event.getType()) {

     case NODE_UPDATE:
      if (!(event instanceof NodeUpdateSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;
      nodeUpdate(nodeUpdatedEvent.getRMNode());
      break;

    //.........
    }
  }

FairScheduler.handle() -> FairScheduler.nodeUpdate():

 protected void nodeUpdate(RMNode nm) {
    try {
      writeLock.lock();
      long start = getClock().getTime();
      super.nodeUpdate(nm);

      FSSchedulerNode fsNode = getFSSchedulerNode(nm.getNodeID());
      attemptScheduling(fsNode);

      long duration = getClock().getTime() - start;
      fsOpDurations.addNodeUpdateDuration(duration);
    } finally {
      writeLock.unlock();
    }
  }

FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling()

 void attemptScheduling(FSSchedulerNode node) {
    try {
      writeLock.lock();
      //省略一些驗證工作
      //首先分配被搶佔的容器   
      assignPreemptedContainers(node);
      FSAppAttempt reservedAppSchedulable = node.getReservedAppSchedulable();
      boolean validReservation = false;
      //然後分配被預留的容器
      if (reservedAppSchedulable != null) {
        validReservation = reservedAppSchedulable.assignReservedContainer(node);
      }
      if (!validReservation) {
        //沒有預留的情況下,調度低於fair share最多的隊列
        int assignedContainers = 0;
        Resource assignedResource = Resources.clone(Resources.none());
        //當前節點的最大可分配資源
        Resource maxResourcesToAssign = Resources.multiply(
            node.getUnallocatedResource(), 0.5f);
        while (node.getReservedContainer() == null) {
          //從根節點開始按照調度規則進行調度,然後容器分配
          Resource assignment = queueMgr.getRootQueue().assignContainer(node);
          if (assignment.equals(Resources.none())) {
            break;
          }

          assignedContainers++;
          Resources.addTo(assignedResource, assignment);
          if (!shouldContinueAssigning(assignedContainers, maxResourcesToAssign,
              assignedResource)) {
            break;
          }
        }
      }
      //更新隊列的信息
      updateRootQueueMetrics();
    } finally {
      writeLock.unlock();
    }
  }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章