利用QJM實現HDFS自動主從切換(HA Automatic Failover)源碼詳析

最近研究了下NameNode HA Automatic Failover方面的東西，當Active NN因爲異常或其他原因不能正常提供服務時，處於Standby狀態的NN就可以自動切換爲Active狀態，從而到達真正的高可用

NN HA Automatic Failover架構圖

爲了實現自動切換，需要依賴ZooKeeper和ZKFC組件，ZooKeeper主要用來記錄NN的相關狀態信息，zkfc組件以單獨的JVM進程的形式運行在NN所在的節點上。下面首先分析下NN的啓動流程，NN對象在實例化過程中，如果在hdfs-site.xml中配置的dfs.ha.namenodes.${dfs.nameservices}個數多於1個，則屬性haEnabled爲true,state的初始狀態即爲STANDBY

HAUtil.isHAEnabled(Configuration conf, String nsId) {
    //根據${dfs.nameservices}獲取dfs.ha.namenodes.${dfs.nameservices}
    Map<String, Map<String, InetSocketAddress>> addresses =
      DFSUtil.getHaNnRpcAddresses(conf);
    if (addresses == null) return false;
    Map<String, InetSocketAddress> nnMap = addresses.get(nsId);
    //如果ndId對應的ha.namenodes個數大於1並且nnMap不爲空,返回true
    return nnMap != null && nnMap.size() > 1;
  }
  
  //NN啓動過程中首先判斷haEnabled屬性值，爲true則初始狀態爲STANDBY
  NameNode.createHAState() {
    return !haEnabled ? ACTIVE_STATE : STANDBY_STATE;
  }

假設集羣中有兩個NN,這兩個NN在初始啓動時，都爲STANDBY狀態，如果不配置Automatic Failover，則需要手動將其中的一個NN切換爲ACTIVE模式，命令如下

hdfs haadmin-transitionToActive nn1

如果配置並啓動zkfc組件，則會自動將本節點的一個NameNode切換爲ACTIVE狀態，這個取決於先在哪臺NN上啓動ZKFC,ZKFC和NN的啓動順序並沒有強制的要求。下面來主要分析下，當配置的兩個NN節點都啓動之後，ZKFC組件的啓動主要做了哪些事。ZKFC的啓動類是DFSZKFailoverController，繼承自ZKFailoverController，首先從main方法入手

DFSZKFailoverController.main(String args[]){
    if (DFSUtil.parseHelpArgument(args, 
        ZKFailoverController.USAGE, System.out, true)) {
      System.exit(0);
    }
    //加載hdfs的配置信息
    GenericOptionsParser parser = new GenericOptionsParser(
        new HdfsConfiguration(), args);
    //根據配置信息構造DFSZKFailoverController
    //根據nsId,nnId實例化NNHAServiceTarget對象，並設置到zkfc對象中
    //HAServiceTarget對象用來建立到指定NN的各種網絡參數
    DFSZKFailoverController zkfc = DFSZKFailoverController.create(
        parser.getConfiguration());
    //建立與zk的連接並格式化zk的目錄
    //實例化ActiveStandbyElector和HealthMonitor
    //啓動RPCServer
    System.exit(zkfc.run(parser.getRemainingArgs()));
  }

在ZKFC啓動的過程中，啓動了兩個非常重要的進程內組件:HealthMonitor和ActiveStandbyElector。ZKFC主要從HealthMonitor和ActiveStandbyElector中訂閱事件並管理NN的狀態並負責fencing。HealthMonitor定期檢查NN的健康狀況，如果出現問題，以捕獲異常的方式通過回調方法將變化通知給ZKFailoverController。ActiveStandbyElector主要用於管理NN在zk上的狀態，包括創建節點，節點監控等。HealthMonitor在初始啓動時，如果本地節點處於健康狀態，則會觸發一系列的事件使當前NN節點參加選舉並切換爲ACTIVE狀態，具體代碼如下

 HealthMonitor.doHealthChecks() {
    while (shouldRun) {
      HAServiceStatus status = null;
      boolean healthy = false;
      try {
    	//通過proxy獲取NN的狀態
        status = proxy.getServiceStatus();
        proxy.monitorHealth();
        healthy = true;
      } catch (HealthCheckFailedException e) {
        LOG.warn("Service health check failed for " + targetToMonitor
            + ": " + e.getMessage());
        //調用callbacks集合中的回調方法，以事件的形式通知ZKFC
        enterState(State.SERVICE_UNHEALTHY);
      } catch (Throwable t) {
        RPC.stopProxy(proxy);
        proxy = null;
        enterState(State.SERVICE_NOT_RESPONDING);
        Thread.sleep(sleepAfterDisconnectMillis);
        return;
      }
      
      if (status != null) {
        setLastServiceStatus(status);
      }
      if (healthy) {
	//初始啓動時State=INITIALIZING，當前狀態與初始狀態不一致時，觸發狀態轉移
        enterState(State.SERVICE_HEALTHY);
      }

      Thread.sleep(checkIntervalMillis);
    }
  }
  
  
  HealthMonitor.enterState(State newState) {
    //本次狀態和上一次狀態不一致的時候，觸發回調
    if (newState != state) {
      LOG.info("Entering state " + newState);
      state = newState;
      synchronized (callbacks) {
        for (Callback cb : callbacks) {
          cb.enteredState(newState);
        }
      }
    }
  }
  
  
  HealthCallbacks.enteredState(HealthMonitor.State newState) {
      setLastHealthState(newState);
      //根據當前狀態判斷是否可以參加選舉
      recheckElectability();
   }
   
  HealthMonitor.recheckElectability() {
    synchronized (elector) {
      synchronized (this) {
        boolean healthy = lastHealthState == State.SERVICE_HEALTHY;
    
        long remainingDelay = delayJoiningUntilNanotime - System.nanoTime(); 
        if (remainingDelay > 0) {
          if (healthy) {
            LOG.info("Would have joined master election, but this node is " +
                "prohibited from doing so for " +
                TimeUnit.NANOSECONDS.toMillis(remainingDelay) + " more ms");
          }
          scheduleRecheck(remainingDelay);
          return;
        }
    
        switch (lastHealthState) {
	//狀態爲SERVICE_HEALTHY,可參加選舉
        case SERVICE_HEALTHY:
          elector.joinElection(targetToData(localTarget));
          break;
        //狀態爲INITIALIZING，失去選舉資格
        case INITIALIZING:
          elector.quitElection(false);
          break;
    	
	//狀態爲SERVICE_UNHEALTHY或者SERVICE_NOT_RESPONDING，失去選舉資格
        case SERVICE_UNHEALTHY:
        case SERVICE_NOT_RESPONDING:
          elector.quitElection(true);
          break;
          
        case HEALTH_MONITOR_FAILED:
          fatalError("Health monitor failed!");
          break;
          
        default:
          throw new IllegalArgumentException("Unhandled state:" + lastHealthState);
        }
      }
    }
  }
  
  
  ActiveStandbyElector.joinElectionInternal() {
    Preconditions.checkState(appData != null,
        "trying to join election without any app data");
    if (zkClient == null) {
      if (!reEstablishSession()) {
        fatalError("Failed to reEstablish connection with ZooKeeper");
        return;
      }
    }

    createRetryCount = 0;
    wantToBeInElection = true;
    //向zk寫入ephemeral類型的znode，當NN掛掉後，會被自動刪除
    createLockNodeAsync();
  }
  
  
  ActiveStandbyElector.createLockNodeAsync() {
    //異步調用，當方法返回時，觸發回調方法
    //proce***esult(int rc, String path, Object ctx,String name)
    zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
        this, zkClient);
  }
  
  
  ActiveStandbyElector.proce***esult(int rc, String path, Object ctx,
      String name) {
    Code code = Code.get(rc);
    if (isSuccess(code)) {
      //創建成功,試圖使節點變爲ACTIVE狀態
      //becomeActive()通過與本地NN進行RPC通信，將NN的state設置爲ACTIVE
      //從而使得當前節點的NN更新爲主節點
      if (becomeActive()) {
	//監控節點
        monitorActiveStatus();
      } else {
        reJoinElectionAfterFailureToBecomeActive();
      }
      return;
    }
    //節點已經存在,說明已經有NN成爲了主節點，此時本節點只能作爲熱備節點存在
    //故狀態設置爲STANDBY
    if (isNodeExists(code)) {
      if (createRetryCount == 0) {
        becomeStandby();
      }
      //監控ACTIVE NN寫入的znode，當節點狀態改變時，通過watch機制觸發回調事件
      monitorActiveStatus();
      return;
    }
    
    
    ActiveStandbyElector.becomeActive() {
      if (state == State.ACTIVE) {
        // already active
        return true;
      }
      try {
    	//獲取上一個ACTIVE NN的麪包屑節點數據,並對上一個ACTIVE NN執行fence操作
        Stat oldBreadcrumbStat = fenceOldActive();
        //更新麪包屑節點的數據
        writeBreadCrumbNode(oldBreadcrumbStat);
      
        //rpc方式與當前NN交互，使得當前節點的NN變爲ACTIVE狀態
        appClient.becomeActive();
        state = State.ACTIVE;
        return true;
      } catch (Exception e) {
        return false;
      }
    }

處於STANDBY狀態的NN會監控ACTIVE NN寫入zk的znode節點，當節點狀態改變時，觸發zk的watch回調，使得STANDBY NN重新參與到選舉中，從而完成狀態的自動切換，代碼如下

 ActiveStandbyElector.processWatchEvent(ZooKeeper zk, WatchedEvent event) {
   String path = event.getPath();
   if (path != null) {
     switch (eventType) {
     case NodeDeleted:
       if (state == State.ACTIVE) {
         enterNeutralMode();
       }
       //重新參加選舉
       joinElectionInternal();
       break;
     case NodeDataChanged:
       monitorActiveStatus();
       break;
     default:
       monitorActiveStatus();
     }
 }

上述主要是從代碼的角度去理解和分析了NN自動切換的大致流程，相比手動切換的方式，可用性大大提升，同時減輕了運維的負擔。

利用QJM實現HDFS自動主從切換(HA Automatic Failover)源碼詳析

hive多用戶使用的配置

MRv2內存監控強殺Container問題解決

利用QJM實現HDFS自動主從切換(HA Automatic Failover)源碼詳析

我的友情鏈接

spark sql on hive初探

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結