最近研究了下NameNode HA Automatic Failover方面的東西,當Active NN因爲異常或其他原因不能正常提供服務時,處於Standby狀態的NN就可以自動切換爲Active狀態,從而到達真正的高可用
NN HA Automatic Failover架構圖
爲了實現自動切換,需要依賴ZooKeeper和ZKFC組件,ZooKeeper主要用來記錄NN的相關狀態信息,zkfc組件以單獨的JVM進程的形式運行在NN所在的節點上。下面首先分析下NN的啓動流程,NN對象在實例化過程中,如果在hdfs-site.xml中配置的dfs.ha.namenodes.${dfs.nameservices}個數多於1個,則屬性haEnabled爲true,state的初始狀態即爲STANDBY
HAUtil.isHAEnabled(Configuration conf, String nsId) { //根據${dfs.nameservices}獲取dfs.ha.namenodes.${dfs.nameservices} Map<String, Map<String, InetSocketAddress>> addresses = DFSUtil.getHaNnRpcAddresses(conf); if (addresses == null) return false; Map<String, InetSocketAddress> nnMap = addresses.get(nsId); //如果ndId對應的ha.namenodes個數大於1並且nnMap不爲空,返回true return nnMap != null && nnMap.size() > 1; } //NN啓動過程中首先判斷haEnabled屬性值,爲true則初始狀態爲STANDBY NameNode.createHAState() { return !haEnabled ? ACTIVE_STATE : STANDBY_STATE; }
假設集羣中有兩個NN,這兩個NN在初始啓動時,都爲STANDBY狀態,如果不配置Automatic Failover,則需要手動將其中的一個NN切換爲ACTIVE模式,命令如下
hdfs haadmin-transitionToActive nn1
如果配置並啓動zkfc組件,則會自動將本節點的一個NameNode切換爲ACTIVE狀態,這個取決於先在哪臺NN上啓動ZKFC,ZKFC和NN的啓動順序並沒有強制的要求。下面來主要分析下,當配置的兩個NN節點都啓動之後,ZKFC組件的啓動主要做了哪些事。ZKFC的啓動類是DFSZKFailoverController,繼承自ZKFailoverController,首先從main方法入手
DFSZKFailoverController.main(String args[]){ if (DFSUtil.parseHelpArgument(args, ZKFailoverController.USAGE, System.out, true)) { System.exit(0); } //加載hdfs的配置信息 GenericOptionsParser parser = new GenericOptionsParser( new HdfsConfiguration(), args); //根據配置信息構造DFSZKFailoverController //根據nsId,nnId實例化NNHAServiceTarget對象,並設置到zkfc對象中 //HAServiceTarget對象用來建立到指定NN的各種網絡參數 DFSZKFailoverController zkfc = DFSZKFailoverController.create( parser.getConfiguration()); //建立與zk的連接並格式化zk的目錄 //實例化ActiveStandbyElector和HealthMonitor //啓動RPCServer System.exit(zkfc.run(parser.getRemainingArgs())); }
在ZKFC啓動的過程中,啓動了兩個非常重要的進程內組件:HealthMonitor和ActiveStandbyElector。ZKFC主要從HealthMonitor和ActiveStandbyElector中訂閱事件並管理NN的狀態並負責fencing。HealthMonitor定期檢查NN的健康狀況,如果出現問題,以捕獲異常的方式通過回調方法將變化通知給ZKFailoverController。ActiveStandbyElector主要用於管理NN在zk上的狀態,包括創建節點,節點監控等。HealthMonitor在初始啓動時,如果本地節點處於健康狀態,則會觸發一系列的事件使當前NN節點參加選舉並切換爲ACTIVE狀態,具體代碼如下
HealthMonitor.doHealthChecks() { while (shouldRun) { HAServiceStatus status = null; boolean healthy = false; try { //通過proxy獲取NN的狀態 status = proxy.getServiceStatus(); proxy.monitorHealth(); healthy = true; } catch (HealthCheckFailedException e) { LOG.warn("Service health check failed for " + targetToMonitor + ": " + e.getMessage()); //調用callbacks集合中的回調方法,以事件的形式通知ZKFC enterState(State.SERVICE_UNHEALTHY); } catch (Throwable t) { RPC.stopProxy(proxy); proxy = null; enterState(State.SERVICE_NOT_RESPONDING); Thread.sleep(sleepAfterDisconnectMillis); return; } if (status != null) { setLastServiceStatus(status); } if (healthy) { //初始啓動時State=INITIALIZING,當前狀態與初始狀態不一致時,觸發狀態轉移 enterState(State.SERVICE_HEALTHY); } Thread.sleep(checkIntervalMillis); } } HealthMonitor.enterState(State newState) { //本次狀態和上一次狀態不一致的時候,觸發回調 if (newState != state) { LOG.info("Entering state " + newState); state = newState; synchronized (callbacks) { for (Callback cb : callbacks) { cb.enteredState(newState); } } } } HealthCallbacks.enteredState(HealthMonitor.State newState) { setLastHealthState(newState); //根據當前狀態判斷是否可以參加選舉 recheckElectability(); } HealthMonitor.recheckElectability() { synchronized (elector) { synchronized (this) { boolean healthy = lastHealthState == State.SERVICE_HEALTHY; long remainingDelay = delayJoiningUntilNanotime - System.nanoTime(); if (remainingDelay > 0) { if (healthy) { LOG.info("Would have joined master election, but this node is " + "prohibited from doing so for " + TimeUnit.NANOSECONDS.toMillis(remainingDelay) + " more ms"); } scheduleRecheck(remainingDelay); return; } switch (lastHealthState) { //狀態爲SERVICE_HEALTHY,可參加選舉 case SERVICE_HEALTHY: elector.joinElection(targetToData(localTarget)); break; //狀態爲INITIALIZING,失去選舉資格 case INITIALIZING: elector.quitElection(false); break; //狀態爲SERVICE_UNHEALTHY或者SERVICE_NOT_RESPONDING,失去選舉資格 case SERVICE_UNHEALTHY: case SERVICE_NOT_RESPONDING: elector.quitElection(true); break; case HEALTH_MONITOR_FAILED: fatalError("Health monitor failed!"); break; default: throw new IllegalArgumentException("Unhandled state:" + lastHealthState); } } } } ActiveStandbyElector.joinElectionInternal() { Preconditions.checkState(appData != null, "trying to join election without any app data"); if (zkClient == null) { if (!reEstablishSession()) { fatalError("Failed to reEstablish connection with ZooKeeper"); return; } } createRetryCount = 0; wantToBeInElection = true; //向zk寫入ephemeral類型的znode,當NN掛掉後,會被自動刪除 createLockNodeAsync(); } ActiveStandbyElector.createLockNodeAsync() { //異步調用,當方法返回時,觸發回調方法 //proce***esult(int rc, String path, Object ctx,String name) zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL, this, zkClient); } ActiveStandbyElector.proce***esult(int rc, String path, Object ctx, String name) { Code code = Code.get(rc); if (isSuccess(code)) { //創建成功,試圖使節點變爲ACTIVE狀態 //becomeActive()通過與本地NN進行RPC通信,將NN的state設置爲ACTIVE //從而使得當前節點的NN更新爲主節點 if (becomeActive()) { //監控節點 monitorActiveStatus(); } else { reJoinElectionAfterFailureToBecomeActive(); } return; } //節點已經存在,說明已經有NN成爲了主節點,此時本節點只能作爲熱備節點存在 //故狀態設置爲STANDBY if (isNodeExists(code)) { if (createRetryCount == 0) { becomeStandby(); } //監控ACTIVE NN寫入的znode,當節點狀態改變時,通過watch機制觸發回調事件 monitorActiveStatus(); return; } ActiveStandbyElector.becomeActive() { if (state == State.ACTIVE) { // already active return true; } try { //獲取上一個ACTIVE NN的麪包屑節點數據,並對上一個ACTIVE NN執行fence操作 Stat oldBreadcrumbStat = fenceOldActive(); //更新麪包屑節點的數據 writeBreadCrumbNode(oldBreadcrumbStat); //rpc方式與當前NN交互,使得當前節點的NN變爲ACTIVE狀態 appClient.becomeActive(); state = State.ACTIVE; return true; } catch (Exception e) { return false; } }
處於STANDBY狀態的NN會監控ACTIVE NN寫入zk的znode節點,當節點狀態改變時,觸發zk的watch回調,使得STANDBY NN重新參與到選舉中,從而完成狀態的自動切換,代碼如下
ActiveStandbyElector.processWatchEvent(ZooKeeper zk, WatchedEvent event) { String path = event.getPath(); if (path != null) { switch (eventType) { case NodeDeleted: if (state == State.ACTIVE) { enterNeutralMode(); } //重新參加選舉 joinElectionInternal(); break; case NodeDataChanged: monitorActiveStatus(); break; default: monitorActiveStatus(); } }
上述主要是從代碼的角度去理解和分析了NN自動切換的大致流程,相比手動切換的方式,可用性大大提升,同時減輕了運維的負擔。