文章目錄

前言
HDFS Standby Read的背景及功能要求
Standby NameNode一致性讀的控制實現
參考鏈接

前言

HDFS有着一套十分成熟的HA的機制來保證其服務的高可用性。在HA模式下，分別對應有Active和Standby NameNode的服務。Active NameNode用於提供對外數據服務，而Standby NameNode則負責做checkpoint的工作以及隨時準備接替變成Active NameNode的角色，假設說當前Active NameNode意外不可用的情況發生的話。其實，Standby NameNode日常的工作並不多，除了定期checkpoint和準實時地同步元數據信息外，它並不處理來自外部client發起的讀寫請求，所以Standby NameNode服務的一個負載是比較低的。當Active NameNode的服務壓力越來越大的時候，那麼是否我們可以讓Standby的NameNode去分流一部分的讀壓力呢？Hadoop社區在很早的時候就已經提出過此設想並且實現了這個功能。本文筆者將結合部分代碼來簡單分析分析HDFS Standby Read的實現原理。

HDFS Standby Read的背景及功能要求

首先我們來說說HDFS Standby Read的背景及功能要求。在Active NameNode隨着集羣規模的不斷擴張下，其服務壓力將會越來越大的。對於這種情況下，我們一般的做法是通過組建HDFS Federation的方式來達到服務橫向擴展的目標。但是這種方式並沒有在NameNode本身的服務能力上做更進一步的挖掘優化，而HDFS Standby Read的功能就是在這塊的一個大大的補強。

在Standby Read模式下，HDFS原有的寫請求依然是被Active NameNode所處理。Standby服務只是支持了讀操作的處理，所以這裏不會涉及到NameNode主要邏輯上的大改。不過這裏面最需要解決的問題是，Standby一致性讀的問題。

我們知道Standby NameNode是通過讀取JournalNode上的editlog，從而進行transaction的同步的。在Active NameNode寫editlog到出去，再到Standby NameNode去讀這這批editlog，中間是存在時間gap的。所以在實現Standby Read功能時，我們並不是簡簡單單地把讀請求直接轉向Standby NN就完事了，這裏會涉及到transaction的等待同步問題。下面筆者會詳細介紹這塊社區是怎麼做的。

Standby NameNode一致性讀的控制實現

原理分析

鑑於上小節提到的Standby NameNode狀態同步的問題，需要Standby NameNode達到client最近一次的txid後，才能允許其處理client的讀請求操作。

上面這句話什麼意思呢？對於client而言，在它發起RPC請求時，Active NameNode和Standby NameNode各自有自身當前的txid，且Active NameNode的txid肯定要大於Standby的txid。這裏我們標記Active txid爲ann txid，Standby爲 snn txid。如果這時，client發起後續請求到Active服務，那沒有什麼數據延時的問題，Active一直都是最新的狀態。但是假設我們想讓Standby NameNode也能夠處理client的請求，那麼它至少得達到剛剛client發起RPC時刻起時Active NameNode的txid的狀態，即snn txid也達到ann txid值。

此部分過程簡單闡述如下所示：

1）Client發起RPC請求前獲取到當前Active NameNode的txid值，這裏我們叫做lastSeenTxid。
2）隨後Client發起讀請求到Standby NameNode，在此請求中會帶上上步驟的lastSeenTxid的值。
3）Standby NameNode在處理上步驟的RPC請求時，會比較自身當前的txid是否已經達到client的lastSeenTxid值，如果已經達到，則正常處理這個請求，否則將請求重新插入RPC callqueu等待下次被處理。這裏的請求重新進queue的操作對於client來說，意味着這個RPC call還沒有處理結束。

爲了避免Client可能出現長時間等待Standby NameNode達到lastSeenTxid狀態的情況，社區在Standby NameNode editlog的同步上做了一部分改進，包括支持editlog_inprogress裏的transaction讀取以及editlog信息的內存讀取等等。

後面筆者來結合實際代碼，來對應分析上面的過程。

代碼分析

社區在實現的過程裏定義了2個類來存放Client和Server端自身能夠“看到”的txid值。

Client對應的類叫做ClientGSIContext，Server端（即NameNode）的叫做GlobalStateIdContext。

我們先來說說Client端的這個類，ClientGSIContext。ClientGSIContext類內部維護有lastSeenStateId這個值，代碼如下所示：

/**
 * Global State Id context for the client.
 * <p>
 * This is the client side implementation responsible for receiving
 * state alignment info from server(s).
 */
@InterfaceAudience.Private
@InterfaceStability.Evolving
public class ClientGSIContext implements AlignmentContext {
   
        

  private final LongAccumulator lastSeenStateId =
      new LongAccumulator(Math::max, Long.MIN_VALUE);
  ...
}

lastSeenStateId這個值的更新和獲取主要發生在接收到RPC response階段（更新當前的lastSeenStateId值）和RPC請求發送（設置當前的lastSeenStateId值）的時候，代碼如下，還是在這個類的邏輯裏。

  /**
   * Client接收到請求response時更新當前的lastSeenStateId值。
   */
  @Override
  public void receiveResponseState(RpcResponseHeaderProto header) {
   
        
    lastSeenStateId.accumulate(header.getStateId());
  }

  /**
   * Client發起請求時設置當前的lastSeenStateId值信息到RPC請求裏。
   */
  @Override
  public void updateRequestState(RpcRequestHeaderProto.Builder header) {
   
        
    header.setStateId(lastSeenStateId.longValue());
  }

然後我們再來看Server端的GlobalStateIdContext是怎麼處理的。首先是GlobalStateIdContext的類定義：

/**
 * This is the server side implementation responsible for passing
 * state alignment info to clients.
 */
@InterfaceAudience.Private
@InterfaceStability.Evolving
class GlobalStateIdContext implements AlignmentContext {
   
        
  /**
   * Estimated number of journal transactions a typical NameNode can execute
   * per second. The number is used to estimate how long a client's
   * RPC request will wait in the call queue before the Observer catches up
   * with its state id.
   */
  private static final long ESTIMATED_TRANSACTIONS_PER_SECOND = 10000L;

  /**
   * The client wait time on an RPC request is composed of
   * the server execution time plus the communication time.
   * This is an expected fraction of the total wait time spent on
   * server execution.
   */
  private static final float ESTIMATED_SERVER_TIME_MULTIPLIER = 0.8f;
  /** FSNamesystem用來獲取當前最新的txid值 */
  private final FSNamesystem namesystem;
  private final HashSet<String> coordinatedMethods;
  ...
}

GlobalStateIdContext在處理client的RPC請求時，主要做下面兩個事情：

1）接受到RPC請求，從RPC 請求中提取client的lastSeenTxid，並且和自身最新的txid做比較。
2）處理完RPC後，設置RPC response時，設置自身最新的txid到client的lastSeenTxid裏，意爲client此時已經see到一個更新的txid狀態。

上面兩部對應的操作方法如下所示：

  /**
   * Server端處理完RPC後，設置RPC response時，設置自身最新的txid到client的lastSeenTxid裏。
   */
  @Override
  public void updateResponseState(RpcResponseHeaderProto.Builder header) {
   
        
    // Using getCorrectLastAppliedOrWrittenTxId will acquire the lock on
    // FSEditLog. This is needed so that ANN will return the correct state id
    // it currently has. But this may not be necessary for Observer, may want
    // revisit for optimization. Same goes to receiveRequestState.
    header.setStateId(getLastSeenStateId());
  }

  /**
   * Server端請求狀態的判斷處理邏輯。
   */
  @Override
  public long receiveRequestState(RpcRequestHeaderProto header,
      long clientWaitTime) throws IOException {
   
        
    if (!header.hasStateId() &&
        HAServiceState.OBSERVER.equals(namesystem.getState())) {
   
        
      // This could happen if client configured with non-observer proxy provider
      // (e.g., ConfiguredFailoverProxyProvider) is accessing a cluster with
      // observers. In this case, we should let the client failover to the
      // active node, rather than potentially serving stale result (client
      // stateId is 0 if not set).
      throw new StandbyException("Observer Node received request without "
          + "stateId. This mostly likely is because client is not configured "
          + "with " + ObserverReadProxyProvider.class.getSimpleName());
    }
    long serverStateId = getLastSeenStateId();
    long clientStateId = header.getStateId();
    FSNamesystem.LOG.trace("Client State ID= {} and Server State ID= {}",
        clientStateId, serverStateId);

    if (clientStateId > serverStateId &&
        HAServiceState.ACTIVE.equals(namesystem.getState())) {
   
        
      FSNamesystem.LOG.warn("The client stateId: {} is greater than "
          + "the server stateId: {} This is unexpected. "
          + "Resetting client stateId to server stateId",
          clientStateId, serverStateId);
      return serverStateId;
    }
    
    // 如果當前client的lastSeenTxid值遠遠大於當前server端的txid值，則拋出異常。
    // 如果是小於serverStateId或者在正常範圍內，則繼續處理。
    if (HAServiceState.OBSERVER.equals(namesystem.getState()) &&
        clientStateId - serverStateId >
        ESTIMATED_TRANSACTIONS_PER_SECOND
            * TimeUnit.MILLISECONDS.toSeconds(clientWaitTime)
            * ESTIMATED_SERVER_TIME_MULTIPLIER) {
   
        
      throw new RetriableException(
          "Observer Node is too far behind: serverStateId = "
              + serverStateId + " clientStateId = " + clientStateId);
    }
    return clientStateId;
  }

  // 獲取自身當前最新的txid值
  @Override
  public long getLastSeenStateId() {
   
        
    // Should not need to call getCorrectLastAppliedOrWrittenTxId()
    // see HDFS-14822.
    return namesystem.getFSImage().getLastAppliedOrWrittenTxId();
  }

注意上面的receiveRequestState只是client請求進入rpc queue的一個前期驗證處理，在後續Handler從rpc queue中獲取這個call處理的時候，還會做一次client lastSeenTxid和server txid的比較。

  /** Handles queued calls . */
  private class Handler extends Thread {
   
        
    public Handler(int instanceNumber) {
   
        
      this.setDaemon(true);
      this.setName("IPC Server handler "+ instanceNumber +
          " on default port " + port);
    }

    @Override
    public void run() {
   
        
      LOG.debug(Thread.currentThread().getName() + ": starting");
      SERVER.set(Server.this);
      while (running) {
   
        
        TraceScope traceScope = null;
        Call call = null;
        long startTimeNanos = 0;
        // True iff the connection for this call has been dropped.
        // Set to true by default and update to false later if the connection
        // can be succesfully read.
        boolean connDropped = true;

        try {
   
        
          1）從call queue 中獲取一個call進行處理
          call = callQueue.take(); // pop the queue; maybe blocked here
          startTimeNanos = Time.monotonicNowNanos();
          // 如果這個call是支持Standby Read，且其client seen txid大於server端txid，則執行此call的重新進queue操作，延遲這個call的處理，等待server端的txid reach到client 的txid值
          if (alignmentContext != null && call.isCallCoordinated() &&
              call.getClientStateId() > alignmentContext.getLastSeenStateId()) {
   
        
            /*
             * The call processing should be postponed until the client call's
             * state id is aligned (<=) with the server state id.

             * NOTE:
             * Inserting the call back to the queue can change the order of call
             * execution comparing to their original placement into the queue.
             * This is not a problem, because Hadoop RPC does not have any
             * constraints on ordering the incoming rpc requests.
             * In case of Observer, it handles only reads, which are
             * commutative.
             */
            // Re-queue the call and continue
            requeueCall(call);
            continue;
          }
          ...
  }
}

分析到這裏，Standby Read的Server邏輯分析的差不多了，不過再回到剛剛上面的步驟裏：

1）Client發起RPC請求前獲取到當前Active NameNode的txid值，這裏我們叫做lastSeenTxid。
2）隨後Client發起讀請求到Standby NameNode，在此請求中會帶上上步驟的lastSeenTxid的值。
…

Client是如何做到先發起請求到Active NameNode獲取最新txid，然後隨後向Standby NameNode發起後續read請求的，這裏涉及到了2個RPC call。

社區實現了新的ProxyProvider類ObserverReadProxyProvider來封裝了此部分的邏輯。在ObserverReadInvocationHandler的邏輯裏，它會在每次發起讀請求到Standby NameNode前，先行發送一次msync call到Active NameNode來同步Client端的ClientGSIContext裏的lastSeenStateId（在Client 處理response方法裏會調用到ClientGSIContext#receiveResponseState操作）。

此部分邏輯如下，ObserverReadProxyProvider類。

  private class ObserverReadInvocationHandler implements RpcInvocationHandler {
   
        

    @Override
    public Object invoke(Object proxy, final Method method, final Object[] args)
        throws Throwable {
   
        
      lastProxy = null;
      Object retVal;

      // 如果開啓了Standby Read的功能並且，RPC call的請求方法是Read類型的
      if (observerReadEnabled && shouldFindObserver() && isRead(method)) {
   
        
        if (!msynced) {
   
        
          // An msync() must first be performed to ensure that this client is
          // up-to-date with the active's state. This will only be done once.
          initializeMsync();
        } else {
   
        
          // 在每次發起請求時，先執行一遍msync操作方法到Active NameNode，進行client lastSeemTxid的同步
          autoMsyncIfNecessary();
        }

        int failedObserverCount = 0;
        int activeCount = 0;
        int standbyCount = 0;
        int unreachableCount = 0;
        // 後續發起請求到Standby NameNode進行讀請求的處理
        for (int i = 0; i < nameNodeProxies.size(); i++) {
   
        
          NNProxyInfo<T> current = getCurrentProxy();
          HAServiceState currState = current.getCachedState();
          if (currState != HAServiceState.OBSERVER) {
   
        
            if (currState == HAServiceState.ACTIVE) {
   
        
              activeCount++;
            } else if (currState == HAServiceState.STANDBY) {
   
        
              standbyCount++;
            } else if (currState == null) {
   
        
              unreachableCount++;
            }
            LOG.debug("Skipping proxy {} for {} because it is in state {}",
                current.proxyInfo, method.getName(),
                currState == null ? "unreachable" : currState);
            changeProxy(current);
            continue;
          }
          ...
      }
       
      // 其它非讀類型的請求，還是訪問Active NameNode
      LOG.debug("Using failoverProxy to service {}", method.getName());
      ProxyInfo<T> activeProxy = failoverProxy.getProxy();
      try {
   
        
        retVal = method.invoke(activeProxy.proxy, args);
      } catch (InvocationTargetException e) {
   
        
        // This exception will be handled by higher layers
        throw e.getCause();
      }
      // If this was reached, the request reached the active, so the
      // state is up-to-date with active and no further msync is needed.
      msynced = true;
      lastMsyncTimeMs = Time.monotonicNow();
      lastProxy = activeProxy;
      return retVal;
    }
}

流程分析圖

結合上述代碼邏輯以及過程分析，HDFS Standby Read功能的過程圖如下所示：

上圖中Observer NameNode是Standby Read feature中引入的一種新的角色，它本質上來說是更輕量級的Standby NameNode，它和原有Standby的主要區別是它不做checkpoint這類的操作。NameNode Observer和Standby的狀態能夠進行互相轉化，但是Observer NameNode不能和Active NameNode進行直接的狀態切換。

在HDFS Standby Read的實現中，還有一大半實現是在SNN快速讀取editlog的優化裏，這部分感興趣的同學可閱讀參考鏈接處。

參考鏈接

[1]. https://issues.apache.org/jira/browse/HDFS-12943 . Consistent Reads from Standby Node

HDFS Standby NameNode Read功能剖析

文章目錄

前言

HDFS Standby Read的背景及功能要求

Standby NameNode一致性讀的控制實現

原理分析

代碼分析

流程分析圖

參考鏈接

【筆記】動手學深度學習-前言

公司新來一個幹練小夥，把 MyBatis 替換成 MyBatis-Plus，上線後哭暈在廁所。。。

支持非IE瀏覽器真的那麼難嗎？

爲啥就那麼痛恨IE？

Brian Sun：回覆“爲啥就那麼痛恨IE？”

體驗下，大廠在使用功能的API網關！

見鬼了！我家的 WiFi 只有下雨天才能正常使用...

短視頻文案提取原來如此簡單

oa系統集成及案例樣式

世界讀書日 | 開發者必讀書單重磅來襲，華爲雲DTSE專家天團力薦

HDFS Standby NameNode Read功能剖析

SQL2005完整+日誌+文件+日誌備份和還原策略

2020年鉅虧56億美元，谷歌雲真的步履維艱了嗎？

吳恩達深度學習學習筆記——C2W3——超參數調試、Batch 正則化和程序框架——作業

ELSA企業日誌歸檔查詢系統

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結