文章目录

前言
HDFS Standby Read的背景及功能要求
Standby NameNode一致性读的控制实现
参考链接

前言

HDFS有着一套十分成熟的HA的机制来保证其服务的高可用性。在HA模式下，分别对应有Active和Standby NameNode的服务。Active NameNode用于提供对外数据服务，而Standby NameNode则负责做checkpoint的工作以及随时准备接替变成Active NameNode的角色，假设说当前Active NameNode意外不可用的情况发生的话。其实，Standby NameNode日常的工作并不多，除了定期checkpoint和准实时地同步元数据信息外，它并不处理来自外部client发起的读写请求，所以Standby NameNode服务的一个负载是比较低的。当Active NameNode的服务压力越来越大的时候，那么是否我们可以让Standby的NameNode去分流一部分的读压力呢？Hadoop社区在很早的时候就已经提出过此设想并且实现了这个功能。本文笔者将结合部分代码来简单分析分析HDFS Standby Read的实现原理。

HDFS Standby Read的背景及功能要求

首先我们来说说HDFS Standby Read的背景及功能要求。在Active NameNode随着集群规模的不断扩张下，其服务压力将会越来越大的。对于这种情况下，我们一般的做法是通过组建HDFS Federation的方式来达到服务横向扩展的目标。但是这种方式并没有在NameNode本身的服务能力上做更进一步的挖掘优化，而HDFS Standby Read的功能就是在这块的一个大大的补强。

在Standby Read模式下，HDFS原有的写请求依然是被Active NameNode所处理。Standby服务只是支持了读操作的处理，所以这里不会涉及到NameNode主要逻辑上的大改。不过这里面最需要解决的问题是，Standby一致性读的问题。

我们知道Standby NameNode是通过读取JournalNode上的editlog，从而进行transaction的同步的。在Active NameNode写editlog到出去，再到Standby NameNode去读这这批editlog，中间是存在时间gap的。所以在实现Standby Read功能时，我们并不是简简单单地把读请求直接转向Standby NN就完事了，这里会涉及到transaction的等待同步问题。下面笔者会详细介绍这块社区是怎么做的。

Standby NameNode一致性读的控制实现

原理分析

鉴于上小节提到的Standby NameNode状态同步的问题，需要Standby NameNode达到client最近一次的txid后，才能允许其处理client的读请求操作。

上面这句话什么意思呢？对于client而言，在它发起RPC请求时，Active NameNode和Standby NameNode各自有自身当前的txid，且Active NameNode的txid肯定要大于Standby的txid。这里我们标记Active txid为ann txid，Standby为 snn txid。如果这时，client发起后续请求到Active服务，那没有什么数据延时的问题，Active一直都是最新的状态。但是假设我们想让Standby NameNode也能够处理client的请求，那么它至少得达到刚刚client发起RPC时刻起时Active NameNode的txid的状态，即snn txid也达到ann txid值。

此部分过程简单阐述如下所示：

1）Client发起RPC请求前获取到当前Active NameNode的txid值，这里我们叫做lastSeenTxid。
2）随后Client发起读请求到Standby NameNode，在此请求中会带上上步骤的lastSeenTxid的值。
3）Standby NameNode在处理上步骤的RPC请求时，会比较自身当前的txid是否已经达到client的lastSeenTxid值，如果已经达到，则正常处理这个请求，否则将请求重新插入RPC callqueu等待下次被处理。这里的请求重新进queue的操作对于client来说，意味着这个RPC call还没有处理结束。

为了避免Client可能出现长时间等待Standby NameNode达到lastSeenTxid状态的情况，社区在Standby NameNode editlog的同步上做了一部分改进，包括支持editlog_inprogress里的transaction读取以及editlog信息的内存读取等等。

后面笔者来结合实际代码，来对应分析上面的过程。

代码分析

社区在实现的过程里定义了2个类来存放Client和Server端自身能够“看到”的txid值。

Client对应的类叫做ClientGSIContext，Server端（即NameNode）的叫做GlobalStateIdContext。

我们先来说说Client端的这个类，ClientGSIContext。ClientGSIContext类内部维护有lastSeenStateId这个值，代码如下所示：

/**
 * Global State Id context for the client.
 * <p>
 * This is the client side implementation responsible for receiving
 * state alignment info from server(s).
 */
@InterfaceAudience.Private
@InterfaceStability.Evolving
public class ClientGSIContext implements AlignmentContext {
   
        

  private final LongAccumulator lastSeenStateId =
      new LongAccumulator(Math::max, Long.MIN_VALUE);
  ...
}

lastSeenStateId这个值的更新和获取主要发生在接收到RPC response阶段（更新当前的lastSeenStateId值）和RPC请求发送（设置当前的lastSeenStateId值）的时候，代码如下，还是在这个类的逻辑里。

  /**
   * Client接收到请求response时更新当前的lastSeenStateId值。
   */
  @Override
  public void receiveResponseState(RpcResponseHeaderProto header) {
   
        
    lastSeenStateId.accumulate(header.getStateId());
  }

  /**
   * Client发起请求时设置当前的lastSeenStateId值信息到RPC请求里。
   */
  @Override
  public void updateRequestState(RpcRequestHeaderProto.Builder header) {
   
        
    header.setStateId(lastSeenStateId.longValue());
  }

然后我们再来看Server端的GlobalStateIdContext是怎么处理的。首先是GlobalStateIdContext的类定义：

/**
 * This is the server side implementation responsible for passing
 * state alignment info to clients.
 */
@InterfaceAudience.Private
@InterfaceStability.Evolving
class GlobalStateIdContext implements AlignmentContext {
   
        
  /**
   * Estimated number of journal transactions a typical NameNode can execute
   * per second. The number is used to estimate how long a client's
   * RPC request will wait in the call queue before the Observer catches up
   * with its state id.
   */
  private static final long ESTIMATED_TRANSACTIONS_PER_SECOND = 10000L;

  /**
   * The client wait time on an RPC request is composed of
   * the server execution time plus the communication time.
   * This is an expected fraction of the total wait time spent on
   * server execution.
   */
  private static final float ESTIMATED_SERVER_TIME_MULTIPLIER = 0.8f;
  /** FSNamesystem用来获取当前最新的txid值 */
  private final FSNamesystem namesystem;
  private final HashSet<String> coordinatedMethods;
  ...
}

GlobalStateIdContext在处理client的RPC请求时，主要做下面两个事情：

1）接受到RPC请求，从RPC 请求中提取client的lastSeenTxid，并且和自身最新的txid做比较。
2）处理完RPC后，设置RPC response时，设置自身最新的txid到client的lastSeenTxid里，意为client此时已经see到一个更新的txid状态。

上面两部对应的操作方法如下所示：

  /**
   * Server端处理完RPC后，设置RPC response时，设置自身最新的txid到client的lastSeenTxid里。
   */
  @Override
  public void updateResponseState(RpcResponseHeaderProto.Builder header) {
   
        
    // Using getCorrectLastAppliedOrWrittenTxId will acquire the lock on
    // FSEditLog. This is needed so that ANN will return the correct state id
    // it currently has. But this may not be necessary for Observer, may want
    // revisit for optimization. Same goes to receiveRequestState.
    header.setStateId(getLastSeenStateId());
  }

  /**
   * Server端请求状态的判断处理逻辑。
   */
  @Override
  public long receiveRequestState(RpcRequestHeaderProto header,
      long clientWaitTime) throws IOException {
   
        
    if (!header.hasStateId() &&
        HAServiceState.OBSERVER.equals(namesystem.getState())) {
   
        
      // This could happen if client configured with non-observer proxy provider
      // (e.g., ConfiguredFailoverProxyProvider) is accessing a cluster with
      // observers. In this case, we should let the client failover to the
      // active node, rather than potentially serving stale result (client
      // stateId is 0 if not set).
      throw new StandbyException("Observer Node received request without "
          + "stateId. This mostly likely is because client is not configured "
          + "with " + ObserverReadProxyProvider.class.getSimpleName());
    }
    long serverStateId = getLastSeenStateId();
    long clientStateId = header.getStateId();
    FSNamesystem.LOG.trace("Client State ID= {} and Server State ID= {}",
        clientStateId, serverStateId);

    if (clientStateId > serverStateId &&
        HAServiceState.ACTIVE.equals(namesystem.getState())) {
   
        
      FSNamesystem.LOG.warn("The client stateId: {} is greater than "
          + "the server stateId: {} This is unexpected. "
          + "Resetting client stateId to server stateId",
          clientStateId, serverStateId);
      return serverStateId;
    }
    
    // 如果当前client的lastSeenTxid值远远大于当前server端的txid值，则抛出异常。
    // 如果是小于serverStateId或者在正常范围内，则继续处理。
    if (HAServiceState.OBSERVER.equals(namesystem.getState()) &&
        clientStateId - serverStateId >
        ESTIMATED_TRANSACTIONS_PER_SECOND
            * TimeUnit.MILLISECONDS.toSeconds(clientWaitTime)
            * ESTIMATED_SERVER_TIME_MULTIPLIER) {
   
        
      throw new RetriableException(
          "Observer Node is too far behind: serverStateId = "
              + serverStateId + " clientStateId = " + clientStateId);
    }
    return clientStateId;
  }

  // 获取自身当前最新的txid值
  @Override
  public long getLastSeenStateId() {
   
        
    // Should not need to call getCorrectLastAppliedOrWrittenTxId()
    // see HDFS-14822.
    return namesystem.getFSImage().getLastAppliedOrWrittenTxId();
  }

注意上面的receiveRequestState只是client请求进入rpc queue的一个前期验证处理，在后续Handler从rpc queue中获取这个call处理的时候，还会做一次client lastSeenTxid和server txid的比较。

  /** Handles queued calls . */
  private class Handler extends Thread {
   
        
    public Handler(int instanceNumber) {
   
        
      this.setDaemon(true);
      this.setName("IPC Server handler "+ instanceNumber +
          " on default port " + port);
    }

    @Override
    public void run() {
   
        
      LOG.debug(Thread.currentThread().getName() + ": starting");
      SERVER.set(Server.this);
      while (running) {
   
        
        TraceScope traceScope = null;
        Call call = null;
        long startTimeNanos = 0;
        // True iff the connection for this call has been dropped.
        // Set to true by default and update to false later if the connection
        // can be succesfully read.
        boolean connDropped = true;

        try {
   
        
          1）从call queue 中获取一个call进行处理
          call = callQueue.take(); // pop the queue; maybe blocked here
          startTimeNanos = Time.monotonicNowNanos();
          // 如果这个call是支持Standby Read，且其client seen txid大于server端txid，则执行此call的重新进queue操作，延迟这个call的处理，等待server端的txid reach到client 的txid值
          if (alignmentContext != null && call.isCallCoordinated() &&
              call.getClientStateId() > alignmentContext.getLastSeenStateId()) {
   
        
            /*
             * The call processing should be postponed until the client call's
             * state id is aligned (<=) with the server state id.

             * NOTE:
             * Inserting the call back to the queue can change the order of call
             * execution comparing to their original placement into the queue.
             * This is not a problem, because Hadoop RPC does not have any
             * constraints on ordering the incoming rpc requests.
             * In case of Observer, it handles only reads, which are
             * commutative.
             */
            // Re-queue the call and continue
            requeueCall(call);
            continue;
          }
          ...
  }
}

分析到这里，Standby Read的Server逻辑分析的差不多了，不过再回到刚刚上面的步骤里：

1）Client发起RPC请求前获取到当前Active NameNode的txid值，这里我们叫做lastSeenTxid。
2）随后Client发起读请求到Standby NameNode，在此请求中会带上上步骤的lastSeenTxid的值。
…

Client是如何做到先发起请求到Active NameNode获取最新txid，然后随后向Standby NameNode发起后续read请求的，这里涉及到了2个RPC call。

社区实现了新的ProxyProvider类ObserverReadProxyProvider来封装了此部分的逻辑。在ObserverReadInvocationHandler的逻辑里，它会在每次发起读请求到Standby NameNode前，先行发送一次msync call到Active NameNode来同步Client端的ClientGSIContext里的lastSeenStateId（在Client 处理response方法里会调用到ClientGSIContext#receiveResponseState操作）。

此部分逻辑如下，ObserverReadProxyProvider类。

  private class ObserverReadInvocationHandler implements RpcInvocationHandler {
   
        

    @Override
    public Object invoke(Object proxy, final Method method, final Object[] args)
        throws Throwable {
   
        
      lastProxy = null;
      Object retVal;

      // 如果开启了Standby Read的功能并且，RPC call的请求方法是Read类型的
      if (observerReadEnabled && shouldFindObserver() && isRead(method)) {
   
        
        if (!msynced) {
   
        
          // An msync() must first be performed to ensure that this client is
          // up-to-date with the active's state. This will only be done once.
          initializeMsync();
        } else {
   
        
          // 在每次发起请求时，先执行一遍msync操作方法到Active NameNode，进行client lastSeemTxid的同步
          autoMsyncIfNecessary();
        }

        int failedObserverCount = 0;
        int activeCount = 0;
        int standbyCount = 0;
        int unreachableCount = 0;
        // 后续发起请求到Standby NameNode进行读请求的处理
        for (int i = 0; i < nameNodeProxies.size(); i++) {
   
        
          NNProxyInfo<T> current = getCurrentProxy();
          HAServiceState currState = current.getCachedState();
          if (currState != HAServiceState.OBSERVER) {
   
        
            if (currState == HAServiceState.ACTIVE) {
   
        
              activeCount++;
            } else if (currState == HAServiceState.STANDBY) {
   
        
              standbyCount++;
            } else if (currState == null) {
   
        
              unreachableCount++;
            }
            LOG.debug("Skipping proxy {} for {} because it is in state {}",
                current.proxyInfo, method.getName(),
                currState == null ? "unreachable" : currState);
            changeProxy(current);
            continue;
          }
          ...
      }
       
      // 其它非读类型的请求，还是访问Active NameNode
      LOG.debug("Using failoverProxy to service {}", method.getName());
      ProxyInfo<T> activeProxy = failoverProxy.getProxy();
      try {
   
        
        retVal = method.invoke(activeProxy.proxy, args);
      } catch (InvocationTargetException e) {
   
        
        // This exception will be handled by higher layers
        throw e.getCause();
      }
      // If this was reached, the request reached the active, so the
      // state is up-to-date with active and no further msync is needed.
      msynced = true;
      lastMsyncTimeMs = Time.monotonicNow();
      lastProxy = activeProxy;
      return retVal;
    }
}

流程分析图

结合上述代码逻辑以及过程分析，HDFS Standby Read功能的过程图如下所示：

上图中Observer NameNode是Standby Read feature中引入的一种新的角色，它本质上来说是更轻量级的Standby NameNode，它和原有Standby的主要区别是它不做checkpoint这类的操作。NameNode Observer和Standby的状态能够进行互相转化，但是Observer NameNode不能和Active NameNode进行直接的状态切换。

在HDFS Standby Read的实现中，还有一大半实现是在SNN快速读取editlog的优化里，这部分感兴趣的同学可阅读参考链接处。

参考链接

[1]. https://issues.apache.org/jira/browse/HDFS-12943 . Consistent Reads from Standby Node

HDFS Standby NameNode Read功能剖析

文章目录

前言

HDFS Standby Read的背景及功能要求

Standby NameNode一致性读的控制实现

原理分析

代码分析

流程分析图

参考链接

HDFS Standby NameNode Read功能剖析

SQL2005完整+日誌+文件+日誌備份和還原策略

2020年鉅虧56億美元，谷歌雲真的步履維艱了嗎？

吳恩達深度學習學習筆記——C2W3——超參數調試、Batch 正則化和程序框架——作業

ELSA企業日誌歸檔查詢系統

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結