關於機房交換機故障導致HDFS NameNode掛掉的問題(續)

過程是痛苦的,後面的結論是令人不安的。


上一篇的分析,確定了至少兩個個結論:
一、如果總體上active NN寫JNs出問題,那麼active NN就主動調用terminate,進程退出;
二、JNs的相關的一個配置項:dfs.namenode.shared.edits.dir,這個配置項中出現的JN的信息,對NN來說一定是“required”的。


這篇後續的分析,解釋“總體上active NN寫JNs出問題”,是怎麼回事。以上一篇相反的另一個方向的思路,分析問題是怎麼導致的,以及解釋代碼與QJM的quorum機制是否一致(答案必然是肯定的了)。


還是從active NN FATAL log說起。


2015-11-16 07:36:50,478 INFO  namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 11830 Total time for transactions(ms): 394 Number of transactions batched in Syncs: 7342 Number of syncs: 350 SyncTimes(ms): 735 30792 26555
1598 2015-11-16 07:36:50,481 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4776804880))
1599 java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
1600         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
1601         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
1602         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
1603         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
1604         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
1605         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
1606         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
1607         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
1608         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
1609         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3001)
1610         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:647)
1611         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTrans     latorPB.java:484)
1612         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenode     ProtocolProtos.java)
1613         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
1614         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
1615         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
1616         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
1617         at java.security.AccessController.doPrivileged(Native Method)
1618         at javax.security.auth.Subject.doAs(Subject.java:415)
1619         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
1620         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)


補充相關日誌:
1469:2015-11-16 07:36:26,770 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 4670ms to send a batch of 78 edits (12198 bytes) to remote journal 192.168.146.67:8485
1471:2015-11-16 07:36:50,383 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 21267ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.67:8485
1537:2015-11-16 07:36:29,116 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2345ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.68:8485
1547:2015-11-16 07:36:29,115 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2344ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.66:8485
1593:2015-11-16 07:36:50,459 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 23689 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [192.168.146.67:8485]




JournalSet.java中,JournalSet.mapJournalsAndReportErrors():
  /**
   * Apply the given operation across all of the journal managers, disabling
   * any for which the closure throws an IOException.
   * @param closure {@link JournalClosure} object encapsulating the operation.
   * @param status message used for logging errors (e.g. "opening journal")
   * @throws IOException If the operation fails on all the journals.
   */
  private void mapJournalsAndReportErrors(JournalClosure closure, String status)
      throws IOException {


    List<JournalAndStream> badJAS = Lists.newLinkedList();
    for (JournalAndStream jas : journals) {
      try {
        closure.apply(jas); // 注意:JournalClosuer.apply()拋出Exception或Error,導致最終的terminate()
      } catch (Throwable t) {
        if (jas.isRequired()) {
          final String msg = "Error: " + status + " failed for required journal ("
            + jas + ")";
          LOG.fatal(msg, t);
          // If we fail on *any* of the required journals, then we must not
          // continue on any of the other journals. Abort them to ensure that
          // retry behavior doesn't allow them to keep going in any way.
          abortAllJournals();
          // the current policy is to shutdown the NN on errors to shared edits
          // dir. There are many code paths to shared edits failures - syncs,
          // roll of edits etc. All of them go through this common function
          // where the isRequired() check is made. Applying exit policy here
          // to catch all code paths.
          terminate(1, msg);
        } else {
          LOG.error("Error: " + status + " failed for (journal " + jas + ")", t);
          badJAS.add(jas);
        }
      }
    }
    disableAndReportErrorOnJournals(badJAS);
    if (!NameNodeResourcePolicy.areResourcesAvailable(journals,
        minimumRedundantJournals)) {
      String message = status + " failed for too many journals";
      LOG.error("Error: " + message);
      throw new IOException(message);
    }
  }


JournalSet.JournalSetOutputStream.flush():
    @Override
    public void flush() throws IOException {
      mapJournalsAndReportErrors(new JournalClosure() {
        @Override
        public void apply(JournalAndStream jas) throws IOException {
          if (jas.isActive()) {
            jas.getCurrentStream().flush(); // 注意:apply()的實現中,EditLogOutputStream.flush()唯一可導致拋出IOException
          }
        }
      }, "flush");
    }


JournalSet.JournalAndStream.getCurrentStream():
    private EditLogOutputStream stream;
    EditLogOutputStream getCurrentStream() {
      return stream;
    }


EditLogOutputStream.java中:
  /**
   * Flush and sync all data that is ready to be flush
   * {@link #setReadyToFlush()} into underlying persistent store.
   * @param durable if true, the edits should be made truly durable before
   * returning
   * @throws IOException
   */
  abstract protected void flushAndSync(boolean durable) throws IOException;


  /**
   * Flush data to persistent store.
   * Collect sync metrics.
   */
  public void flush() throws IOException {
    flush(true);
  }


  public void flush(boolean durable) throws IOException {
    numSync++;
    long start = monotonicNow();
    flushAndSync(durable); // 注意:flush()的實現中,flushAndSync()唯一可拋出IOException
    long end = monotonicNow();
    totalTimeSync += (end - start);
  }


QuorumOutputStream.java中:
/**
 * EditLogOutputStream implementation that writes to a quorum of
 * remote journals.
 */
class QuorumOutputStream extends EditLogOutputStream {


  private final int writeTimeoutMs;


  @Override
  protected void flushAndSync(boolean durable) throws IOException {
    int numReadyBytes = buf.countReadyBytes();
    if (numReadyBytes > 0) {
      int numReadyTxns = buf.countReadyTxns();
      long firstTxToFlush = buf.getFirstReadyTxId();


      assert numReadyTxns > 0;


      // Copy from our double-buffer into a new byte array. This is for
      // two reasons:
      // 1) The IPC code has no way of specifying to send only a slice of
      //    a larger array.
      // 2) because the calls to the underlying nodes are asynchronous, we
      //    need a defensive copy to avoid accidentally mutating the buffer
      //    before it is sent.
      DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
      buf.flushTo(bufToSend);
      assert bufToSend.getLength() == numReadyBytes;
      byte[] data = bufToSend.getData();
      assert data.length == bufToSend.getLength();


      // 注意:AysncLoggerSet.sendEdits()不會拋出exception
      QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
          segmentTxId, firstTxToFlush,
          numReadyTxns, data);
      // 注意:AsyncLogger.waitForWriteQuorum()可拋出IOException
      // 需要確認,writeTimeoutMs的值是什麼,怎麼來的
      loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");


      // Since we successfully wrote this batch, let the loggers know. Any future
      // RPCs will thus let the loggers know of the most recent transaction, even
      // if a logger has fallen behind.
      loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
    }
  }


}


AsyncLoggerSet.java中:
  public QuorumCall<AsyncLogger, Void> sendEdits(
      long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
    Map<AsyncLogger, ListenableFuture<Void>> calls = Maps.newHashMap();
    for (AsyncLogger logger : loggers) {
      ListenableFuture<Void> future =
        logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
      calls.put(logger, future);
    }
    return QuorumCall.create(calls);
  }


  /**
   * Wait for a quorum of loggers to respond to the given call. If a quorum
   * can't be achieved, throws a QuorumException.
   * @param q the quorum call
   * @param timeoutMs the number of millis to wait
   * @param operationName textual description of the operation, for logging
   * @return a map of successful results
   * @throws QuorumException if a quorum doesn't respond with success
   * @throws IOException if the thread is interrupted or times out
   */
  <V> Map<AsyncLogger, V> waitForWriteQuorum(QuorumCall<AsyncLogger, V> q,
      int timeoutMs, String operationName) throws IOException {
    // 注意:這裏是實現quorum機制的關鍵一點,參見下面這個AsyncLoggerSet.getMajoritySize()的實現
    int majority = getMajoritySize();
    try {
      // 注意:QuorumCall.waitFor()中,可能會寫一個WARN log
      q.waitFor(
          loggers.size(), // either all respond
          majority, // or we get a majority successes
          majority, // or we get a majority failures,
          timeoutMs, operationName);
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
      throw new IOException("Interrupted waiting " + timeoutMs + "ms for a " +
          "quorum of nodes to respond.");
    } catch (TimeoutException e) {
      // 注意:這個exception的內容,與FATAL log的exception內容一致
      throw new IOException("Timed out waiting " + timeoutMs + "ms for a " +
          "quorum of nodes to respond.");
    }


    if (q.countSuccesses() < majority) {
      q.rethrowException("Got too many exceptions to achieve quorum size " +
          getMajorityString());
    }


    return q.getResults();
  }


  /**
   * @return the number of nodes which are required to obtain a quorum.
   */
  int getMajoritySize() {
    // 注意:這裏需要確認,loggers最終從何而來
    return loggers.size() / 2 + 1;
  }


QuorumCall.java中:
  /**
   * Wait for the quorum to achieve a certain number of responses.
   *
   * Note that, even after this returns, more responses may arrive,
   * causing the return value of other methods in this class to change.
   *
   * @param minResponses return as soon as this many responses have been
   * received, regardless of whether they are successes or exceptions
   * @param minSuccesses return as soon as this many successful (non-exception)
   * responses have been received
   * @param maxExceptions return as soon as this many exception responses
   * have been received. Pass 0 to return immediately if any exception is
   * received.
   * @param millis the number of milliseconds to wait for
   * @throws InterruptedException if the thread is interrupted while waiting
   * @throws TimeoutException if the specified timeout elapses before
   * achieving the desired conditions
   */
  public synchronized void waitFor(
      int minResponses, int minSuccesses, int maxExceptions,
      int millis, String operationName)
      throws InterruptedException, TimeoutException {
    long st = Time.monotonicNow();
    long nextLogTime = st + (long)(millis * WAIT_PROGRESS_INFO_THRESHOLD);
    long et = st + millis;
    while (true) {
      checkAssertionErrors();
      if (minResponses > 0 && countResponses() >= minResponses) return;
      if (minSuccesses > 0 && countSuccesses() >= minSuccesses) return;
      if (maxExceptions >= 0 && countExceptions() > maxExceptions) return;
      long now = Time.monotonicNow();


      if (now > nextLogTime) {
        long waited = now - st;
        // 注意:這裏的msg變量的內容,與WARN log的內容一致
        // 同時也說明,到達這裏,說明上面的三個條件都不滿足,所以沒有return
        // 進而說明,沒有達到QJM的quorum機制所要求的majority數量
        String msg = String.format(
            "Waited %s ms (timeout=%s ms) for a response for %s",
            waited, millis, operationName);
        if (!successes.isEmpty()) {
          msg += ". Succeeded so far: [" + Joiner.on(",").join(successes.keySet()) + "]";
        }
        if (!exceptions.isEmpty()) {
          msg += ". Exceptions so far: [" + getExceptionMapString() + "]";
        }
        if (successes.isEmpty() && exceptions.isEmpty()) {
          msg += ". No responses yet.";
        }
        if (waited > millis * WAIT_PROGRESS_WARN_THRESHOLD) {
          QuorumJournalManager.LOG.warn(msg);
        } else {
          QuorumJournalManager.LOG.info(msg);
        }
        nextLogTime = now + WAIT_PROGRESS_INTERVAL_MILLIS;
      }
      long rem = et - now;
      if (rem <= 0) {
        // 注意:這裏拋出TimeoutException,最終導致了active NN進程terminate
        throw new TimeoutException();
      }
      rem = Math.min(rem, nextLogTime - now);
      rem = Math.max(rem, 1);
      wait(rem);
    }
  }


QuorumJournalManager.java中:
/**
 * A JournalManager that writes to a set of remote JournalNodes,
 * requiring a quorum of nodes to ack each write.
 */
@InterfaceAudience.Private
public class QuorumJournalManager implements JournalManager {


    private final int writeTxnsTimeoutMs;


    // 注意:這裏決定了上面的timeout的值
    this.writeTxnsTimeoutMs = conf.getInt(
        DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY,
        DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_DEFAULT);


}


DFSConfigKeys.java中:
    public static final String  DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY = "dfs.qjournal.write-txns.timeout.ms";
    public static final int     DFS_QJOURNAL_START_SEGMENT_TIMEOUT_DEFAULT = 20000;




到這裏,結合上面結論就很明顯了:
active NN往JNs寫edit log時,不能夠在quorum中寫成功(寫超時),導致拋出異常(TimeoutException)。
更深層次的原因,可能是網絡原因,或者JVM GC的原因。GC日誌找不到了,不好確定。
網絡方面,並不能說網絡不好,應該說,是網絡狀況沒有達到上面的配置項(dfs.qjournal.write-txns.timeout.ms)所要求的程度。
所以,最直接的解決辦法,就是增大這個配置項的值,現在是默認的20000毫秒。
保險起見,最好把QuorumJournalManager.java中涉及的關於timeout的配置項都增大。
同時,必須意識到,在極端情況下,HA的兩個NN都可能因爲這個原因退出,導致整個HDFS集羣不可用。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章