關於機房交換機故障導致HDFS NameNode掛掉的問題（續）

過程是痛苦的，後面的結論是令人不安的。

上一篇的分析，確定了至少兩個個結論：
一、如果總體上active NN寫JNs出問題，那麼active NN就主動調用terminate，進程退出；
二、JNs的相關的一個配置項：dfs.namenode.shared.edits.dir，這個配置項中出現的JN的信息，對NN來說一定是“required”的。

這篇後續的分析，解釋“總體上active NN寫JNs出問題”，是怎麼回事。以上一篇相反的另一個方向的思路，分析問題是怎麼導致的，以及解釋代碼與QJM的quorum機制是否一致（答案必然是肯定的了）。

還是從active NN FATAL log說起。

2015-11-16 07:36:50,478 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 11830 Total time for transactions(ms): 394 Number of transactions batched in Syncs: 7342 Number of syncs: 350 SyncTimes(ms): 735 30792 26555
1598 2015-11-16 07:36:50,481 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4776804880))
1599 java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
1600 at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
1601 at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
1602 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
1603 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
1604 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
1605 at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
1606 at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
1607 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
1608 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
1609 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3001)
1610 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:647)
1611 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTrans latorPB.java:484)
1612 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenode ProtocolProtos.java)
1613 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
1614 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
1615 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
1616 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
1617 at java.security.AccessController.doPrivileged(Native Method)
1618 at javax.security.auth.Subject.doAs(Subject.java:415)
1619 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
1620 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

補充相關日誌：
1469:2015-11-16 07:36:26,770 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 4670ms to send a batch of 78 edits (12198 bytes) to remote journal 192.168.146.67:8485
1471:2015-11-16 07:36:50,383 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 21267ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.67:8485
1537:2015-11-16 07:36:29,116 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2345ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.68:8485
1547:2015-11-16 07:36:29,115 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2344ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.66:8485
1593:2015-11-16 07:36:50,459 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 23689 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [192.168.146.67:8485]

JournalSet.java中，JournalSet.mapJournalsAndReportErrors()：
/**
* Apply the given operation across all of the journal managers, disabling
* any for which the closure throws an IOException.
* @param closure {@link JournalClosure} object encapsulating the operation.
* @param status message used for logging errors (e.g. "opening journal")
* @throws IOException If the operation fails on all the journals.
*/
private void mapJournalsAndReportErrors(JournalClosure closure, String status)
throws IOException {

List<JournalAndStream> badJAS = Lists.newLinkedList();
for (JournalAndStream jas : journals) {
try {
closure.apply(jas); // 注意：JournalClosuer.apply()拋出Exception或Error，導致最終的terminate()
} catch (Throwable t) {
if (jas.isRequired()) {
final String msg = "Error: " + status + " failed for required journal ("
+ jas + ")";
LOG.fatal(msg, t);
// If we fail on *any* of the required journals, then we must not
// continue on any of the other journals. Abort them to ensure that
// retry behavior doesn't allow them to keep going in any way.
abortAllJournals();
// the current policy is to shutdown the NN on errors to shared edits
// dir. There are many code paths to shared edits failures - syncs,
// roll of edits etc. All of them go through this common function
// where the isRequired() check is made. Applying exit policy here
// to catch all code paths.
terminate(1, msg);
} else {
LOG.error("Error: " + status + " failed for (journal " + jas + ")", t);
badJAS.add(jas);
}
}
}
disableAndReportErrorOnJournals(badJAS);
if (!NameNodeResourcePolicy.areResourcesAvailable(journals,
minimumRedundantJournals)) {
String message = status + " failed for too many journals";
LOG.error("Error: " + message);
throw new IOException(message);
}
}

JournalSet.JournalSetOutputStream.flush()：
@Override
public void flush() throws IOException {
mapJournalsAndReportErrors(new JournalClosure() {
@Override
public void apply(JournalAndStream jas) throws IOException {
if (jas.isActive()) {
jas.getCurrentStream().flush(); // 注意：apply()的實現中，EditLogOutputStream.flush()唯一可導致拋出IOException
}
}
}, "flush");
}

JournalSet.JournalAndStream.getCurrentStream()：
private EditLogOutputStream stream;
EditLogOutputStream getCurrentStream() {
return stream;
}

EditLogOutputStream.java中：
/**
* Flush and sync all data that is ready to be flush
* {@link #setReadyToFlush()} into underlying persistent store.
* @param durable if true, the edits should be made truly durable before
* returning
* @throws IOException
*/
abstract protected void flushAndSync(boolean durable) throws IOException;

/**
* Flush data to persistent store.
* Collect sync metrics.
*/
public void flush() throws IOException {
flush(true);
}

public void flush(boolean durable) throws IOException {
numSync++;
long start = monotonicNow();
flushAndSync(durable); // 注意：flush()的實現中，flushAndSync()唯一可拋出IOException
long end = monotonicNow();
totalTimeSync += (end - start);
}

QuorumOutputStream.java中：
/**
* EditLogOutputStream implementation that writes to a quorum of
* remote journals.
*/
class QuorumOutputStream extends EditLogOutputStream {

private final int writeTimeoutMs;

@Override
protected void flushAndSync(boolean durable) throws IOException {
int numReadyBytes = buf.countReadyBytes();
if (numReadyBytes > 0) {
int numReadyTxns = buf.countReadyTxns();
long firstTxToFlush = buf.getFirstReadyTxId();

assert numReadyTxns > 0;

// Copy from our double-buffer into a new byte array. This is for
// two reasons:
// 1) The IPC code has no way of specifying to send only a slice of
// a larger array.
// 2) because the calls to the underlying nodes are asynchronous, we
// need a defensive copy to avoid accidentally mutating the buffer
// before it is sent.
DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
buf.flushTo(bufToSend);
assert bufToSend.getLength() == numReadyBytes;
byte[] data = bufToSend.getData();
assert data.length == bufToSend.getLength();

// 注意：AysncLoggerSet.sendEdits()不會拋出exception
QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
segmentTxId, firstTxToFlush,
numReadyTxns, data);
// 注意：AsyncLogger.waitForWriteQuorum()可拋出IOException
// 需要確認，writeTimeoutMs的值是什麼，怎麼來的
loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");

// Since we successfully wrote this batch, let the loggers know. Any future
// RPCs will thus let the loggers know of the most recent transaction, even
// if a logger has fallen behind.
loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
}
}

}

AsyncLoggerSet.java中：
public QuorumCall<AsyncLogger, Void> sendEdits(
long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
Map<AsyncLogger, ListenableFuture<Void>> calls = Maps.newHashMap();
for (AsyncLogger logger : loggers) {
ListenableFuture<Void> future =
logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
calls.put(logger, future);
}
return QuorumCall.create(calls);
}

/**
* Wait for a quorum of loggers to respond to the given call. If a quorum
* can't be achieved, throws a QuorumException.
* @param q the quorum call
* @param timeoutMs the number of millis to wait
* @param operationName textual description of the operation, for logging
* @return a map of successful results
* @throws QuorumException if a quorum doesn't respond with success
* @throws IOException if the thread is interrupted or times out
*/
<V> Map<AsyncLogger, V> waitForWriteQuorum(QuorumCall<AsyncLogger, V> q,
int timeoutMs, String operationName) throws IOException {
// 注意：這裏是實現quorum機制的關鍵一點，參見下面這個AsyncLoggerSet.getMajoritySize()的實現
int majority = getMajoritySize();
try {
// 注意：QuorumCall.waitFor()中，可能會寫一個WARN log
q.waitFor(
loggers.size(), // either all respond
majority, // or we get a majority successes
majority, // or we get a majority failures,
timeoutMs, operationName);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IOException("Interrupted waiting " + timeoutMs + "ms for a " +
"quorum of nodes to respond.");
} catch (TimeoutException e) {
// 注意：這個exception的內容，與FATAL log的exception內容一致
throw new IOException("Timed out waiting " + timeoutMs + "ms for a " +
"quorum of nodes to respond.");
}

if (q.countSuccesses() < majority) {
q.rethrowException("Got too many exceptions to achieve quorum size " +
getMajorityString());
}

return q.getResults();
}

/**
* @return the number of nodes which are required to obtain a quorum.
*/
int getMajoritySize() {
// 注意：這裏需要確認，loggers最終從何而來
return loggers.size() / 2 + 1;
}

QuorumCall.java中：
/**
* Wait for the quorum to achieve a certain number of responses.
*
* Note that, even after this returns, more responses may arrive,
* causing the return value of other methods in this class to change.
*
* @param minResponses return as soon as this many responses have been
* received, regardless of whether they are successes or exceptions
* @param minSuccesses return as soon as this many successful (non-exception)
* responses have been received
* @param maxExceptions return as soon as this many exception responses
* have been received. Pass 0 to return immediately if any exception is
* received.
* @param millis the number of milliseconds to wait for
* @throws InterruptedException if the thread is interrupted while waiting
* @throws TimeoutException if the specified timeout elapses before
* achieving the desired conditions
*/
public synchronized void waitFor(
int minResponses, int minSuccesses, int maxExceptions,
int millis, String operationName)
throws InterruptedException, TimeoutException {
long st = Time.monotonicNow();
long nextLogTime = st + (long)(millis * WAIT_PROGRESS_INFO_THRESHOLD);
long et = st + millis;
while (true) {
checkAssertionErrors();
if (minResponses > 0 && countResponses() >= minResponses) return;
if (minSuccesses > 0 && countSuccesses() >= minSuccesses) return;
if (maxExceptions >= 0 && countExceptions() > maxExceptions) return;
long now = Time.monotonicNow();

if (now > nextLogTime) {
long waited = now - st;
// 注意：這裏的msg變量的內容，與WARN log的內容一致
// 同時也說明，到達這裏，說明上面的三個條件都不滿足，所以沒有return
// 進而說明，沒有達到QJM的quorum機制所要求的majority數量
String msg = String.format(
"Waited %s ms (timeout=%s ms) for a response for %s",
waited, millis, operationName);
if (!successes.isEmpty()) {
msg += ". Succeeded so far: [" + Joiner.on(",").join(successes.keySet()) + "]";
}
if (!exceptions.isEmpty()) {
msg += ". Exceptions so far: [" + getExceptionMapString() + "]";
}
if (successes.isEmpty() && exceptions.isEmpty()) {
msg += ". No responses yet.";
}
if (waited > millis * WAIT_PROGRESS_WARN_THRESHOLD) {
QuorumJournalManager.LOG.warn(msg);
} else {
QuorumJournalManager.LOG.info(msg);
}
nextLogTime = now + WAIT_PROGRESS_INTERVAL_MILLIS;
}
long rem = et - now;
if (rem <= 0) {
// 注意：這裏拋出TimeoutException，最終導致了active NN進程terminate
throw new TimeoutException();
}
rem = Math.min(rem, nextLogTime - now);
rem = Math.max(rem, 1);
wait(rem);
}
}

QuorumJournalManager.java中：
/**
* A JournalManager that writes to a set of remote JournalNodes,
* requiring a quorum of nodes to ack each write.
*/
@InterfaceAudience.Private
public class QuorumJournalManager implements JournalManager {

private final int writeTxnsTimeoutMs;

// 注意：這裏決定了上面的timeout的值
this.writeTxnsTimeoutMs = conf.getInt(
DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY,
DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_DEFAULT);

}

DFSConfigKeys.java中：
public static final String DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY = "dfs.qjournal.write-txns.timeout.ms";
public static final int DFS_QJOURNAL_START_SEGMENT_TIMEOUT_DEFAULT = 20000;

到這裏，結合上面結論就很明顯了：
active NN往JNs寫edit log時，不能夠在quorum中寫成功（寫超時），導致拋出異常（TimeoutException）。
更深層次的原因，可能是網絡原因，或者JVM GC的原因。GC日誌找不到了，不好確定。
網絡方面，並不能說網絡不好，應該說，是網絡狀況沒有達到上面的配置項（dfs.qjournal.write-txns.timeout.ms）所要求的程度。
所以，最直接的解決辦法，就是增大這個配置項的值，現在是默認的20000毫秒。
保險起見，最好把QuorumJournalManager.java中涉及的關於timeout的配置項都增大。
同時，必須意識到，在極端情況下，HA的兩個NN都可能因爲這個原因退出，導致整個HDFS集羣不可用。

關於機房交換機故障導致HDFS NameNode掛掉的問題（續）

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

mail技術相關

multi-process & cpu with multi-cores

Parameter server anatomy (1)

關於機房交換機故障導致HDFS NameNode掛掉的問題（續）

關於機房交換機故障導致HDFS NameNode掛掉的問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結