關於機房交換機故障導致HDFS NameNode掛掉的問題

HDFS HA中，一個active NN，一個standby NN，三個JNs，共涉及三臺機器146.66/67/68。其中66上有一個JN，67上一個JN和一個active NN，68上一個JN和一個standby NN。67和68在一個機房，66在不同的機房。

發生故障的機房交換機，導致67和68都無法與66通信。所以，67上的active NN只能寫入67和68上的JNs，68上的standby NN只能讀取67和68上的JNs。
根據Hadoop官方文檔（http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html），這種情況NNs和JN quorum應該正常工作：There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally。

但現實是，NN掛掉了。相關日誌節選：
2015-11-05 03:01:37,135 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4354654650))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2748)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:590)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-11-05 03:01:37,135 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 4354654650
2015-11-05 03:01:37,139 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2015-11-05 03:01:37,145 INFO namenode.NameNode (StringUtils.java:run(640)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at storm14667/192.168.146.67

查看HDFS的源碼，與上面的行爲一致。那麼問題來了，是官方文檔描述有問題（“at most”！？），還是我們的理解有誤，進而我們的配置有問題？

從調用棧看，NN爲處理client的請求，正在分配新的block，JournalSet.java：
/**
* The client would like to obtain an additional block for the indicated
* filename (which is being written-to). Return an array that consists
* of the block, plus a set of machines. The first on this list should
* be where the client writes data. Subsequent items in the list must
* be provided in the connection to the first datanode.
*
* Make sure the previous blocks have been reported by datanodes and
* are replicated. Will return an empty 2-elt array if we want the
* client to "try again later".
*/
LocatedBlock getAdditionalBlock(String src, long fileId, String clientName,
ExtendedBlock previous, Set<Node> excludedNodes,
List<String> favoredNodes) throws IOException {
LocatedBlock[] onRetryBlock = new LocatedBlock[1];
DatanodeStorageInfo targets[] = getNewBlockTargets(src, fileId,
clientName, previous, excludedNodes, favoredNodes, onRetryBlock);
if (targets == null) {
assert onRetryBlock[0] != null : "Retry block is null";
// This is a retry. Just return the last block.
return onRetryBlock[0];
}
LocatedBlock newBlock = storeAllocatedBlock(
src, fileId, clientName, previous, targets);
return newBlock;
}

/**
* Part II of getAdditionalBlock().
* Should repeat the same analysis of the file state as in Part 1,
* but under the write lock.
* If the conditions still hold, then allocate a new block with
* the new targets, add it to the INode and to the BlocksMap.
*/
LocatedBlock storeAllocatedBlock(String src, long fileId, String clientName,
ExtendedBlock previous, DatanodeStorageInfo[] targets) throws IOException {
Block newBlock = null;
long offset;
checkOperation(OperationCategory.WRITE);
waitForLoadingFSImage();
writeLock();
try {
// 此處省略幾十行
} finally {
writeUnlock();
}
getEditLog().logSync(); // 注意：這裏是同步edit log，同步到NN本地或/和JNs上

// Return located block
return makeLocatedBlock(newBlock, targets, offset);
}

在JournalSet.storeAllocatedBlock函數中，可以看到先logSync再執行真正的動作，說明HDFS對寫操作採用了WAL（Wrtie-Ahead Log）。

FSEditLog.java：
/**
* Sync all modifications done by this thread.
*
* The internal concurrency design of this class is as follows:
* - Log items are written synchronized into an in-memory buffer,
* and each assigned a transaction ID.
* - When a thread (client) would like to sync all of its edits, logSync()
* uses a ThreadLocal transaction ID to determine what edit number must
* be synced to.
* - The isSyncRunning volatile boolean tracks whether a sync is currently
* under progress.
*
* The data is double-buffered within each edit log implementation so that
* in-memory writing can occur in parallel with the on-disk writing.
*
* Each sync occurs in three steps:
* 1. synchronized, it swaps the double buffer and sets the isSyncRunning
* flag.
* 2. unsynchronized, it flushes the data to storage
* 3. synchronized, it resets the flag and notifies anyone waiting on the
* sync.
*
* The lack of synchronization on step 2 allows other threads to continue
* to write into the memory buffer while the sync is in progress.
* Because this step is unsynchronized, actions that need to avoid
* concurrency with sync() should be synchronized and also call
* waitForSyncToFinish() before assuming they are running alone.
*/
public void logSync() {
long syncStart = 0;

// Fetch the transactionId of this thread.
long mytxid = myTransactionId.get().txid;

boolean sync = false;
try {
EditLogOutputStream logStream = null;
//
// 此處省略上百行代碼
//
// do the sync
long start = monotonicNow();
try {
if (logStream != null) {
logStream.flush(); // 注意：這裏是寫edit log，根據配置不同，寫入本地和/或JournalNodes
}
} catch (IOException ex) {
synchronized (this) {
final String msg =
"Could not sync enough journals to persistent storage. "
+ "Unsynced transactions: " + (txid - synctxid);
LOG.fatal(msg, new Exception());
synchronized(journalSetLock) {
IOUtils.cleanup(LOG, journalSet);
}
terminate(1, msg);
}
}
} finally {
// 此處省略十幾行代碼
}
}

又回到JournalSet.java：

/**
* An implementation of EditLogOutputStream that applies a requested method on
* all the journals that are currently active.
*/
private class JournalSetOutputStream extends EditLogOutputStream {
//
// 此處省略上百行
//
@Override
public void flush() throws IOException {
mapJournalsAndReportErrors(new JournalClosure() {
@Override
public void apply(JournalAndStream jas) throws IOException {
if (jas.isActive()) {
jas.getCurrentStream().flush();
}
}
}, "flush");
}
//
// 此處省略上百行
//
}

/**
* Manages a collection of Journals. None of the methods are synchronized, it is
* assumed that FSEditLog methods, that use this class, use proper
* synchronization.
*/
public class JournalSet implements JournalManager {
//
// 此處省略數百行
//
private final List<JournalAndStream> journals =
new CopyOnWriteArrayList<JournalSet.JournalAndStream>();
//
// 此處省略近百行
//
/**
* Apply the given operation across all of the journal managers, disabling
* any for which the closure throws an IOException.
* @param closure {@link JournalClosure} object encapsulating the operation.
* @param status message used for logging errors (e.g. "opening journal")
* @throws IOException If the operation fails on all the journals.
*/
// 注意：上面的註釋所表達的意思，和Hadoop官方文檔中的表述，不一致，當然也或許我們理解不到位或不全面
private void mapJournalsAndReportErrors(JournalClosure closure, String status)
throws IOException {

List<JournalAndStream> badJAS = Lists.newLinkedList();
for (JournalAndStream jas : journals) {
try {
closure.apply(jas);
} catch (Throwable t) {
if (jas.isRequired()) {
final String msg = "Error: " + status + " failed for required journal ("
+ jas + ")";
LOG.fatal(msg, t); // 注意：這就是日誌中出現的FATAL日誌
// If we fail on *any* of the required journals, then we must not
// continue on any of the other journals. Abort them to ensure that
// retry behavior doesn't allow them to keep going in any way.
abortAllJournals();
// the current policy is to shutdown the NN on errors to shared edits
// dir. There are many code paths to shared edits failures - syncs,
// roll of edits etc. All of them go through this common function
// where the isRequired() check is made. Applying exit policy here
// to catch all code paths.
terminate(1, msg);
} else {
// 次數省略數十行
}
}
}
}
//
// 此處省略數百行
//
}

再回到FSEditLog.java，這裏初始化了JournalSet.journals：

/**
* FSEditLog maintains a log of the namespace modifications.
*
*/
@InterfaceAudience.Private
@InterfaceStability.Evolving
public class FSEditLog implements LogsPurgeable {
//
// 此處省略數百行
//
private final List<URI> editsDirs; // 注意：這個成員變量和我們的配置（hdfs-site.xml）密切相關

/**
* The edit directories that are shared between primary and secondary.
*/
private final List<URI> sharedEditsDirs; // 注意：這個成員變量和我們配置（hdfs-site.xml）密切相關
//
// 此處省略數百行
//
/**
* Constructor for FSEditLog. Underlying journals are constructed, but
* no streams are opened until open() is called.
*
* @param conf The namenode configuration
* @param storage Storage object used by namenode
* @param editsDirs List of journals to use
*/
FSEditLog(Configuration conf, NNStorage storage, List<URI> editsDirs) {
isSyncRunning = false;
this.conf = conf;
this.storage = storage;
metrics = NameNode.getNameNodeMetrics();
lastPrintTime = monotonicNow();

// If this list is empty, an error will be thrown on first use
// of the editlog, as no journals will exist
this.editsDirs = Lists.newArrayList(editsDirs);

this.sharedEditsDirs = FSNamesystem.getSharedEditsDirs(conf);
}

public synchronized void initJournalsForWrite() {
Preconditions.checkState(state == State.UNINITIALIZED ||
state == State.CLOSED, "Unexpected state: %s", state);

initJournals(this.editsDirs);
state = State.BETWEEN_LOG_SEGMENTS;
}

public synchronized void initSharedJournalsForRead() {
if (state == State.OPEN_FOR_READING) {
LOG.warn("Initializing shared journals for READ, already open for READ",
new Exception());
return;
}
Preconditions.checkState(state == State.UNINITIALIZED ||
state == State.CLOSED);

initJournals(this.sharedEditsDirs);
state = State.OPEN_FOR_READING;
}

private synchronized void initJournals(List<URI> dirs) {
int minimumRedundantJournals = conf.getInt(
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_KEY,
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_DEFAULT);

synchronized(journalSetLock) {
journalSet = new JournalSet(minimumRedundantJournals);

for (URI u : dirs) {
// 注意：這個布爾變量很關鍵，決定了能否容忍active NN寫一個JN失敗
boolean required = FSNamesystem.getRequiredNamespaceEditsDirs(conf)
.contains(u);
// 注意：在我們配置（hdfs-site.xml）中，沒有顯式的配置任何local sheme（file://）的配置項，所以真正執行的是對應的else分支
if (u.getScheme().equals(NNStorage.LOCAL_URI_SCHEME)) {
StorageDirectory sd = storage.getStorageDirectory(u);
if (sd != null) {
journalSet.add(new FileJournalManager(conf, sd, storage),
required, sharedEditsDirs.contains(u));
}
} else {
journalSet.add(createJournal(u), required,
sharedEditsDirs.contains(u));
}
}
}

if (journalSet.isEmpty()) {
LOG.error("No edits directories configured!");
}
}
//
// 此處省略數百行
//
}

再到FSNamesystem.java：

/**
* Get all edits dirs which are required. If any shared edits dirs are
* configured, these are also included in the set of required dirs.
*
* @param conf the HDFS configuration.
* @return all required dirs.
*/
public static Collection<URI> getRequiredNamespaceEditsDirs(Configuration conf) {
Set<URI> ret = new HashSet<URI>();
ret.addAll(getStorageDirs(conf, DFS_NAMENODE_EDITS_DIR_REQUIRED_KEY)); // 注意：我們沒有配置，所以第一個addAll沒有add了空
ret.addAll(getSharedEditsDirs(conf));
return ret;
}

private static Collection<URI> getStorageDirs(Configuration conf,
String propertyName) {
Collection<String> dirNames = conf.getTrimmedStringCollection(propertyName);
StartupOption startOpt = NameNode.getStartupOption(conf);
if(startOpt == StartupOption.IMPORT) {
// In case of IMPORT this will get rid of default directories
// but will retain directories specified in hdfs-site.xml
// When importing image from a checkpoint, the name-node can
// start with empty set of storage directories.
Configuration cE = new HdfsConfiguration(false);
cE.addResource("core-default.xml");
cE.addResource("core-site.xml");
cE.addResource("hdfs-default.xml");
Collection<String> dirNames2 = cE.getTrimmedStringCollection(propertyName);
dirNames.removeAll(dirNames2);
if(dirNames.isEmpty())
LOG.warn("!!! WARNING !!!" +
"\n\tThe NameNode currently runs without persistent storage." +
"\n\tAny changes to the file system meta-data may be lost." +
"\n\tRecommended actions:" +
"\n\t\t- shutdown and restart NameNode with configured \""
+ propertyName + "\" in hdfs-site.xml;" +
"\n\t\t- use Backup Node as a persistent and up-to-date storage " +
"of the file system meta-data.");
} else if (dirNames.isEmpty()) {
dirNames = Collections.singletonList(
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_DEFAULT);
}
return Util.stringCollectionAsURIs(dirNames);
}

/**
* Returns edit directories that are shared between primary and secondary.
* @param conf configuration
* @return collection of edit directories from {@code conf}
*/
public static List<URI> getSharedEditsDirs(Configuration conf) {
// don't use getStorageDirs here, because we want an empty default
// rather than the dir in /tmp
Collection<String> dirNames = conf.getTrimmedStringCollection(
DFS_NAMENODE_SHARED_EDITS_DIR_KEY);
return Util.stringCollectionAsURIs(dirNames);
}

最後，回到我們的hdfs-site.xml，相關的配置項包括：
dfs.namenode.shared.edits.dir
dfs.journalnode.edits.dir
dfs.namenode.name.dir
dfs.namenode.edits.dir
dfs.namenode.edits.dir.required
dfs.namenode.edits.dir.minimum
具體含義參考：http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml。

其中，後三項我們沒有做任何配置，所以要麼是默認值，要麼是空。默認值也都是空，除了dfs.namenode.edits.dir。

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop146066.ysc.com:8485;storm14667:8485;storm14668:8485/hadoopcluster</value>
</property>

配置中，把三個JNs都顯式的配值了，所以三個都是required（上面代碼中我提到的很關鍵的布爾變量）的。根據代碼，這樣配置的作用是，任何要給JN不可通信，active NN都會shut down。

這是幾個關於相關配置項的討論：
https://issues.apache.org/jira/browse/HDFS-4342
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/I4YRcmiVcBY

關於機房交換機故障導致HDFS NameNode掛掉的問題

mail技術相關

multi-process & cpu with multi-cores

Parameter server anatomy (1)

關於機房交換機故障導致HDFS NameNode掛掉的問題（續）

關於機房交換機故障導致HDFS NameNode掛掉的問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結