【Elasticsearch索引恢復流程（上）】

本文基於ES 6.7介紹ElasticSearch的索引恢復，索引恢復的幾個重要階段，synced flush機制以及主副分片的恢復流程等知識。

ES索引恢復

索引恢復（indices.recovery）是ES數據恢復過程，是集羣啓動過程中最緩慢的過程。集羣完全重啓，或者Master節點掛掉後，新選出的Master也有可能執行索引恢復過程。

待恢復的是哪些數據？是客戶端寫入成功，但是未執行刷盤（flush）的Lucene分段。

索引恢復有哪些好處？

保持數據完整性：當節點異常重啓時，寫入磁盤的數據先到文件系統的緩衝，未必來得及刷盤。如果不通過某種方式將未進行刷盤的數據找回，則會丟失一些數據。

數據副本一致性：由於寫入操作在多個分片副本上沒有來得及全部執行，副分片需要同步成和主分片完全一致。

索引恢復流程的類型

根據數據分片的性質，索引恢復過程可以分成“主分片恢復流程” 和“副分片恢復流程”。

主分片恢復流程：主分片從translog中自我恢復，尚未執行flush到磁盤的Lucene分段可以從translog中重建 —— 耗時怎麼樣？！

副分片恢復流程：副分片需要從主分片拉取Lucene分段和translog進行恢復。但是有機會跳過拉取Lucene分段的過程 —— 表白還有機會！

恢復流程的觸發
Recovery是由clusterChanged事件觸發，從觸發到開始執行恢復的調用流程如下：

indicesClusterStateService#applyClusterState()
->createOrUpdateShards()
->createShard()
->indicesService#createShard()
->indexShard#startRecovery()

IndexShard.java的startRecovery()方法會執行對一個特定分片的恢復流程，根據此分片不同的恢復類型執行相應的恢復過程，這裏我們主要關注下列兩種恢復類型：

EXISTING_STORE類型：主分片從本地恢復

PEER類型：副分片從遠程主分片恢復

public void startRecovery(...) {
        switch (recoveryState.getRecoverySource().getType()) {
            case EMPTY_STORE:
            case EXISTING_STORE: // 主分片從本地恢復
                markAsRecovering("from store", recoveryState); // mark the shard as recovering on the cluster state thread
                threadPool.generic().execute(() -> {
                    try {
                        if (recoverFromStore()) {
                            recoveryListener.onRecoveryDone(recoveryState); // 恢復成功
                        }
                    } catch (Exception e) { // 恢復失敗
                        recoveryListener.onRecoveryFailure(recoveryState,
                            new RecoveryFailedException(recoveryState, null, e), true);
                    }
                });
                break;
            case PEER: // 副分片從遠程主分片恢復
                try {
                    markAsRecovering("from " + recoveryState.getSourceNode(), recoveryState);
                    recoveryTargetService.startRecovery(this, recoveryState.getSourceNode(), recoveryListener);
                } catch (Exception e) {
                    failShard("corrupted preexisting index", e);
                    recoveryListener.onRecoveryFailure(recoveryState,
                        new RecoveryFailedException(recoveryState, null, e), true);
                }
                break;
            case SNAPSHOT:
                ...
                break;
            case LOCAL_SHARDS:
                ...
                break;
            default:
                throw new IllegalArgumentException("Unknown recovery source " + recoveryState.getRecoverySource());
        }
    }

索引恢復的6個階段

恢復一般要經歷下列6個階段（查看RecoveryState類）：> INIT -> INDEX -> VERIFY_INDEX -> TRANSLOG -> FINALIZE -> DONE

主分片和副分片恢復都會經歷這些階段，但是有時候會跳過具體的執行過程，只是在流程上體現出經歷了這個短暫階段。
例如，副分片恢復時會跳過TRANSLOG重放過程；主分片恢復過程中的INDEX階段不會在節點之間複製數據。

/**
 * Keeps track of state related to shard recovery.
 */
public class RecoveryState implements ToXContentFragment, Streamable, Writeable {

    public enum Stage {
        INIT((byte) 0),

        /**
         * recovery of lucene files, either reusing local ones are copying new ones
         */
        INDEX((byte) 1),

        /**
         * potentially running check index
         */
        VERIFY_INDEX((byte) 2),

        /**
         * starting up the engine, replaying the translog
         */
        TRANSLOG((byte) 3),

        /**
         * performing final task after all translog ops have been done
         */
        FINALIZE((byte) 4),

        DONE((byte) 5);

        ...
        }
    }
	
	private Stage stage;
	...
	
	public RecoveryState(ShardRouting shardRouting, DiscoveryNode targetNode, @Nullable DiscoveryNode sourceNode) {
        assert shardRouting.initializing() : "only allow initializing shard routing to be recovered: " + shardRouting;
        RecoverySource recoverySource = shardRouting.recoverySource();
        assert (recoverySource.getType() == RecoverySource.Type.PEER) == (sourceNode != null) :
            "peer recovery requires source node, recovery type: " + recoverySource.getType() + " source node: " + sourceNode;
        this.shardId = shardRouting.shardId();
        this.primary = shardRouting.primary();
        this.recoverySource = recoverySource;
        this.sourceNode = sourceNode;
        this.targetNode = targetNode;
        stage = Stage.INIT;
        index = new Index();
        translog = new Translog();
        verifyIndex = new VerifyIndex();
        timer = new Timer();
        timer.start();
    }
    
    ...
}

Stage	Description	Comment
INIT	恢復尚未啓動	從開始執行恢復的那一刻起，就被標記爲INIT階段。Stage.INIT是RecoveryState類屬性，RecoveryState類屬性作爲在IndexShard#startRecovery()函數的參數傳入恢復流程，之後判斷具體屬於哪種恢復類型，然後做些簡單驗證，然後進入INDEX階段
INDEX	恢復Lucene文件，以及在節點間複製索引數據	從Lucene讀取最後一次提交的分段信息，獲取版本號，更新當前索引版本
VERIFY_INDEX	驗證索引是否損壞	參數可配，默認配置爲不執行驗證索引，進入最重要的TRANSLOG階段 —— 呵呵
TRANSLOG	啓動Engine，重放Translog，建立Lucene索引	最重要的階段：根據最後一次提交的信息做快照，來確定事務日誌中哪些數據需要重放。重放事務日誌中尚未刷盤的信息，重放完畢後將新生成的Lucene數據刷入磁盤
FINALIZE	清理工作	執行刷新（refresh）操作，將緩衝的數據寫入文件，但不刷盤，數據在操作系統的cache中
DONE	恢復完畢	進入DONE之前再次執行refresh，然後更新分片狀態

主分片恢復流程

IndexShard#innerOpenEngineAndTranslog()函數完成了VERIFY_INDEX階段（默認關閉校驗），並開啓TRANSLOG階段。

/**
 * Replays translog operations from the provided translog {@code snapshot} to the current engine using the given {@code origin}.
 * The callback {@code onOperationRecovered} is notified after each translog operation is replayed successfully.
 */
int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin,
						Runnable onOperationRecovered) throws IOException {
	int opsRecovered = 0;
	Translog.Operation operation;
	while ((operation = snapshot.next()) != null) { // 遍歷索引需要重放的事務日誌
		try {
			logger.trace("[translog] recover op {}", operation);
			Engine.Result result = applyTranslogOperation(engine, operation, origin); // 執行具體的操作，比如索引寫入，刪除等
			switch (result.getResultType()) {
				case FAILURE:
					throw result.getFailure();
				case MAPPING_UPDATE_REQUIRED:
					throw new IllegalArgumentException("unexpected mapping update: " + result.getRequiredMappingUpdate());
				case SUCCESS:
					break;
				default:
					throw new AssertionError("Unknown result type [" + result.getResultType() + "]");
			}

			opsRecovered++;
			onOperationRecovered.run();
		} catch (Exception e) {
			...
		}
	}
	return opsRecovered;
}

private Engine.Result applyTranslogOperation(Engine engine, Translog.Operation operation,
											 Engine.Operation.Origin origin) throws IOException {
	// If a translog op is replayed on the primary (eg. ccr), we need to use external instead of null for its version type.
	final VersionType versionType = (origin == Engine.Operation.Origin.PRIMARY) ? VersionType.EXTERNAL : null;
	final Engine.Result result;
	switch (operation.opType()) { // 根據操作類型進行重放邏輯，比如索引寫入，刪除等
		case INDEX:
			final Translog.Index index = (Translog.Index) operation;
			// we set canHaveDuplicates to true all the time such that we de-optimze the translog case and ensure that all
			// autoGeneratedID docs that are coming from the primary are updated correctly.
			result = applyIndexOperation(engine, index.seqNo(), index.primaryTerm(), index.version(),
				index.versionType().versionTypeForReplicationAndRecovery(), UNASSIGNED_SEQ_NO, 0, index.getAutoGeneratedIdTimestamp(),
				true, origin, source(shardId.getIndexName(), index.type(), index.id(), index.source(),
					XContentHelper.xContentType(index.source())).routing(index.routing()).parent(index.parent()));
			break;
		case DELETE:
			final Translog.Delete delete = (Translog.Delete) operation;
			result = applyDeleteOperation(engine, delete.seqNo(), delete.primaryTerm(), delete.version(), delete.type(), delete.id(),
				delete.versionType().versionTypeForReplicationAndRecovery(), UNASSIGNED_SEQ_NO, 0, origin);
			break;
		case NO_OP:
			final Translog.NoOp noOp = (Translog.NoOp) operation;
			result = markSeqNoAsNoop(engine, noOp.seqNo(), noOp.primaryTerm(), noOp.reason(), origin);
			break;
		default:
			throw new IllegalStateException("No operation defined for [" + operation + "]");
	}
	return result;
}

副分片恢復流程

副分片恢復的核心思想是從主分片拉去Lucene分段和translog進行恢復。按數據傳遞的方向，主分片節點成爲Source，副分片節點成爲Target。
副分片恢復的VERIFY_INDEX、TRANSLOG、FINALIZE三個階段由主分片節點發送的RPC調用觸發。
副分片恢復主要涉及恢復的目標節點和源節點，目標節點即故障恢復的節點，源節點爲提供恢復的節點。目標節點向源節點發送分片恢復請求，源節點接收到請求後主要分兩階段來處理，整體流程如下圖所示。

phase1：第一階段，對需要恢復的shard創建snapshot，然後根據請求中的metadata對比，如果syncid相同且doc數量相同則跳出phase1。否則對比shard的segment文件差異，將有差異的segment文件發送給target node。

phase2：第二階段，爲了保證target node數據的完整性，需要將本地的translog發送給target node，且對接收到的translog進行回放。

ES 6.7中有兩個機會可以快速跳過phase1：

如果可以基於恢復請求中的SequencNumber進行恢復，則直接跳過phase1。

final SendFileResult sendFileResult; // 本段代碼的重點是拿到sendFileResult
if (isSequenceNumberBasedRecovery) { // 可以基於恢復請求中的SequenceNumber進行恢復，則跳過phase1
	logger.trace("performing sequence numbers based recovery. starting at [{}]", request.startingSeqNo());
	startingSeqNo = request.startingSeqNo();
	sendFileResult = SendFileResult.EMPTY;
} else { // 進入phase1，還有機會根據syncid和doc數快速跳出phase1
	...
	sendFileResult = phase1(phase1Snapshot.getIndexCommit(), () -> estimateNumOps);
	...
}

如果主副兩分片有相同的sync_id且doc數相同，則快速跳出phase1。

/**
 * Perform phase1 of the recovery operations. Once this {@link IndexCommit}
 * snapshot has been performed no commit operations (files being fsync'd)
 * are effectively allowed on this index until all recovery phases are done
 * <p>
 * Phase1 examines the segment files on the target node and copies over the
 * segments that are missing. Only segments that have the same size and
 * checksum can be reused
 */
public SendFileResult phase1(final IndexCommit snapshot, final Supplier<Integer> translogOps) {
	cancellableThreads.checkForCancel();
	// Total size of segment files that are recovered
	long totalSize = 0;
	// Total size of segment files that were able to be re-used
	long existingTotalSize = 0;
	final List<String> phase1FileNames = new ArrayList<>();
	final List<Long> phase1FileSizes = new ArrayList<>();
	final List<String> phase1ExistingFileNames = new ArrayList<>();
	final List<Long> phase1ExistingFileSizes = new ArrayList<>();
	final Store store = shard.store(); // 拿到shard的存儲信息
	store.incRef();
	try {
		StopWatch stopWatch = new StopWatch().start();
		final Store.MetadataSnapshot recoverySourceMetadata;
		try {
			recoverySourceMetadata = store.getMetadata(snapshot); // 拿到snapshot的metadata
		} catch (CorruptIndexException | IndexFormatTooOldException | IndexFormatTooNewException ex) {
			shard.failShard("recovery", ex);
			throw ex;
		}
		for (String name : snapshot.getFileNames()) {
			final StoreFileMetaData md = recoverySourceMetadata.get(name);
			if (md == null) {
				logger.info("Snapshot differs from actual index for file: {} meta: {}", name, recoverySourceMetadata.asMap());
				throw new CorruptIndexException("Snapshot differs from actual index - maybe index was removed metadata has " +
						recoverySourceMetadata.asMap().size() + " files", name);
			}
		}
		// Generate a "diff" of all the identical, different, and missing
		// segment files on the target node, using the existing files on
		// the source node
		String recoverySourceSyncId = recoverySourceMetadata.getSyncId();
		String recoveryTargetSyncId = request.metadataSnapshot().getSyncId();
		final boolean recoverWithSyncId = recoverySourceSyncId != null &&
				recoverySourceSyncId.equals(recoveryTargetSyncId);
		if (recoverWithSyncId) { // 如果SyncId相等，再繼續比較文檔數，如果都相同則快速跳出phase1
			final long numDocsTarget = request.metadataSnapshot().getNumDocs();
			final long numDocsSource = recoverySourceMetadata.getNumDocs();
			if (numDocsTarget != numDocsSource) {
				throw new IllegalStateException("try to recover " + request.shardId() + " from primary shard with sync id but number " +
						"of docs differ: " + numDocsSource + " (" + request.sourceNode().getName() + ", primary) vs " + numDocsTarget
						+ "(" + request.targetNode().getName() + ")");
			}
			// we shortcut recovery here because we have nothing to copy. but we must still start the engine on the target.
			// so we don't return here
			logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", recoverySourceSyncId);
		} else { // 如果SyncId不相等，則找出target和source有差別的segment，將需要恢復的文件發送到target node —— 耗費網絡帶寬，耗時的操作！
			...
		}
		final TimeValue took = stopWatch.totalTime();
		logger.trace("recovery [phase1]: took [{}]", took);
		return new SendFileResult(phase1FileNames, phase1FileSizes, totalSize, phase1ExistingFileNames,
			phase1ExistingFileSizes, existingTotalSize, took);
	} catch (Exception e) {
		throw new RecoverFilesRecoveryException(request.shardId(), phase1FileNames.size(), new ByteSizeValue(totalSize), e);
	} finally {
		store.decRef();
	}
}

Synced Flush能夠快速跳出冷索引的phase1，有效縮短恢復時長！

Elasticsearch tracks the indexing activity of each shard. Shards that have not received any indexing operations for 5 minutes are automatically marked as inactive. This presents an opportunity for Elasticsearch to reduce shard resources and also perform a special kind of flush, called synced flush. A synced flush performs a normal flush, then adds a generated unique marker (sync_id) to all shards.
Since the sync id marker was added when there were no ongoing indexing operations, it can be used as a quick way to check if the two shards’ lucene indices are identical. This quick sync id comparison (if present) is used during recovery or restarts to skip the first and most costly phase of the process. In that case, no segment files need to be copied and the transaction log replay phase of the recovery can start immediately. Note that since the sync id marker was applied together with a flush, it is very likely that the transaction log will be empty, speeding up recoveries even more.
This is particularly useful for use cases having lots of indices which are never or very rarely updated, such as time based data. This use case typically generates lots of indices whose recovery without the synced flush marker would take a long time.

爲了解決副分片恢復過程第一階段時間太漫長而引入了synced flush，默認情況下5分鐘沒有寫入操作的索引被標記爲inactive，執行synced flush，生成一個唯一的syncid，寫入分片的所有副本中。

注意這個syncid是分片級，意味着擁有相同syncid的分片具有相同的Lucene索引 —— 提供了一種快速檢測Lucene索引是否相同的機制。若主副分片有相同sync_id且文檔數相同，則快速跳出phase1。

顯然synced flush期間不能有新寫入的內容，如果synced flush執行期間受到寫請求，則ES選擇了寫入可用性：讓synced flush失敗，讓寫操作成功。

在某個分片上執行普通flush操作會刪除已有sync_id；在沒有執行flush的情形，已有sync_id不會失效。
結論：synced flush操作時一個不可靠操作，只適用於冷索引，即數據很少被更新的索引。

curl -X POST "localhost:9200/index1,index2/_flush/synced"
curl -X POST "localhost:9200/_flush/synced"

遺留問題

SequenceNumber是什麼？記錄在哪裏？什麼情況下可以基於恢復請求的SequenceNumber進行恢復？
synced flush需要主副分片都活着才能刷成功嗎？
sync_id記錄在哪個文件？
滾動重啓時，每個實例調工一次curl -X POST "localhost:9200/_flush/synced"會不會造成SyncId大量不一致的情況，反而延長恢復時長？

《各位吃瓜羣衆，插播一個廣告》
下篇文章會着重介紹如何保證主副分片一致，恢復流程慢的可能原因，recovery速度調優以及recovery相關監控命令知識。

Reference
《ElasticSearch源碼解析與優化實戰》
Elasticsearch 底層系列之分片恢復解析
 談談ES 的Recovery
RecoverySourceHandler類的所有改動

【Elasticsearch索引恢復流程（上）】

詐騙（殺豬盤）網站進行滲透測試

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

外行也能讀懂的網絡硬件設備功能原理速成

【學習Flume日誌收集框架】

The happy secret to better work

Alibaba Sofa

【端午貴港之旅】

【Lucene基本知識】

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結