【Elasticsearch索引恢复流程（上）】

本文基于ES 6.7介绍ElasticSearch的索引恢复，索引恢复的几个重要阶段，synced flush机制以及主副分片的恢复流程等知识。

ES索引恢复

索引恢复（indices.recovery）是ES数据恢复过程，是集群启动过程中最缓慢的过程。集群完全重启，或者Master节点挂掉后，新选出的Master也有可能执行索引恢复过程。

待恢复的是哪些数据？是客户端写入成功，但是未执行刷盘（flush）的Lucene分段。

索引恢复有哪些好处？

保持数据完整性：当节点异常重启时，写入磁盘的数据先到文件系统的缓冲，未必来得及刷盘。如果不通过某种方式将未进行刷盘的数据找回，则会丢失一些数据。

数据副本一致性：由于写入操作在多个分片副本上没有来得及全部执行，副分片需要同步成和主分片完全一致。

索引恢复流程的类型

根据数据分片的性质，索引恢复过程可以分成“主分片恢复流程” 和“副分片恢复流程”。

主分片恢复流程：主分片从translog中自我恢复，尚未执行flush到磁盘的Lucene分段可以从translog中重建 —— 耗时怎么样？！

副分片恢复流程：副分片需要从主分片拉取Lucene分段和translog进行恢复。但是有机会跳过拉取Lucene分段的过程 —— 表白还有机会！

恢复流程的触发
Recovery是由clusterChanged事件触发，从触发到开始执行恢复的调用流程如下：

indicesClusterStateService#applyClusterState()
->createOrUpdateShards()
->createShard()
->indicesService#createShard()
->indexShard#startRecovery()

IndexShard.java的startRecovery()方法会执行对一个特定分片的恢复流程，根据此分片不同的恢复类型执行相应的恢复过程，这里我们主要关注下列两种恢复类型：

EXISTING_STORE类型：主分片从本地恢复

PEER类型：副分片从远程主分片恢复

public void startRecovery(...) {
        switch (recoveryState.getRecoverySource().getType()) {
            case EMPTY_STORE:
            case EXISTING_STORE: // 主分片从本地恢复
                markAsRecovering("from store", recoveryState); // mark the shard as recovering on the cluster state thread
                threadPool.generic().execute(() -> {
                    try {
                        if (recoverFromStore()) {
                            recoveryListener.onRecoveryDone(recoveryState); // 恢复成功
                        }
                    } catch (Exception e) { // 恢复失败
                        recoveryListener.onRecoveryFailure(recoveryState,
                            new RecoveryFailedException(recoveryState, null, e), true);
                    }
                });
                break;
            case PEER: // 副分片从远程主分片恢复
                try {
                    markAsRecovering("from " + recoveryState.getSourceNode(), recoveryState);
                    recoveryTargetService.startRecovery(this, recoveryState.getSourceNode(), recoveryListener);
                } catch (Exception e) {
                    failShard("corrupted preexisting index", e);
                    recoveryListener.onRecoveryFailure(recoveryState,
                        new RecoveryFailedException(recoveryState, null, e), true);
                }
                break;
            case SNAPSHOT:
                ...
                break;
            case LOCAL_SHARDS:
                ...
                break;
            default:
                throw new IllegalArgumentException("Unknown recovery source " + recoveryState.getRecoverySource());
        }
    }

索引恢复的6个阶段

恢复一般要经历下列6个阶段（查看RecoveryState类）：> INIT -> INDEX -> VERIFY_INDEX -> TRANSLOG -> FINALIZE -> DONE

主分片和副分片恢复都会经历这些阶段，但是有时候会跳过具体的执行过程，只是在流程上体现出经历了这个短暂阶段。
例如，副分片恢复时会跳过TRANSLOG重放过程；主分片恢复过程中的INDEX阶段不会在节点之间复制数据。

/**
 * Keeps track of state related to shard recovery.
 */
public class RecoveryState implements ToXContentFragment, Streamable, Writeable {

    public enum Stage {
        INIT((byte) 0),

        /**
         * recovery of lucene files, either reusing local ones are copying new ones
         */
        INDEX((byte) 1),

        /**
         * potentially running check index
         */
        VERIFY_INDEX((byte) 2),

        /**
         * starting up the engine, replaying the translog
         */
        TRANSLOG((byte) 3),

        /**
         * performing final task after all translog ops have been done
         */
        FINALIZE((byte) 4),

        DONE((byte) 5);

        ...
        }
    }
	
	private Stage stage;
	...
	
	public RecoveryState(ShardRouting shardRouting, DiscoveryNode targetNode, @Nullable DiscoveryNode sourceNode) {
        assert shardRouting.initializing() : "only allow initializing shard routing to be recovered: " + shardRouting;
        RecoverySource recoverySource = shardRouting.recoverySource();
        assert (recoverySource.getType() == RecoverySource.Type.PEER) == (sourceNode != null) :
            "peer recovery requires source node, recovery type: " + recoverySource.getType() + " source node: " + sourceNode;
        this.shardId = shardRouting.shardId();
        this.primary = shardRouting.primary();
        this.recoverySource = recoverySource;
        this.sourceNode = sourceNode;
        this.targetNode = targetNode;
        stage = Stage.INIT;
        index = new Index();
        translog = new Translog();
        verifyIndex = new VerifyIndex();
        timer = new Timer();
        timer.start();
    }
    
    ...
}

Stage	Description	Comment
INIT	恢复尚未启动	从开始执行恢复的那一刻起，就被标记为INIT阶段。Stage.INIT是RecoveryState类属性，RecoveryState类属性作为在IndexShard#startRecovery()函数的参数传入恢复流程，之后判断具体属于哪种恢复类型，然后做些简单验证，然后进入INDEX阶段
INDEX	恢复Lucene文件，以及在节点间复制索引数据	从Lucene读取最后一次提交的分段信息，获取版本号，更新当前索引版本
VERIFY_INDEX	验证索引是否损坏	参数可配，默认配置为不执行验证索引，进入最重要的TRANSLOG阶段 —— 呵呵
TRANSLOG	启动Engine，重放Translog，建立Lucene索引	最重要的阶段：根据最后一次提交的信息做快照，来确定事务日志中哪些数据需要重放。重放事务日志中尚未刷盘的信息，重放完毕后将新生成的Lucene数据刷入磁盘
FINALIZE	清理工作	执行刷新（refresh）操作，将缓冲的数据写入文件，但不刷盘，数据在操作系统的cache中
DONE	恢复完毕	进入DONE之前再次执行refresh，然后更新分片状态

主分片恢复流程

IndexShard#innerOpenEngineAndTranslog()函数完成了VERIFY_INDEX阶段（默认关闭校验），并开启TRANSLOG阶段。

/**
 * Replays translog operations from the provided translog {@code snapshot} to the current engine using the given {@code origin}.
 * The callback {@code onOperationRecovered} is notified after each translog operation is replayed successfully.
 */
int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin,
						Runnable onOperationRecovered) throws IOException {
	int opsRecovered = 0;
	Translog.Operation operation;
	while ((operation = snapshot.next()) != null) { // 遍历索引需要重放的事务日志
		try {
			logger.trace("[translog] recover op {}", operation);
			Engine.Result result = applyTranslogOperation(engine, operation, origin); // 执行具体的操作，比如索引写入，删除等
			switch (result.getResultType()) {
				case FAILURE:
					throw result.getFailure();
				case MAPPING_UPDATE_REQUIRED:
					throw new IllegalArgumentException("unexpected mapping update: " + result.getRequiredMappingUpdate());
				case SUCCESS:
					break;
				default:
					throw new AssertionError("Unknown result type [" + result.getResultType() + "]");
			}

			opsRecovered++;
			onOperationRecovered.run();
		} catch (Exception e) {
			...
		}
	}
	return opsRecovered;
}

private Engine.Result applyTranslogOperation(Engine engine, Translog.Operation operation,
											 Engine.Operation.Origin origin) throws IOException {
	// If a translog op is replayed on the primary (eg. ccr), we need to use external instead of null for its version type.
	final VersionType versionType = (origin == Engine.Operation.Origin.PRIMARY) ? VersionType.EXTERNAL : null;
	final Engine.Result result;
	switch (operation.opType()) { // 根据操作类型进行重放逻辑，比如索引写入，删除等
		case INDEX:
			final Translog.Index index = (Translog.Index) operation;
			// we set canHaveDuplicates to true all the time such that we de-optimze the translog case and ensure that all
			// autoGeneratedID docs that are coming from the primary are updated correctly.
			result = applyIndexOperation(engine, index.seqNo(), index.primaryTerm(), index.version(),
				index.versionType().versionTypeForReplicationAndRecovery(), UNASSIGNED_SEQ_NO, 0, index.getAutoGeneratedIdTimestamp(),
				true, origin, source(shardId.getIndexName(), index.type(), index.id(), index.source(),
					XContentHelper.xContentType(index.source())).routing(index.routing()).parent(index.parent()));
			break;
		case DELETE:
			final Translog.Delete delete = (Translog.Delete) operation;
			result = applyDeleteOperation(engine, delete.seqNo(), delete.primaryTerm(), delete.version(), delete.type(), delete.id(),
				delete.versionType().versionTypeForReplicationAndRecovery(), UNASSIGNED_SEQ_NO, 0, origin);
			break;
		case NO_OP:
			final Translog.NoOp noOp = (Translog.NoOp) operation;
			result = markSeqNoAsNoop(engine, noOp.seqNo(), noOp.primaryTerm(), noOp.reason(), origin);
			break;
		default:
			throw new IllegalStateException("No operation defined for [" + operation + "]");
	}
	return result;
}

副分片恢复流程

副分片恢复的核心思想是从主分片拉去Lucene分段和translog进行恢复。按数据传递的方向，主分片节点成为Source，副分片节点成为Target。
副分片恢复的VERIFY_INDEX、TRANSLOG、FINALIZE三个阶段由主分片节点发送的RPC调用触发。
副分片恢复主要涉及恢复的目标节点和源节点，目标节点即故障恢复的节点，源节点为提供恢复的节点。目标节点向源节点发送分片恢复请求，源节点接收到请求后主要分两阶段来处理，整体流程如下图所示。

phase1：第一阶段，对需要恢复的shard创建snapshot，然后根据请求中的metadata对比，如果syncid相同且doc数量相同则跳出phase1。否则对比shard的segment文件差异，将有差异的segment文件发送给target node。

phase2：第二阶段，为了保证target node数据的完整性，需要将本地的translog发送给target node，且对接收到的translog进行回放。

ES 6.7中有两个机会可以快速跳过phase1：

如果可以基于恢复请求中的SequencNumber进行恢复，则直接跳过phase1。

final SendFileResult sendFileResult; // 本段代码的重点是拿到sendFileResult
if (isSequenceNumberBasedRecovery) { // 可以基于恢复请求中的SequenceNumber进行恢复，则跳过phase1
	logger.trace("performing sequence numbers based recovery. starting at [{}]", request.startingSeqNo());
	startingSeqNo = request.startingSeqNo();
	sendFileResult = SendFileResult.EMPTY;
} else { // 进入phase1，还有机会根据syncid和doc数快速跳出phase1
	...
	sendFileResult = phase1(phase1Snapshot.getIndexCommit(), () -> estimateNumOps);
	...
}

如果主副两分片有相同的sync_id且doc数相同，则快速跳出phase1。

/**
 * Perform phase1 of the recovery operations. Once this {@link IndexCommit}
 * snapshot has been performed no commit operations (files being fsync'd)
 * are effectively allowed on this index until all recovery phases are done
 * <p>
 * Phase1 examines the segment files on the target node and copies over the
 * segments that are missing. Only segments that have the same size and
 * checksum can be reused
 */
public SendFileResult phase1(final IndexCommit snapshot, final Supplier<Integer> translogOps) {
	cancellableThreads.checkForCancel();
	// Total size of segment files that are recovered
	long totalSize = 0;
	// Total size of segment files that were able to be re-used
	long existingTotalSize = 0;
	final List<String> phase1FileNames = new ArrayList<>();
	final List<Long> phase1FileSizes = new ArrayList<>();
	final List<String> phase1ExistingFileNames = new ArrayList<>();
	final List<Long> phase1ExistingFileSizes = new ArrayList<>();
	final Store store = shard.store(); // 拿到shard的存储信息
	store.incRef();
	try {
		StopWatch stopWatch = new StopWatch().start();
		final Store.MetadataSnapshot recoverySourceMetadata;
		try {
			recoverySourceMetadata = store.getMetadata(snapshot); // 拿到snapshot的metadata
		} catch (CorruptIndexException | IndexFormatTooOldException | IndexFormatTooNewException ex) {
			shard.failShard("recovery", ex);
			throw ex;
		}
		for (String name : snapshot.getFileNames()) {
			final StoreFileMetaData md = recoverySourceMetadata.get(name);
			if (md == null) {
				logger.info("Snapshot differs from actual index for file: {} meta: {}", name, recoverySourceMetadata.asMap());
				throw new CorruptIndexException("Snapshot differs from actual index - maybe index was removed metadata has " +
						recoverySourceMetadata.asMap().size() + " files", name);
			}
		}
		// Generate a "diff" of all the identical, different, and missing
		// segment files on the target node, using the existing files on
		// the source node
		String recoverySourceSyncId = recoverySourceMetadata.getSyncId();
		String recoveryTargetSyncId = request.metadataSnapshot().getSyncId();
		final boolean recoverWithSyncId = recoverySourceSyncId != null &&
				recoverySourceSyncId.equals(recoveryTargetSyncId);
		if (recoverWithSyncId) { // 如果SyncId相等，再继续比较文档数，如果都相同则快速跳出phase1
			final long numDocsTarget = request.metadataSnapshot().getNumDocs();
			final long numDocsSource = recoverySourceMetadata.getNumDocs();
			if (numDocsTarget != numDocsSource) {
				throw new IllegalStateException("try to recover " + request.shardId() + " from primary shard with sync id but number " +
						"of docs differ: " + numDocsSource + " (" + request.sourceNode().getName() + ", primary) vs " + numDocsTarget
						+ "(" + request.targetNode().getName() + ")");
			}
			// we shortcut recovery here because we have nothing to copy. but we must still start the engine on the target.
			// so we don't return here
			logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", recoverySourceSyncId);
		} else { // 如果SyncId不相等，则找出target和source有差别的segment，将需要恢复的文件发送到target node —— 耗费网络带宽，耗时的操作！
			...
		}
		final TimeValue took = stopWatch.totalTime();
		logger.trace("recovery [phase1]: took [{}]", took);
		return new SendFileResult(phase1FileNames, phase1FileSizes, totalSize, phase1ExistingFileNames,
			phase1ExistingFileSizes, existingTotalSize, took);
	} catch (Exception e) {
		throw new RecoverFilesRecoveryException(request.shardId(), phase1FileNames.size(), new ByteSizeValue(totalSize), e);
	} finally {
		store.decRef();
	}
}

Synced Flush能够快速跳出冷索引的phase1，有效缩短恢复时长！

Elasticsearch tracks the indexing activity of each shard. Shards that have not received any indexing operations for 5 minutes are automatically marked as inactive. This presents an opportunity for Elasticsearch to reduce shard resources and also perform a special kind of flush, called synced flush. A synced flush performs a normal flush, then adds a generated unique marker (sync_id) to all shards.
Since the sync id marker was added when there were no ongoing indexing operations, it can be used as a quick way to check if the two shards’ lucene indices are identical. This quick sync id comparison (if present) is used during recovery or restarts to skip the first and most costly phase of the process. In that case, no segment files need to be copied and the transaction log replay phase of the recovery can start immediately. Note that since the sync id marker was applied together with a flush, it is very likely that the transaction log will be empty, speeding up recoveries even more.
This is particularly useful for use cases having lots of indices which are never or very rarely updated, such as time based data. This use case typically generates lots of indices whose recovery without the synced flush marker would take a long time.

为了解决副分片恢复过程第一阶段时间太漫长而引入了synced flush，默认情况下5分钟没有写入操作的索引被标记为inactive，执行synced flush，生成一个唯一的syncid，写入分片的所有副本中。

注意这个syncid是分片级，意味着拥有相同syncid的分片具有相同的Lucene索引 —— 提供了一种快速检测Lucene索引是否相同的机制。若主副分片有相同sync_id且文档数相同，则快速跳出phase1。

显然synced flush期间不能有新写入的内容，如果synced flush执行期间受到写请求，则ES选择了写入可用性：让synced flush失败，让写操作成功。

在某个分片上执行普通flush操作会删除已有sync_id；在没有执行flush的情形，已有sync_id不会失效。
结论：synced flush操作时一个不可靠操作，只适用于冷索引，即数据很少被更新的索引。

curl -X POST "localhost:9200/index1,index2/_flush/synced"
curl -X POST "localhost:9200/_flush/synced"

遗留问题

SequenceNumber是什么？记录在哪里？什么情况下可以基于恢复请求的SequenceNumber进行恢复？
synced flush需要主副分片都活着才能刷成功吗？
sync_id记录在哪个文件？
滚动重启时，每个实例调工一次curl -X POST "localhost:9200/_flush/synced"会不会造成SyncId大量不一致的情况，反而延长恢复时长？

《各位吃瓜群众，插播一个广告》
下篇文章会着重介绍如何保证主副分片一致，恢复流程慢的可能原因，recovery速度调优以及recovery相关监控命令知识。

Reference
《ElasticSearch源码解析与优化实战》
Elasticsearch 底层系列之分片恢复解析
 谈谈ES 的Recovery
RecoverySourceHandler类的所有改动

【Elasticsearch索引恢复流程（上）】

【學習Flume日誌收集框架】

The happy secret to better work

Alibaba Sofa

【端午貴港之旅】

【Lucene基本知識】

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結