我們是基礎架構部,騰訊雲 CES/CTSDB 產品後臺服務的支持團隊,我們擁有專業的ES開發運維能力,爲大家提供穩定、高性能的服務,歡迎有需求的童鞋接入,同時也歡迎各位交流 Elasticsearch、Lucene 相關的技術!
1. 前言
在線上生產環境中,對於大規模的ES集羣出現節點故障的場景比較多,例如,網絡分區、機器故障、集羣壓力等等,都會導致節點故障。當外在環境恢復後,節點需要重新加入集羣,那麼當節點重新加入集羣時,由於ES的自平衡策略,需要將某些分片恢復到新加入的節點上,那麼ES的分片恢復流程是如何進行的呢?遇到分片恢復的坑該如何解決呢?(這裏線上用戶有碰到過,當恢復的併發調得較大時,會觸發es的bug導致分佈式死鎖)?分片恢復的完整性、一致性如何保證呢?,本文將通過ES源碼一窺究竟。注:ES分片恢復的場景有多種,本文只剖析最複雜的場景--peer recovery。
2. 分片恢復總體流程
ES副本分片恢復主要涉及恢復的目標節點和源節點,目標節點即故障恢復的節點,源節點爲提供恢復的節點。目標節點向源節點發送分片恢復請求,源節點接收到請求後主要分兩階段來處理。第一階段,對需要恢復的shard創建snapshot,然後根據請求中的metadata對比如果 syncid 相同且 doc 數量相同則跳過,否則對比shard的segment文件差異,將有差異的segment文件發送給target node。第二階段,爲了保證target node數據的完整性,需要將本地的translog發送給target node,且對接收到的translog進行回放。整體流程如下圖所示。
以上爲恢復的總體流程,具體實現細節,下面將結合源碼進行解析。
3. 副本分片流程
3.1. 目標節點請求恢復
本節,我們通過源碼來剖析副本分片的詳細恢復流程。ES根據metadata的變化來驅動各個模塊工作,副本分片恢復的起始入口爲IndicesClusterStateService.createOrUpdateShards,這裏首先會判斷本地節點是否在routingNodes中,如果在,說明本地節點有分片創建或更新的需求,否則跳過。邏輯如下:
private void createOrUpdateShards(final ClusterState state) { RoutingNode localRoutingNode = state.getRoutingNodes().node(state.nodes().getLocalNodeId()); if (localRoutingNode == null) { return; } DiscoveryNodes nodes = state.nodes(); RoutingTable routingTable = state.routingTable(); for (final ShardRouting shardRouting : localRoutingNode) { ShardId shardId = shardRouting.shardId(); if (failedShardsCache.containsKey(shardId) == false) { AllocatedIndex<? extends Shard> indexService = indicesService.indexService(shardId.getIndex()); Shard shard = indexService.getShardOrNull(shardId.id()); if (shard == null) { // shard不存在則需創建 createShard(nodes, routingTable, shardRouting, state); } else { // 存在則更新 updateShard(nodes, shardRouting, shard, routingTable, state); } } } }
副本分片恢復走的是createShard分支,在該方法中,首先獲取shardRouting的類型,如果恢復類型爲PEER,說明該分片需要從遠端獲取,則需要找到源節點,然後調用IndicesService.createShard:
private void createShard(DiscoveryNodes nodes, RoutingTable routingTable, ShardRouting shardRouting, ClusterState state) { DiscoveryNode sourceNode = null; if (shardRouting.recoverySource().getType() == Type.PEER) { sourceNode = findSourceNodeForPeerRecovery(logger, routingTable, nodes, shardRouting); // 如果恢復方式是peer,則會找到shard所在的源節點進行恢復 if (sourceNode == null) { return; } } RecoveryState recoveryState = new RecoveryState(shardRouting, nodes.getLocalNode(), sourceNode); indicesService.createShard(shardRouting, recoveryState, recoveryTargetService, new RecoveryListener(shardRouting), repositoriesService, failedShardHandler); ... ... } private static DiscoveryNode findSourceNodeForPeerRecovery(Logger logger, RoutingTable routingTable, DiscoveryNodes nodes, ShardRouting shardRouting) { DiscoveryNode sourceNode = null; if (!shardRouting.primary()) { ShardRouting primary = routingTable.shardRoutingTable(shardRouting.shardId()).primaryShard(); if (primary.active()) { sourceNode = nodes.get(primary.currentNodeId()); // 找到primary shard所在節點 } } else if (shardRouting.relocatingNodeId() != null) { sourceNode = nodes.get(shardRouting.relocatingNodeId()); // 找到搬遷的源節點 } else { ... ... } return sourceNode; }
源節點的確定分兩種情況,如果當前shard本身不是primary shard,則源節點爲primary shard所在節點,否則,如果當前shard正在搬遷中(從其他節點搬遷到本節點),則源節點爲數據搬遷的源頭節點。得到源節點後調用IndicesService.createShard,在該方法中調用方法IndexShard.startRecovery開始恢復。對於恢復類型爲PEER的任務,恢復動作的真正執行者爲PeerRecoveryTargetService.doRecovery。在該方法中,首先獲取shard的metadataSnapshot,該結構中包含shard的段信息,如syncid、checksum、doc數等,然後封裝爲 StartRecoveryRequest,通過RPC發送到源節點:
... ... metadataSnapshot = recoveryTarget.indexShard().snapshotStoreMetadata(); ... ... // 創建recovery quest request = new StartRecoveryRequest(recoveryTarget.shardId(), recoveryTarget.indexShard().routingEntry().allocationId().getId(), recoveryTarget.sourceNode(), clusterService.localNode(), metadataSnapshot, recoveryTarget.state().getPrimary(), recoveryTarget.recoveryId()); ... ... // 向源節點發送請求,請求恢復 cancellableThreads.execute(() -> responseHolder.set( transportService.submitRequest(request.sourceNode(), PeerRecoverySourceService.Actions.START_RECOVERY, request, new FutureTransportResponseHandler<RecoveryResponse>() { @Override public RecoveryResponse newInstance() { return new RecoveryResponse(); } }).txGet()));
注意,請求的發送是異步的,但是這裏會調用 PlainTransportFuture.txGet() 方法,等待對端的回覆,否則將一直 阻塞 。至此,目標節點已將請求發送給源節點,源節點的執行邏輯隨後詳細分析。
3.2 源節點處理恢復請求
源節點接收到請求後會調用恢復的入口函數recover:
class StartRecoveryTransportRequestHandler implements TransportRequestHandler<StartRecoveryRequest> { @Override public void messageReceived(final StartRecoveryRequest request, final TransportChannel channel) throws Exception { RecoveryResponse response = recover(request); channel.sendResponse(response); } }
recover方法根據request得到shard並構造RecoverySourceHandler對象,然後調用handler.recoverToTarget進入恢復的執行體:
public RecoveryResponse recoverToTarget() throws IOException { // 恢復分爲兩階段 try (Translog.View translogView = shard.acquireTranslogView()) { final IndexCommit phase1Snapshot; try { phase1Snapshot = shard.acquireIndexCommit(false); } catch (Exception e) { IOUtils.closeWhileHandlingException(translogView); throw new RecoveryEngineException(shard.shardId(), 1, "Snapshot failed", e); } try { phase1(phase1Snapshot, translogView); // 第一階段,比較syncid和segment,然後得出有差異的部分,主動將數據推送給請求方 } catch (Exception e) { throw new RecoveryEngineException(shard.shardId(), 1, "phase1 failed", e); } finally { try { shard.releaseIndexCommit(phase1Snapshot); } catch (IOException ex) { logger.warn("releasing snapshot caused exception", ex); } } // engine was just started at the end of phase 1 if (shard.state() == IndexShardState.RELOCATED) { throw new IndexShardRelocatedException(request.shardId()); } try { phase2(translogView.snapshot()); // 第二階段,發送translog } catch (Exception e) { throw new RecoveryEngineException(shard.shardId(), 2, "phase2 failed", e); } finalizeRecovery(); } return response; }
從上面的代碼可以看出,恢復主要分兩個階段,第一階段恢復segment文件,第二階段發送translog。這裏有個關鍵的地方,在恢復前,首先需要獲取translogView及segment snapshot,translogView的作用是保證當前時間點到恢復結束時間段的translog不被刪除,segment snapshot的作用是保證當前時間點之前的segment文件不被刪除。接下來看看兩階段恢復的具體執行邏輯。phase1:
public void phase1(final IndexCommit snapshot, final Translog.View translogView) { final Store store = shard.store(); //拿到shard的存儲信息 recoverySourceMetadata = store.getMetadata(snapshot); // 拿到snapshot的metadata String recoverySourceSyncId = recoverySourceMetadata.getSyncId(); String recoveryTargetSyncId = request.metadataSnapshot().getSyncId(); final boolean recoverWithSyncId = recoverySourceSyncId != null && recoverySourceSyncId.equals(recoveryTargetSyncId); if (recoverWithSyncId) { // 如果syncid相等,再繼續比較下文檔數,如果都相同則不用恢復 final long numDocsTarget = request.metadataSnapshot().getNumDocs(); final long numDocsSource = recoverySourceMetadata.getNumDocs(); if (numDocsTarget != numDocsSource) { throw new IllegalStateException("... ..."); } } else { final Store.RecoveryDiff diff = recoverySourceMetadata.recoveryDiff(request.metadataSnapshot()); // 找出target和source有差別的segment List<StoreFileMetaData> phase1Files = new ArrayList<>(diff.different.size() + diff.missing.size()); phase1Files.addAll(diff.different); phase1Files.addAll(diff.missing); ... ... final Function<StoreFileMetaData, OutputStream> outputStreamFactories = md -> new BufferedOutputStream(new RecoveryOutputStream(md, translogView), chunkSizeInBytes); sendFiles(store, phase1Files.toArray(new StoreFileMetaData[phase1Files.size()]), outputStreamFactories); // 將需要恢復的文件發送到target node ... ... } prepareTargetForTranslog(translogView.totalOperations(), shard.segmentStats(false).getMaxUnsafeAutoIdTimestamp());
從上面代碼可以看出,phase1的具體邏輯是,首先拿到待恢復shard的metadataSnapshot從而得到recoverySourceSyncId,根據request拿到recoveryTargetSyncId,比較兩邊的syncid,如果相同再比較源和目標的文檔數,如果也相同,說明在當前提交點之前源和目標的shard對應的segments都相同,因此不用恢復segment文件。如果兩邊的syncid不同,說明segment文件有差異,則需要找出所有有差異的文件進行恢復。通過比較recoverySourceMetadata和recoveryTargetSnapshot的差異性,可以找出所有有差別的segment文件。這塊邏輯如下:
public RecoveryDiff recoveryDiff(MetadataSnapshot recoveryTargetSnapshot) { final List<StoreFileMetaData> identical = new ArrayList<>(); // 相同的file final List<StoreFileMetaData> different = new ArrayList<>(); // 不同的file final List<StoreFileMetaData> missing = new ArrayList<>(); // 缺失的file final Map<String, List<StoreFileMetaData>> perSegment = new HashMap<>(); final List<StoreFileMetaData> perCommitStoreFiles = new ArrayList<>(); ... ... for (List<StoreFileMetaData> segmentFiles : Iterables.concat(perSegment.values(), Collections.singleton(perCommitStoreFiles))) { identicalFiles.clear(); boolean consistent = true; for (StoreFileMetaData meta : segmentFiles) { StoreFileMetaData storeFileMetaData = recoveryTargetSnapshot.get(meta.name()); if (storeFileMetaData == null) { consistent = false; missing.add(meta); // 該segment在target node中不存在,則加入到missing } else if (storeFileMetaData.isSame(meta) == false) { consistent = false; different.add(meta); // 存在但不相同,則加入到different } else { identicalFiles.add(meta); // 存在且相同 } } if (consistent) { identical.addAll(identicalFiles); } else { // make sure all files are added - this can happen if only the deletes are different different.addAll(identicalFiles); } } RecoveryDiff recoveryDiff = new RecoveryDiff(Collections.unmodifiableList(identical), Collections.unmodifiableList(different), Collections.unmodifiableList(missing)); return recoveryDiff; }
這裏將所有的segment file分爲三類:identical(相同)、different(不同)、missing(target缺失)。然後將different和missing的segment files作爲第一階段需要恢復的文件發送到target node。發送完segment files後,源節點還會向目標節點發送消息以通知目標節點清理臨時文件,然後也會發送消息通知目標節點打開引擎準備接收translog,這裏需要注意的是,這兩次網絡通信都會調用 PlainTransportFuture.txGet() 方法阻塞等待 對端回覆。至此,第一階段的恢復邏輯完畢。
第二階段的邏輯比較簡單,只需將translog view到當前時間之間的所有translog發送給源節點即可。
3.3 目標節點開始恢復
- 接收segment
對應上一小節源節點恢復的第一階段,源節點將所有有差異的segment發送給目標節點,目標節點接收到後會將segment文件落盤。segment files的寫入函數爲RecoveryTarget.writeFileChunk:
public void writeFileChunk(StoreFileMetaData fileMetaData, long position, BytesReference content, boolean lastChunk, int totalTranslogOps) throws IOException { final Store store = store(); final String name = fileMetaData.name(); ... ... if (position == 0) { indexOutput = openAndPutIndexOutput(name, fileMetaData, store); } else { indexOutput = getOpenIndexOutput(name); // 加一層前綴,組成臨時文件 } ... ... while((scratch = iterator.next()) != null) { indexOutput.writeBytes(scratch.bytes, scratch.offset, scratch.length); // 寫臨時文件 } ... ... store.directory().sync(Collections.singleton(temporaryFileName)); // 這裏會調用fsync落盤 }
- 打開引擎
經過上面的過程,目標節點完成了追數據的第一步。接收完segment後,目標節點打開shard對應的引擎準備接收translog,注意,這裏打開引擎後,正在恢復的shard便可進行寫入、刪除(操作包括primary shard同步的請求和translog中的操作命令)。打開引擎的邏輯如下:
private void internalPerformTranslogRecovery(boolean skipTranslogRecovery, boolean indexExists, long maxUnsafeAutoIdTimestamp) throws IOException { ... ... recoveryState.setStage(RecoveryState.Stage.TRANSLOG); final EngineConfig.OpenMode openMode; if (indexExists == false) { openMode = EngineConfig.OpenMode.CREATE_INDEX_AND_TRANSLOG; } else if (skipTranslogRecovery) { openMode = EngineConfig.OpenMode.OPEN_INDEX_CREATE_TRANSLOG; } else { openMode = EngineConfig.OpenMode.OPEN_INDEX_AND_TRANSLOG; } final EngineConfig config = newEngineConfig(openMode, maxUnsafeAutoIdTimestamp); // we disable deletes since we allow for operations to be executed against the shard while recovering // but we need to make sure we don't loose deletes until we are done recovering config.setEnableGcDeletes(false); // 恢復過程中不刪除translog Engine newEngine = createNewEngine(config); // 創建engine ... ... }
- 接收並重放translog
打開引擎後,便可以根據translog中的命令進行相應的回放動作,回放的邏輯和正常的寫入、刪除類似,這裏需要根據translog還原出操作類型和操作數據,並根據操作數據構建相應的數據對象,然後再調用上一步打開的engine執行相應的操作,這塊邏輯如下:
private void performRecoveryOperation(Engine engine, Translog.Operation operation, boolean allowMappingUpdates, Engine.Operation.Origin origin) throws IOException { switch (operation.opType()) { // 還原出操作類型及操作數據並調用engine執行相應的動作 case INDEX: Translog.Index index = (Translog.Index) operation; // ... 根據index構建engineIndex對象 ... maybeAddMappingUpdate(engineIndex.type(), engineIndex.parsedDoc().dynamicMappingsUpdate(), engineIndex.id(), allowMappingUpdates); index(engine, engineIndex); // 執行寫入操作 break; case DELETE: Translog.Delete delete = (Translog.Delete) operation; // ... 根據delete構建engineDelete對象 ... delete(engine, engineDelete); // 執行刪除操作 break; default: throw new IllegalStateException("No operation defined for [" + operation + "]"); } }
通過上面的步驟,translog的重放完畢,此後需要做一些收尾的工作,包括,refresh讓回放後的最新數據可見,打開translog gc:
public void finalizeRecovery() { recoveryState().setStage(RecoveryState.Stage.FINALIZE); Engine engine = getEngine(); engine.refresh("recovery_finalization"); engine.config().setEnableGcDeletes(true); }
到這裏,replica shard恢復的兩個階段便完成了,由於此時shard還處於INITIALIZING狀態,還需通知master節點啓動已恢復的shard:
private class RecoveryListener implements PeerRecoveryTargetService.RecoveryListener { @Override public void onRecoveryDone(RecoveryState state) { if (state.getRecoverySource().getType() == Type.SNAPSHOT) { SnapshotRecoverySource snapshotRecoverySource = (SnapshotRecoverySource) state.getRecoverySource(); restoreService.indexShardRestoreCompleted(snapshotRecoverySource.snapshot(), shardRouting.shardId()); } shardStateAction.shardStarted(shardRouting, "after " + state.getRecoverySource(), SHARD_STATE_ACTION_LISTENER); } }
至此,shard recovery的所有流程都已完成。
4. 答疑解惑
通過上述源碼剖析後,本節將對文章開頭拋出的幾個問題進行答疑解惑,加深大家對分片恢復的理解。
- 分佈式死鎖 通過上述源碼的分析,大家注意3.1和3.2末尾處加粗的地方,可以看出,源節點和目標節點都有調用PlainTransportFuture.txGet()方法阻塞線程同步返回結果,這是導致死鎖的關鍵點。具體問題描述及處理方法見https://cloud.tencent.com/developer/article/1370318,大家可以結合本文源碼分析搞清楚死鎖的原因。
- 完整性 首先,phase1階段,保證了存量的歷史數據可以恢復到從分片。phase1階段完成後,從分片引擎打開,可以正常處理index、delete請求,而translog覆蓋完了整個phase1階段,因此在phase1階段中的index/delete操作都將被記錄下來,在phase2階段進行translog回放時,副本分片正常的index和delete操作和translog是並行執行的,這就保證了恢復開始之前的數據、恢復中的數據都會完整的寫入到副本分片,保證了數據的完整性。如下圖所示:
- 一致性
由於phase1階段完成後,從分片便可正常處理寫入操作,而此時從分片的寫入和phase2階段的translog回放時並行執行的,如果translog的回放慢於正常的寫入操作,那麼可能會導致老的數據後寫入,造成數據不一致。ES爲了保證數據的一致性在進行寫入操作時,會比較當前寫入的版本和lucene文檔版本號,如果當前版本更小,說明是舊數據則不會將文檔寫入lucene。相關代碼如下:
final OpVsLuceneDocStatus opVsLucene = compareOpToLuceneDocBasedOnVersions(index); if (opVsLucene == OpVsLuceneDocStatus.OP_STALE_OR_EQUAL) { plan = IndexingStrategy.skipAsStale(false, index.version()); }
5. 小結
本文結合ES源碼詳細分析了副本分片恢復的具體流程,並通過對源碼的理解對文章開頭提出的問題進行答疑解惑。後面,我們也將推出更多ES相關的文章,歡迎大家多多關注和交流。