深入解析 Flink 的算子鏈機制

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“爲什麼我的 Flink 作業 Web UI 中只顯示出了一個框,並且 Records Sent 和Records Received 指標都是 0 ?是我的程序寫得有問題嗎?”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Flink 算子鏈簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"筆者在 Flink 社區羣裏經常能看到類似這樣的疑問。這種情況幾乎都不是程序有問題,而是因爲 Flink 的 operator chain ——即算子鏈機制導致的,即提交的作業的執行計劃中,所有算子的併發實例(即 sub-task )都因爲滿足特定條件而串成了整體來執行,自然就觀察不到算子之間的數據流量了。當然上述是一種特殊情況。我們更常見到的是隻有部分算子得到了算子鏈機制的優化,如官方文檔中出現過多次的下圖所示,注意 Source 和 map() 算子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/72/720b4e0e2709778ba105ae601308d74d.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算子鏈機制的好處是顯而易見的:所有 chain 在一起的 sub-task 都會在同一個線程(即 TaskManager 的 slot)中執行,能夠減少不必要的數據交換、序列化和上下文切換,從而提高作業的執行效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4b9122f479e9ddeb5b03394bca2d367a.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鋪墊了這麼多,接下來就通過源碼簡單看看算子鏈產生的條件,以及它是如何在 Flink Runtime 中實現的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"邏輯計劃中的算子鏈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對 Flink Runtime 稍有了解的看官應該知道,Flink 作業的執行計劃會用三層圖結構來表示,即:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" StreamGraph —— 原始邏輯執行計劃"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" JobGraph —— 優化的邏輯執行計劃(Web UI 中看到的就是這個)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ExecutionGraph —— 物理執行計劃"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算子鏈是在優化邏輯計劃時加入的,也就是由 StreamGraph 生成 JobGraph 的過程中。那麼我們來到負責生成 JobGraph 的 o.a.f.streaming.api.graph.StreamingJobGraphGenerator 類,查看其核心方法 createJobGraph() 的源碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"private JobGraph createJobGraph() {\n // make sure that all vertices start immediately\n jobGraph.setScheduleMode(streamGraph.getScheduleMode());\n // Generate deterministic hashes for the nodes in order to identify them across\n // submission iff they didn't change.\n Map hashes = defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);\n // Generate legacy version hashes for backwards compatibility\n List> legacyHashes = new ArrayList<>(legacyStreamGraphHashers.size());\n for (StreamGraphHasher hasher : legacyStreamGraphHashers) {\n legacyHashes.add(hasher.traverseStreamGraphAndGenerateHashes(streamGraph));\n }\n Map>> chainedOperatorHashes = new HashMap<>();\n setChaining(hashes, legacyHashes, chainedOperatorHashes);\n\n setPhysicalEdges();\n // 略......\n\n return jobGraph;\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可見,該方法會先計算出 StreamGraph 中各個節點的哈希碼作爲唯一標識,並創建一個空的 Map 結構保存即將被鏈在一起的算子的哈希碼,然後調用 setChaining() 方法,如下源碼所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"private void setChaining(Map hashes, List> legacyHashes, Map>> chainedOperatorHashes) {\n for (Integer sourceNodeId : streamGraph.getSourceIDs()) {\n createChain(sourceNodeId, sourceNodeId, hashes, legacyHashes, 0, chainedOperatorHashes);\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可見是逐個遍歷 StreamGraph 中的 Source 節點,並調用 createChain() 方法。createChain() 是邏輯計劃層創建算子鏈的核心方法,完整源碼如下,有點長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"private List createChain(\n Integer startNodeId,\n Integer currentNodeId,\n Map hashes,\n List> legacyHashes,\n int chainIndex,\n Map>> chainedOperatorHashes) {\n if (!builtVertices.contains(startNodeId)) {\n List transitiveOutEdges = new ArrayList();\n List chainableOutputs = new ArrayList();\n List nonChainableOutputs = new ArrayList();\n\n StreamNode currentNode = streamGraph.getStreamNode(currentNodeId);\n for (StreamEdge outEdge : currentNode.getOutEdges()) {\n if (isChainable(outEdge, streamGraph)) {\n chainableOutputs.add(outEdge);\n } else {\n nonChainableOutputs.add(outEdge);\n }\n }\n\n for (StreamEdge chainable : chainableOutputs) {\n transitiveOutEdges.addAll(\n createChain(startNodeId, chainable.getTargetId(), hashes, legacyHashes, chainIndex + 1, chainedOperatorHashes));\n }\n\n for (StreamEdge nonChainable : nonChainableOutputs) {\n transitiveOutEdges.add(nonChainable);\n createChain(nonChainable.getTargetId(), nonChainable.getTargetId(), hashes, legacyHashes, 0, chainedOperatorHashes);\n }\n\n List> operatorHashes =\n chainedOperatorHashes.computeIfAbsent(startNodeId, k -> new ArrayList<>());\n\n byte[] primaryHashBytes = hashes.get(currentNodeId);\n OperatorID currentOperatorId = new OperatorID(primaryHashBytes);\n\n for (Map legacyHash : legacyHashes) {\n operatorHashes.add(new Tuple2<>(primaryHashBytes, legacyHash.get(currentNodeId)));\n }\n\n chainedNames.put(currentNodeId, createChainedName(currentNodeId, chainableOutputs));\n chainedMinResources.put(currentNodeId, createChainedMinResources(currentNodeId, chainableOutputs));\n chainedPreferredResources.put(currentNodeId, createChainedPreferredResources(currentNodeId, chainableOutputs));\n\n if (currentNode.getInputFormat() != null) {\n getOrCreateFormatContainer(startNodeId).addInputFormat(currentOperatorId, currentNode.getInputFormat());\n }\n if (currentNode.getOutputFormat() != null) {\n getOrCreateFormatContainer(startNodeId).addOutputFormat(currentOperatorId, currentNode.getOutputFormat());\n }\n\n StreamConfig config = currentNodeId.equals(startNodeId)\n ? createJobVertex(startNodeId, hashes, legacyHashes, chainedOperatorHashes)\n : new StreamConfig(new Configuration());\n\n setVertexConfig(currentNodeId, config, chainableOutputs, nonChainableOutputs);\n\n if (currentNodeId.equals(startNodeId)) {\n config.setChainStart();\n config.setChainIndex(0);\n config.setOperatorName(streamGraph.getStreamNode(currentNodeId).getOperatorName());\n config.setOutEdgesInOrder(transitiveOutEdges);\n config.setOutEdges(streamGraph.getStreamNode(currentNodeId).getOutEdges());\n for (StreamEdge edge : transitiveOutEdges) {\n connect(startNodeId, edge);\n }\n config.setTransitiveChainedTaskConfigs(chainedConfigs.get(startNodeId));\n } else {\n chainedConfigs.computeIfAbsent(startNodeId, k -> new HashMap());\n config.setChainIndex(chainIndex);\n StreamNode node = streamGraph.getStreamNode(currentNodeId);\n config.setOperatorName(node.getOperatorName());\n chainedConfigs.get(startNodeId).put(currentNodeId, config);\n }\n\n config.setOperatorID(currentOperatorId);\n if (chainableOutputs.isEmpty()) {\n config.setChainEnd();\n }\n return transitiveOutEdges;\n } else {\n return new ArrayList<>();\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先解釋一下方法開頭創建的 3 個 List 結構:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" transitiveOutEdges:當前算子鏈在 JobGraph 中的出邊列表,同時也是 createChain() 方法的最終返回值;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" chainableOutputs:當前能夠鏈在一起的 StreamGraph 邊列表;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" nonChainableOutputs:當前不能夠鏈在一起的 StreamGraph 邊列表。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,從 Source 開始遍歷 StreamGraph 中當前節點的所有出邊,調用 isChainable() 方法判斷是否可以被鏈在一起(這個判斷邏輯稍後會講到)。可以鏈接的出邊被放入 chainableOutputs 列表,否則放入 nonChainableOutputs 列表。對於 chainableOutputs 中的邊,就會以這些邊的直接下游爲起點,繼續遞歸調用createChain() 方法延展算子鏈。對於 nonChainableOutputs 中的邊,由於當前算子鏈的延展已經到頭,就會以這些“斷點”爲起點,繼續遞歸調用 createChain() 方法試圖創建新的算子鏈。也就是說,邏輯計劃中整個創建算子鏈的過程都是遞歸的,亦即實際返回時,是從 Sink 端開始返回的。然後要判斷當前節點是不是算子鏈的起始節點。如果是,則調用 createJobVertex()方法爲算子鏈創建一個 JobVertex( 即 JobGraph 中的節點),也就形成了我們在Web UI 中看到的 JobGraph 效果:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e68f44b05c728c84a067425221bc196c.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,還需要將各個節點的算子鏈數據寫入各自的 StreamConfig 中,算子鏈的起始節點要額外保存下 transitiveOutEdges。StreamConfig 在後文的物理執行階段會再次用到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"形成算子鏈的條件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來看看 isChainable() 方法的代碼。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"public static boolean isChainable(StreamEdge edge, StreamGraph streamGraph) {\n StreamNode upStreamVertex = streamGraph.getSourceVertex(edge);\n StreamNode downStreamVertex = streamGraph.getTargetVertex(edge);\n\n StreamOperatorFactory> headOperator = upStreamVertex.getOperatorFactory();\n StreamOperatorFactory> outOperator = downStreamVertex.getOperatorFactory();\n\n return downStreamVertex.getInEdges().size() == 1\n && outOperator != null\n && headOperator != null\n && upStreamVertex.isSameSlotSharingGroup(downStreamVertex)\n && outOperator.getChainingStrategy() == ChainingStrategy.ALWAYS\n && (headOperator.getChainingStrategy() == ChainingStrategy.HEAD ||\n headOperator.getChainingStrategy() == ChainingStrategy.ALWAYS)\n && (edge.getPartitioner() instanceof ForwardPartitioner)\n && edge.getShuffleMode() != ShuffleMode.BATCH\n && upStreamVertex.getParallelism() == downStreamVertex.getParallelism()\n && streamGraph.isChainingEnabled();\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由此可得,上下游算子能夠 chain 在一起的條件還是非常苛刻的(老生常談了),列舉如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 上下游算子實例處於同一個 SlotSharingGroup 中(之後再提);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 下游算子的鏈接策略(ChainingStrategy)爲 ALWAYS ——既可以與上游鏈接,也可以與下游鏈接。我們常見的 map()、filter() 等都屬此類;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 上游算子的鏈接策略爲 HEAD 或 ALWAYS。HEAD 策略表示只能與下游鏈接,這在正常情況下是 Source 算子的專屬;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 兩個算子間的物理分區邏輯是 ForwardPartitioner ,可參見之前寫過的《聊聊Flink DataStream 的八種物理分區邏輯》;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 兩個算子間的 shuffle 方式不是批處理模式;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 上下游算子實例的並行度相同;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 沒有禁用算子鏈。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"禁用算子鏈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶可以在一個算子上調用 startNewChain() 方法強制開始一個新的算子鏈,或者調用 disableOperatorChaining() 方法指定它不參與算子鏈。代碼位於 SingleOutputStreamOperator 類中,都是通過改變算子的鏈接策略實現的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"@PublicEvolving\npublic SingleOutputStreamOperator disableChaining() {\n return setChainingStrategy(ChainingStrategy.NEVER);\n}\n\n@PublicEvolving\npublic SingleOutputStreamOperator startNewChain() {\n return setChainingStrategy(ChainingStrategy.HEAD);\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果要在整個運行時環境中禁用算子鏈,調用 StreamExecutionEnvironment.disableOperatorChaining() 方法即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"物理計劃中的算子鏈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 JobGraph 轉換成 ExecutionGraph 並交由 TaskManager 執行之後,會生成調度執行的基本任務單元 ——StreamTask,負責執行具體的 StreamOperator 邏輯。在StreamTask.invoke() 方法中,初始化了狀態後端、checkpoint 存儲和定時器服務之後,可以發現:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"operatorChain = new OperatorChain<>(this, recordWriters);\nheadOperator = operatorChain.getHeadOperator();"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"構造出了一個 OperatorChain 實例,這就是算子鏈在實際執行時的形態。解釋一下OperatorChain 中的幾個主要屬性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"private final StreamOperator>[] allOperators;\nprivate final RecordWriterOutput>[] streamOutputs;\nprivate final WatermarkGaugeExposingOutput> chainEntryPoint;\nprivate final OP headOperator;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" headOperator:算子鏈的第一個算子,對應 JobGraph 中的算子鏈起始節點;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" allOperators:算子鏈中的所有算子,倒序排列,即 headOperator 位於該數組的末尾;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" streamOutputs:算子鏈的輸出,可以有多個;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" chainEntryPoint:算子鏈的“入口點”,它的含義將在後文說明。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由上可知,所有 StreamTask 都會創建 OperatorChain。如果一個算子無法進入算子鏈,也會形成一個只有 headOperator 的單個算子的 OperatorChain。OperatorChain 構造方法中的核心代碼如下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"for (int i = 0; i < outEdgesInOrder.size(); i++) {\n StreamEdge outEdge = outEdgesInOrder.get(i);\n RecordWriterOutput> streamOutput = createStreamOutput(\n recordWriters.get(i),\n outEdge,\n chainedConfigs.get(outEdge.getSourceId()),\n containingTask.getEnvironment());\n this.streamOutputs[i] = streamOutput;\n streamOutputMap.put(outEdge, streamOutput);\n}\n\n// we create the chain of operators and grab the collector that leads into the chain\nList> allOps = new ArrayList<>(chainedConfigs.size());\nthis.chainEntryPoint = createOutputCollector(\n containingTask,\n configuration,\n chainedConfigs,\n userCodeClassloader,\n streamOutputMap,\n allOps);\n\nif (operatorFactory != null) {\n WatermarkGaugeExposingOutput> output = getChainEntryPoint();\n headOperator = operatorFactory.createStreamOperator(containingTask, configuration, output);\n headOperator.getMetricGroup().gauge(MetricNames.IO_CURRENT_OUTPUT_WATERMARK, output.getWatermarkGauge());\n} else {\n headOperator = null;\n}\n\n// add head operator to end of chain\nallOps.add(headOperator);\nthis.allOperators = allOps.toArray(new StreamOperator>[allOps.size()]);"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先會遍歷算子鏈整體的所有出邊,並調用 createStreamOutput() 方法創建對應的下游輸出 RecordWriterOutput。然後就會調用 createOutputCollector() 方法創建物理的算子鏈,並返回 chainEntryPoint,這個方法比較重要,部分代碼如下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"private WatermarkGaugeExposingOutput> createOutputCollector(\n StreamTask, ?> containingTask,\n StreamConfig operatorConfig,\n Map chainedConfigs,\n ClassLoader userCodeClassloader,\n Map> streamOutputs,\n List> allOperators) {\n List>, StreamEdge>> allOutputs = new ArrayList<>(4);\n\n // create collectors for the network outputs\n for (StreamEdge outputEdge : operatorConfig.getNonChainedOutputs(userCodeClassloader)) {\n @SuppressWarnings(\"unchecked\")\n RecordWriterOutput output = (RecordWriterOutput) streamOutputs.get(outputEdge);\n allOutputs.add(new Tuple2<>(output, outputEdge));\n }\n\n // Create collectors for the chained outputs\n for (StreamEdge outputEdge : operatorConfig.getChainedOutputs(userCodeClassloader)) {\n int outputId = outputEdge.getTargetId();\n StreamConfig chainedOpConfig = chainedConfigs.get(outputId);\n WatermarkGaugeExposingOutput> output = createChainedOperator(\n containingTask,\n chainedOpConfig,\n chainedConfigs,\n userCodeClassloader,\n streamOutputs,\n allOperators,\n outputEdge.getOutputTag());\n allOutputs.add(new Tuple2<>(output, outputEdge));\n }\n // 以下略......\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該方法從上一節提到的 StreamConfig 中分別取出出邊和鏈接邊的數據,並創建各自的 Output。出邊的 Output 就是將數據發往算子鏈之外下游的 RecordWriterOutput,而鏈接邊的輸出要靠 createChainedOperator() 方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"private WatermarkGaugeExposingOutput> createChainedOperator(\n StreamTask, ?> containingTask,\n StreamConfig operatorConfig,\n Map chainedConfigs,\n ClassLoader userCodeClassloader,\n Map> streamOutputs,\n List> allOperators,\n OutputTag outputTag) {\n // create the output that the operator writes to first. this may recursively create more operators\n WatermarkGaugeExposingOutput> chainedOperatorOutput = createOutputCollector(\n containingTask,\n operatorConfig,\n chainedConfigs,\n userCodeClassloader,\n streamOutputs,\n allOperators);\n\n // now create the operator and give it the output collector to write its output to\n StreamOperatorFactory chainedOperatorFactory = operatorConfig.getStreamOperatorFactory(userCodeClassloader);\n OneInputStreamOperator chainedOperator = chainedOperatorFactory.createStreamOperator(\n containingTask, operatorConfig, chainedOperatorOutput);\n\n allOperators.add(chainedOperator);\n\n WatermarkGaugeExposingOutput> currentOperatorOutput;\n if (containingTask.getExecutionConfig().isObjectReuseEnabled()) {\n currentOperatorOutput = new ChainingOutput<>(chainedOperator, this, outputTag);\n }\n else {\n TypeSerializer inSerializer = operatorConfig.getTypeSerializerIn1(userCodeClassloader);\n currentOperatorOutput = new CopyingChainingOutput<>(chainedOperator, inSerializer, outputTag, this);\n }\n\n // wrap watermark gauges since registered metrics must be unique\n chainedOperator.getMetricGroup().gauge(MetricNames.IO_CURRENT_INPUT_WATERMARK, currentOperatorOutput.getWatermarkGauge()::getValue);\n chainedOperator.getMetricGroup().gauge(MetricNames.IO_CURRENT_OUTPUT_WATERMARK, chainedOperatorOutput.getWatermarkGauge()::getValue);\n return currentOperatorOutput;\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們一眼就可以看到,這個方法遞歸調用了上述 createOutputCollector() 方法,與邏輯計劃階段類似,通過不斷延伸 Output 來產生 chainedOperator(即算子鏈中除了headOperator 之外的算子),並逆序返回,這也是 allOperators 數組中的算子順序爲倒序的原因。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"chainedOperator 產生之後,將它們通過 ChainingOutput 連接起來,形成如下圖所示的結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3c/3cd92803db185884b01cb455295dabbe.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖片來自:http://wuchong.me/blog/2016/05/09/flink-internals-understanding-execution-resources/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後來看看 ChainingOutput.collect() 方法是如何輸出數據流的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"@Override\npublic void collect(StreamRecord record) {\n if (this.outputTag != null) {\n // we are only responsible for emitting to the main input\n return;\n }\n pushToOperator(record);\n}\n\n@Override\npublic void collect(OutputTag outputTag, StreamRecord record) {\n if (this.outputTag == null || !this.outputTag.equals(outputTag)) {\n // we are only responsible for emitting to the side-output specified by our\n // OutputTag.\n return;\n }\n pushToOperator(record);\n}\n\nprotected void pushToOperator(StreamRecord record) {\n try {\n // we know that the given outputTag matches our OutputTag so the record\n // must be of the type that our operator expects.\n @SuppressWarnings(\"unchecked\")\n StreamRecord castRecord = (StreamRecord) record;\n numRecordsIn.inc();\n operator.setKeyContextElement1(castRecord);\n operator.processElement(castRecord);\n }\n catch (Exception e) {\n throw new ExceptionInChainedOperatorException(e);\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可見是通過調用鏈接算子的 processElement() 方法,直接將數據推給下游處理了。也就是說,OperatorChain 完全可以看做一個由 headOperator 和 streamOutputs組成的單個算子,其內部的 chainedOperator 和 ChainingOutput 都像是被黑盒遮蔽,同時沒有引入任何 overhead。打通了算子鏈在執行層的邏輯,看官應該會明白 chainEntryPoint 的含義了。由於它位於遞歸返回的終點,所以它就是流入算子鏈的起始 Output,即上圖中指向 headOperator 的 RecordWriterOutput。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章轉載自簡書,作者:LittleMagic。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:https://www.jianshu.com/p/799744e347c7"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章