Flink 作業執行解析
所有有關Flink作業執行的介紹都包含以下的這個流程,今天我們就是實戰一些這些轉換是如何完成的?
StreamGraph
Class representing the streaming topology. It contains all the information necessary to build the jobgraph for the execution. 這個類表示流處理的拓撲結構,包含構造JobGraph
的所有信息,從而滿足任務執行。JobGraph
:JobGraph
代表Flink
的dataflow
程序,處於JobManager
接受的底層。來自更高級別API
的所有程序都將轉換爲JobGraphs
。在此之前,都是在client裏面進行運行的。並且可以根據ExplainPlan獲取執行計劃ExecutionGraph
:ExecutionGraph
是JobGraph的並行化版本,是調度層(Schduler)最核心的數據結構。- 物理執行計劃
- StreamGraph: 流處理節點拓撲圖
- JobGraph: Flink的數據流圖。
- JobGraph 屬性
-
Operator : 算子,理解爲function定義。
-
Transformation: 轉換,包含輸入、算子、與輸出,理解爲一個完整的流程,function runtime。
示例程序
flink-examples-streaming
工程下面的org.apache.flink.streaming.examples.wordcount.WordCount
public static void main(String[] args) throws Exception {
// Checking input parameters
final MultipleParameterTool params = MultipleParameterTool.fromArgs(args);
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// make parameters available in the web interface
env.getConfig().setGlobalJobParameters(params);
// get input data
DataStream<String> text = null;
if (params.has("input")) {
// union all the inputs from text files
for (String input : params.getMultiParameterRequired("input")) {
if (text == null) {
text = env.readTextFile(input);
} else {
text = text.union(env.readTextFile(input));
}
}
Preconditions.checkNotNull(text, "Input DataStream should not be null.");
} else {
System.out.println("Executing WordCount example with default input data set.");
System.out.println("Use --input to specify file input.");
// get default test text data
text = env.fromElements(WordCountData.WORDS);
}
DataStream<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0).sum(1);
// emit result
if (params.has("output")) {
counts.writeAsText(params.get("output"));
} else {
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
}
// execute program
env.execute("Streaming WordCount");
}
Flink Client
WordCount程序
text = env.fromElements(WordCountData.WORDS);
DataStream<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0).sum(1);
初始化
初始化是之前當前程序的定義,主要是整個數據流的定義元數據收集
Source
FlatMap
- (1) 首先調用DataStream的FlatMap方法 => transform 方法
DataStream#doTransform
-
(2)創建Transformation(包含輸入實例、算子 、輸出)
-
(3)創建結果流
-
(4)添加算子到當前上下文
public void addOperator(Transformation<?> transformation) { Preconditions.checkNotNull(transformation, "transformation must not be null."); this.transformations.add(transformation); }
-
(5)返回結果流
keyBy
-
創建一個KeyStream 返回
sum
sum還是在當前的dataStream流上面。
需要注意的是,有些 transform 操作並不會生成StreamNode 如 PartitionTransformtion,而是生成個虛擬節點。
調用transform即可flatmap一樣。會把自己作爲transformations。
最終配置總覽
執行提交
env.execute("Streaming WordCount");
===>
public JobExecutionResult execute(String jobName) throws Exception {
Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");
return execute(getStreamGraph(jobName));
}
生成StreamGraph(Pipeline)
public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
if (clearTransformations) {
this.transformations.clear();
}
return streamGraph;
}
Generate全圖
拓撲圖的生成邏輯,循環處理每一個節點
Generate Transformation
這裏對操作符的類型進行判斷,並以此調用相應的處理邏輯.簡而言之,
處理的核心:是遞歸的將該節點和節點的上游節點加入圖
private Collection<Integer> transform(Transformation<?> transform) {
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
LOG.debug("Transforming " + transform);
if (transform.getMaxParallelism() <= 0) {
// if the max parallelism hasn't been set, then first use the job wide max parallelism
// from the ExecutionConfig.
int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
if (globalMaxParallelismFromConfig > 0) {
transform.setMaxParallelism(globalMaxParallelismFromConfig);
}
}
// call at least once to trigger exceptions about MissingTypeInfo
transform.getOutputType();
Collection<Integer> transformedIds;
if (transform instanceof OneInputTransformation<?, ?>) {
transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
} else if (transform instanceof SourceTransformation<?>) {
// .......... 省略
// need this check because the iterate transformation adds itself before
// transforming the feedback edges
if (!alreadyTransformed.containsKey(transform)) {
alreadyTransformed.put(transform, transformedIds);
}
if (transform.getBufferTimeout() >= 0) {
streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());
} else {
streamGraph.setBufferTimeout(transform.getId(), defaultBufferTimeout);
}
if (transform.getUid() != null) {
streamGraph.setTransformationUID(transform.getId(), transform.getUid());
}
if (transform.getUserProvidedNodeHash() != null) {
streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());
}
if (!streamGraph.getExecutionConfig().hasAutoGeneratedUIDsEnabled()) {
if (transform instanceof PhysicalTransformation &&
transform.getUserProvidedNodeHash() == null &&
transform.getUid() == null) {
throw new IllegalStateException("Auto generated UIDs have been disabled " +
"but no UID or hash has been assigned to operator " + transform.getName());
}
}
if (transform.getMinResources() != null && transform.getPreferredResources() != null) {
streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());
}
streamGraph.setManagedMemoryWeight(transform.getId(), transform.getManagedMemoryWeight());
return transformedIds;
}
創建節點並加入圖
生成結果
生成JobGraph
執行
PipelinExecutor
execute 執行
生成jobGraph
- 使用PipeLineTranslator生成JobGraph
Job Graph生成邏輯
生成結果
可以看到上面 key-aggregate-sink合併爲了一個chain。
JobGraph 對象結構如上圖所示,taskVertices 中只存在三個 TaskVertex,Sink operator 被嵌到 Keyed operator 中去了。
提交作業
這裏相當於把本地作業提交到集羣上面。就是一個作業元數據上傳的過程。底層的RestClient使用的是Netty。
ClientUtils#submitJob
public static JobExecutionResult submitJob(
ClusterClient<?> client,
JobGraph jobGraph) throws ProgramInvocationException {
checkNotNull(client);
checkNotNull(jobGraph);
try {
return client
.submitJob(jobGraph)
.thenApply(DetachedJobExecutionResult::new)
.get();
} catch (InterruptedException | ExecutionException e) {
ExceptionUtils.checkInterrupted(e);
throw new ProgramInvocationException("Could not run job in detached mode.", jobGraph.getJobID(), e);
}
}
通過 RestClusterClient 提交到集羣上
public CompletableFuture<JobID> submitJob(@Nonnull JobGraph jobGraph) {
// 生成二進制包,用於作業恢復
CompletableFuture<java.nio.file.Path> jobGraphFileFuture = CompletableFuture.supplyAsync(() -> {
try {
final java.nio.file.Path jobGraphFile = Files.createTempFile("flink-jobgraph", ".bin");
try (ObjectOutputStream objectOut = new ObjectOutputStream(Files.newOutputStream(jobGraphFile))) {
objectOut.writeObject(jobGraph);
}
return jobGraphFile;
} catch (IOException e) {
throw new CompletionException(new FlinkException("Failed to serialize JobGraph.", e));
}
}, executorService);
// 查詢所有需要上傳的包
CompletableFuture<Tuple2<JobSubmitRequestBody, Collection<FileUpload>>> requestFuture = jobGraphFileFuture.thenApply(jobGraphFile -> {
List<String> jarFileNames = new ArrayList<>(8);
List<JobSubmitRequestBody.DistributedCacheFile> artifactFileNames = new ArrayList<>(8);
Collection<FileUpload> filesToUpload = new ArrayList<>(8);
filesToUpload.add(new FileUpload(jobGraphFile, RestConstants.CONTENT_TYPE_BINARY));
for (Path jar : jobGraph.getUserJars()) {
jarFileNames.add(jar.getName());
filesToUpload.add(new FileUpload(Paths.get(jar.toUri()), RestConstants.CONTENT_TYPE_JAR));
}
for (Map.Entry<String, DistributedCache.DistributedCacheEntry> artifacts : jobGraph.getUserArtifacts().entrySet()) {
final Path artifactFilePath = new Path(artifacts.getValue().filePath);
try {
// Only local artifacts need to be uploaded.
if (!artifactFilePath.getFileSystem().isDistributedFS()) {
artifactFileNames.add(new JobSubmitRequestBody.DistributedCacheFile(artifacts.getKey(), artifactFilePath.getName()));
filesToUpload.add(new FileUpload(Paths.get(artifacts.getValue().filePath), RestConstants.CONTENT_TYPE_BINARY));
}
} catch (IOException e) {
throw new CompletionException(
new FlinkException("Failed to get the FileSystem of artifact " + artifactFilePath + ".", e));
}
}
final JobSubmitRequestBody requestBody = new JobSubmitRequestBody(
jobGraphFile.getFileName().toString(),
jarFileNames,
artifactFileNames);
return Tuple2.of(requestBody, Collections.unmodifiableCollection(filesToUpload));
});
// 上傳jar包,提交作業
final CompletableFuture<JobSubmitResponseBody> submissionFuture = requestFuture.thenCompose(
requestAndFileUploads -> sendRetriableRequest(
JobSubmitHeaders.getInstance(),
EmptyMessageParameters.getInstance(),
requestAndFileUploads.f0,
requestAndFileUploads.f1,
isConnectionProblemOrServiceUnavailable())
);
// 刪除生成的jobGraph文件
submissionFuture
.thenCombine(jobGraphFileFuture, (ignored, jobGraphFile) -> jobGraphFile)
.thenAccept(jobGraphFile -> {
try {
Files.delete(jobGraphFile);
} catch (IOException e) {
LOG.warn("Could not delete temporary file {}.", jobGraphFile, e);
}
});
// 返回jobId
return submissionFuture
.thenApply(ignore -> jobGraph.getJobID())
.exceptionally(
(Throwable throwable) -> {
throw new CompletionException(new JobSubmissionException(jobGraph.getJobID(), "Failed to submit JobGraph.", ExceptionUtils.stripCompletionException(throwable)));
});
}
// 提交到集羣
private <M extends MessageHeaders<R, P, U>, U extends MessageParameters, R extends RequestBody, P extends ResponseBody> CompletableFuture<P>
sendRetriableRequest(M messageHeaders, U messageParameters, R request, Collection<FileUpload> filesToUpload, Predicate<Throwable> retryPredicate) {
return retry(() -> getWebMonitorBaseUrl().thenCompose(webMonitorBaseUrl -> {
try {
return restClient.sendRequest(webMonitorBaseUrl.getHost(), webMonitorBaseUrl.getPort(), messageHeaders, messageParameters, request, filesToUpload);
} catch (IOException e) {
throw new CompletionException(e);
}
}), retryPredicate);
}
FlinkCluster Server
Dispatcher 接收請求
內部執行
持久化和啓動作業
持久化作業
創建JobManagerRunner(JobManagerRunnerImpl)
屬性
翻譯JobGraph => ExecutionGraph
ExecutionGraphBuilder# buildGraph邏輯
-
create a new execution graph, if none exists so far
-
set the basic properties
-
initialize the vertices that have a master initialization hook
-
file output formats create directories here, input formats create splits
-
topologically sort the job vertices and attach the graph to the existing one
- 將 JobGraph 裏面的 jobVertex 從 Source 節點開始排序。
- 在 executionGraph.attachJobGraph(sortedTopology)方法裏面,根據 JobVertex 生成 ExecutionJobVertex,在 ExecutionJobVertex 構造方法裏面,根據 jobVertex 的 IntermediateDataSet 構建 IntermediateResult,根據 jobVertex 併發構建 ExecutionVertex,ExecutionVertex 構建的時候,構建 IntermediateResultPartition(每一個 Execution 構建 IntermediateResult 數個IntermediateResultPartition );將創建的 ExecutionJobVertex 與前置的 IntermediateResult 連接起來。
- 構建 ExecutionEdge ,連接到前面的 IntermediateResultPartition,最終從 ExecutionGraph 到物理執行計劃。
-
configure the state checkpointing
-
create all the metrics for the Execution Graph
實際啓動作業
執行作業
JobMaster
調度器執行
resetAndStartScheduler =》
總結
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-jC5iQsH7-1592732400299)(images/image-20200621172505926.png)]
- 第一層 StreamGraph 從 Source 節點開始,每一次 transform 生成一個 StreamNode,兩個 StreamNode 通過 StreamEdge 連接在一起,形成 StreamNode 和 StreamEdge 構成的DAG。
- 第二層 JobGraph,依舊從 Source 節點開始,然後去遍歷尋找能夠嵌到一起的 operator,如果能夠嵌到一起則嵌到一起,不能嵌到一起的單獨生成 jobVertex,通過 JobEdge 鏈接上下游 JobVertex,最終形成 JobVertex 層面的 DAG。
- JobVertex DAG 提交到任務以後,從 Source 節點開始排序,根據 JobVertex 生成ExecutionJobVertex,根據 jobVertex的IntermediateDataSet 構建IntermediateResult,然後 IntermediateResult 構建上下游的依賴關係,形成 ExecutionJobVertex 層面的 DAG 即 ExecutionGraph。
- 最後通過 ExecutionGraph 層到物理執行層。
參考資料
https://ververica.cn/developers/advanced-tutorial-2-flink-job-execution-depth-analysis/ Flink 作業執行深度解析