Flink 作業執行解析

所有有關Flink作業執行的介紹都包含以下的這個流程，今天我們就是實戰一些這些轉換是如何完成的？

StreamGraph Class representing the streaming topology. It contains all the information necessary to build the jobgraph for the execution. 這個類表示流處理的拓撲結構，包含構造JobGraph的所有信息，從而滿足任務執行。
JobGraph: JobGraph代表Flink的 dataflow程序，處於JobManager接受的底層。來自更高級別API的所有程序都將轉換爲JobGraphs。在此之前，都是在client裏面進行運行的。並且可以根據ExplainPlan獲取執行計劃
ExecutionGraph: ExecutionGraph是JobGraph的並行化版本，是調度層(Schduler)最核心的數據結構。
物理執行計劃

StreamGraph：流處理節點拓撲圖

JobGraph： Flink的數據流圖。

JobGraph 屬性

Operator ：算子，理解爲function定義。
Transformation：轉換，包含輸入、算子、與輸出，理解爲一個完整的流程，function runtime。

示例程序

flink-examples-streaming工程下面的org.apache.flink.streaming.examples.wordcount.WordCount

public static void main(String[] args) throws Exception {

		// Checking input parameters
		final MultipleParameterTool params = MultipleParameterTool.fromArgs(args);

		// set up the execution environment
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		// make parameters available in the web interface
		env.getConfig().setGlobalJobParameters(params);

		// get input data
		DataStream<String> text = null;
		if (params.has("input")) {
			// union all the inputs from text files
			for (String input : params.getMultiParameterRequired("input")) {
				if (text == null) {
					text = env.readTextFile(input);
				} else {
					text = text.union(env.readTextFile(input));
				}
			}
			Preconditions.checkNotNull(text, "Input DataStream should not be null.");
		} else {
			System.out.println("Executing WordCount example with default input data set.");
			System.out.println("Use --input to specify file input.");
			// get default test text data
			text = env.fromElements(WordCountData.WORDS);
		}

		DataStream<Tuple2<String, Integer>> counts =
			// split up the lines in pairs (2-tuples) containing: (word,1)
			text.flatMap(new Tokenizer())
			// group by the tuple field "0" and sum up tuple field "1"
			.keyBy(0).sum(1);

		// emit result
		if (params.has("output")) {
			counts.writeAsText(params.get("output"));
		} else {
			System.out.println("Printing result to stdout. Use --output to specify output path.");
			counts.print();
		}
		// execute program
		env.execute("Streaming WordCount");
	}

Flink Client

WordCount程序

text = env.fromElements(WordCountData.WORDS);

DataStream<Tuple2<String, Integer>> counts =
			// split up the lines in pairs (2-tuples) containing: (word,1)
			text.flatMap(new Tokenizer())
			// group by the tuple field "0" and sum up tuple field "1"
			.keyBy(0).sum(1);

初始化

初始化是之前當前程序的定義，主要是整個數據流的定義元數據收集

Source

FlatMap

(1) 首先調用DataStream的FlatMap方法 => transform 方法

DataStream#doTransform

（2）創建Transformation（包含輸入實例、算子、輸出）
（3）創建結果流

（4）添加算子到當前上下文

public void addOperator(Transformation<?> transformation) {
		Preconditions.checkNotNull(transformation, "transformation must not be null.");
		this.transformations.add(transformation);
	}

（5）返回結果流

keyBy

創建一個KeyStream 返回

sum

sum還是在當前的dataStream流上面。

需要注意的是，有些 transform 操作並不會生成StreamNode 如 PartitionTransformtion，而是生成個虛擬節點。

調用transform即可flatmap一樣。會把自己作爲transformations。

最終配置總覽

執行提交

    env.execute("Streaming WordCount");

===>

     public JobExecutionResult execute(String jobName) throws Exception {
		Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");

		return execute(getStreamGraph(jobName));
	}

生成StreamGraph（Pipeline）

public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
		StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
		if (clearTransformations) {
			this.transformations.clear();
		}
		return streamGraph;
	}

Generate全圖

拓撲圖的生成邏輯，循環處理每一個節點

Generate Transformation

這裏對操作符的類型進行判斷，並以此調用相應的處理邏輯.簡而言之，

處理的核心：是遞歸的將該節點和節點的上游節點加入圖

private Collection<Integer> transform(Transformation<?> transform) {

		if (alreadyTransformed.containsKey(transform)) {
			return alreadyTransformed.get(transform);
		}

		LOG.debug("Transforming " + transform);

		if (transform.getMaxParallelism() <= 0) {

			// if the max parallelism hasn't been set, then first use the job wide max parallelism
			// from the ExecutionConfig.
			int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
			if (globalMaxParallelismFromConfig > 0) {
				transform.setMaxParallelism(globalMaxParallelismFromConfig);
			}
		}

		// call at least once to trigger exceptions about MissingTypeInfo
		transform.getOutputType();

		Collection<Integer> transformedIds;
		if (transform instanceof OneInputTransformation<?, ?>) {
			transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
		} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
			transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
		} else if (transform instanceof SourceTransformation<?>) {
			// .......... 省略

		// need this check because the iterate transformation adds itself before
		// transforming the feedback edges
		if (!alreadyTransformed.containsKey(transform)) {
			alreadyTransformed.put(transform, transformedIds);
		}

		if (transform.getBufferTimeout() >= 0) {
			streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());
		} else {
			streamGraph.setBufferTimeout(transform.getId(), defaultBufferTimeout);
		}

		if (transform.getUid() != null) {
			streamGraph.setTransformationUID(transform.getId(), transform.getUid());
		}
		if (transform.getUserProvidedNodeHash() != null) {
			streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());
		}

		if (!streamGraph.getExecutionConfig().hasAutoGeneratedUIDsEnabled()) {
			if (transform instanceof PhysicalTransformation &&
					transform.getUserProvidedNodeHash() == null &&
					transform.getUid() == null) {
				throw new IllegalStateException("Auto generated UIDs have been disabled " +
					"but no UID or hash has been assigned to operator " + transform.getName());
			}
		}

		if (transform.getMinResources() != null && transform.getPreferredResources() != null) {
			streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());
		}

		streamGraph.setManagedMemoryWeight(transform.getId(), transform.getManagedMemoryWeight());

		return transformedIds;
	}

創建節點並加入圖

生成結果

生成JobGraph

執行

PipelinExecutor

execute 執行

生成jobGraph

使用PipeLineTranslator生成JobGraph

Job Graph生成邏輯

生成結果

可以看到上面 key-aggregate-sink合併爲了一個chain。

JobGraph 對象結構如上圖所示，taskVertices 中只存在三個 TaskVertex，Sink operator 被嵌到 Keyed operator 中去了。

提交作業

這裏相當於把本地作業提交到集羣上面。就是一個作業元數據上傳的過程。底層的RestClient使用的是Netty。

ClientUtils#submitJob

	public static JobExecutionResult submitJob(
			ClusterClient<?> client,
			JobGraph jobGraph) throws ProgramInvocationException {
		checkNotNull(client);
		checkNotNull(jobGraph);
		try {
			return client
				.submitJob(jobGraph)
				.thenApply(DetachedJobExecutionResult::new)
				.get();
		} catch (InterruptedException | ExecutionException e) {
			ExceptionUtils.checkInterrupted(e);
			throw new ProgramInvocationException("Could not run job in detached mode.", jobGraph.getJobID(), e);
		}
	}

通過 RestClusterClient 提交到集羣上

public CompletableFuture<JobID> submitJob(@Nonnull JobGraph jobGraph) {
         // 生成二進制包，用於作業恢復
		CompletableFuture<java.nio.file.Path> jobGraphFileFuture = CompletableFuture.supplyAsync(() -> {
			try {
				final java.nio.file.Path jobGraphFile = Files.createTempFile("flink-jobgraph", ".bin");
				try (ObjectOutputStream objectOut = new ObjectOutputStream(Files.newOutputStream(jobGraphFile))) {
					objectOut.writeObject(jobGraph);
				}
				return jobGraphFile;
			} catch (IOException e) {
				throw new CompletionException(new FlinkException("Failed to serialize JobGraph.", e));
			}
		}, executorService);
         // 查詢所有需要上傳的包
		CompletableFuture<Tuple2<JobSubmitRequestBody, Collection<FileUpload>>> requestFuture = jobGraphFileFuture.thenApply(jobGraphFile -> {
			List<String> jarFileNames = new ArrayList<>(8);
			List<JobSubmitRequestBody.DistributedCacheFile> artifactFileNames = new ArrayList<>(8);
			Collection<FileUpload> filesToUpload = new ArrayList<>(8);

			filesToUpload.add(new FileUpload(jobGraphFile, RestConstants.CONTENT_TYPE_BINARY));

			for (Path jar : jobGraph.getUserJars()) {
				jarFileNames.add(jar.getName());
				filesToUpload.add(new FileUpload(Paths.get(jar.toUri()), RestConstants.CONTENT_TYPE_JAR));
			}

			for (Map.Entry<String, DistributedCache.DistributedCacheEntry> artifacts : jobGraph.getUserArtifacts().entrySet()) {
				final Path artifactFilePath = new Path(artifacts.getValue().filePath);
				try {
					// Only local artifacts need to be uploaded.
					if (!artifactFilePath.getFileSystem().isDistributedFS()) {
						artifactFileNames.add(new JobSubmitRequestBody.DistributedCacheFile(artifacts.getKey(), artifactFilePath.getName()));
						filesToUpload.add(new FileUpload(Paths.get(artifacts.getValue().filePath), RestConstants.CONTENT_TYPE_BINARY));
					}
				} catch (IOException e) {
					throw new CompletionException(
						new FlinkException("Failed to get the FileSystem of artifact " + artifactFilePath + ".", e));
				}
			}

			final JobSubmitRequestBody requestBody = new JobSubmitRequestBody(
				jobGraphFile.getFileName().toString(),
				jarFileNames,
				artifactFileNames);

			return Tuple2.of(requestBody, Collections.unmodifiableCollection(filesToUpload));
		});
         // 上傳jar包，提交作業
		final CompletableFuture<JobSubmitResponseBody> submissionFuture = requestFuture.thenCompose(
			requestAndFileUploads -> sendRetriableRequest(
				JobSubmitHeaders.getInstance(),
				EmptyMessageParameters.getInstance(),
				requestAndFileUploads.f0,
				requestAndFileUploads.f1,
				isConnectionProblemOrServiceUnavailable())
		);
         // 刪除生成的jobGraph文件
		submissionFuture
			.thenCombine(jobGraphFileFuture, (ignored, jobGraphFile) -> jobGraphFile)
			.thenAccept(jobGraphFile -> {
			try {
				Files.delete(jobGraphFile);
			} catch (IOException e) {
				LOG.warn("Could not delete temporary file {}.", jobGraphFile, e);
			}
		});
        // 返回jobId
		return submissionFuture
			.thenApply(ignore -> jobGraph.getJobID())
			.exceptionally(
				(Throwable throwable) -> {
					throw new CompletionException(new JobSubmissionException(jobGraph.getJobID(), "Failed to submit JobGraph.", ExceptionUtils.stripCompletionException(throwable)));
				});
	}
	
	// 提交到集羣
	private <M extends MessageHeaders<R, P, U>, U extends MessageParameters, R extends RequestBody, P extends ResponseBody> CompletableFuture<P>
	sendRetriableRequest(M messageHeaders, U messageParameters, R request, Collection<FileUpload> filesToUpload, Predicate<Throwable> retryPredicate) {
		return retry(() -> getWebMonitorBaseUrl().thenCompose(webMonitorBaseUrl -> {
			try {
				return restClient.sendRequest(webMonitorBaseUrl.getHost(), webMonitorBaseUrl.getPort(), messageHeaders, messageParameters, request, filesToUpload);
			} catch (IOException e) {
				throw new CompletionException(e);
			}
		}), retryPredicate);
	}

FlinkCluster Server

Dispatcher 接收請求

內部執行

持久化和啓動作業

持久化作業

創建JobManagerRunner（JobManagerRunnerImpl）

屬性

翻譯JobGraph => ExecutionGraph

ExecutionGraphBuilder# buildGraph邏輯

create a new execution graph, if none exists so far
set the basic properties
initialize the vertices that have a master initialization hook
file output formats create directories here, input formats create splits
topologically sort the job vertices and attach the graph to the existing one
- 將 JobGraph 裏面的 jobVertex 從 Source 節點開始排序。
- 在 executionGraph.attachJobGraph(sortedTopology)方法裏面，根據 JobVertex 生成 ExecutionJobVertex，在 ExecutionJobVertex 構造方法裏面，根據 jobVertex 的 IntermediateDataSet 構建 IntermediateResult，根據 jobVertex 併發構建 ExecutionVertex，ExecutionVertex 構建的時候，構建 IntermediateResultPartition（每一個 Execution 構建 IntermediateResult 數個IntermediateResultPartition ）；將創建的 ExecutionJobVertex 與前置的 IntermediateResult 連接起來。
- 構建 ExecutionEdge ，連接到前面的 IntermediateResultPartition，最終從 ExecutionGraph 到物理執行計劃。
configure the state checkpointing
create all the metrics for the Execution Graph

實際啓動作業

執行作業

JobMaster

調度器執行

resetAndStartScheduler =》

總結

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-jC5iQsH7-1592732400299)(images/image-20200621172505926.png)]

第一層 StreamGraph 從 Source 節點開始，每一次 transform 生成一個 StreamNode，兩個 StreamNode 通過 StreamEdge 連接在一起,形成 StreamNode 和 StreamEdge 構成的DAG。
第二層 JobGraph，依舊從 Source 節點開始，然後去遍歷尋找能夠嵌到一起的 operator，如果能夠嵌到一起則嵌到一起，不能嵌到一起的單獨生成 jobVertex，通過 JobEdge 鏈接上下游 JobVertex，最終形成 JobVertex 層面的 DAG。
JobVertex DAG 提交到任務以後，從 Source 節點開始排序,根據 JobVertex 生成ExecutionJobVertex，根據 jobVertex的IntermediateDataSet 構建IntermediateResult，然後 IntermediateResult 構建上下游的依賴關係，形成 ExecutionJobVertex 層面的 DAG 即 ExecutionGraph。
最後通過 ExecutionGraph 層到物理執行層。

參考資料

https://ververica.cn/developers/advanced-tutorial-2-flink-job-execution-depth-analysis/ Flink 作業執行深度解析

【Flink博客閱讀】 Flink 作業執行深度解析(WordCount) 讀後實戰總結

Flink 作業執行解析

示例程序

Flink Client

WordCount程序

初始化

Source

FlatMap

keyBy

sum

最終配置總覽

執行提交

生成StreamGraph（Pipeline）

生成結果

生成JobGraph

生成結果

提交作業

FlinkCluster Server

Dispatcher 接收請求

內部執行

持久化和啓動作業

持久化作業

創建JobManagerRunner（JobManagerRunnerImpl）

翻譯JobGraph => ExecutionGraph

實際啓動作業

執行作業

總結

參考資料

【Druid 實戰】Druid 的 SQL 中文亂碼問題（avatica）

【Flink博客閱讀】 Flink 作業執行深度解析(WordCount) 讀後實戰總結

最近總結出來的一個素質N連方法論

K8S-鏡像管理（安裝harbor）

FlinkX 代碼總體結構

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結