十五、Flink源碼閱讀--StreamGraph生成過程

本篇我們將介紹下StreamGraph的生成過程

源碼分析

以WordCount爲例子

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

...中間過程省略

// 執行程序
env.execute("Streaming WordCount");

然後進到StreamExecutionEnvironment的execute方法

public abstract JobExecutionResult execute(String jobName) throws Exception;

發現這個方法是一個抽象方法，debug發現最終調用的是子類StreamContextEnvironment的execute方法，如下所示

public JobExecutionResult execute(String jobName) throws Exception {
	Preconditions.checkNotNull("Streaming Job name should not be null.");

	StreamGraph streamGraph = this.getStreamGraph();// 獲取StreamGraph
	streamGraph.setJobName(jobName);

	transformations.clear();

	// execute the programs
	if (ctx instanceof DetachedEnvironment) {
		LOG.warn("Job was executed in detached mode, the results will be available on completion.");
		((DetachedEnvironment) ctx).setDetachedPlan(streamGraph);
		return DetachedEnvironment.DetachedJobExecutionResult.INSTANCE;
	} else {
		return ctx
			.getClient()
			.run(streamGraph, ctx.getJars(), ctx.getClasspaths(), ctx.getUserCodeClassLoader(), ctx.getSavepointRestoreSettings())
			.getJobExecutionResult();
	}
}

核心方法就是 StreamGraph streamGraph = this.getStreamGraph(); 這一行，繼續往下看

@Internal
public StreamGraph getStreamGraph() {
	if (transformations.size() <= 0) {
		throw new IllegalStateException("No operators defined in streaming topology. Cannot execute.");
	}
	return StreamGraphGenerator.generate(this, transformations);//生成 StreamGraph
}

在這裏調用了StreamGraphGenerator.generate方法，並將transformations傳入，這裏的transformations變量是一個List<StreamTransformation<?>>，那麼StreamTransformation又是什麼東西呢，會發現他是一個抽象類，並有很多的實現類。具體實現如下圖：

那一個具體的transform實現類，是怎麼加入到transformations集合中的呢，我們以DataStream.map爲例看一下添加過程。

public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {

	TypeInformation<R> outType = TypeExtractor.getMapReturnTypes(clean(mapper), getType(),
			Utils.getCallLocationName(), true);

	return transform("Map", outType, new StreamMap<>(clean(mapper)));//接着調用了transform方法
}

public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) {

	// read the output type of the input Transform to coax out errors about MissingTypeInfo
	transformation.getOutputType();

	OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(
			this.transformation,
			operatorName,
			operator,
			outTypeInfo,
			environment.getParallelism());

	@SuppressWarnings({ "unchecked", "rawtypes" })
	SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);

	getExecutionEnvironment().addOperator(resultTransform);//加入到了transformations集合中

	return returnStream;
}

上面兩個方法可以看到調用map方法後，會再調用transform方法，mapFunction生成一個OneInputOperator，OneInputOperator會生成一個OneInputTransformation，OneInputTransformation會再生成一個SingleOutputStreamOperator，SingleOutputStreamOperator實際是DataStream的一個子類，也就是說再調用了map函數之後，返回的還是一個DataStream，然後可以繼續調用filter等方法，這也就是Flink中的鏈式調用。
再DataStream到DataStream的轉換過程中，生成了一個OneInputTransformation,並執行了
getExecutionEnvironment().addOperator(resultTransform)

public void addOperator(StreamTransformation<?> transformation) {
	Preconditions.checkNotNull(transformation, "transformation must not be null.");
	this.transformations.add(transformation);
}

原來是DataStream再轉換的過程中將各個operator加入到了transformations集合中了，我們接着return StreamGraphGenerator.generate(this, transformations) 繼續往下研究

public static StreamGraph generate(StreamExecutionEnvironment env, List<StreamTransformation<?>> transformations) {
	return new StreamGraphGenerator(env).generateInternal(transformations);
}


private StreamGraph generateInternal(List<StreamTransformation<?>> transformations) {
	for (StreamTransformation<?> transformation: transformations) {
		transform(transformation);//核心代碼
	}
	return streamGraph;
}

核心代碼就在transform(transformation)方法中,詳細看下

private Collection<Integer> transform(StreamTransformation<?> transform) {

	if (alreadyTransformed.containsKey(transform)) {
		return alreadyTransformed.get(transform);
	}

	LOG.debug("Transforming " + transform);

	if (transform.getMaxParallelism() <= 0) {

		// if the max parallelism hasn't been set, then first use the job wide max parallelism
		// from theExecutionConfig.
		int globalMaxParallelismFromConfig = env.getConfig().getMaxParallelism();
		if (globalMaxParallelismFromConfig > 0) {
			transform.setMaxParallelism(globalMaxParallelismFromConfig);
		}
	}

	// call at least once to trigger exceptions about MissingTypeInfo
	transform.getOutputType();

	Collection<Integer> transformedIds;
	if (transform instanceof OneInputTransformation<?, ?>) {
		transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
	} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
		transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
	} else if (transform instanceof SourceTransformation<?>) {
		transformedIds = transformSource((SourceTransformation<?>) transform);
	} else if (transform instanceof SinkTransformation<?>) {
		transformedIds = transformSink((SinkTransformation<?>) transform);
	} else if (transform instanceof UnionTransformation<?>) {
		transformedIds = transformUnion((UnionTransformation<?>) transform);
	} else if (transform instanceof SplitTransformation<?>) {
		transformedIds = transformSplit((SplitTransformation<?>) transform);
	} else if (transform instanceof SelectTransformation<?>) {
		transformedIds = transformSelect((SelectTransformation<?>) transform);
	} else if (transform instanceof FeedbackTransformation<?>) {
		transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
	} else if (transform instanceof CoFeedbackTransformation<?>) {
		transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
	} else if (transform instanceof PartitionTransformation<?>) {
		transformedIds = transformPartition((PartitionTransformation<?>) transform);
	} else if (transform instanceof SideOutputTransformation<?>) {
		transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
	} else {
		throw new IllegalStateException("Unknown transformation: " + transform);
	}

	// need this check because the iterate transformation adds itself before
	// transforming the feedback edges
	if (!alreadyTransformed.containsKey(transform)) {
		alreadyTransformed.put(transform, transformedIds);
	}

	if (transform.getBufferTimeout() >= 0) {
		streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());
	}
	if (transform.getUid() != null) {
		streamGraph.setTransformationUID(transform.getId(), transform.getUid());
	}
	if (transform.getUserProvidedNodeHash() != null) {
		streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());
	}

	if (transform.getMinResources() != null && transform.getPreferredResources() != null) {
		streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());
	}

	return transformedIds;
}

這裏會根據transform的類型調用t不同的ransformXXX方法，我們挑一個OneInputTransformation類型，具體看下transformOneInputTransform是怎麼實現的

private <IN, OUT> Collection<Integer> transformOneInputTransform(OneInputTransformation<IN, OUT> transform) {

	Collection<Integer> inputIds = transform(transform.getInput());//遞歸調用該transform的上游進行轉換

	// the recursive call might have already transformed this
	if (alreadyTransformed.containsKey(transform)) {
		return alreadyTransformed.get(transform);
	}

	String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);

	streamGraph.addOperator(transform.getId(),
			slotSharingGroup,
			transform.getCoLocationGroupKey(),
			transform.getOperator(),
			transform.getInputType(),
			transform.getOutputType(),
			transform.getName());//添加StreamNode

	if (transform.getStateKeySelector() != null) {
		TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());
		streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
	}

	streamGraph.setParallelism(transform.getId(), transform.getParallelism());
	streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism());

	for (Integer inputId: inputIds) {
		streamGraph.addEdge(inputId, transform.getId(), 0);//添加edge邊
	}

	return Collections.singleton(transform.getId());
}

然後上面方法會首先遞歸調用自己的上游transform，然後添加StreamNode節點和StreamEdge邊，看下streamGraph.addOperator方法

public <IN, OUT> void addOperator(
			Integer vertexID,
			String slotSharingGroup,
			@Nullable String coLocationGroup,
			StreamOperator<OUT> operatorObject,
			TypeInformation<IN> inTypeInfo,
			TypeInformation<OUT> outTypeInfo,
			String operatorName) {

	if (operatorObject instanceof StoppableStreamSource) {//添加node
		addNode(vertexID, slotSharingGroup, coLocationGroup, StoppableSourceStreamTask.class, operatorObject, operatorName);
	} else if (operatorObject instanceof StreamSource) {
		addNode(vertexID, slotSharingGroup, coLocationGroup, SourceStreamTask.class, operatorObject, operatorName);
	} else {
		addNode(vertexID, slotSharingGroup, coLocationGroup, OneInputStreamTask.class, operatorObject, operatorName);
	}

	TypeSerializer<IN> inSerializer = inTypeInfo != null && !(inTypeInfo instanceof MissingTypeInfo) ? inTypeInfo.createSerializer(executionConfig) : null;

	TypeSerializer<OUT> outSerializer = outTypeInfo != null && !(outTypeInfo instanceof MissingTypeInfo) ? outTypeInfo.createSerializer(executionConfig) : null;

	setSerializers(vertexID, inSerializer, null, outSerializer);

	if (operatorObject instanceof OutputTypeConfigurable && outTypeInfo != null) {
		@SuppressWarnings("unchecked")
		OutputTypeConfigurable<OUT> outputTypeConfigurable = (OutputTypeConfigurable<OUT>) operatorObject;
		// sets the output type which must be know at StreamGraph creation time
		outputTypeConfigurable.setOutputType(outTypeInfo, executionConfig);
	}

	if (operatorObject instanceof InputTypeConfigurable) {
		InputTypeConfigurable inputTypeConfigurable = (InputTypeConfigurable) operatorObject;
		inputTypeConfigurable.setInputType(inTypeInfo, executionConfig);
	}

	if (LOG.isDebugEnabled()) {
		LOG.debug("Vertex: {}", vertexID);
	}
}

//在addNode方法添加StreamNode


protected StreamNode addNode(Integer vertexID,
		String slotSharingGroup,
		@Nullable String coLocationGroup,
		Class<? extends AbstractInvokable> vertexClass,
		StreamOperator<?> operatorObject,
		String operatorName) {

	if (streamNodes.containsKey(vertexID)) {
		throw new RuntimeException("Duplicate vertexID " + vertexID);
	}

	StreamNode vertex = new StreamNode(environment,
		vertexID,
		slotSharingGroup,
		coLocationGroup,
		operatorObject,
		operatorName,
		new ArrayList<OutputSelector<?>>(),
		vertexClass);

	streamNodes.put(vertexID, vertex);

	return vertex;
}

具體的轉換原理總結：List<StreamTransformation<?>>集合中的transform每一個transform會由他們自己內部的input屬性，類似一個指針，指向父transform。這樣就會將所有的transform串起來。接着每個transform會轉化爲StreamNode，連接的指針轉換爲StreamEdge。StreamEdge內部包含一個StreamPartitioner，相當於是StreamNode之間的數據分區分發類型，具體的有以下幾類。broadCast Forward shuffle rebalance等

十五、Flink源碼閱讀--StreamGraph生成過程

源碼分析

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

spark、hadoop大數據計算面試題彙總

hive源碼編譯

spark streaming任務,讀kafka寫入mysql

一、Spark官網走讀筆記

二十四、Flink進階--Flink sql轉換爲JobGraph過程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結