序
本文主要研究一下flink如何兼容StormTopology
實例
@Test public void testStormWordCount() throws Exception { //NOTE 1 build Topology the Storm way final TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomWordSpout(), 1); builder.setBolt("count", new WordCountBolt(), 5) .fieldsGrouping("spout", new Fields("word")); builder.setBolt("print", new PrintBolt(), 1) .shuffleGrouping("count"); //NOTE 2 convert StormTopology to FlinkTopology FlinkTopology flinkTopology = FlinkTopology.createTopology(builder); //NOTE 3 execute program locally using FlinkLocalCluster Config conf = new Config(); // only required to stabilize integration test conf.put(FlinkLocalCluster.SUBMIT_BLOCKING, true); final FlinkLocalCluster cluster = FlinkLocalCluster.getLocalCluster(); cluster.submitTopology("stormWordCount", conf, flinkTopology); cluster.shutdown(); }
- 這裏使用FlinkLocalCluster.getLocalCluster()來創建或獲取FlinkLocalCluster,之後調用FlinkLocalCluster.submitTopology來提交topology,結束時通過FlinkLocalCluster.shutdown來關閉cluster
- 這裏構建的RandomWordSpout繼承自storm的BaseRichSpout,WordCountBolt繼承自storm的BaseBasicBolt;PrintBolt繼承自storm的BaseRichBolt(
由於flink是使用的Checkpoint機制,不會轉換storm的ack操作,因而這裏用BaseBasicBolt還是BaseRichBolt都無特別要求
) - FlinkLocalCluster.submitTopology這裏使用的topology是StormTopoloy轉換後的FlinkTopology
LocalClusterFactory
flink-release-1.6.2/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/api/FlinkLocalCluster.java
// ------------------------------------------------------------------------ // Access to default local cluster // ------------------------------------------------------------------------ // A different {@link FlinkLocalCluster} to be used for execution of ITCases private static LocalClusterFactory currentFactory = new DefaultLocalClusterFactory(); /** * Returns a {@link FlinkLocalCluster} that should be used for execution. If no cluster was set by * {@link #initialize(LocalClusterFactory)} in advance, a new {@link FlinkLocalCluster} is returned. * * @return a {@link FlinkLocalCluster} to be used for execution */ public static FlinkLocalCluster getLocalCluster() { return currentFactory.createLocalCluster(); } /** * Sets a different factory for FlinkLocalClusters to be used for execution. * * @param clusterFactory * The LocalClusterFactory to create the local clusters for execution. */ public static void initialize(LocalClusterFactory clusterFactory) { currentFactory = Objects.requireNonNull(clusterFactory); } // ------------------------------------------------------------------------ // Cluster factory // ------------------------------------------------------------------------ /** * A factory that creates local clusters. */ public interface LocalClusterFactory { /** * Creates a local Flink cluster. * @return A local Flink cluster. */ FlinkLocalCluster createLocalCluster(); } /** * A factory that instantiates a FlinkLocalCluster. */ public static class DefaultLocalClusterFactory implements LocalClusterFactory { @Override public FlinkLocalCluster createLocalCluster() { return new FlinkLocalCluster(); } }
- flink在FlinkLocalCluster裏頭提供了一個靜態方法getLocalCluster,用來獲取FlinkLocalCluster,它是通過LocalClusterFactory來創建一個FlinkLocalCluster
- LocalClusterFactory這裏使用的是DefaultLocalClusterFactory實現類,它的createLocalCluster方法,直接new了一個FlinkLocalCluster
- 目前的實現來看,每次調用FlinkLocalCluster.getLocalCluster,都會創建一個新的FlinkLocalCluster,這個在調用的時候是需要注意一下的
FlinkTopology
flink-release-1.6.2/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/api/FlinkTopology.java
/** * Creates a Flink program that uses the specified spouts and bolts. * @param stormBuilder The Storm topology builder to use for creating the Flink topology. * @return A {@link FlinkTopology} which contains the translated Storm topology and may be executed. */ public static FlinkTopology createTopology(TopologyBuilder stormBuilder) { return new FlinkTopology(stormBuilder); } private FlinkTopology(TopologyBuilder builder) { this.builder = builder; this.stormTopology = builder.createTopology(); // extract the spouts and bolts this.spouts = getPrivateField("_spouts"); this.bolts = getPrivateField("_bolts"); this.env = StreamExecutionEnvironment.getExecutionEnvironment(); // Kick off the translation immediately translateTopology(); }
- FlinkTopology提供了一個靜態工廠方法createTopology用來創建FlinkTopology
- FlinkTopology先保存一下TopologyBuilder,然後通過getPrivateField反射調用getDeclaredField獲取spouts、bolts私有屬性然後保存起來,方便後面轉換topology使用
- 之後先獲取到ExecutionEnvironment,最後就是調用translateTopology進行整個StormTopology的轉換
translateTopology
flink-release-1.6.2/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/api/FlinkTopology.java
/** * Creates a Flink program that uses the specified spouts and bolts. */ private void translateTopology() { unprocessdInputsPerBolt.clear(); outputStreams.clear(); declarers.clear(); availableInputs.clear(); // Storm defaults to parallelism 1 env.setParallelism(1); /* Translation of topology */ for (final Entry<String, IRichSpout> spout : spouts.entrySet()) { final String spoutId = spout.getKey(); final IRichSpout userSpout = spout.getValue(); final FlinkOutputFieldsDeclarer declarer = new FlinkOutputFieldsDeclarer(); userSpout.declareOutputFields(declarer); final HashMap<String, Fields> sourceStreams = declarer.outputStreams; this.outputStreams.put(spoutId, sourceStreams); declarers.put(spoutId, declarer); final HashMap<String, DataStream<Tuple>> outputStreams = new HashMap<String, DataStream<Tuple>>(); final DataStreamSource<?> source; if (sourceStreams.size() == 1) { final SpoutWrapper<Tuple> spoutWrapperSingleOutput = new SpoutWrapper<Tuple>(userSpout, spoutId, null, null); spoutWrapperSingleOutput.setStormTopology(stormTopology); final String outputStreamId = (String) sourceStreams.keySet().toArray()[0]; DataStreamSource<Tuple> src = env.addSource(spoutWrapperSingleOutput, spoutId, declarer.getOutputType(outputStreamId)); outputStreams.put(outputStreamId, src); source = src; } else { final SpoutWrapper<SplitStreamType<Tuple>> spoutWrapperMultipleOutputs = new SpoutWrapper<SplitStreamType<Tuple>>( userSpout, spoutId, null, null); spoutWrapperMultipleOutputs.setStormTopology(stormTopology); @SuppressWarnings({ "unchecked", "rawtypes" }) DataStreamSource<SplitStreamType<Tuple>> multiSource = env.addSource( spoutWrapperMultipleOutputs, spoutId, (TypeInformation) TypeExtractor.getForClass(SplitStreamType.class)); SplitStream<SplitStreamType<Tuple>> splitSource = multiSource .split(new StormStreamSelector<Tuple>()); for (String streamId : sourceStreams.keySet()) { SingleOutputStreamOperator<Tuple> outStream = splitSource.select(streamId) .map(new SplitStreamMapper<Tuple>()); outStream.getTransformation().setOutputType(declarer.getOutputType(streamId)); outputStreams.put(streamId, outStream); } source = multiSource; } availableInputs.put(spoutId, outputStreams); final ComponentCommon common = stormTopology.get_spouts().get(spoutId).get_common(); if (common.is_set_parallelism_hint()) { int dop = common.get_parallelism_hint(); source.setParallelism(dop); } else { common.set_parallelism_hint(1); } } /** * 1. Connect all spout streams with bolts streams * 2. Then proceed with the bolts stream already connected * * <p>Because we do not know the order in which an iterator steps over a set, we might process a consumer before * its producer * ->thus, we might need to repeat multiple times */ boolean makeProgress = true; while (bolts.size() > 0) { if (!makeProgress) { StringBuilder strBld = new StringBuilder(); strBld.append("Unable to build Topology. Could not connect the following bolts:"); for (String boltId : bolts.keySet()) { strBld.append("\n "); strBld.append(boltId); strBld.append(": missing input streams ["); for (Entry<GlobalStreamId, Grouping> streams : unprocessdInputsPerBolt .get(boltId)) { strBld.append("'"); strBld.append(streams.getKey().get_streamId()); strBld.append("' from '"); strBld.append(streams.getKey().get_componentId()); strBld.append("'; "); } strBld.append("]"); } throw new RuntimeException(strBld.toString()); } makeProgress = false; final Iterator<Entry<String, IRichBolt>> boltsIterator = bolts.entrySet().iterator(); while (boltsIterator.hasNext()) { final Entry<String, IRichBolt> bolt = boltsIterator.next(); final String boltId = bolt.getKey(); final IRichBolt userBolt = copyObject(bolt.getValue()); final ComponentCommon common = stormTopology.get_bolts().get(boltId).get_common(); Set<Entry<GlobalStreamId, Grouping>> unprocessedBoltInputs = unprocessdInputsPerBolt.get(boltId); if (unprocessedBoltInputs == null) { unprocessedBoltInputs = new HashSet<>(); unprocessedBoltInputs.addAll(common.get_inputs().entrySet()); unprocessdInputsPerBolt.put(boltId, unprocessedBoltInputs); } // check if all inputs are available final int numberOfInputs = unprocessedBoltInputs.size(); int inputsAvailable = 0; for (Entry<GlobalStreamId, Grouping> entry : unprocessedBoltInputs) { final String producerId = entry.getKey().get_componentId(); final String streamId = entry.getKey().get_streamId(); final HashMap<String, DataStream<Tuple>> streams = availableInputs.get(producerId); if (streams != null && streams.get(streamId) != null) { inputsAvailable++; } } if (inputsAvailable != numberOfInputs) { // traverse other bolts first until inputs are available continue; } else { makeProgress = true; boltsIterator.remove(); } final Map<GlobalStreamId, DataStream<Tuple>> inputStreams = new HashMap<>(numberOfInputs); for (Entry<GlobalStreamId, Grouping> input : unprocessedBoltInputs) { final GlobalStreamId streamId = input.getKey(); final Grouping grouping = input.getValue(); final String producerId = streamId.get_componentId(); final Map<String, DataStream<Tuple>> producer = availableInputs.get(producerId); inputStreams.put(streamId, processInput(boltId, userBolt, streamId, grouping, producer)); } final SingleOutputStreamOperator<?> outputStream = createOutput(boltId, userBolt, inputStreams); if (common.is_set_parallelism_hint()) { int dop = common.get_parallelism_hint(); outputStream.setParallelism(dop); } else { common.set_parallelism_hint(1); } } } }
- 整個轉換是先轉換spout,再轉換bolt,他們根據的spouts及bolts信息是在構造器裏頭使用反射從storm的TopologyBuilder對象獲取到的
- flink使用FlinkOutputFieldsDeclarer(
它實現了storm的OutputFieldsDeclarer接口
)來承載storm的IRichSpout及IRichBolt裏頭配置的declareOutputFields信息,不過要注意的是flink不支持dirct emit;這裏通過userSpout.declareOutputFields方法,將原始spout的declare信息設置到FlinkOutputFieldsDeclarer - flink使用SpoutWrapper來包裝spout,將其轉換爲RichParallelSourceFunction類型,這裏對spout的outputStreams的個數是否大於1進行不同處理;之後就是將RichParallelSourceFunction作爲StreamExecutionEnvironment.addSource方法的參數創建flink的DataStreamSource,並添加到availableInputs中,然後根據spout的parallelismHit來設置DataStreamSource的parallelism
- 對於bolt的轉換,這裏維護了unprocessdInputsPerBolt,key爲boltId,value爲該bolt要連接的GlobalStreamId及Grouping方式,由於是使用map來進行遍歷的,因此轉換的bolt可能亂序,如果連接的GlobalStreamId存在則進行轉換,然後從bolts中移除,bolt連接的GlobalStreamId不在availableInputs中的時候,需要跳過處理下一個,不會從bolts中移除,因爲外層的循環條件是bolts的size大於0,就是依靠這個機制來處理亂序
- 對於bolt的轉換有一個重要的方法就是processInput,它把bolt的grouping轉換爲對spout的DataStream的對應操作(
比如shuffleGrouping轉換爲對DataStream的rebalance操作,fieldsGrouping轉換爲對DataStream的keyBy操作,globalGrouping轉換爲global操作,allGrouping轉換爲broadcast操作
),之後調用createOutput方法轉換bolt的執行邏輯,它使用BoltWrapper或者MergedInputsBoltWrapper將bolt轉換爲flink的OneInputStreamOperator,然後作爲參數對stream進行transform操作返回flink的SingleOutputStreamOperator,同時將轉換後的SingleOutputStreamOperator添加到availableInputs中,之後根據bolt的parallelismHint對這個SingleOutputStreamOperator設置parallelism
FlinkLocalCluster
flink-storm_2.11-1.6.2-sources.jar!/org/apache/flink/storm/api/FlinkLocalCluster.java
/** * {@link FlinkLocalCluster} mimics a Storm {@link LocalCluster}. */ public class FlinkLocalCluster { /** The log used by this mini cluster. */ private static final Logger LOG = LoggerFactory.getLogger(FlinkLocalCluster.class); /** The Flink mini cluster on which to execute the programs. */ private FlinkMiniCluster flink; /** Configuration key to submit topology in blocking mode if flag is set to {@code true}. */ public static final String SUBMIT_BLOCKING = "SUBMIT_STORM_TOPOLOGY_BLOCKING"; public FlinkLocalCluster() { } public FlinkLocalCluster(FlinkMiniCluster flink) { this.flink = Objects.requireNonNull(flink); } @SuppressWarnings("rawtypes") public void submitTopology(final String topologyName, final Map conf, final FlinkTopology topology) throws Exception { this.submitTopologyWithOpts(topologyName, conf, topology, null); } @SuppressWarnings("rawtypes") public void submitTopologyWithOpts(final String topologyName, final Map conf, final FlinkTopology topology, final SubmitOptions submitOpts) throws Exception { LOG.info("Running Storm topology on FlinkLocalCluster"); boolean submitBlocking = false; if (conf != null) { Object blockingFlag = conf.get(SUBMIT_BLOCKING); if (blockingFlag instanceof Boolean) { submitBlocking = ((Boolean) blockingFlag).booleanValue(); } } FlinkClient.addStormConfigToTopology(topology, conf); StreamGraph streamGraph = topology.getExecutionEnvironment().getStreamGraph(); streamGraph.setJobName(topologyName); JobGraph jobGraph = streamGraph.getJobGraph(); if (this.flink == null) { Configuration configuration = new Configuration(); configuration.addAll(jobGraph.getJobConfiguration()); configuration.setString(TaskManagerOptions.MANAGED_MEMORY_SIZE, "0"); configuration.setInteger(TaskManagerOptions.NUM_TASK_SLOTS, jobGraph.getMaximumParallelism()); this.flink = new LocalFlinkMiniCluster(configuration, true); this.flink.start(); } if (submitBlocking) { this.flink.submitJobAndWait(jobGraph, false); } else { this.flink.submitJobDetached(jobGraph); } } public void killTopology(final String topologyName) { this.killTopologyWithOpts(topologyName, null); } public void killTopologyWithOpts(final String name, final KillOptions options) { } public void activate(final String topologyName) { } public void deactivate(final String topologyName) { } public void rebalance(final String name, final RebalanceOptions options) { } public void shutdown() { if (this.flink != null) { this.flink.stop(); this.flink = null; } } //...... }
- FlinkLocalCluster的submitTopology方法調用了submitTopologyWithOpts,而後者主要是設置一些參數,調用topology.getExecutionEnvironment().getStreamGraph()根據transformations生成StreamGraph,再獲取JobGraph,然後創建LocalFlinkMiniCluster並start,最後使用LocalFlinkMiniCluster的submitJobAndWait或submitJobDetached來提交整個JobGraph
小結
- flink通過FlinkTopology對storm提供了一定的兼容性,這對於遷移storm到flink非常有幫助
- 要在flink上運行storm的topology,主要有幾個步驟,分別是構建storm原生的TopologyBuilder,之後通過FlinkTopology.createTopology(builder)來將StormTopology轉換爲FlinkTopology,最後是通過FlinkLocalCluster(
本地模式
)或者FlinkSubmitter(遠程提交
)的submitTopology方法提交FlinkTopology - FlinkTopology是flink兼容storm的核心,它負責將StormTopology轉換爲flink對應的結構,比如使用SpoutWrapper將spout轉換爲RichParallelSourceFunction,然後添加到StreamExecutionEnvironment創建DataStream,把bolt的grouping轉換爲對spout的DataStream的對應操作(
比如shuffleGrouping轉換爲對DataStream的rebalance操作,fieldsGrouping轉換爲對DataStream的keyBy操作,globalGrouping轉換爲global操作,allGrouping轉換爲broadcast操作
),然後使用BoltWrapper或者MergedInputsBoltWrapper將bolt轉換爲flink的OneInputStreamOperator,然後作爲參數對stream進行transform操作 - 構建完FlinkTopology之後,就使用FlinkLocalCluster提交到本地執行,或者使用FlinkSubmitter提交到遠程執行
- FlinkLocalCluster的submitTopology方法主要是通過FlinkTopology作用的StreamExecutionEnvironment生成StreamGraph,通過它獲取JobGraph,然後創建LocalFlinkMiniCluster並start,最後通過LocalFlinkMiniCluster提交JobGraph
doc
- Storm Compatibility Beta