storm指南

Tutorial

Inthis tutorial, you'll learn how to create Storm topologies and deploy them to aStorm cluster.

Javawill be the main language used, but a few examples will use Python toillustrate Storm's

multi-languagecapabilities.

在這篇指南中,你將學習到如何創建一個拓撲並把它們部署到一個storm集羣中

Java是後面使用的主要語言,但是爲了展示storm支持多語言的能力,有一些例子也會使用python

 

Preliminaries

前言

Thistutorial uses examples from the storm-starter project. It's recommendedthat you clone the project and follow along with the examples. Read Setting up a development environment and Creating a new Storm project to get your machine set up.

這篇指南使用storm-starter項目中的例子,推薦你克隆這個項目跟着例子做一遍,閱讀創建一個開發環境和創建一個新的storm項目來開始你的機器安裝。

Componentsof a Storm cluster

Storm的組件

AStorm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoopyou run "MapReduce jobs", on Storm you run "topologies"."Jobs" and "topologies" themselves are very different --one key difference is that a MapReduce job eventually finishes, whereas atopology processes messages forever (or until you kill it).

一個storm集羣從表面上看和一個hadoop羣集相似,像在hadoop中運行mapreduce jobs一樣,在storm中你得運行”topologies”,”job””topologies”本身有很大區別一個主要的區別是mapreduce job最終會執行結束而會持續的處理消息(直到你kill掉它)。

Thereare two kinds of nodes on a Storm cluster: the master node and the workernodes. The master node runs a daemon called "Nimbus" that is similarto Hadoop's "JobTracker". Nimbus is responsible for distributing codearound the cluster, assigning tasks to machines, and monitoring for failures.

Storm集羣中有兩類結點,主結點和工作結點,主結點中運行一個稱爲”nimbus”的主進程,這有點像hadoop中的jobtracker.” nimbus”負責集羣中的代碼分發,分配task到機器上執行,監控任選的執行情況。

Eachworker node runs a daemon called the "Supervisor". The supervisorlistens for work assigned to its machine and starts and stops worker processesas necessary based on what Nimbus has assigned to it. Each worker processexecutes a subset of a topology; a running topology consists of many workerprocesses spread across many machines.

每一個工作結點都運行一個後臺進程,稱之爲“Supervisor” “Supervisor”進程監聽分配到它機器上的任務,基於“Nimbus”分配給它的任務,在必要時“Supervisor”啓動和停止worker線程。每一個worker線程執行” topology”的一個子集;一個正在運行的“topology”由分佈在多個機器上的很多worker線程組成。

Allcoordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, theNimbus daemon and Supervisor daemons are fail-fast andstateless; all state is kept in Zookeeper or on local disk. This means you cankill -9 Nimbus or the Supervisors and they'll start back up like nothinghappened. This design leads to Storm clusters being incredibly stable.

Nimbus the Supervisors之間的所有協調工作都是能過zookeeper羣集來實現的,另外,

TopologiesSupervisors守護進程是fail-fast和無狀態的;所有的狀態都保持在zookeeper或本地磁盤上。這意味着你可以kill -9 Nimbusor the Supervisors,然後它們會像什麼都沒發生過一個重新啓動,這種設計使得storm羣集變得異常的穩定。

To dorealtime computation on Storm, you create what are called"topologies". A topology is a graph of computation. Each node in atopology contains processing logic, and links between nodes indicate how datashould be passed around between nodes.

通過創建所謂的topologies來做實時計算。Topology是一個計算的拓撲圖,每一個結點包含了處理邏輯和結點間爲了指出如何傳遞數據的聯繫。

Runninga topology is straightforward. First, you package all your code anddependencies into a single jar. Then, you run a command like the following:

stormjar all-my-code.jar backtype.storm.MyTopology arg1 arg2

運行一個topology相關簡單,打包你的代碼和依賴到一個jar包中,然後按下面的格式運行命令。

storm jarall-my-code.jar backtype.storm.MyTopology arg1 arg2

 

Thisruns the class backtype.storm.MyTopology with the arguments arg1 and arg2. Themain function of the class defines the topology and submits it to Nimbus. Thestorm jar part takes care of connecting to Nimbus and uploading the jar.

以上運行了一個帶兩個參數arg1,arg2class backtype.storm.MyTopology。這個class的主要功能是定義了一個topology和把這個“topology”提交給“Nimbus”,storm jar部分注意一下Nimbus和上傳jar包的關係。

Sincetopology definitions are just Thrift structs, and Nimbus is a Thrift service,you can create and submit topologies using any programming language. The aboveexample is the easiest way to do it from a JVM-based language. See Running topologies on a production cluster for more information on starting and stopping topologies.

因爲topology的定義是thrift結構,Nimbus是一個thrift服務,所以你可以使用任何語言創建和提交topologies,通過一個基於jvm的語言是最容易實現上面的例子的。接着看如何在產品羣集中運行topologies,更多的信息在啓動和停止羣集。

Streams

Thecore abstraction in Storm is the "stream". A stream is an unboundedsequence of tuples. Storm provides the primitives for transforming a streaminto a new stream in a distributed and reliable way. For example, you maytransform a stream of tweets into a stream of trendingtopics.

Thebasic primitives Storm provides for doing stream transformations are"spouts" and "bolts". Spouts and bolts have interfaces thatyou implement to run your application-specific logic.

Storm中最關鍵的抽像概念是stream,一個流是一個無邊界的tuples的序列,storm提供了在一個分佈可靠環境裏面把一個流轉換一個新流的。例如,你可以轉換一個tweets成爲一個趨勢的話題,提供流轉換的基元是spoutsbolts,你可以通過實現spoutsbolts的接口來執行你應用的特定邏輯

Aspout is a source of streams. For example, a spout may read tuples off of a Kestrel queue and emit them as a stream. Or a spout may connect to the TwitterAPI and emit a stream of tweets.

Abolt consumes any number of input streams, does some processing, and possiblyemits new streams. Complex stream transformations, like computing a stream oftrending topics from a stream of tweets, require multiple steps and thusmultiple bolts. Bolts can do anything from run functions, filter tuples, dostreaming aggregations, do streaming joins, talk to databases, and more.

 Spout是流的源,例如,spout可能從Kestrel queue中讀取tuples然後把它們輸出爲一個流。或者連接一個twitterAPI然後輸出一個twitter流,bolt消費任意數據的流,做一些處理,然後可能輸出一個新流。一些複雜的流式轉換,比如從twitter的流中計算趨勢圖,就需要多個步驟和多個boltsBolts可以通過運行函數,過濾tuples流聚合,流join,查數據庫和其它方式做任何事情。

Networksof spouts and bolts are packaged into a "topology"which is the top-level abstraction that you submit to Storm clusters forexecution. A topology is a graph of stream transformations where each node is aspout or bolt. Edges in the graph indicate which bolts are subscribing to whichstreams. When a spout or bolt emits a tuple to a stream, it sends the tuple toevery bolt that subscribed to that stream.

Spoutsbolts網絡是打包到topology中的,topology是你提交到storm羣集運行的一個頂級抽象概念。Topology是一個流轉換的圖,其中每一個結點是一個spoutbolt,圖中的第一條邊指明瞭bolts訂閱了那個流。當一個spoutbolt寫了一個tuple到一個流,它會把這個流寫到每一個訂閱這個流的bolts中。

 

Linksbetween nodes in your topology indicate how tuples should be passed around. Forexample, if there is a link between Spout A and Bolt B, a link from Spout A toBolt C, and a link from Bolt B to Bolt C, then everytime Spout A emits a tuple,it will send the tuple to both Bolt B and Bolt C. All of Bolt B's output tupleswill go to Bolt C as well.

Topology中節點關係的關第指明瞭tuples應該如何分發。例如,SpoutASpoutB間有一個關係,SpoutASpoutC以間有一個關係,SpoutBSpoutC以間有一個關係,那麼任何時候Spout A輸出一個tuple,都會發送到BoltBBoltC,同樣所有BoltB的輸出都要輸出到BoltC.

 

Eachnode in a Storm topology executes in parallel. In your topology, you canspecify how much parallelism you want for each node, and then Storm will spawnthat number of threads across the cluster to do the execution.

stormtopology中,每個結點都是並行執行的。在你的topology可,你可以指定每個結點的併發數,然後storm會在整個集羣中產生相應數目的線程來執行。

Atopology runs forever, or until you kill it. Storm will automatically reassignany failed tasks. Additionally, Storm guarantees that there will be no dataloss, even if machines go down and messages are dropped.

Topology是永久執行的,直到殺掉它,storm可以自動的重新分配失敗的任務,另外,儘管機器會宕機消息會丟失,但storm保證數據不會丟失

Datamodel

數據模型

Stormuses tuples as its data model. A tuple is a named list of values, and a fieldin a tuple can be an object of any type. Out of the box, Storm supports all theprimitive types, strings, and byte arrays as tuple fieldvalues. To use an object of another type, you just need to implement a serializer for the type.

Storm使用tuples做爲數據模型,tuple是值的命名列表,一個字段在tuple中可以是任何類型的對象,storm支持所有的原生類型,strings,字節數組都可以做爲tuple中字段的值,如果想使用其它類型的數據,僅僅需要實現一下它的序列化機制。

Everynode in a topology must declare the output fields for the tuples it emits. Forexample, this bolt declares that it emits 2-tuples with the fields"double" and "triple":

 Topology中的每一個結點都要爲它輸出的tuples指定輸出字段,例如下面這個bolts聲明瞭它使用了兩個帶doubletriple兩個字段的的tuples

publicclass DoubleAndTripleBolt extends BaseRichBolt {

    private OutputCollectorBase _collector;

 

    @Override

    public void prepare(Map conf,TopologyContext context, OutputCollectorBase collector) {

        _collector = collector;

    }

 

    @Override

    public void execute(Tuple input) {

        int val = input.getInteger(0);       

        _collector.emit(input, newValues(val*2, val*3));

        _collector.ack(input);

    }

 

    @Override

    public void declareOutputFields(OutputFieldsDeclarerdeclarer) {

        declarer.declare(newFields("double", "triple"));

    }   

}

ThedeclareOutputFields function declares the output fields ["double","triple"] for the component. The rest of the bolt will be explainedin the upcoming sections.

declareOutputFields函數爲bolt組件聲明輸出字段doubletriple,其它的bolt將按此格式解析接收到的數據片。

Asimple topology

一個簡單的topology

Let'stake a look at a simple topology to explore the concepts more and see how thecode shapes up. Let's look at the ExclamationTopology definition fromstorm-starter:

看看下面這段簡單的代碼能夠更深入的瞭解一些概念,並且能夠看到代碼是如何形成的,從storm-starter看看ExclamationTopology的定義:

TopologyBuilderbuilder = new TopologyBuilder();       

builder.setSpout("words",new TestWordSpout(), 10);       

builder.setBolt("exclaim1",new ExclamationBolt(), 3)

        .shuffleGrouping("words");

builder.setBolt("exclaim2",new ExclamationBolt(), 2)

        .shuffleGrouping("exclaim1");

Thistopology contains a spout and two bolts. The spout emits words, and each bolt appendsthe string "!!!" to its input. The nodes are arranged in a line: thespout emits to the first bolt which then emits to the second bolt. If the spoutemits the tuples ["bob"] and ["john"], then the second boltwill emit the words ["bob!!!!!!"] and ["john!!!!!!"].

這個 topology 包含了一個spout 和兩個boltspout每輸出一個詞,每個bolt往它的輸入中添加"!!!"。這些結點以線性的方式組織起來:spout輸出到第一個bolt,第一個bolt又輸出到第二個bolt,如果spout 輸出["bob"] and["john"],第二個bolt將輸出["bob!!!!!!"] and ["john!!!!!!"].

 

This code defines the nodesusing the setSpout and setBolt methods. These methods take as input auser-specified id, an object containing the processing logic, and the amount ofparallelism you want for the node. In this example, the spout is given id "words"and the bolts are given ids "exclaim1" and "exclaim2".

這段代碼使用setBoltsetSpout定義了定點,這些方法的輸入是一個用戶自定義的id,一個包含處理邏輯的對象,還有一個是你期望這個節點的併發數。在這個例子裏面spoutid”words”,boltid分別爲"exclaim1" "exclaim2".

Theobject containing the processing logic implements the IRichSpout interface for spouts and the IRichBolt interface for bolts.

對象封裝了處理邏輯, spout實現了IRichSpout接口,bolts實現了IRichBolt接口。

Thelast parameter, how much parallelism you want for the node, is optional. Itindicates how many threads should execute that component across the cluster. Ifyou omit it, Storm will only allocate one thread for that node.

最後一個參數是,你設置的節點併發數,是可個可選參數,它指出了在集羣中有多少個線程來執行這個組件,如果你忽略它,storm將會一個結點對應一個線程。

setBoltreturns an InputDeclarer object that is used to definethe inputs to the Bolt. Here, component "exclaim1" declares that itwants to read all the tuples emitted by component "words" using ashuffle grouping, and component "exclaim2" declares that it wants toread all the tuples emitted by component "exclaim1" using a shufflegrouping. "shuffle grouping" means that tuples should berandomly distributed from the input tasks to the bolt's tasks. There are manyways to group data between components. These will be explained in a fewsections.

setBolt返回一個InputDeclarer對象,InputDeclarer對象用來定義bolt的輸入。這裏“exclaim1”聲明它想利用shuffle分組過濾讀取所有”words”組件輸出的tuples。“exclaim2”組件想通過shuffle分組過濾來讀取” exclaim1”輸出的tuples, "shufflegrouping"意味着tuples會從輸入任務到bolts任務間實現隨機分配。有很多實現組件間分組的方式,這裏面只介紹一小部分。

Ifyou wanted component "exclaim2" to read all the tuples emitted byboth component "words" and component "exclaim1", you wouldwrite component "exclaim2"'s definition like this:

如果你想讓exclaim2組件讀取所有words組件和exclaim1組件的輸出,可以如下定exclaim2的定義。

builder.setBolt("exclaim2",new ExclamationBolt(), 5)

            .shuffleGrouping("words")

           .shuffleGrouping("exclaim1");

Asyou can see, input declarations can be chained to specify multiple sources forthe Bolt.

如你所看,Bolt 輸入的聲明可以將多個數據源鏈式的串起來。

Let'sdig into the implementations of the spouts and bolts in this topology. Spoutsare responsible for emitting new messages into the topology. TestWordSpout inthis topology emits a random word from the list ["nathan","mike", "jackson", "golda", "bertels"]as a 1-tuple every 100ms. The implementation of nextTuple() in TestWordSpoutlooks like this:

現在來深度挖掘一下topology spouts bolts的實現,Spouts負責輸出新的消息到topology,在這個topologyTestWordSpout 100ms["nathan","mike", "jackson", "golda", "bertels"]中隨機的挑出一個詞做爲一個tuple輸出,nextTupleTestWordSpout中的實現如下所示:

 

publicvoid nextTuple() {

    Utils.sleep(100);

    final String[] words = new String[]{"nathan", "mike", "jackson", "golda","bertels"};

    final Random rand = new Random();

    final String word =words[rand.nextInt(words.length)];

    _collector.emit(new Values(word));

}

Asyou can see, the implementation is very straightforward.

ExclamationBoltappends the string "!!!" to its input. Let's take a look at the fullimplementation for ExclamationBolt:

publicstatic class ExclamationBolt implements IRichBolt {

    OutputCollector _collector;

 

    @Override

    public void prepare(Map conf,TopologyContext context, OutputCollector collector) {

        _collector = collector;

    }

 

    @Override

    public void execute(Tuple tuple) {

        _collector.emit(tuple, newValues(tuple.getString(0) + "!!!"));

        _collector.ack(tuple);

    }

 

    @Override

    public void cleanup() {

    }

 

    @Override

    public voiddeclareOutputFields(OutputFieldsDeclarer declarer) {

        declarer.declare(newFields("word"));

    }

 

    @Override

    public Map getComponentConfiguration() {

        return null;

    }

}

Theprepare method provides the bolt with an OutputCollector thatis used for emitting tuples from this bolt. Tuples can be emitted at anytimefrom the bolt -- in the prepare, execute, or cleanup methods, or evenasynchronously in another thread. This prepare implementation simply saves theOutputCollector as an instance variable to be used later on in the executemethod.

Prepare方法爲bolt提供了一個OutputCollector,用來輸出tuples,bolt任何時間都能在prepare,execute,cleanup方法乃至異步的在其它線程中輸出tuples-

Theexecute method receives a tuple from one of the bolt's inputs. TheExclamationBolt grabs the first field from the tuple and emits a new tuple withthe string "!!!" appended to it. If you implement a bolt thatsubscribes to multiple input sources, you can find out which component the Tuple came from by using theTuple#getSourceComponent method.

Execute從其中一個bolts的輸入裏面接收tuple, ExclamationBolttuble中獲取第一個字段然後產生一個新的tuble追加上字串符”!!!”,如果實現一個bolt,這個bolt訂閱了多個輸入源,通過Tuple#getSourceComponent方法能夠知道這個tuple來自於那個組件。

There'sa few other things going on in the execute method, namely that the input tupleis passed as the first argument to emit and the input tuple is acked on thefinal line. These are part of Storm's reliability API for guaranteeing no dataloss and will be explained later in this tutorial.

也有一些其它的事情需要在execute方法裏面完成,輸入tuple被做爲第一個輸出參數,並且在最後一行中表明輸入參數是要求確認的,也有其它一些stormAPI保證數據不會丟失,這些都將在這篇指南的後面介紹。

The cleanupmethod is called when a Bolt is being shutdown and should cleanupany resources that were opened. There's no guarantee that this method will becalled on the cluster: for example, if the machine the task is running on blowsup, there's no way to invoke the method. The cleanup method is intended forwhen you run topologies in local mode (where a Storm cluster issimulated in process), and you want to be able to run and kill many topologieswithout suffering any resource leaks.

Cleanupbolt關閉的時候會被調用以清理所有打開的資源,在集羣中不保證這個方法一定會被調用;例如,如果機器上這個task跑掛了對這種情況是沒有辦法調用cleanup的。當topologies本地運行的時候,clearnup方法是沒法調用的,cleanup方法是在以本地模式運行topologies時使用的,並且你可以在沒有資源泄露的情況下運行或殺掉任何的topologies

ThedeclareOutputFields method declares that the ExclamationBolt emits 1-tupleswith one field called "word".

declareOutputFields方法聲明瞭ExclamationBolt輸出了一個帶有word字段的tuples.

ThegetComponentConfiguration method allows you to configure various aspects of howthis component runs. This is a more advanced topic that is explained further onConfiguration.

getComponentConfiguration方法允許你配置組件的運行模式。這是一個比較高級的話題將在配置部分進一步討論。

Methodslike cleanup and getComponentConfiguration are often not needed in a boltimplementation. You can define bolts more succinctly by using a base class thatprovides default implementations where appropriate. ExclamationBolt can bewritten more succinctly by extending BaseRichBolt, like so:

CleanupgetComponentConfiguration方法通常在 bolt中不是必須實現的。可以通過提供默認實現的一個基類來來簡單的定義bolt, ExclamationBolt也可以通過擴展BaseRichBolt來實現。

publicstatic class ExclamationBolt extends BaseRichBolt {

    OutputCollector _collector;

 

    @Override

    public void prepare(Map conf,TopologyContext context, OutputCollector collector) {

        _collector = collector;

    }

 

    @Override

    public void execute(Tuple tuple) {

        _collector.emit(tuple, newValues(tuple.getString(0) + "!!!"));

        _collector.ack(tuple);

    }

 

    @Override

    public voiddeclareOutputFields(OutputFieldsDeclarer declarer) {

        declarer.declare(newFields("word"));

    }   

}

RunningExclamationTopology in local mode

本地模式運行ExclamationTopology

Let'ssee how to run the ExclamationTopology in local mode and see that it's working.

看一下如何以本地模式運行ExclamationTopology以及它是如何工作的。

Stormhas two modes of operation: local mode and distributed mode. In local mode,Storm executes completely in process by simulating worker nodes with threads.Local mode is useful for testing and development of topologies. When you runthe topologies in storm-starter, they'll run in local mode and you'll be ableto see what messages each component is emitting. You can read more aboutrunning topologies in local mode on Local mode.

Storm有兩個運行模式,本地模式和分佈式模式,本地模式下,Storm通過線程模擬工作節點來完整的在進程中執行。本地模式對topologies的開發有測試是很有用的。當在storm-starter中運行topologies時,topologies將會以本地模式運行,並且你能夠看見每個組件輸出的消息,如此,在你將會更深刻的理解本地模式下運行topologies

Indistributed mode, Storm operates as a cluster of machines. When you submit atopology to the master, you also submit all the code necessary to run thetopology. The master will take care of distributing your code and allocatingworkers to run your topology. If workers go down, the master will reassign themsomewhere else. You can read more about running topologies on a cluster on Running topologies on a productioncluster.

在分佈式模式下,storm做爲一個集羣運行。當你提交一個topologymaster的時候,你也提交了運行這個topology的必要代碼。Master將會負責分發代碼、分配worker來運行你的topology。如果workers掛掉,master將會把topology重新分配到其它地方,能過在生產集羣中運行一個topology可以深入瞭解分佈式運行模式。

Here'sthe code that runs ExclamationTopology in local mode:

Configconf = new Config();

conf.setDebug(true);

conf.setNumWorkers(2);

 

LocalClustercluster = new LocalCluster();

cluster.submitTopology("test",conf, builder.createTopology());

Utils.sleep(10000);

cluster.killTopology("test");

cluster.shutdown();

First,the code defines an in-process cluster by creating a LocalCluster object.Submitting topologies to this virtual cluster is identical to submittingtopologies to distributed clusters. It submits a topology to the LocalClusterby calling submitTopology, which takes as arguments a name for the runningtopology, a configuration for the topology, and then the topology itself.

Thename is used to identify the topology so that you can kill it later on. A topology will run indefinitely until you kill it.

首先,上面代碼通過創建一個LocalCluster定義了一個進程內的集羣。提交topologies給一個虛擬集羣和提交給一個分佈式集羣是一樣的。通過調用submitTopology,提交了一個topologyLocalClustersubmitTopology需要一個參數做爲運行topology的名稱,一個configuration,一個topology本身。名稱是用來標識一個topology以使用後續殺掉它,topology會一直運行,直到殺掉它。

Theconfiguration is used to tune various aspects of the running topology. The twoconfigurations specified here are very common:

Configuration用來設置運行topology的各項配置信息,下面兩項配置是非常常見的。

TOPOLOGY_WORKERS(set with setNumWorkers) specifies how many processes you want allocated aroundthe cluster to execute the topology. Each component in the topology willexecute as many threads. The number of threads allocated to a given componentis configured through the setBolt and setSpout methods. Those threads existwithin worker processes. Each worker process contains within it some number ofthreads for some number of components. For instance, you may have 300 threadsspecified across all your components and 50 worker processes specified in yourconfig. Each worker process will execute 6 threads, each of which of couldbelong to a different component. You tune the performance of Storm topologiesby tweaking the parallelism for each component and the number of workerprocesses those threads should run within.

TOPOLOGY_WORKERS:用來設置你想在集羣中分配多少個進程來運行topologyTopology中的每一個組件都作爲多個線程在執行的。分配給一個組件的執行線程的數量是通過setBolt setSpout來設置的。這些線程存在於工作進程裏面。每一個工作進程都運行着多個組件,每個組件也可能對應多個線程。例如,在你的配置中配置了50個工作進程和300個線程來執行所有的組件,每一個工作進程來執行6個線程,每一個線程屬於一個不同的組件。可以通過調整組件的並行機制和線程運行所需要的進程數據量來調整Storm topologies的性能。

TOPOLOGY_DEBUG (set with setDebug), whenset to true, tells Storm to log every message every emitted by a component.This is useful in local mode when testing topologies, but you probably want tokeep this turned off when running topologies on the cluster.

TOPOLOGY_DEBUG:當設置爲true的時候,告訴storm記錄下組件的每一條日誌第一項輸出,這在本地模式下測試topologies時是非常有用的,但是在集羣中運行時,記得關閉。

There'smany other configurations you can set for the topology. The variousconfigurations are detailed on the Javadoc for Config.

Topology也有許多其它的配置可以設置,在javadoc中有各項配置的詳細說明。

Tolearn about how to set up your development environment so that you can runtopologies in local mode (such as in Eclipse), see Creating a new Storm project.

學習如何創建開發環境,在本地模式運行topologies,可以參見創建一個storm項目。

Streamgroupings

流分組

Astream grouping tells a topology how to send tuples between two components.Remember, spouts and bolts execute in parallel as many tasks across thecluster. If you look at how a topology is executing at the task level, it lookssomething like this:

流分組決定如何在兩個組件間發送truples. Spouts bolts是並行執行的,就像是在集羣中有多個任務在運行一下,如果你查看一個topology在任務級別是如何運行的,如下所示。

 

Whena task for Bolt A emits a tuple to Bolt B, which task should it send the tupleto?

BoltA的一個任務輸出一個tupleBoltB,那一個任務把它發送給BoltB.

A"stream grouping" answers this question by telling Storm how to sendtuples between sets of tasks. Before we dig into the different kinds of streamgroupings, let's take a look at another topology from storm-starter. This WordCountTopology reads sentences off of aspout and streams out of WordCountBolt the total number of times it has seenthat word before:

stream grouping通過通知storm如何在任務間發送tuples回答了這個問題。在這這前我們討論過不同的分組策略,現在從storm-starter的角度看看另一個topologyWordCountTopologyspout讀取情況子,輸入出給WordCountBoltWordCountBolt統計已經發送過單詞的次數。

TopologyBuilderbuilder = new TopologyBuilder();

 

builder.setSpout("sentences",new RandomSentenceSpout(), 5);       

builder.setBolt("split",new SplitSentence(), 8)

       .shuffleGrouping("sentences");

builder.setBolt("count",new WordCount(), 12)

        .fieldsGrouping("split", newFields("word"));

SplitSentenceemits a tuple for each word in each sentence it receives, and WordCount keeps amap in memory from word to count. Each time WordCount receives a word, itupdates its state and emits the new word count.

SplitSentence爲它接收到的每個句子中的每個詞輸出一個tupleWordCount維護了一個由詞到計數的內存映射。每當WordCount收到一個詞,就更新內存中的狀態並輸出一個新的詞。

There'sa few different kinds of stream groupings.

有兩種不同的流分組策略。

Thesimplest kind of grouping is called a "shuffle grouping" which sendsthe tuple to a random task. A shuffle grouping is used in the WordCountTopologyto send tuples from RandomSentenceSpout to the SplitSentence bolt. It has theeffect of evenly distributing the work of processing the tuples across all ofSplitSentence bolt's tasks.

最簡單的分組策略是"shufflegrouping",它是將tuple隨機的發送給一個任務。shuffle grouping策略被用來在WordCountTopology中從RandomSentenceSpout送一個tuplesSplitSentence olt,它能均勻的將tuples分配到 Sentence bolt's中處理tuple 的任務。

Amore interesting kind of grouping is the "fields grouping". A fieldsgrouping is used between the SplitSentence bolt and the WordCount bolt. It iscritical for the functioning of the WordCount bolt that the same word always goto the same task. Otherwise, more than one task will see the same word, andthey'll each emit incorrect values for the count since each has incompleteinformation. A fields grouping lets you group a stream by a subset of itsfields. This causes equal values for that subset of fields to go to the sametask. Since WordCount subscribes to SplitSentence's output stream using afields grouping on the "word" field, the same word always goes to thesame task and the bolt produces the correct output.

一個最有趣的分組策略是fields grouping分組。fields grouping被用於SplitSentenceWordCount間傳輸數據。這種策略對WordCount的功能是相當重要的,因爲它將相同的詞給了同一個task.否則的話,一個或多個task就會收到同一個詞。因爲各個task都收到了不完整的信息,它們都會輸出錯誤的count信息。fields grouping算法能讓你按照流的字段的子集來進行分組。這就造就了發給同一任務的字段的子集的統計結果都相等,因爲WordCountword fields上使用fields grouping訂閱了SplitSentence's的輸出流。同一個詞總是輸出到同一個task,並且the也能產生正確的結果。

Fieldsgroupings are the basis of implementing streaming joins and streamingaggregations as well as a plethora of other use cases. Underneath the hood,fields groupings are implemented using mod hashing.

Fields groupings是實現流連接和流聚合的基礎,像其它的分組策略一樣,Fields groupings是通過mod哈稀來實現的。

There'sa few other kinds of stream groupings. You can read more about them on Concepts.

DefiningBolts in other languages

也有其它的一些流分組策略,你可以從概念上了解他們。使用其它的語言來定義bolts.

Boltscan be defined in any language. Bolts written in another language are executedas subprocesses, and Storm communicates with those subprocesses with JSONmessages over stdin/stdout. The communication protocol just requires an ~100line adapter library, and Storm ships with adapter libraries for Ruby, Python,and Fancy.

Bolts能夠使用任何語言來定義。用其它語言寫成的Bolts以子進程的方式運行。Storm在標準的輸入和輸出上採用json格式的消息與這些子進程通信。通信協議僅僅使用了不到100行代碼的一個適配器。Storm使用適配器實現Ruby,Python,Fance之間的過渡。

Here'sthe definition of the SplitSentence bolt from WordCountTopology:

下面是WordCountTopologySplitSentence的定義。

publicstatic class SplitSentence extends ShellBolt implements IRichBolt {

    public SplitSentence() {

        super("python","splitsentence.py");

    }

 

    public voiddeclareOutputFields(OutputFieldsDeclarer declarer) {

        declarer.declare(newFields("word"));

    }

}

SplitSentenceoverrides ShellBolt and declares it as running using python with the arguments splitsentence.py. Here's the implementation ofsplitsentence.py:

SplitSentence 覆蓋了ShellBolt,使用了帶splitsentence.py.參數的python來實現一個可執行程序

 

importstorm

classSplitSentenceBolt(storm.BasicBolt):

    def process(self, tup):

        words = tup.values[0].split("")

        for word in words:

          storm.emit([word])

 

SplitSentenceBolt().run()

 

Formore information on writing spouts and bolts in other languages, and to learnabout how to create topologies in other languages (and avoid the JVMcompletely), see Using non-JVM languages with Storm.

關於如何利用其它語言編寫spouts and bolts,參照如何利用其它語言編寫topologies

Guaranteeingmessage processing

保證消息的處理。

Earlieron in this tutorial, we skipped over a few aspects of how tuples are emitted.Those aspects were part of Storm's reliability API: how Storm guarantees thatevery message coming off a spout will be fully processed. See Guaranteeing message processing for information on how thisworks and what you have to do as a user to take advantage of Storm'sreliability capabilities.

在這篇指南的前面,我們跳過了關於tuples是如何輸出的這一小部分,這是storm可靠API的一部分:瞭解消息的可靠性機制和storm是如何保證來自於一個spout的消息完全被處理呢。做爲用戶如何利用storm的可靠性能能力。

Transactionaltopologies

事務型topologies

Stormguarantees that every message will be played through the topology at leastonce. A common question asked is "how do you do things like counting ontop of Storm? Won't you overcount?" Storm has a feature calledtransactional topologies that let you achieve exactly-once messaging semanticsfor most computations. Read more about transactional topologies here.

Storm保證每一個消息在topology中至少處理一次。通用的問題是,如何才能熟練的掌握storm.Storm有一個機制稱爲事務topologies,通過事務topologies能讓多數的計算精確的執行一次。更多的瞭解事務topologies,接着看。

DistributedRPC

分佈式RPC

Thistutorial showed how to do basic stream processing on top of Storm. There's lotsmore things you can do with Storm's primitives. One of the most interestingapplications of Storm is Distributed RPC, where you parallelize the computationof intense functions on the fly. Read more about Distributed RPC here.

這篇指南顯示瞭如何在storm上做一些簡單的流處理。在storm的規則下你可以做的還有很多。其中一個最有意思的事情是分佈式的RPC,這樣可以並行的高效的進行一些高強度的計算。

Conclusion

Thistutorial gave a broad overview of developing, testing, and deploying Stormtopologies. The rest of the documentation dives deeper into all the aspects ofusing Storm.

Meetups

Apache Storm & Apache Kafka (Sunnyvale, CA)

Apache Storm & Kafka Users (Seattle, WA)

NYC Storm User Group (New York, NY)

Bay Area Stream Processing (Emeryville, CA)

Boston Realtime Data (Boston, MA)

LondonStorm User Group (London, UK)

AboutStorm

Stormintegrates with any queueing system and any database system. Storm's spoutabstraction makes it easy to integrate a new queuing system. Likewise,integrating Storm with database systems is easy.

FirstLook

Rationale

Tutorial

Setting up development environment

Creating a new Storm project

Documentation

Index

Manual

Javadoc

FAQ


Copyright© 2015 Apache Software Foundation. All Rights Reserved.
Apache Storm, Apache, the Apache feather logo, and the Apache Storm projectlogos are trademarks of The Apache Software Foundation.
All other marks mentioned may be trademarks or registered trademarks of theirrespective owners.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章