Storm(二)Storm基本术语

Topologies


The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.

应用程序的所有逻辑被打包到Storm的topology中。Storm的topology类似与MapReduce的job。两者之间的一个不同点就是MR的job最终是会运行结束的,而topology则会一直运行下去,除非人为的kill掉。topology中定义了spouts、bolts以及数据流的分组。

Resources:

  • TopologyBuilder: use this class to construct topologies in Java
  • Cluster: Running topologies on a production cluster
  • Local mode: Read this to learn how to develop and test topologies in local mode.

Streams


The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

Stream是Storm的核心概念。Stream是无边界的tuple序列,在分布式场景下可以被并行的创造和处理。可以使用schema来定义Stream中传递的tuple的字段。默认情况下,tuple可以包含integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays。你也可以自定义自己的序列化对象类型。

Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of “default”.

每个stream在声明的时候需要给定一个id。但是自从single-stream spouts and bolts更加常用之后,OutputFieldsDeclarer中具有了更加便捷的方法用来声明一个stream而不需要指定id,在这种情况下,默认的id是 ‘default’。

Resources:

  • Tuple: streams are composed of tuples
  • OutputFieldsDeclarer: used to declare streams and their schemas
  • Serialization: Information about Storm’s dynamic typing of tuples and declaring custom serializations
  • ISerialization: custom serializers must implement this interface
  • CONFIG.TOPOLOGY_SERIALIZATIONS: custom serializers can be registered using this configuration

Spouts

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

Spout是topology中stream的源头。通常spout会从外部数据源读取tuples,然后emit到topology中的bolt。Spout可以是reliable或者unreliable。一个可靠的spout在失败后可以重新恢复,而不可靠的spout则无法做到这一点。

Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emitmethod on SpoutOutputCollector.

Spout可以emit到多个stream。声明多个streams使用declareStream方法,并且在emit数据的时候要声明具体emit到哪个stream。

The main method on spouts is nextTuple. nextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.

Spout的主要方法是nextTuple。nextTuple可以emit一个新的tuple到topology或者仅仅return。在任何spout的实现中nextTuple方法都不要block住,因为storm是在相同的线程中调用spout的方法的。

The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

Spout中的其他主要方法是ack和fail。当spout emit数据成功或者失败的时候这两个方法会被触发。ack和fail只针对reliable spout。

Resources:

  • IRichSpout: this is the interface that spouts must implement.
  • Guaranteeing message processing

Bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Topology中对tuple的逻辑处理都是在bolts中。bolts可以通过filtering, functions, aggregations, joins, talking to databases做任意的事情。bolts可以进行简单的数据流的转换,也可以进行复杂的数据流转换,进行复杂的数据流转换往往需要多个bolt来进行数据处理。

Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

Bolts可以emit到多个stream。声明多个streams使用declareStream方法,并且在emit数据的时候要声明具体emit的stream。

When you declare a bolt’s input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually.InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping(“1”) subscribes to the default stream on component “1” and is equivalent todeclarer.shuffleGrouping(“1”, DEFAULT_STREAM_ID).

当声明一个bolt的输入流时,你通常需要订阅另一个组件的stream。如果你想订阅另一个组件的所有stream的话,你需要针对每一个stream都订阅。InputDeclarer中声明了通过stream 的id来订阅stream的方法。declarer.shuffleGrouping(“1”) 订阅了另一个组件”1”的默认的stream,等同于declarer.shuffleGrouping(“1”, DEFAULT_STREAM_ID)。

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

Bolts中的主要方法是execute,将tuple作为输入参数。bolts 使用OutputCollector emit 新的tuple对象。bolts每处理完成一个tuple就必须调用OutputCollector 的ack方法,以便于让storm知道tuples完成了。通常情况下处理tuple输入的时候,根据实际情况emit 0或者更多的tuples,然后ack对应的输入tuple,storm提供了IBasicBolt接口来自动完成ack机制。

Please note that OutputCollector is not thread-safe, and all emits, acks, and fails must happen on the same thread. Please refer Troubleshooting for more details.

OutputCollector 并不是线程安全的,所以所有的emits、acks和fails必须在相同的线程中。

Resources:

  • IRichBolt: this is general interface for bolts.
  • IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions.
  • OutputCollector: bolts emit tuples to their output streams using an instance of this class
  • Guaranteeing message processing

Stream groupings

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.

定义topology的时候需要为每个bolt声明具体需要接受哪些streams作为输入。stream的分组定义了如何在bolt的task之间进行stream的partition。

There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface:

下面是8个内置的stream grouping,你也可以通过实现 CustomStreamGrouping接口来自定义stream grouping。

  • Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.

    tuples在bolts的task之间随机的分布,在分布的时候会尽量确保分布均匀。

  • Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id“‘s may go to different tasks.**

    stream根据在在grouping中定义的变量名称来对stream进行partition,例如stream根据”user-id”变量进行分组,那么tuples中具有相同”user-id”的就会被被分配到相同的task处理,不同的”user-id”会被分配到不同的task处理。

  • Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.

    stream根据在在grouping中定义的变量名称来对stream进行partition,与Fields grouping类似,但是在。。。

  • All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.

    stream在bolts的每个task中进行了replicate,意思就是每个task收到的tuple都是相同的。

  • Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.

    所有的stream都去了bolts的那个id最低的task中。

  • None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).

    当前Nonoe grouping等同于shuffle grouping。

  • Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the emitDirect methods. A bolt can get the task ids of its consumers by either using the providedTopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).

    这个一个特殊的grouping。stream使用这种方式进行grouping意味着tuple的producer决定了哪个task会接收到该tuple。Direct grouping只可以用在那些已经声明了direct stream的stream上。tuple emit到一个直接的stream上必须使用emitDirect方法。bolt可以使用TopologyContext或者追踪emit的输出来得到task的id。

  • Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

    如果目标bolt在相同的worker中有一个或者多个tasks,tuple仅仅只会在这些worker的task中会tuple进行分配,否则的话该grouping类似于shuffle grouping。

Resources:

  • TopologyBuilder: use this class to define topologies
  • InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt’s input streams and how those streams should be grouped
  • CoordinatedBolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct groupings

Reliability

Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a “message timeout” associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

Storm确保每一个tuple都会被topology处理。topology通过追踪tuples tree中的每一个tuple来判断tuple是否被处理成功。每一个topology都有一个 “message timeout”与之关联。如果Storm在超时时间之内仍然未发现该tuple被处理完成的话,那么该tuple则处理失败并在一段时间后重试!

To take advantage of Storm’s reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you’ve finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that you’re finished with a tuple using the ack method.

为了确保Storm的可靠处理能力,当tuple tree中有新的tuple被创建的时候,以及每个tuple被处理完成的时候你的都必须告诉storm。这些都是使用OutputCollector对象来完成的。是在emit方法中隐式的完成,你可以使用ack方法来声明tuple被处理完毕。

This is all explained in much more detail in Guaranteeing message processing.

Tasks

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

spout或者bolt在集群中作为多个task来运行。每个task都是一个线程,stream grouping决定了将tuples从一个task集合发送到另一个task集合的方式。你可以使用TopologyBuilder的setSpout和setBolt针对每个spout或者bolt设置并行处理。

Workers

Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

topology运行在一个或多个worker中。每个worker都是一个单独的JVM,可以处理所有task中的一部分。例如:topology中设置的parallelism是300,分配的workers是50,那么每个worker会运行6个task。storm会尽量均匀的为每个worker分配task。

Resources:

  • Config.TOPOLOGY_WORKERS: this config sets the number of workers to allocate for executing the topology


参考链接:http://storm.apache.org/documentation/Concepts.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章