storm並行機制

Understandingthe Parallelism of a Storm Topology

STORM的並行機制

What makes a running topology: worker processes,executors and tasks

一個運行着的topology是什麼構成:工作進程,執行器和任務

Stormdistinguishes between the following three main entities that are used toactually run a topology in a Storm cluster:

1.      Worker processes

2.      Executors (threads)

3.      Tasks

  Storm對下面三個在storm集羣中運行topology的實體進行了區分,分別是:

1,  工作進程

2,  執行器(線程)

3,  任務

 

Here is asimple illustration of their relationships:

下面是他們關係的一個簡單圖示:

集羣中的一個機器可能爲一個或多個topologies運行一個或多個進程,每一個工作進程爲一個特定的topology運行多個執行器。

一個單獨的工作進程可能運行一個或多個執行器,第一個執行器都是工作進程創建的一個線程,第一個執行器運行一個或多個同一組件的任務(spoutbolt)。

任務執行特定的數據處理。

 

A workerprocess executes a subset of a topology. A worker process belongs to aspecific topology and may run one or more executors for one or more components(spouts or bolts) of this topology. A running topology consists of many suchprocesses running on many machines within a Storm cluster.

一個工作進程執行一個topology的子集,一個工作進程屬於一個特定的topology並且針對一個或多個組件運行着一個或多個執行器。一個運行着的topology由若干個運行在同一個storm集羣中的進程組成。

An executoris a thread that is spawned by a worker process. It may run one or more tasksfor the same component (spout or bolt).

執行器是由工作進程創建的一個線程,它可能會爲同一個組件(spoutbolt)運行一個或多個任務。

A taskperforms the actual data processing — each spout or bolt that you implement inyour code executes as many tasks across the cluster. The number of tasks for acomponent is always the same throughout the lifetime of a topology, but thenumber of executors (threads) for a component can change over time. This meansthat the following condition holds true: #threads ≤ #tasks. Bydefault, the number of tasks is set to be the same as the number of executors,i.e. Storm will run one task per thread.

任務執行特定的數據處理-你代碼中實現的每一個spoutbolt都做爲集羣中的多個任務來執行。組件的任務數在topology的整個生命週期中是維持不變,但是執行器的數據是一直在變的。這意味着下面這個條件是永久成立的:#threads ≤ #tasks.默認來講任務的數量被設置成和執行器的數據一致,例如storm讓一個線程執行一個任務。

Configuring the parallelism of a topology

配置topology的並行機制

Note that inStorm’s terminology "parallelism" is specifically used to describethe so-called parallelism hint, which means the initial number ofexecutor (threads) of a component. In this document though we use the term"parallelism" in a more general sense to describe how you canconfigure not only the number of executors but also the number of workerprocesses and the number of tasks of a Storm topology. We will specificallycall out when "parallelism" is used in the normal, narrow definitionof Storm

注意,storm的術語並行被特定用來描述所謂的並行提示,用來表示一個組件的初始線程數。在這篇文章中,使用並行度這個術語,更通用的意義上來講,不僅用來描述執行器的數量,而且用來描述storm中工作進程的數量和任務的數量。我們將從通用和狹義的角度來討論一下storm並行機制.

Thefollowing sections give an overview of the various configuration options andhow to set them in your code. There is more than one way of setting theseoptions though, and the table lists only some of them. Storm currently has thefollowing order of precedencefor configuration settings: defaults.yaml < storm.yaml <topology-specific configuration < internal component-specific configuration< external component-specific configuration.

下面這部分給出了不同配置部分的一個全貌以及如何在你的代碼中設置它們。有多種方式設置這些選項,下面列表僅僅列出了一部分,目前storm對配置選項的引用順序:

Defaults.yaml<storm.yaml<topology特殊配置<內部組件特殊配置<外部組件特殊配置。

Number of worker processes

工作進程的數量

·        Description: How many worker processes tocreate for the topology across machines in the cluster.

·        描述:在集羣的機器中爲topology創建多少個工作進程。

·        Configuration option: TOPOLOGY_WORKERS

·        配置選項:TOPOLOGY_WORKERS

·        How to set in your code (examples):

·        如何在代碼中設置

o   Config#setNumWorkers

o    

Number of executors (threads)

執行器數量

·        Description: How many executors to spawn percomponent.

·        描述:每個組件創建多少個執行器

·        Configuration option: None (pass parallelism_hintparameter to setSpout or setBolt)

·        配置選項:無(傳遞parallelism_hintsetSpout or setBolt

·        How to set in your code (examples):

·        代碼中如何設置

o   TopologyBuilder#setSpout()

o   TopologyBuilder#setBolt()

o   Note that as of Storm 0.8 the parallelism_hintparameter now specifies the initial number of executors (not tasks!) for thatbolt.

o   注意storm0.8 parallelism_hint指的是bolt執行器的數量。

Number of tasks

任務數

·        Description: How many tasks to create percomponent.

·        描述:每個組件創建多少個任務

·        Configuration option: TOPOLOGY_TASKS

·        配置選項:TOPOLOGY_TASKS

·        How to set in your code (examples):

·        代碼中如何設置

o   ComponentConfigurationDeclarer#setNumTasks()

Here is anexample code snippet to show these settings in practice:

下面是一個代碼片斷顯示如何在實踐中使用這些設置:

topologyBuilder.setBolt("green-bolt", new GreenBolt(),2)

               .setNumTasks(4)

               .shuffleGrouping("blue-spout");

In the abovecode we configured Storm to run the bolt GreenBoltwith aninitial number of two executors and four associated tasks. Storm will run twotasks per executor (thread). If you do not explicitly configure the number oftasks, Storm will run by default one task per executor.

下面代碼中,我們配置storm運行一個bolt GreenBolt,這個bolt初始化了兩個執行器和四個相關任務。如果不明確配置任務的數量,storm將默認將會一個執行器一個任務執行。

Example of a running topology

運行topology的例子

The followingillustration shows how a simple topology would look like in operation. Thetopology consists of three components: one spout called BlueSpout and twobolts called GreenBolt and YellowBolt. Thecomponents are linked such that BlueSpout sends its output to GreenBolt, whichin turns sends its own output to YellowBolt.

下圖展示了一個運行中topology是什麼樣子,這個topology由三個組件組成:一個BlueSpout,一個GreenBolt和一個YellowBolt。這些組件是有聯繫的,BlueSpout輸出給GreenBoltGreenBolt輸出給YellowBolt

The GreenBolt wasconfigured as per the code snippet above whereas BlueSpout and YellowBolt only setthe parallelism hint (number of executors). Here is the relevant code:

GreenBolt如上面代碼配置一樣,BlueSpoutYellowBolt僅僅設置了parallelism hint(執行器的數量),下面是相關代碼:

Config conf = new Config();

conf.setNumWorkers(2);// use two worker processes

 

topologyBuilder.setSpout("blue-spout", new BlueSpout(),2);// set parallelism hint to 2

 

topologyBuilder.setBolt("green-bolt", new GreenBolt(),2)

               .setNumTasks(4)

               .shuffleGrouping("blue-spout");

 

topologyBuilder.setBolt("yellow-bolt", new YellowBolt(),6)

               .shuffleGrouping("green-bolt");

 

StormSubmitter.submitTopology(

       "mytopology",

       conf,

       topologyBuilder.createTopology()

    );

And ofcourse Storm comes with additional configuration settings to control theparallelism of a topology, including:

當然,storm也於其它的配置信息一起來控制topology的並行機制:

·        TOPOLOGY_MAX_TASK_PARALLELISM:This setting puts a ceiling on the number of executors that can be spawned fora single component. It is typically used during testing to limit the number ofthreads spawned when running a topology in local mode. You can set this optionvia e.g. Config#setMaxTaskParallelism().

·        TOPOLOGY_MAX_TASK_PARALLELISM:這個設置限制一個單獨組件能夠創建的最大線程數。它通常被用在本地模式運行的時候,測試執行一個topology的最大創建線程數據限制。

How to change the parallelism of a running topology

如何改變一個運行topology的並行限制

A niftyfeature of Storm is that you can increase or decrease the number of workerprocesses and/or executors without being required to restart the cluster or thetopology. The act of doing so is called rebalancing.

Storm有一個漂亮的特性,就是你可以在不重啓羣集和topology的情況下增加或減少工作時程或執行器的數量,這就是所謂的再平衡機制。

You have twooptions to rebalance a topology:

有兩個選項可以再平衡一個topology

1.      Use the Storm web UI to rebalance thetopology.

使用web UI

2.      Use the CLI tool storm rebalance asdescribed below.

使用CLI 工具

Here is anexample of using the CLI tool:

## Reconfigure the topology"mytopology" to use 5 worker processes,

## the spout "blue-spout" to use3 executors and

## the bolt "yellow-bolt" to use10 executors.

 

$ storm rebalance mytopology -n 5 -eblue-spout=3 -e yellow-bolt=10

References

·        Concepts

·        Configuration

·        Runningtopologies on a production cluster

·        Local mode

·        Tutorial

·        Storm API documentation,most notably the class Config

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章