Twitter Storm的新利器Pluggable Scheduler 【轉】

轉自:http://www.51studyit.com/html/notes/20140403/51.html

版本：

storm0.9.1 kafka0.8.1

可插拔式的任務分配器(Pluggable Scheduler)給實現了，將在0.8.0版本里面跟大家見面。這篇文章先給大家嚐嚐鮮，介紹下這個新特性。

在Pluggable Scheduler之前，Twitter Storm裏面對於用戶提交的每個Topology進行任務分配是由nimbus來做的，nimbus的任務分配算法可是非常牛逼的哦，主要特點如下

在slot充沛的情況下，能夠保證所有topology的task被均勻的分配到整個機器的所有機器上
在slot不足的情況下，它會把topology的所有的task分配到僅有的slot上去，這時候其實不是理想狀態，所以。。
- 在nimbus發現有多餘slot的時候，它會重新分配topology的task分配到空餘的slot上去以達到理想狀態。
在沒有slot的時候，它什麼也不做

一般情況下，用這種默認的task分配機制就已經足夠了。但是也會有一些應用場景是默認的task分配機制所搞定不了的，比如

如果你想你的spout分配到固定的機器上去 — 比如你的數據就在那上面
如果你有兩個topology都很耗CPU，你不想他們運行在同一臺機器上
…

這些情況下我們默認的task分配機制雖然強大，卻是搞不定的，因爲它根本就不考慮這些。所以我們設計了新的Pluggable Scheduler機制，使得用戶可以編寫自己的task分配算法 — Scheduler來實現自己特定的需求。下面我們就來親自動手來看看怎麼才能實現上面提到的默認Scheduler搞不定的第一個場景，爲了後面敘述的方便，我們來細化一下這個需求：讓我們的名爲special-spout的組件分配到名爲special-supervisor的supervisor上去

要實現一個Scheduler其實很簡單，只要實現IScheduler


public interface IScheduler {

    /**

     * Set assignments for the topologies which needs scheduling. The new assignments is available

     * through <code>cluster.getAssignments()

     *

     *@param topologies all the topologies in the cluster, some of them need schedule. Topologies object here

     *       only contain static information about topologies. Information like assignments, slots are all in

     *       the <code>clusterobject.

     *@param cluster the cluster these topologies are running in. <code>cluster contains everything user

     *       need to develop a new scheduling logic. e.g. supervisors information, available slots, current

     *       assignments for all the topologies etc. User can set the new assignment for topologies using

     *       <code>cluster.setAssignmentById

     */

    public void schedule(Topologies topologies, Cluster cluster);

}

這個接口會提供兩個參數，其中Topologies包含當前集羣裏面運行的所有Topology的信息：StormTopology對象，配置信息，以及從task到組件(bolt, spout)id的映射信息。Cluster對象則包含了當前集羣的所有狀態信息：對於系統所有Topology的task分配信息，所有的supervisor信息等等 — 已經足夠我們實現上面的那個需求了，讓我們動手吧

找出我們的目標Topology

首先我們要確定我們的topology是否已經提交到集羣了，很簡單，到topologies對象裏面找找看，找到了的話就說明已經提交了。

1 2	// Gets the topology which we want to schedule TopologyDetails topology = topologies.getByName("special-topology");

只要這個topology不爲null的話就說明這個topology已經提交了。

目標Topology是否需要分配

緊接着我們要看看這個topology需不需要進行task分配 — 有可能之前分配過了。怎麼弄呢？很簡單，Cluster對象已經提供了api可以使用

1	boolean needsScheduling = cluster.needsScheduling(topology);

這裏要說明的一點是，有關Scheduler編寫的幾乎所有api都是定義在Cluster類裏面，大家只要把這個類搞熟悉，編寫起Scheduler起來應該就得心應手了。如果這個topology需要進行task分配我們還要看下有那些task需要進行分配 — 因爲可能有部分task已經被分配過了

1 2	// find out all the needs-scheduling components of this topology Map<String, List<Integer>> componentToTasks = cluster.getNeedsSchedulingComponentToTasks(topology);

我們的目標spout是否需要分配？

因爲我們的目標是讓名爲special-spout的組件運行在名爲special-supervisor的supervisor上，所以我們要看看這些task裏面有沒有是屬於special-spout的task，很簡單，上面返回的componentToTasks就是從component-id到task-ids的一個映射。所以要找出special-spout就很簡單了

1	List<Integer> tasks = componentToTasks.get("special-spout");

找出目標supervisor

找到我們要分配的task之後，我們還要把我們的special-supervisor找出來，Cluster同樣提供了方便的方法:


// find out the our "special-supervisor" from the supervisor metadata

Collection<SupervisorDetails> supervisors = cluster.getSupervisors().values();

SupervisorDetails specialSupervisor = null;

for (SupervisorDetails supervisor : supervisors) {

    Map meta = (Map) supervisor.getSchedulerMeta();

    if (meta.get("name").equals("special-supervisor")) {

       specialSupervisor = supervisor;

       break;

    }

}

這裏要特別說明一下Map meta = (Map) supervisor.getSchedulerMeta();, 我們前面說名爲special-supervisor的supevisor，其實在storm裏面supervisor是沒有名字的，這裏我們所謂的名字是從supervisor.getSchedulerMeta裏面找出來的，這個schedulerMeta是supervisor上面配置的給scheduler使用的一些meta信息，你可以配置任意信息！比如在這個例子裏面，我在storm.yaml裏面配置了:

1 2	supervisor.scheduler.meta: name: "special-supervisor"

這樣我們才能用meta.get("name").equals("special-supervisor")找到我們的special-supervisor到這裏我們就找到了我們的special-supervisor，但是要記住一點的是，我們的集羣裏面有很多topology，這個supervisor的slot很可能已經被別的topology佔用掉了。所以我們要檢查下有沒有slot了

1	List<WorkerSlot> availableSlots = cluster.getAvailableSlots(specialSupervisor);

判斷上面的availableSlots是不是空就知道有沒有空餘的slot了，如果沒有slot了怎麼辦？沒別的topology佔用掉了怎麼辦？很簡單！把它趕走


// if there is no available slots on this supervisor, free some.

if (availableSlots.isEmpty() && !tasks.isEmpty()) {

    for (Integer task : specialSupervisor.getAllPorts()) {

        cluster.freeSlot(new WorkerSlot(specialSupervisor.getId(), task));

    }

}

最後一步：分配

到這裏爲止呢，我們要分配的tasks已經有了，要分配到的slot也搞定了，剩下的就分配下就好了(注意，這裏因爲爲了保持例子簡單，代碼做了簡化)


// re-get the aviableSlots

availableSlots = cluster.getAvailableSlots(specialSupervisor);

// since it is just a demo, to keep things simple, we assign all the

// tasks into one slot.

cluster.assign(availableSlots.get(0), topology.getId(), tasks);

我們的目標實現了! 隨着cluster.assign的調用，我們已經把我們的special-spout分配到special-supervisor上去了。不難吧

別的任務誰來分配?

不過有件事情別忘了，我們只給special-spout分配了task, 別的task誰來分配啊？你可能會說我不關心啊，沒關係，把這個交給系統默認的分配器吧：我們已經把系統的默認分配器包裝到backtype.storm.scheduler.EvenScheduler裏面去了，所以你簡單調用下就好了

1	new backtype.storm.scheduler.EvenScheduler().schedule(topologies, cluster);

讓Storm知道我們的Scheduler

哦，有一件事情忘記說了，我們完成了我們的自定義Scheduler，怎麼讓storm知道並且使用我們的Scheduler呢？兩件事情：

把包含這個Scheduler的jar包放到$STORM_HOME/lib下面去
在storm.yaml 裏面作如下配置:

幫助

1

storm.scheduler: "storm.DemoScheduler"

這樣Storm在做任務分配的時候就會用你的storm.DemoScheduler, 而不會使用默認的系統Scheduler

總結：

上面知識對0.8版本的一個說明整理，針對0.9版本，我們給出如下的一些說明

1、首先類：

package com.wy.storm.topology;

import java.util.Collection;
import java.util.List;
import java.util.Map;

import backtype.storm.scheduler.Cluster;
import backtype.storm.scheduler.ExecutorDetails;
import backtype.storm.scheduler.IScheduler;
import backtype.storm.scheduler.SupervisorDetails;
import backtype.storm.scheduler.Topologies;
import backtype.storm.scheduler.TopologyDetails;
import backtype.storm.scheduler.WorkerSlot;

public class TestScheduler implements IScheduler {

@Override
public void prepare(Map map) {
// TODO Auto-generated method stub

}

@Override
public void schedule(Topologies topologies, Cluster cluster) {
//獲取指定的topology
TopologyDetails topology = topologies.getByName("special-topology");
if(topology!=null){
//判斷topology是否已經分配過
boolean needsScheduling = cluster.needsScheduling(topology);
if(needsScheduling){
////找出所有需要分配的executor
Map<String, List<ExecutorDetails>> componentToTasks = cluster.getNeedsSchedulingComponentToExecutors(topology);
//找出需要分配的executor
List<ExecutorDetails> tasks = componentToTasks.get("special-spout");

//獲取所有的supervisor
Collection<SupervisorDetails> supervisors = cluster.getSupervisors().values();
SupervisorDetails specialSupervisor = null;

for (SupervisorDetails supervisor : supervisors) {
Map meta = (Map) supervisor.getSchedulerMeta();
//找出指定的supervisor
//supervisor.scheduler.meta:
// name: "special-supervisor"
if (meta.get("name").equals("special-supervisor")) {
System.out.println("---------special-supervisor success---------");
specialSupervisor = supervisor;
break;
}
}
if(specialSupervisor!=null){
System.out.println("---------specialSupervisor!=null success---------");
//查看是否有可用的slot
List<WorkerSlot> availableSlots = cluster.getAvailableSlots(specialSupervisor);
//沒有可用的slot，剔除其他的
if (availableSlots.isEmpty() && !tasks.isEmpty()) {
for (Integer task : specialSupervisor.getAllPorts()) {
cluster.freeSlot(new WorkerSlot(specialSupervisor.getId(), task));
}
}
//分配
availableSlots = cluster.getAvailableSlots(specialSupervisor);
cluster.assign(availableSlots.get(0), topology.getId(), tasks);
//剩下的分配給storm默認的分配器
new backtype.storm.scheduler.EvenScheduler().schedule(topologies, cluster);
//storm.scheduler: "com.wy.storm.topology.TestScheduler" 指定
}
}
}
}

}

編寫上面代碼的時候，首先要引入storm jar包，編寫完成之後打成jar，放到storm集羣的所有集羣的lib下面即可。

配置storm.yaml文件

x00:

supervisor.scheduler.meta:
name: "special-supervisor0"
storm.scheduler: "com.wy.storm.topology.TestScheduler"

x01:

supervisor.scheduler.meta:
name: "special-supervisor"
storm.scheduler: "com.wy.storm.topology.TestScheduler"

x02:

supervisor.scheduler.meta:
name: "special-supervisor2"
storm.scheduler: "com.wy.storm.topology.TestScheduler"

注意supervisor的name都不一樣。

2、測試

Topology

package com.wy.storm.topology;

import storm.kafka.Broker;
import storm.kafka.BrokerHosts;
import storm.kafka.KafkaSpout;
import storm.kafka.SpoutConfig;
import storm.kafka.StaticHosts;
import storm.kafka.StringScheme;
import storm.kafka.ZkHosts;
import storm.kafka.trident.GlobalPartitionInformation;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.StormSubmitter;
import backtype.storm.spout.SchemeAsMultiScheme;
import backtype.storm.topology.TopologyBuilder;

import com.google.common.collect.ImmutableList;
import com.wy.storm.bolt.Counter;

public class CounterTopology {

/**
* @param args
*/
public static void main(String[] args) {
try{
/*GlobalPartitionInformation gi = new GlobalPartitionInformation();
gi.addPartition(0, new Broker("x00",9092));
gi.addPartition(1, new Broker("x01",9092));
gi.addPartition(2, new Broker("x02",9092));*/
String kafkaZookeeper = "x00:2181,x01:2181,x02:2181";
BrokerHosts brokerHosts = new ZkHosts(kafkaZookeeper);
SpoutConfig kafkaConf = new SpoutConfig(brokerHosts, "test3", "/newkafka", "id");

kafkaConf.scheme = new SchemeAsMultiScheme(new StringScheme());

kafkaConf.zkServers = ImmutableList.of("x00","x01","x02");
kafkaConf.zkPort = 2181;

kafkaConf.forceFromStart = true;

KafkaSpout kafkaSpout = new KafkaSpout(kafkaConf);

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("special-spout", kafkaSpout, 10);
builder.setBolt("printer", new Counter(),45).shuffleGrouping("special-spout");

Config config = new Config();
config.setDebug(true);

if(args!=null && args.length > 0) {
config.setNumWorkers(8);

StormSubmitter.submitTopology(args[0], config, builder.createTopology());
} else {
config.setMaxTaskParallelism(3);

LocalCluster cluster = new LocalCluster();
cluster.submitTopology("special-topology", config, builder.createTopology());

Thread.sleep(500000);

cluster.shutdown();
}
}catch (Exception e) {
e.printStackTrace();
}

}

Blot

package com.wy.storm.bolt;

import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Tuple;

public class Counter extends BaseBasicBolt {

private static long i = 0;
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {

System.out.println("i="+i++);

}

@Override
public void declareOutputFields(OutputFieldsDeclarer arg0) {

}

3、將上面的topology打包運行即可

Twitter Storm的新利器Pluggable Scheduler 【轉】

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（三）數據卷掛載NFS（網絡文件系統）

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（一）部署K8s

storm學習2-storm生命周以及相關配置詳解

Twitter Storm的新利器Pluggable Scheduler 【轉】

storm學習相關API介紹（轉）

Storm 實現滑動窗口計數和TopN排序【轉】

storm學習-Storm DRPC實戰【轉】

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結