storm示例之trident

storm在0.7中引入了事務型拓撲,以滿足對消息處理有着極其嚴格要求的場景,例如某些應用於某些統計場景,當然要求統計量必須是完全精確的,不能多也不能少。你可以聯想數據庫的事務特性ACID來加深對事務型拓撲的理解。在storm的0.8版本之後,事務性已經被封裝到Trident中。trident提供了一套非常成熟的批處理API來批量處理元組,可以對這些元組執行分組(group by)、連接(join)、聚合(aggregation)、運行函數、運行過濾器等,trident還封裝了DRPC功能,同樣支持DRPC遠程調用。本文通過一個示例簡介trident拓撲。

trident示例
新建工程,在project中的src/的同級新建lib/文件夾,將storm安裝包根目錄下的lib/文件夾下的所有jar包拷貝到新建工程的lib/目錄下,並在eclipse中選中所有的jar包右鍵選擇build path-> add to build path

Spout
下文的FixedBatchSpout是storm源碼中自帶的spout,用於向拓撲中發送一個語句,其中setCycle()方法可以設置向拓撲中循環發送語句。

FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3, new Values("the cow jumped over the moon"),...);

FixedBatchSpout的參數sentence爲輸出的字段名,數字3表示將發送的元組最多分爲3個Batch進行處理,剩餘的參數均爲spout待發送的內容列表。
這裏強調下參數3是最大的batch,實際的batch還受內容列表大小的限制,這在源碼中的emitBatch()方法中得以體現。這裏的Batch是指將元組分批處理。每個事務處理一批元組,各批次之間可以並行處理,以提高資源的利用率,減少相關聯的事務間的等待。
FixedBatchSpout繼承自IBatchSpout,IBatchSpout是一個非事務Spout,每次發送一個Batch元組。
FixedBatchSpout的源碼實現如下:

//FixedBatchSpout.java
package org.apache.storm.trident.testing;

import org.apache.storm.Config;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.tuple.Fields;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.HashMap;

import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.spout.IBatchSpout;

public class FixedBatchSpout implements IBatchSpout {

    Fields fields;
    List<Object>[] outputs;
    int maxBatchSize;
    HashMap<Long, List<List<Object>>> batches = new HashMap<Long, List<List<Object>>>();

    public FixedBatchSpout(Fields fields, int maxBatchSize, List<Object>... outputs) {
        this.fields = fields;
        this.outputs = outputs;
        this.maxBatchSize = maxBatchSize;
    }

    int index = 0;
    boolean cycle = false;

    public void setCycle(boolean cycle) {
        this.cycle = cycle;
    }

    @Override
    public void open(Map conf, TopologyContext context) {
        index = 0;
    }

    @Override
    public void emitBatch(long batchId, TridentCollector collector) {
        List<List<Object>> batch = this.batches.get(batchId);
        if(batch == null){
            batch = new ArrayList<List<Object>>();
            if(index>=outputs.length && cycle) {
                index = 0;
            }
            for(int i=0; index < outputs.length && i < maxBatchSize; index++, i++) {
                batch.add(outputs[index]);
            }
            this.batches.put(batchId, batch);
        }
        for(List<Object> list : batch){
            collector.emit(list);
        }
    }

    @Override
    public void ack(long batchId) {
        this.batches.remove(batchId);
    }

    @Override
    public void close() {
    }

    @Override
    public Map<String, Object> getComponentConfiguration() {
        Config conf = new Config();
        conf.setMaxTaskParallelism(1);
        return conf;
    }

    @Override
    public Fields getOutputFields() {
        return fields;
    }

}

批處理API
在Trident中each(),stateQuery()等操作功能類似於Storm拓撲中的BOLT,這些bolt實現對數據的分組,聚合,查詢等。

TridentState wordCounts = topology.newStream("spout1", spout) 
    .parallelismHint(16)
    .each(new Fields("sentence"),new Split(), new Fields("word"))
    .groupBy(new Fields("word"))
    .persistentAggregate(new MemoryMapState.Factory(),new Count(), new Fields("count"))
    .parallelismHint(16);

TridentState對象在本例中代表了所有單詞的統計,用於對外提供給DRPC服務進行查詢。
newStream():newStream方法在拓撲中創建一個新的數據流以便從輸入源中讀取數據
parallelismHint():設置並行處理的數量
each()配合運行函數(或過濾器):
例如:each(new Fields(“sentence”),new Split(), new Fields(“word”)) //對每個輸入的sentence”字段”,調用Split()函數進行處理,並生成新的字段word
例如:each(new Fields(“sentence”),new MyFilter()) //對每個輸入的sentence”字段”,調用MyFilter()函數進行過濾
groupBy():分組操作,按特定的字段進行分組
persistentAggregate():聚合函數,persistentAggregate實現的是將數據持久到特定的存儲介質中
stateQuery():提供對已生成的TridentState對象的查詢
過濾器
FilterNull的實現如下

public class FilterNull extends BaseFilter {
    @Override
    public boolean isKeep(TridentTuple tuple) {
        for(Object o: tuple) {
            if(o==null) return false;
        }
        return true;
    }
}

完整的示例代碼

//TridentWordCount.java
package org.apache.storm.starter.trident;

import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.LocalDRPC;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.StormTopology;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.trident.TridentState;
import org.apache.storm.trident.TridentTopology;
import org.apache.storm.trident.operation.BaseFunction;
import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.operation.builtin.Count;
import org.apache.storm.trident.operation.builtin.FilterNull;
import org.apache.storm.trident.operation.builtin.MapGet;
import org.apache.storm.trident.operation.builtin.Sum;
import org.apache.storm.trident.testing.FixedBatchSpout;
import org.apache.storm.trident.testing.MemoryMapState;
import org.apache.storm.trident.tuple.TridentTuple;

public class TridentWordCount {

    public static class Split extends BaseFunction {
    @Override
        public void execute(TridentTuple tuple, TridentCollector collector) {
            String sentence = tuple.getString(0);
            for (String word : sentence.split(" ")) {
                collector.emit(new Values(word));
            }
        }
    }

    public static StormTopology buildTopology(LocalDRPC drpc) {
        FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3, new Values("the cow jumped over the moon"),
            new Values("the man went to the store and bought some candy"), new Values("four score and seven years ago"),
            new Values("how many apples can you eat"), new Values("to be or not to be the person"));
        spout.setCycle(true);

        TridentTopology topology = new TridentTopology();
        TridentState wordCounts = topology.newStream("spout1", spout) //newStream方法在拓撲中創建一個新的數據流以便從輸入源(FixedBatchSpout)中讀取數據
            .parallelismHint(16)
            .each(new Fields("sentence"),new Split(), new Fields("word")) //each(),對每個輸入的sentence"字段",調用Split()函數進行處理
            .groupBy(new Fields("word"))
            .persistentAggregate(new MemoryMapState.Factory(),new Count(), new Fields("count"))
            .parallelismHint(16);

        topology.newDRPCStream("words", drpc) //以words作爲函數名,對應於drpc.execute("words", "cat the dog jumped")中的words名
            .each(new Fields("args"), new Split(), new Fields("word"))//對於輸入參數args,使用Split()方法進行切分,並以word作爲字段發送
            .groupBy(new Fields("word"))//對word字段進行重新分區,保證相同的字段落入同一個分區
            .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
            .each(new Fields("count"),new FilterNull())//使用FilterNull()方法過濾count字段的數據(過濾沒有統計到的單詞)
            .aggregate(new Fields("count"), new Sum(), new Fields("sum"));

        return topology.build();
    }

    public static void main(String[] args) throws Exception {
        Config conf = new Config();
        conf.setMaxSpoutPending(20);
        if (args.length == 0) {
            LocalDRPC drpc = new LocalDRPC();
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("wordCounter", conf, buildTopology(drpc));
            for (int i = 0; i < 100; i++) {
                System.out.println("DRPC RESULT: " + drpc.execute("words", "cat the dog jumped"));
                Thread.sleep(1000);
            }
        } else {
            conf.setNumWorkers(3);
            StormSubmitter.submitTopologyWithProgressBar(args[0], conf, buildTopology(null));
        }
    }
}

將代碼打成jar包,提交storm拓撲,可以通過storm日誌 ./storm/logs/workers-artifacts/* 觀察實時的統計;
也可以通過DRPC,遠程查詢相關字段的統計
DRPC客戶端 查詢

import backtype.storm.utils.DRPCClient;

public class TestDRPC {

    public static void main(String[] args) throws Exception{
        // TODO Auto-generated method stub
        DRPCClient client = new DRPCClient("localhost",3772);//3772是drpc對外默認的服務端口
        System.out.println("DRPC result:" + client.execute("words", "the man storm"));
    }
}

storm會切分查詢列表”the man storm”(each(new Fields(“args”), new Split(), new Fields(“word”))),並通過stateQuery進行查詢

最後,基於trident API可以實現函數功能(例如Split),過濾器(例如FilterNull()),分區聚合(例如persistentAggregate),狀態查詢(例如stateQuery),投影,重新分組,合併,連接等。具體可以參考相關接口,文獻[1]中也有相關簡介。


1.趙必夏,程麗明. 從零開始學Storm(第二版).清華大學出版社,2016.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章