storm 微批處理高級API Trident

storm Trident 概述
1.1. Apply Locally本地操作：操作都應用在本地節點的Batch上，不會產生網絡傳輸
1.2. Functions:函數操作
1.3. Filters:過濾操作
1.4. PartitionAggregate
1.5. Aggragation聚合操作
1.6. grouped streams
1.7. Merge和Joins:
什麼是Storm Trident ？

Trident是基於Storm進行實時留處理的高級抽象，提供了對實時流4的聚集，投影，過濾等操作，從而大大減少了開發Storm程序的工作量。Trident還提供了針對數據庫或則其他持久化存儲的有狀態的，增量的更新操作的原語。

若我們要開發一個對文本中的詞頻進行統計的程序，使用Storm框架的話我們需要開發三個Storm組件：

1.一個Spout負責接收kafka數據

2.一個Bolt對數據進行解析，將解析結果以word字段發送給下游的Bolt

3.一個Bolt對詞頻進行統計，把統計結果記錄在count字段並存儲

如果使用Trident我們可以使用一下代碼完成上述操作：

import com.alibaba.fastjson.JSONObject;
import kafka.Kafka;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.kafka.*;
import org.apache.storm.kafka.trident.TransactionalTridentKafkaSpout;
import org.apache.storm.kafka.trident.TridentKafkaConfig;
import org.apache.storm.spout.SchemeAsMultiScheme;
import org.apache.storm.trident.Stream;
import org.apache.storm.trident.TridentTopology;

import org.apache.storm.trident.operation.BaseFilter;
import org.apache.storm.trident.operation.MapFunction;
import org.apache.storm.trident.tuple.TridentTuple;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class StormTridentTest {

public static void main(String[] args) {
    Logger logger=LoggerFactory.getLogger(StormTridentTest.class);
    TridentTopology topology=new TridentTopology();
    BrokerHosts brokerHosts =new ZkHosts("localhost:2181");
     
    //Kafka spout 相關配置
    String topic="testtest";
    TridentKafkaConfig config=new TridentKafkaConfig(brokerHosts,topic,"trid");
    config.scheme=new SchemeAsMultiScheme(new StringScheme());

    //trident kafka spout
    TransactionalTridentKafkaSpout spout=new TransactionalTridentKafkaSpout(config);

    //create trident kafka stream,對數據進行解析，計數
    Stream tickteAdd=topology.newStream("spout",spout)
            .map(new MapFunction() {
                @Override
                public Values execute(TridentTuple tridentTuple) {
                    String data=tridentTuple.getString(0);
                    JSONObject dataJson= JSONObject.parseObject(data);
                    if(dataJson==null){
                        logger.info("xxs"+dataJson);
                        return (new Values("",0));
                    }
                    String word=dataJson.getString("word");
                    int cnt=dataJson.getIntValue("cnt");
                    return (new Values(word,cnt));
                }
            },new Fields("word","cnt")).each(new Fields("word","cnt"), new BaseFilter() {
                private  final Logger logger = LoggerFactory.getLogger(this.getClass());
                @Override
                public boolean isKeep(TridentTuple tridentTuple) {
                    logger.info("cnt:"+tridentTuple);
                    return true;
                }
            });
    Config conf = new Config();
    LocalCluster clu = new LocalCluster();
    clu.submitTopology("mytopology", conf, topology.build());
}}

storm Trident 概述
Storm Trident中的核心數據模型就是“Stream”，也就是說，Storm Trident處理的是Stream，但是實際上Stream是被成批處理的，Stream被切分成一個個的Batch分佈到集羣中，所有應用在Stream上的函數最終會應用到每個節點的Batch中，實現並行計算，具體如下圖所示：

在Trident中有五種操作類型：

Apply Locally:本地操作，所有操作應用在本地節點數據上，不會產生網絡傳輸
Repartitioning:數據流重定向，單純的改變數據流向，不會改變數據內容，這部分會有網絡傳輸
Aggragation:聚合操作，會有網絡傳輸
Grouped streams上的操作
Merge和Join
1.1. Apply Locally本地操作：操作都應用在本地節點的Batch上，不會產生網絡傳輸
1.2. Functions:函數操作
函數的作用是接收一個tuple(需指定接收tuple的哪個字段)，輸出0個或多個tuples。輸出的新字段值會被追加到原始輸入tuple的後面，如果一個function不輸出tuple，那就意味這這個tuple被過濾掉了，如下是實現的一個MapFunction，這個Function的功能就是遍歷並解析每個tuple：

1.3. Filters:過濾操作
Filters很簡單，接收一個tuple並決定是否保留這個tuple。舉個例子，定義一個Filter：
1.4. PartitionAggregate

PartitionAggregate的作用對每個Partition中的tuple進行聚合，與前面的函數在原tuple後面追加數據不同，PartitionAggregate的輸出會直接替換掉輸入的tuple，僅數據PartitionAggregate中發射的tuple。栗子，定義一個累加的PartitionAggregate：

stream.partitionAggregate(new Fields(“b”), new Sum(), new Fields(“sum”))；

上面語句完成對partition內數據進行分組累加的操作。

TridentAPI提供了三個聚合器的接口：CombinerAggregator, ReducerAggregator, and Aggregator.

1.5. Aggragation聚合操作
Trident有aggregate和 persistentAggregate方法來做聚合操作。aggregate是獨立的運行在Stream的每個Batch上的，而persistentAggregate則是運行在Stream的所有Batch上並把運算結果存儲在state source中。運行aggregate方法做全局聚合。當你用到 ReducerAggregator或Aggregator時，Stream首先被重定向到一個分區中，然後其中的聚合函數便在這個分區上運行。當你用到CombinerAggregator時，Trident會首先在每個分區上做局部聚合，然後把局部聚合後的結果重定向到一個分區，因此使用CombinerAggregator會更高效，可能的話我們需要優先考慮使用它。下面舉個例子來說明如何用aggregate進行全局計數：
stream.aggregate(new Count(), new Fields(“count”))；
和paritionAggregate一樣，aggregators的聚合也可以串聯起來，但是如果你把一個 CombinerAggregator和一個非CombinerAggregator串聯在一起，Trident是無法完成局部聚合優化的。

1.6. grouped streams

GroupBy操作是根據特定的字段對流進行重定向的，還有，在一個分區內部，每個相同字段的tuple也會被Group到一起。如果你在grouped Stream上面運行aggregators，聚合操作會運行在每個Group中而不是整個Batch。persistentAggregate也能運行在GroupedSteam上，不過結果會被保存在MapState中，其中的key便是分組的字段。當然，aggregators在GroupedStreams上也可以串聯。
1.7. Merge和Joins:
api的最後一部分便是如何把各種流匯聚到一起。最簡單的方式就是把這些流匯聚成一個流。我們可以這麼做：
topology.merge(stream1, stream2, stream3);
另一種合併流的方式就是join。一個標準的join就像是一個sql,必須有標準的輸入，因此，join只針對符合條件的Stream。join應用在來自Spout的每一個小Batch中。join時候的tuple會包含：

join的字段，如Stream1中的key和Stream2中的x
所有非join的字段，根據傳入join方法的順序，a和b分別代表steam1的val1和val2，c代表Stream2的val1
當join的是來源於不同Spout的stream時，這些Spout在發射數據時需要同步，一個Batch所包含的tuple會來自各個Spout。
什麼是Storm Trident ？

Trident是編寫Storm Topology的一套高級框架，是對傳統Spout、Bolt的高級封裝。在學習Trident之前，我們都是都Spout、Bolt的相關API來編寫一個Topology，在學習了Trident之後，我們會使用Trident API來編寫Topology。

可以將StormTopology與TridentTopology的關係，類比爲JDBC與ORM框架(mybatis、hibernate)之間的關係，後者是前者的高級封裝，功能相同，但是可以極大的減少開發的工作量。
通常情況下，新的概念意味着要使用新的API。但是歸根結底，還是底層還是通過storm原語來實現。在Trident中，我們使用TridentTopology表示一個拓撲，而在Storm原語中，我們使用StormTopology來表示一個拓撲。TridentTopology最終會被轉換成StormTopology。

storm 微批處理高級API Trident

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

storm 批處理以及TickTuple窗口

storm 微批處理高級API Trident

hadoop client 本地開發調試客戶端搭建

flink table api 自定義數據格式解析

Apache Kylin優化篇之聯合維度(Joint Dimension)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結