安裝部署使用

ack機制

ack機制原理

這裏不講什麼是ack機制，可以參考官網的文檔Ack 機制
我們只要知道它是使用異或xor的原理即可：

A xor A = 0
A xor B xor B xor A = 0

使用ack機制

要想使用ack機制，需要做以下工作：

Topology的處理

構建topology時設置acker不爲0，方法如下：

config.setNumAckers(1);

該方法實際是設置以Config.TOPOLOGY_ACKER_EXECUTORS爲key的value，說明如下：

     /**
     * How many executors to spawn for ackers.
     * <p/>
     * <p>
     * If this is set to 0, then Storm will immediately ack tuples as soon as they come off the spout, effectively disabling reliability.
     * </p>
     */
    public static final String TOPOLOGY_ACKER_EXECUTORS = "topology.acker.executors";

Spout的處理

使用spout發送數據時，帶上msgid，接口說明如下：

    /**
     * Emits a new tuple to the default output stream with the given message ID.
     * When Storm detects that this tuple has been fully processed, or has
     * failed to be fully processed, the spout will receive an ack or fail
     * callback respectively with the messageId as long as the messageId was not
     * null. If the messageId was null, Storm will not track the tuple and no
     * callback will be received. The emitted values must be immutable.
     *
     * @return the list of task ids that this tuple was sent to
     */
    public List<Integer> emit(List<Object> tuple, Object messageId) {
        return emit(Utils.DEFAULT_STREAM_ID, tuple, messageId);
    }

我們看下KafkaSpout是怎麼做的：

    @Override
    public void nextTuple() {
        List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
        for (int i = 0; i < managers.size(); i++) {

            try {
                // in case the number of managers decreased
                _currPartitionIndex = _currPartitionIndex % managers.size();
                EmitState state = managers.get(_currPartitionIndex).next(_collector);
                if (state != EmitState.EMITTED_MORE_LEFT) {
                    _currPartitionIndex = (_currPartitionIndex + 1) % managers.size();
                }
                if (state != EmitState.NO_EMITTED) {
                    break;
                }
            } catch (FailedFetchException e) {
                LOG.warn("Fetch failed", e);
                _coordinator.refresh();
            }
        }

        long now = System.currentTimeMillis();
        if ((now - _lastUpdateMs) > _spoutConfig.stateUpdateIntervalMs) {
            commit();
        }
    }

注意上面的EmitState state = managers.get(_currPartitionIndex).next(_collector);我們進去看看：

     public EmitState next(SpoutOutputCollector collector) {
        if (_waitingToEmit.isEmpty()) {
            fill();
        }
        while (true) {
            MessageAndRealOffset toEmit = _waitingToEmit.pollFirst();
            if (toEmit == null) {
                return EmitState.NO_EMITTED;
            }
            Iterable<List<Object>> tups = KafkaUtils.generateTuples(_spoutConfig, toEmit.msg);
            if ((tups != null) && tups.iterator().hasNext()) {
                for (List<Object> tup : tups) {
                    collector.emit(tup, new KafkaMessageId(_partition, toEmit.offset));
                }
                break;
            } else {
                ack(toEmit.offset);
            }
        }
        if (!_waitingToEmit.isEmpty()) {
            return EmitState.EMITTED_MORE_LEFT;
        } else {
            return EmitState.EMITTED_END;
        }
    }

看到了吧，collector.emit(tup, new KafkaMessageId(_partition, toEmit.offset));emit的時候指定了messageId，而這個KafkaMessageId是一個靜態內部類，包括分區和偏移量2個屬性

    static class KafkaMessageId {
        public Partition partition;
        public long offset;

        public KafkaMessageId(Partition partition, long offset) {
            this.partition = partition;
            this.offset = offset;
        }
    }

Bolt的處理

一般我們寫bolt的時候有兩種方式，一種使用IRichBolt接口或者它的抽象實現類BaseRichBolt，一種使用IBasicBolt或者它的抽象實現類BaseBasicBolt，這2種是有區別的，主要在於影響ack機制

使用IRichBolt

使用IRichBolt意味着你要實現的接口如下：

void execute(Tuple input);

也意味着你要操作的類爲OutputCollector
使用OutputCollector來emit tuple給下個bolt的時候必須要用anchored的方式，接口如下：

    /**
     * Emits a new tuple to the default stream anchored on a single tuple. The
     * emitted values must be immutable.
     *
     * @param anchor the tuple to anchor to
     * @param tuple the new output tuple from this bolt
     * @return the list of task ids that this new tuple was sent to
     */
    public List<Integer> emit(Tuple anchor, List<Object> tuple) {
        return emit(Utils.DEFAULT_STREAM_ID, anchor, tuple);

    /**
     * Emits a new tuple to the default stream anchored on a group of input
     * tuples. The emitted values must be immutable.
     *
     * @param anchors the tuples to anchor to
     * @param tuple the new output tuple from this bolt
     * @return the list of task ids that this new tuple was sent to
     */
    public List<Integer> emit(Collection<Tuple> anchors, List<Object> tuple) {
        return emit(Utils.DEFAULT_STREAM_ID, anchors, tuple);
    }

所謂的anchor即爲Bolt的execute方法裏面的tuple，也即上游發給你的tuple
注意不能使用unanchored 的方式，說明如下：

    /**
     * Emits a new unanchored tuple to the default stream. Beacuse it's
     * unanchored, if a failure happens downstream, this new tuple won't affect
     * whether any spout tuples are considered failed or not. The emitted values
     * must be immutable.
     *
     * @param tuple the new output tuple from this bolt
     * @return the list of task ids that this new tuple was sent to
     */
    public List<Integer> emit(List<Object> tuple) {
        return emit(Utils.DEFAULT_STREAM_ID, tuple);
    }

同時在emit後要手動執行collector.ack(tuple);方法

使用IbasicBolt

使用IbasicBolt則編程會簡單的多，因爲它會幫我做很多事情，我們要做的僅僅是調用emit方法即可，先看要實現的接口：

    /**
     * Process the input tuple and optionally emit new tuples based on the input tuple.
     * 
     * All acking is managed for you. Throw a FailedException if you want to fail the tuple.
     */
    void execute(Tuple input, BasicOutputCollector collector);

這個execute方法和上述不一樣了，他給我們注入了BasicOutputCollector類，我們操作它即可，其實這個類裏面有一個上述OutputCollector out屬性，並且，自動注入了inputTuple，使用它來emit tuple即可，暴露的emit的方法只有2個：

    public List<Integer> emit(String streamId, List<Object> tuple) {
        return out.emit(streamId, inputTuple, tuple);
    }

    public List<Integer> emit(List<Object> tuple) {
        return emit(Utils.DEFAULT_STREAM_ID, tuple);
    }

正如上述代碼所示，它實際調用的是OutputCollector的emit方法，並且自動幫我們使用anchor的方式，這裏用到了我們熟悉的設計模式中的代理的模式
大家可能有注意到了，這裏並沒有顯示的調用collector.ack(tuple);方法，這裏猜猜也會知道，應該是用到了模板模式，在調用該方法的調用者那裏，調用了execute方法後，調用ack方法，查下代碼，果然沒錯，在BasicBoltExecutor類裏面，方法如下：

    public void execute(Tuple input) {
        _collector.setContext(input);
        try {
            _bolt.execute(input, _collector);
            _collector.getOutputter().ack(input);
        } catch (FailedException e) {
            if (e instanceof ReportedFailedException) {
                _collector.reportError(e);
            }
            _collector.getOutputter().fail(input);
        }
    }

仔細看看，發現它還幫我們處理了異常，只要我們拋出FailedException，它就會自動執行fail方法

關閉ack

ack機制並不是必須的，並且會消耗一部分性能，如果可以容忍部分數據丟失，想要更高的性能則可以關閉ack機制

方法

spout 在發送數據的時候不帶上msgid
設置acker數等於0
使用 unanchored的方式

以上方法任一種都可以，推薦使用第二種方式

性能和事務

事務

jstorm支持事務操作，這裏所謂的事務即是，順序處理tuple，如果這次的tuple沒有被完整的處理完，就不會處理下一個tuple，可以看到這樣大大降低了併發性，性能不會太好。所以可以採用批量的思想個時候，一個batch爲一個transaction處理單元，當一個batch處理完畢，才能處理下一個batch。還可以採用分階段處理的方式，在processing階段併發，實際commit的時候按順序

Trident

Trident是Storm之上的高級抽象，提供了joins，grouping，aggregations，fuctions和filters等接口。Trident將stream中的tuples分成batches進行處理，API封裝了對這些batches的處理過程，保證tuple只被處理一次。處理batches中間結果存儲在TridentState對象中。

性能

很明顯的，按照性能來說， trident < transaction < 使用ack機制普通接口 < 關掉ack機制的普通接口
我們也可以通過增加ack的併發數來提高線程

ack和fail

ack方法和fail方法只有在Spout中才有
ack，當spout收到一條ack消息時，觸發的動作
fail，當spout收到一條fail消息時，觸發的動作

    @Override
    public void ack(Object msgId) {
    }

    @Override
    public void fail(Object msgId) {
    }

參數爲msgId，即爲前面說的Spout裏面發送數據的msgId，失敗了是否需要重發tuple完全取決於你的實現，比如KafkaSpout就有自己的實現，代碼這裏就不貼了
需要注意的是，一般我們會有多個Bolt，在Topology處理流程上的任意Bolt處理失敗都會觸發Spout執行fail方法，如果你的程序在fail方法裏面會重發tuple的話，那麼這個tuple仍將會被所有的Bolt執行一遍，舉例如下：
假設topology的流程爲：SpoutA->BoltB->BoltC->BoltD 如果BoltC處理失敗，則SpoutA將重發tuple，並且將再次按照topology的流程走一遍。可以看到，BoltB處理了2遍Bolt，如果在BoltB裏有插入數據庫的操作則會出現問題。
好在一般情況下，我們也只是在最末尾的Bolt中執行入庫的操作，前面執行的Bolt基本都是內存計算，不落地，所以執行多遍也就不會有問題了

多線程

在jstorm中， spout中nextTuple和ack/fail運行在不同的線程中，從而鼓勵用戶在nextTuple裏面執行block的操作，原生的storm，nextTuple和ack/fail在同一個線程，不允許nextTuple/ack/fail執行任何block的操作，否則就會出現數據超時，但帶來的問題是，當沒有數據時，整個spout就不停的在空跑，極大的浪費了cpu，因此，jstorm更改了storm的spout設計，鼓勵用戶block操作（比如從隊列中take消息），從而節省cpu。
進一步說明如下：
當topology.max.spout.pending 設置不爲1時（包括topology.max.spout.pending設置爲null），spout內部將額外啓動一個線程單獨執行ack或fail操作，從而nextTuple在單獨一個線程中執行，因此允許在nextTuple中執行block動作，而原生的storm，nextTuple/ack/fail 都在一個線程中執行，當數據量不大時，nextTuple立即返回，而ack、fail同樣也容易沒有數據，進而導致CPU 大量空轉，白白浪費CPU，而在JStorm中， nextTuple可以以block方式獲取數據，比如從disruptor中或BlockingQueue中獲取數據，當沒有數據時，直接block住，節省了大量CPU。
但因此帶來一個問題，處理ack/fail 和nextTuple時，必須小心線程安全性。
當topology.max.spout.pending爲1時，恢復爲spout一個線程，即nextTuple/ack/fail 運行在一個線程中。

其他

重啓

建議不超過1個月，強制重啓一下supervisor，因爲supervisor是一個daemon進程，不停的創建子進程，當使用時間過長時，文件打開的句柄會非常多，導致啓動worker的時間會變慢，因此，建議每隔一週，強制重啓一次supervisor

輸出到kafka

寫入數據到kafka可以使用KafkaBolt這個類，它已經幫我做好了，我們只需要提供一些參數即可
上面講的KafkaSpout和KafkaBolt都在storm-kafka這個框架裏面，maven配置如下：

 <dependency>
        <groupId>org.apache.storm</groupId>
        <artifactId>storm-kafka</artifactId>
        <version>0.10.2</version>
        <exclusions>
                <exclusion>
                    <groupId>org.apache.zookeeper</groupId>
                    <artifactId>zookeeper</artifactId>
                </exclusion>
        </exclusions>
 </dependency>

注意版本不爲最新，爲0.10.2，1.0.0版本後的包結構變了，和jstorm不兼容，不能使用
該項目爲storm的官方插件項目，項目地址爲：Storm Kafka

--------------------------------------我是分割線，2017年5月10日16:29:33加--------------------------------------
KafkaSpout的nextTuple方法裏，每次都是調用

List<PartitionManager> managers = _coordinator.getMyManagedPartitions();

來獲得分區的信息，而這個方法如下：

@Override
    public List<PartitionManager> getMyManagedPartitions() {
        if (_lastRefreshTime == null || (System.currentTimeMillis() - _lastRefreshTime) > _refreshFreqMs) {
            refresh();
            _lastRefreshTime = System.currentTimeMillis();
        }
        return _cachedList;
    }

調用的時候判斷是否超過了一定的時間，如果超過則重新獲取分區的消息，這個時間默認爲60s，爲ZkHosts裏的refreshFreqSecs屬性

那麼分區增加了或者減少了會不會有問題呢，答案是不會有問題，KafkaSpout已經幫我們做了很多了
如果分區數增加，在這60s內，我獲取的是原來的分區進行消費，到60s後，刷新分區數，對新增加的分區進行消費，完全沒有任何問題
如果分區數減少，比如之前有5個分區：0,1,2,3,4，現在減少爲：0,1,2，當要消費分區3時會拋出異常並且在異常裏面會重新刷新分區，這是分區數就會變爲3，則直接跳出這個循環了，也不會有機會去消費分區4，所以也不會有任何問題

--------------------------------------我是分割線，2017年5月22日15:48:31加--------------------------------------
bolt中不要有靜態變量和static{}方法 bolt中不要有靜態變量和static{}方法 bolt中不要有靜態變量和static{}方法重要的事情說三遍

jstorm進階-ack機制及KafkaSpout

安裝部署使用

ack機制

ack機制原理

使用ack機制

Topology的處理

Spout的處理

Bolt的處理

使用IRichBolt

使用IbasicBolt

關閉ack

方法

性能和事務

事務

Trident

性能

ack和fail

多線程

其他

重啓

輸出到kafka

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

Servlet3.0中WEB-INF\lib下的jar包中的資源可以直接通過瀏覽器訪問

Canal & Otter 的一些注意事項和最佳實踐

Git遠程操作詳解

ehcache3.0嚐鮮

jstorm進階-ack機制及KafkaSpout

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結