Flink SQL流式聚合Mini-Batch優化原理淺析

前言

流式聚合（streaming aggregation）是我們編寫實時業務邏輯時非常常見的場景，當然也比較容易出現各種各樣的性能問題。Flink SQL使得用戶可以通過簡單的聚合函數和GROUP BY子句實現流式聚合，同時也內置了一些優化機制來解決部分case下可能遇到的瓶頸。本文對其中常用的Mini-Batch做個簡要的介紹，順便從源碼看一看它的實現思路。

注意：截至當前版本，Flink SQL的流式聚合優化暫時對窗口聚合（即GROUP BY TUMBLE/HOP/SESSION）無效，僅對純無界流上的聚合有效。

Mini-Batch概述

Flink SQL中的Mini-Batch概念與Spark Streaming有些類似，即微批次處理。

在默認情況下，聚合算子對攝入的每一條數據，都會執行“讀取累加器狀態→修改狀態→寫回狀態”的操作。如果數據流量很大，狀態操作的overhead也會隨之增加，影響效率（特別是RocksDB這種序列化成本高的Backend）。開啓Mini-Batch之後，攝入的數據會攢在算子內部的buffer中，達到指定的容量或時間閾值後再做聚合邏輯。這樣，一批數據內的每個key只需要執行一次狀態讀寫。如果key的量相對比較稀疏，優化效果更加明顯。

未開啓和開啓Mini-Batch聚合機制的對比示意圖如下。

顯然，Mini-Batch機制會導致數據處理出現一定的延遲，用戶需要自己權衡時效性和吞吐量的重要程度再決定。

Mini-Batch聚合默認是關閉的。要開啓它，可以設定如下3個參數。

val tEnv: TableEnvironment = ...
val configuration = tEnv.getConfig().getConfiguration()

configuration.setString("table.exec.mini-batch.enabled", "true")         // 啓用
configuration.setString("table.exec.mini-batch.allow-latency", "5 s")    // 緩存超時時長
configuration.setString("table.exec.mini-batch.size", "5000")            // 緩存大小

開啓Mini-Batch並執行一個簡單的無界流聚合查詢，觀察Web UI上展示的JobGraph如下。

注意LocalGroupAggregate和GlobalGroupAggregate就是基於Mini-Batch的Local-Global機制優化的結果，在分析完原生Mini-Batch後會簡單說明。

Mini-Batch原理解析

產生水印

Mini-Batch機制底層對應的優化器規則名爲MiniBatchIntervalInferRule（代碼略去），產生的物理節點爲StreamExecMiniBatchAssigner，直接附加在Source節點的後面。其translateToPlanInternal()方法的源碼如下。

@SuppressWarnings("unchecked")
@Override
protected Transformation<RowData> translateToPlanInternal(PlannerBase planner) {
    final Transformation<RowData> inputTransform =
            (Transformation<RowData>) getInputEdges().get(0).translateToPlan(planner);
    final OneInputStreamOperator<RowData, RowData> operator;

    if (miniBatchInterval.mode() == MiniBatchMode.ProcTime()) {
        operator = new ProcTimeMiniBatchAssignerOperator(miniBatchInterval.interval());
    } else if (miniBatchInterval.mode() == MiniBatchMode.RowTime()) {
        operator = new RowTimeMiniBatchAssginerOperator(miniBatchInterval.interval());
    } else {
        throw new TableException(
                String.format(
                        "MiniBatchAssigner shouldn't be in %s mode this is a bug, please file an issue.",
                        miniBatchInterval.mode()));
    }

    return new OneInputTransformation<>(
            inputTransform,
            getDescription(),
            operator,
            InternalTypeInfo.of(getOutputType()),
            inputTransform.getParallelism());
}

可見，根據作業時間語義的不同，產生的算子也不同（本質上都是OneInputStreamOperator）。先看processing time時間語義下產生的算子ProcTimeMiniBatchAssignerOperator的相關方法。

@Override
public void processElement(StreamRecord<RowData> element) throws Exception {
    long now = getProcessingTimeService().getCurrentProcessingTime();
    long currentBatch = now - now % intervalMs;
    if (currentBatch > currentWatermark) {
        currentWatermark = currentBatch;
        // emit
        output.emitWatermark(new Watermark(currentBatch));
    }
    output.collect(element);
}

@Override
public void onProcessingTime(long timestamp) throws Exception {
    long now = getProcessingTimeService().getCurrentProcessingTime();
    long currentBatch = now - now % intervalMs;
    if (currentBatch > currentWatermark) {
        currentWatermark = currentBatch;
        // emit
        output.emitWatermark(new Watermark(currentBatch));
    }
    getProcessingTimeService().registerTimer(currentBatch + intervalMs, this);
}

processing time語義下本不需要用到水印，但這裏的處理非常巧妙，即借用水印作爲分隔批次的標記。每處理一條數據，都檢查其時間戳是否處於當前批次內，若新的批次已經開始，則發射一條新的水印，另外也註冊了Timer用於發射水印，且保證發射週期是上述table.exec.mini-batch.allow-latency參數指定的間隔。

event time語義下的思路相同，只需要檢查Source產生的水印的時間戳，並只發射符合週期的水印，不符合週期的水印不會流轉到下游。RowTimeMiniBatchAssginerOperator類中對應的代碼如下。

@Override
public void processWatermark(Watermark mark) throws Exception {
    // if we receive a Long.MAX_VALUE watermark we forward it since it is used
    // to signal the end of input and to not block watermark progress downstream
    if (mark.getTimestamp() == Long.MAX_VALUE && currentWatermark != Long.MAX_VALUE) {
        currentWatermark = Long.MAX_VALUE;
        output.emitWatermark(mark);
        return;
    }
    currentWatermark = Math.max(currentWatermark, mark.getTimestamp());
    if (currentWatermark >= nextWatermark) {
        advanceWatermark();
    }
}

private void advanceWatermark() {
    output.emitWatermark(new Watermark(currentWatermark));
    long start = getMiniBatchStart(currentWatermark, minibatchInterval);
    long end = start + minibatchInterval - 1;
    nextWatermark = end > currentWatermark ? end : end + minibatchInterval;
}

攢批處理

在實現分組聚合的物理節點StreamExecGroupAggregate中，會對啓用了Mini-Batch的情況做特殊處理。

final OneInputStreamOperator<RowData, RowData> operator;
if (isMiniBatchEnabled) {
    MiniBatchGroupAggFunction aggFunction =
            new MiniBatchGroupAggFunction(
                    aggsHandler,
                    recordEqualiser,
                    accTypes,
                    inputRowType,
                    inputCountIndex,
                    generateUpdateBefore,
                    tableConfig.getIdleStateRetention().toMillis());
    operator =
            new KeyedMapBundleOperator<>(
                    aggFunction, AggregateUtil.createMiniBatchTrigger(tableConfig));
} else {
    GroupAggFunction aggFunction = new GroupAggFunction(/*...*/);
    operator = new KeyedProcessOperator<>(aggFunction);
}

可見，生成的負責攢批處理的算子爲KeyedMapBundleOperator，對應的Function則是MiniBatchGroupAggFunction。先來看前者，在它的抽象基類中，有如下三個重要的屬性。

/** The map in heap to store elements. */
private transient Map<K, V> bundle;
/** The trigger that determines how many elements should be put into a bundle. */
private final BundleTrigger<IN> bundleTrigger;
/** The function used to process when receiving element. */
private final MapBundleFunction<K, V, IN, OUT> function;

bundle：即用於暫存數據的buffer。
bundleTrigger：與CountTrigger類似，負責在bundle內的數據量達到閾值（即上文所述table.exec.mini-batch.size）時觸發計算。源碼很簡單，不再貼出。
function：即MiniBatchGroupAggFunction，承載具體的計算邏輯。

算子內對應的處理方法如下。

@Override
public void processElement(StreamRecord<IN> element) throws Exception {
    // get the key and value for the map bundle
    final IN input = element.getValue();
    final K bundleKey = getKey(input);
    final V bundleValue = bundle.get(bundleKey);
    // get a new value after adding this element to bundle
    final V newBundleValue = function.addInput(bundleValue, input);
    // update to map bundle
    bundle.put(bundleKey, newBundleValue);
    numOfElements++;
    bundleTrigger.onElement(input);
}

@Override
public void finishBundle() throws Exception {
    if (!bundle.isEmpty()) {
        numOfElements = 0;
        function.finishBundle(bundle, collector);
        bundle.clear();
    }
    bundleTrigger.reset();
}

@Override
public void processWatermark(Watermark mark) throws Exception {
    finishBundle();
    super.processWatermark(mark);
}

每來一條數據，就將其加入bundle中，增加計數，並調用BundleTrigger#onElement()方法檢查是否達到了觸發閾值，如是，則回調finishBundle()方法處理已經收齊的批次，並清空bundle。當水印到來時也同樣處理，即可滿足批次超時的設定。

finishBundle()方法實際上代理了MiniBatchGroupAggFunction#finishBundle()方法，代碼比較冗長，看官可自行查閱，但是流程很簡單：先創建累加器實例，再根據輸入數據的RowKind執行累加或回撤操作（同時維護每個key對應的狀態），最後輸出批次聚合結果的changelog。值得注意的是，MiniBatchGroupAggFunction中利用了代碼生成技術來自動生成聚合函數的底層handler（即AggsHandleFunction），在Flink Table模塊中很常見。

Local-Global簡述

Local-Global其實就是自動利用兩階段聚合思想解決數據傾斜的優化方案（是不是很方便），與MapReduce中引入Combiner類似。話休絮煩，直接上官網的圖吧。

要啓用Local-Global聚合，需要在啓用Mini-Batch的基礎上指定如下參數。

configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE")

Local-Global機制底層對應的優化器規則名爲TwoStageOptimizedAggregateRule，產生的物理節點分別是StreamExecLocalGroupAggregate（本地聚合）和StreamExecGlobalGroupAggregate（全局聚合）。在它們各自的translateToPlanInternal()方法中也都運用了代碼生成技術生成對應的聚合函數MiniBatchLocalGroupAggFunction和MiniBatchGlobalGroupAggFunction，代碼比較多，但思路同樣清晰，看官可自行找來看看。

The End

民那晚安晚安。

Flink SQL流式聚合Mini-Batch優化原理淺析

前言

Mini-Batch概述

Mini-Batch原理解析

產生水印

攢批處理

Local-Global簡述

The End

淺談軟件工程中的Shim

Flink RichFunction題目一則

「Daylight -デイライト-」（日光）

2022。

淺談Flink批模式Adaptive Hash Join

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結