淺談Flink批模式Adaptive Hash Join

Flink批Hash Join遞歸超限問題

隨着Flink流批一體能力的迅速發展以及Flink SQL易用性的提升，越來越多的廠商開始將Flink作爲離線批處理引擎使用。在我們使用Flink進行大規模join操作時，也許會發生如下的異常，導致任務失敗：

Hash join exceeded maximum number of recursions, without reducing partitions enough to be memory resident.

字面意思即爲Hash Join的遞歸次數超出限制。Flink批模式下的join算法有兩種，即Hybrid Hash Join和Sort-Merge Join。顧名思義，Hybrid Hash Join就是Simple Hash Join和Grace Hash Join兩種算法的結合（關於它們，看官可參考這篇文章）。引用一張Flink官方博客中的手繪圖來說明。

Flink的Hybrid Hash Join在build階段會積極地利用TaskManager的託管內存，並將內存無法容納的哈希分區spill到磁盤中。在probe階段，當內存中的哈希分區處理完成後，會釋放掉對應的MemorySegment，並將先前溢寫到磁盤的分區讀入，以提升probe效率。特別注意，如果溢寫分區對空閒的託管內存而言仍然過大（特別是存在數據傾斜的情況時），就會將其遞歸拆分成更小的分區，原理如下圖所示。

當然，遞歸拆分也不能是無限制的。在Blink Runtime中，如果遞歸拆分3次仍然不能滿足內存需求，就會拋出前文所述的異常了。

筆者在今年7月ApacheCon Asia 2022流處理專場的分享內容裏談到了這個問題，並且將其歸咎於Flink SQL的CBO優化器的代價模型不太科學，導致其十分偏向選擇Hash Join。由於修改的難度很大，所以暫時的workaround就是在任務失敗後，自動設置table.exec.disabled-operators參數來禁用掉ShuffleHashJoin算子，從而強制使用Sort-Merge Join。

當然這仍然不算優雅的解決方法，接下來簡要看看Flink 1.16版本中提出的更好一點的方案：Adaptive Hash Join。

Adaptive Hash Join的實現

所謂adaptive（自適應），就是指Hash Join遞歸超限時，不必讓任務失敗，而是將這些大分區自動轉爲Sort-Merge Join來處理。

Blink Runtime中的哈希表有兩種，即BinaryHashTable（key的類型爲BinaryRowData）和LongHybridHashTable（key的類型爲Long）。以前者爲例，查看其prepareNextPartition()方法，該方法負責遞歸地取得下一個要處理的哈希分區。

    private boolean prepareNextPartition() throws IOException {
        // finalize and cleanup the partitions of the current table
        // ......

        // there are pending partitions
        final BinaryHashPartition p = this.partitionsPending.get(0);
        // ......

        final int nextRecursionLevel = p.getRecursionLevel() + 1;
        if (nextRecursionLevel == 2) {
            LOG.info("Recursive hash join: partition number is " + p.getPartitionNumber());
        } else if (nextRecursionLevel > MAX_RECURSION_DEPTH) {
            LOG.info(
                    "Partition number [{}] recursive level more than {}, process the partition using SortMergeJoin later.",
                    p.getPartitionNumber(),
                    MAX_RECURSION_DEPTH);
            // if the partition has spilled to disk more than three times, process it by sort merge
            // join later
            this.partitionsPendingForSMJ.add(p);
            // also need to remove it from pending list
            this.partitionsPending.remove(0);
            // recursively get the next partition
            return prepareNextPartition();
        }

        // build the next table; memory must be allocated after this call
        buildTableFromSpilledPartition(p, nextRecursionLevel);

        // set the probe side
        setPartitionProbeReader(p);

        // unregister the pending partition
        this.partitionsPending.remove(0);
        this.currentRecursionDepth = p.getRecursionLevel() + 1;

        // recursively get the next
        return nextMatching();
    }

注意當遞歸深度超過MAX_RECURSION_DEPTH（常量定義即爲3）時，會將分區直接放入一個名爲partitionsPendingForSMJ的容器中，等待做Sort-Merge Join。另外，在該方法調用的buildTableFromSpilledPartition()方法（對溢寫分區執行build操作）開頭，去掉了對遞歸超限的判斷，也就是說Hash join exceeded maximum number of recursions異常已經成爲歷史。

那麼等待做Sort-Merge Join的分區是如何被處理的？查看Blink Runtime中的HashJoinOperator算子，在構造該算子時，需要比原來多傳入一個SortMergeJoinFunction的實例：

private final SortMergeJoinFunction sortMergeJoinFunction;

SortMergeJoinFunction實際上是將舊版的SortMergeJoinOperator處理邏輯抽離出來的類，算法本身沒有任何變化。然後從哈希表中讀取前述的partitionsPendingForSMJ容器，對每個分區的build側和probe側分別執行Sort-Merge Join操作即可。

    /**
     * If here also exists partitions which spilled to disk more than three time when hash join end,
     * means that the key in these partitions is very skewed, so fallback to sort merge join
     * algorithm to process it.
     */
    private void fallbackSMJProcessPartition() throws Exception {
        if (!table.getPartitionsPendingForSMJ().isEmpty()) {
            // release memory to MemoryManager first that is used to sort merge join operator
            table.releaseMemoryCacheForSMJ();
            // initialize sort merge join operator
            LOG.info("Fallback to sort merge join to process spilled partitions.");
            initialSortMergeJoinFunction();
            fallbackSMJ = true;

            for (BinaryHashPartition p : table.getPartitionsPendingForSMJ()) {
                // process build side
                RowIterator<BinaryRowData> buildSideIter =
                        table.getSpilledPartitionBuildSideIter(p);
                while (buildSideIter.advanceNext()) {
                    processSortMergeJoinElement1(buildSideIter.getRow());
                }

                // process probe side
                ProbeIterator probeIter = table.getSpilledPartitionProbeSideIter(p);
                BinaryRowData probeNext;
                while ((probeNext = probeIter.next()) != null) {
                    processSortMergeJoinElement2(probeNext);
                }
            }

            // close the HashTable
            closeHashTable();

            // finish build and probe
            sortMergeJoinFunction.endInput(1);
            sortMergeJoinFunction.endInput(2);
            LOG.info("Finish sort merge join for spilled partitions.");
        }
    }

    private void initialSortMergeJoinFunction() throws Exception {
        sortMergeJoinFunction.open(
                true,
                this.getContainingTask(),
                this.getOperatorConfig(),
                (StreamRecordCollector) this.collector,
                this.computeMemorySize(),
                this.getRuntimeContext(),
                this.getMetricGroup());
    }

    private void processSortMergeJoinElement1(RowData rowData) throws Exception {
        if (leftIsBuild) {
            sortMergeJoinFunction.processElement1(rowData);
        } else {
            sortMergeJoinFunction.processElement2(rowData);
        }
    }

    private void processSortMergeJoinElement2(RowData rowData) throws Exception {
        if (leftIsBuild) {
            sortMergeJoinFunction.processElement2(rowData);
        } else {
            sortMergeJoinFunction.processElement1(rowData);
        }
    }

與BinaryHashTable不同，LongHybridHashTable的join邏輯全部是代碼生成的，在對應的生成器LongHashJoinGenerator中，可以看到與上文類似的代碼，看官可以自行找來讀讀。

The End

民那晚安晚安。

淺談Flink批模式Adaptive Hash Join

Flink批Hash Join遞歸超限問題

Adaptive Hash Join的實現

The End

致遠OA及相關OA系統集成與二次開發

System.Object未被引用的程序集中定義

Java 信號量（semaphore）搭配CountDownLatch 實現多線程處理循環內邏輯並限制創建線程數

【面試準備】項目經驗——接口自動化項目

淺談軟件工程中的Shim

Flink RichFunction題目一則

「Daylight -デイライト-」（日光）

2022。

淺談Flink批模式Adaptive Hash Join

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結