淺談Flink批模式Adaptive Hash Join

Flink批Hash Join遞歸超限問題

隨着Flink流批一體能力的迅速發展以及Flink SQL易用性的提升,越來越多的廠商開始將Flink作爲離線批處理引擎使用。在我們使用Flink進行大規模join操作時,也許會發生如下的異常,導致任務失敗:

Hash join exceeded maximum number of recursions, without reducing partitions enough to be memory resident.

字面意思即爲Hash Join的遞歸次數超出限制。Flink批模式下的join算法有兩種,即Hybrid Hash Join和Sort-Merge Join。顧名思義,Hybrid Hash Join就是Simple Hash Join和Grace Hash Join兩種算法的結合(關於它們,看官可參考這篇文章)。引用一張Flink官方博客中的手繪圖來說明。

Flink的Hybrid Hash Join在build階段會積極地利用TaskManager的託管內存,並將內存無法容納的哈希分區spill到磁盤中。在probe階段,當內存中的哈希分區處理完成後,會釋放掉對應的MemorySegment,並將先前溢寫到磁盤的分區讀入,以提升probe效率。特別注意,如果溢寫分區對空閒的託管內存而言仍然過大(特別是存在數據傾斜的情況時),就會將其遞歸拆分成更小的分區,原理如下圖所示。

當然,遞歸拆分也不能是無限制的。在Blink Runtime中,如果遞歸拆分3次仍然不能滿足內存需求,就會拋出前文所述的異常了。

筆者在今年7月ApacheCon Asia 2022流處理專場的分享內容裏談到了這個問題,並且將其歸咎於Flink SQL的CBO優化器的代價模型不太科學,導致其十分偏向選擇Hash Join。由於修改的難度很大,所以暫時的workaround就是在任務失敗後,自動設置table.exec.disabled-operators參數來禁用掉ShuffleHashJoin算子,從而強制使用Sort-Merge Join。

當然這仍然不算優雅的解決方法,接下來簡要看看Flink 1.16版本中提出的更好一點的方案:Adaptive Hash Join。

Adaptive Hash Join的實現

所謂adaptive(自適應),就是指Hash Join遞歸超限時,不必讓任務失敗,而是將這些大分區自動轉爲Sort-Merge Join來處理。

Blink Runtime中的哈希表有兩種,即BinaryHashTable(key的類型爲BinaryRowData)和LongHybridHashTable(key的類型爲Long)。以前者爲例,查看其prepareNextPartition()方法,該方法負責遞歸地取得下一個要處理的哈希分區。

    private boolean prepareNextPartition() throws IOException {
        // finalize and cleanup the partitions of the current table
        // ......

        // there are pending partitions
        final BinaryHashPartition p = this.partitionsPending.get(0);
        // ......

        final int nextRecursionLevel = p.getRecursionLevel() + 1;
        if (nextRecursionLevel == 2) {
            LOG.info("Recursive hash join: partition number is " + p.getPartitionNumber());
        } else if (nextRecursionLevel > MAX_RECURSION_DEPTH) {
            LOG.info(
                    "Partition number [{}] recursive level more than {}, process the partition using SortMergeJoin later.",
                    p.getPartitionNumber(),
                    MAX_RECURSION_DEPTH);
            // if the partition has spilled to disk more than three times, process it by sort merge
            // join later
            this.partitionsPendingForSMJ.add(p);
            // also need to remove it from pending list
            this.partitionsPending.remove(0);
            // recursively get the next partition
            return prepareNextPartition();
        }

        // build the next table; memory must be allocated after this call
        buildTableFromSpilledPartition(p, nextRecursionLevel);

        // set the probe side
        setPartitionProbeReader(p);

        // unregister the pending partition
        this.partitionsPending.remove(0);
        this.currentRecursionDepth = p.getRecursionLevel() + 1;

        // recursively get the next
        return nextMatching();
    }

注意當遞歸深度超過MAX_RECURSION_DEPTH(常量定義即爲3)時,會將分區直接放入一個名爲partitionsPendingForSMJ的容器中,等待做Sort-Merge Join。另外,在該方法調用的buildTableFromSpilledPartition()方法(對溢寫分區執行build操作)開頭,去掉了對遞歸超限的判斷,也就是說Hash join exceeded maximum number of recursions異常已經成爲歷史。

那麼等待做Sort-Merge Join的分區是如何被處理的?查看Blink Runtime中的HashJoinOperator算子,在構造該算子時,需要比原來多傳入一個SortMergeJoinFunction的實例:

private final SortMergeJoinFunction sortMergeJoinFunction;

SortMergeJoinFunction實際上是將舊版的SortMergeJoinOperator處理邏輯抽離出來的類,算法本身沒有任何變化。然後從哈希表中讀取前述的partitionsPendingForSMJ容器,對每個分區的build側和probe側分別執行Sort-Merge Join操作即可。

    /**
     * If here also exists partitions which spilled to disk more than three time when hash join end,
     * means that the key in these partitions is very skewed, so fallback to sort merge join
     * algorithm to process it.
     */
    private void fallbackSMJProcessPartition() throws Exception {
        if (!table.getPartitionsPendingForSMJ().isEmpty()) {
            // release memory to MemoryManager first that is used to sort merge join operator
            table.releaseMemoryCacheForSMJ();
            // initialize sort merge join operator
            LOG.info("Fallback to sort merge join to process spilled partitions.");
            initialSortMergeJoinFunction();
            fallbackSMJ = true;

            for (BinaryHashPartition p : table.getPartitionsPendingForSMJ()) {
                // process build side
                RowIterator<BinaryRowData> buildSideIter =
                        table.getSpilledPartitionBuildSideIter(p);
                while (buildSideIter.advanceNext()) {
                    processSortMergeJoinElement1(buildSideIter.getRow());
                }

                // process probe side
                ProbeIterator probeIter = table.getSpilledPartitionProbeSideIter(p);
                BinaryRowData probeNext;
                while ((probeNext = probeIter.next()) != null) {
                    processSortMergeJoinElement2(probeNext);
                }
            }

            // close the HashTable
            closeHashTable();

            // finish build and probe
            sortMergeJoinFunction.endInput(1);
            sortMergeJoinFunction.endInput(2);
            LOG.info("Finish sort merge join for spilled partitions.");
        }
    }

    private void initialSortMergeJoinFunction() throws Exception {
        sortMergeJoinFunction.open(
                true,
                this.getContainingTask(),
                this.getOperatorConfig(),
                (StreamRecordCollector) this.collector,
                this.computeMemorySize(),
                this.getRuntimeContext(),
                this.getMetricGroup());
    }

    private void processSortMergeJoinElement1(RowData rowData) throws Exception {
        if (leftIsBuild) {
            sortMergeJoinFunction.processElement1(rowData);
        } else {
            sortMergeJoinFunction.processElement2(rowData);
        }
    }

    private void processSortMergeJoinElement2(RowData rowData) throws Exception {
        if (leftIsBuild) {
            sortMergeJoinFunction.processElement2(rowData);
        } else {
            sortMergeJoinFunction.processElement1(rowData);
        }
    }

與BinaryHashTable不同,LongHybridHashTable的join邏輯全部是代碼生成的,在對應的生成器LongHashJoinGenerator中,可以看到與上文類似的代碼,看官可以自行找來讀讀。

The End

民那晚安晚安。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章