Flink批Hash Join遞歸超限問題
隨着Flink流批一體能力的迅速發展以及Flink SQL易用性的提升,越來越多的廠商開始將Flink作爲離線批處理引擎使用。在我們使用Flink進行大規模join操作時,也許會發生如下的異常,導致任務失敗:
Hash join exceeded maximum number of recursions, without reducing partitions enough to be memory resident.
字面意思即爲Hash Join的遞歸次數超出限制。Flink批模式下的join算法有兩種,即Hybrid Hash Join和Sort-Merge Join。顧名思義,Hybrid Hash Join就是Simple Hash Join和Grace Hash Join兩種算法的結合(關於它們,看官可參考這篇文章)。引用一張Flink官方博客中的手繪圖來說明。
Flink的Hybrid Hash Join在build階段會積極地利用TaskManager的託管內存,並將內存無法容納的哈希分區spill到磁盤中。在probe階段,當內存中的哈希分區處理完成後,會釋放掉對應的MemorySegment,並將先前溢寫到磁盤的分區讀入,以提升probe效率。特別注意,如果溢寫分區對空閒的託管內存而言仍然過大(特別是存在數據傾斜的情況時),就會將其遞歸拆分成更小的分區,原理如下圖所示。
當然,遞歸拆分也不能是無限制的。在Blink Runtime中,如果遞歸拆分3次仍然不能滿足內存需求,就會拋出前文所述的異常了。
筆者在今年7月ApacheCon Asia 2022流處理專場的分享內容裏談到了這個問題,並且將其歸咎於Flink SQL的CBO優化器的代價模型不太科學,導致其十分偏向選擇Hash Join。由於修改的難度很大,所以暫時的workaround就是在任務失敗後,自動設置table.exec.disabled-operators
參數來禁用掉ShuffleHashJoin
算子,從而強制使用Sort-Merge Join。
當然這仍然不算優雅的解決方法,接下來簡要看看Flink 1.16版本中提出的更好一點的方案:Adaptive Hash Join。
Adaptive Hash Join的實現
所謂adaptive(自適應),就是指Hash Join遞歸超限時,不必讓任務失敗,而是將這些大分區自動轉爲Sort-Merge Join來處理。
Blink Runtime中的哈希表有兩種,即BinaryHashTable(key的類型爲BinaryRowData
)和LongHybridHashTable(key的類型爲Long
)。以前者爲例,查看其prepareNextPartition()
方法,該方法負責遞歸地取得下一個要處理的哈希分區。
private boolean prepareNextPartition() throws IOException {
// finalize and cleanup the partitions of the current table
// ......
// there are pending partitions
final BinaryHashPartition p = this.partitionsPending.get(0);
// ......
final int nextRecursionLevel = p.getRecursionLevel() + 1;
if (nextRecursionLevel == 2) {
LOG.info("Recursive hash join: partition number is " + p.getPartitionNumber());
} else if (nextRecursionLevel > MAX_RECURSION_DEPTH) {
LOG.info(
"Partition number [{}] recursive level more than {}, process the partition using SortMergeJoin later.",
p.getPartitionNumber(),
MAX_RECURSION_DEPTH);
// if the partition has spilled to disk more than three times, process it by sort merge
// join later
this.partitionsPendingForSMJ.add(p);
// also need to remove it from pending list
this.partitionsPending.remove(0);
// recursively get the next partition
return prepareNextPartition();
}
// build the next table; memory must be allocated after this call
buildTableFromSpilledPartition(p, nextRecursionLevel);
// set the probe side
setPartitionProbeReader(p);
// unregister the pending partition
this.partitionsPending.remove(0);
this.currentRecursionDepth = p.getRecursionLevel() + 1;
// recursively get the next
return nextMatching();
}
注意當遞歸深度超過MAX_RECURSION_DEPTH
(常量定義即爲3)時,會將分區直接放入一個名爲partitionsPendingForSMJ
的容器中,等待做Sort-Merge Join。另外,在該方法調用的buildTableFromSpilledPartition()
方法(對溢寫分區執行build操作)開頭,去掉了對遞歸超限的判斷,也就是說Hash join exceeded maximum number of recursions
異常已經成爲歷史。
那麼等待做Sort-Merge Join的分區是如何被處理的?查看Blink Runtime中的HashJoinOperator
算子,在構造該算子時,需要比原來多傳入一個SortMergeJoinFunction
的實例:
private final SortMergeJoinFunction sortMergeJoinFunction;
SortMergeJoinFunction
實際上是將舊版的SortMergeJoinOperator
處理邏輯抽離出來的類,算法本身沒有任何變化。然後從哈希表中讀取前述的partitionsPendingForSMJ
容器,對每個分區的build側和probe側分別執行Sort-Merge Join操作即可。
/**
* If here also exists partitions which spilled to disk more than three time when hash join end,
* means that the key in these partitions is very skewed, so fallback to sort merge join
* algorithm to process it.
*/
private void fallbackSMJProcessPartition() throws Exception {
if (!table.getPartitionsPendingForSMJ().isEmpty()) {
// release memory to MemoryManager first that is used to sort merge join operator
table.releaseMemoryCacheForSMJ();
// initialize sort merge join operator
LOG.info("Fallback to sort merge join to process spilled partitions.");
initialSortMergeJoinFunction();
fallbackSMJ = true;
for (BinaryHashPartition p : table.getPartitionsPendingForSMJ()) {
// process build side
RowIterator<BinaryRowData> buildSideIter =
table.getSpilledPartitionBuildSideIter(p);
while (buildSideIter.advanceNext()) {
processSortMergeJoinElement1(buildSideIter.getRow());
}
// process probe side
ProbeIterator probeIter = table.getSpilledPartitionProbeSideIter(p);
BinaryRowData probeNext;
while ((probeNext = probeIter.next()) != null) {
processSortMergeJoinElement2(probeNext);
}
}
// close the HashTable
closeHashTable();
// finish build and probe
sortMergeJoinFunction.endInput(1);
sortMergeJoinFunction.endInput(2);
LOG.info("Finish sort merge join for spilled partitions.");
}
}
private void initialSortMergeJoinFunction() throws Exception {
sortMergeJoinFunction.open(
true,
this.getContainingTask(),
this.getOperatorConfig(),
(StreamRecordCollector) this.collector,
this.computeMemorySize(),
this.getRuntimeContext(),
this.getMetricGroup());
}
private void processSortMergeJoinElement1(RowData rowData) throws Exception {
if (leftIsBuild) {
sortMergeJoinFunction.processElement1(rowData);
} else {
sortMergeJoinFunction.processElement2(rowData);
}
}
private void processSortMergeJoinElement2(RowData rowData) throws Exception {
if (leftIsBuild) {
sortMergeJoinFunction.processElement2(rowData);
} else {
sortMergeJoinFunction.processElement1(rowData);
}
}
與BinaryHashTable不同,LongHybridHashTable的join邏輯全部是代碼生成的,在對應的生成器LongHashJoinGenerator
中,可以看到與上文類似的代碼,看官可以自行找來讀讀。
The End
民那晚安晚安。