提效7倍,Apache Spark 自適應查詢優化在網易的深度實踐及改進

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文基於Apahce Spark 3.1.1版本,講述AQE自適應查詢優化的原理,以及網易數帆在AQE實踐中遇到的痛點和做出的思考。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自適應查詢優化(Adaptive Query Execution, AQE) 是 Spark 3.0 版本引入的重大特性之一,可以在運行時動態的優化用戶的SQL執行計劃,很大程度上提高了 Spark 作業的性能和穩定性。AQE包含動態分區合併、Join數據傾斜自動優化、動態Join策略選擇等多個子特性,這些特性可以讓用戶省去很多需要根據作業負載逐個手動調優,甚至修改業務邏輯的痛苦過程,極大的提升了Spark自身的易用性和靈活性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲網易大數據基礎軟件的締造者,網易數帆旗下網易有數團隊自AQE誕生起就關注其應用。第一個應用AQE的系統是 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/NetEase\/kyuubi","title":"","type":null},"content":[{"type":"text","text":"Kyuubi"}]},{"type":"text","text":"。 Kyuubi 是網易開源的一款企業級數據湖探索平臺,它基於Spark SQL 實現了多租戶 SQL on Hadoop 查詢引擎。在網易內部,基於 Kyuubi 的 C\/S 架構,在保證SQL兼容性的前提下,服務端可以平滑地實現Spark版本升級,將社區和內部的最新優化和增強快速賦能用戶。從Spark 3.0.2開始,網易有數就在生產環境中逐步試用和推廣AQE的特性。而在Spark 3.1.1發佈後,AQE在 Kyuubi 生產環境中已經是用戶默認的執行方式。在這個過程中,我們還端到端地幫助某個業務遷移了1500+ Hive 歷史任務到Spark 3.1.1上,不僅實現了資源量減半,更將總執行時間縮短了70%以上,綜合來看執行性能提升7倍多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然,AQE作爲一個“新”特性,在實踐過程中我們也發現它在很多方面不盡如人意,還有很大的優化空間。秉着堅持開源策略,網易有數努力將團隊遇到的問題和Spark社區分享,將我們的優化努力合進社區。以下章節,我們將展開介紹這半年多來AQE特性在網易的實踐經驗和優化改進。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"AQE 的設計思路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先明確一個核心概念,AQE的設計和優化完全圍繞着 "},{"type":"link","attrs":{"href":"http:\/\/spark.apache.org\/docs\/latest\/rdd-programming-guide.html#shuffle-operations","title":"","type":null},"content":[{"type":"text","text":"shuffle"}]},{"type":"text","text":",也就是說如果執行計劃裏不包含shuffle,那麼AQE是無效的。常見的可能產生shuffle的算子比如Aggregate(group by), Join, Repartition。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不同於傳統以整個執行計劃爲粒度進行調度的方式,AQE會把執行計劃基於shuffle劃分成若干個子計劃,每個子計劃用一個新的葉子節點包裹起來,從而使得執行計劃的調度粒度細化到stage級別 (stage也是基於shuffle劃分)。這樣拆解後,AQE就可以在某個子執行計劃完成後獲取到其shuffle的統計數據,並基於這些統計數據再對下一個子計劃動態優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/6e\/e3\/6ed455618bd93f92fd7f3e91bde69de3.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖片來自 "},{"type":"link","attrs":{"href":"https:\/\/databricks.com\/blog\/2020\/05\/29\/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html","title":"","type":null},"content":[{"type":"text","text":"databricks博客"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了這個調度流程之後,AQE纔可能有接下來的優化策略,從宏觀上來看AQE優化執行計劃的策略有兩種:一是動態修改執行計劃;二是動態生成shuffle reader。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"動態修改執行計劃"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態修改執行計劃包括兩個部分:對其邏輯計劃重新優化,以及生成新的物理執行計劃。我們知道一般的SQL執行流程是,邏輯執行計劃 -> 物理執行計劃,而AQE的執行邏輯是,子物理執行計劃 -> 父邏輯執行計劃 -> 父物理執行計劃,這樣的執行流程提供了更多優化的空間。比如在對Join算子選擇執行方式的時候可能有原來的Sort Merge Join優化爲Broadcast Hash Join。執行計劃層面看起來是這樣:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/90\/f7\/90fabae4b78e9fbe609110b6a8de3af7.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"動態生成Shuffle Reader"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先明確一個簡單的概念map負責寫shuffle數據,reduce負責讀取shuffle數據。而shuffle reader可以理解爲在reduce裏負責拉shuffle數據的工具。標準的shuffle reader會根據預設定的分區數量 (也就是我們經常改的"},{"type":"text","marks":[{"type":"strong"}],"text":"spark.sql.shuffle.partitions"},{"type":"text","text":"),在每個reduce內拉取分配給它的shuffle數據。而動態生成的shuffle reader會根據運行時的shuffle統計數據來決定reduce的數量。下面舉兩個例子,分區合併和Join動態優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分區合併是一個通用的優化,其思路是將多個讀取shuffle數據量少的reduce合併到1個reduce。假如有一個極端情況,shuffle的數據量只有幾十KB,但是分區數聲明瞭幾千,那麼這個任務就會極大的浪費調度資源。在這個背景下,AQE在跑完map後,會感知到這個情況,然後動態的合併reduce的數量,而在這個case下reduce的數量就會合併爲1。這樣優化後可以極大的節省reduce數量,並提高reduce吞吐量。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Join傾斜優化相對於分區合併,Join傾斜優化則只專注於Join的場景。如果我們Join的某個key存在傾斜,那麼對應到Spark中就會出現某個reduce的分區出現傾斜。在這個背景下,AQE在跑完map後,會預統計每個reduce讀取到的shuffle數據量,然後把數據量大的reduce分區做切割,也就是把原本由1個reduce讀取的shuffle數據改爲n個reduce讀取。這樣處理後就保證了每個reduce處理的數據量是一致的,從而解決數據傾斜問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE優化規則實現都是非常巧妙的,其他更多優化細節就不展開了,推薦閱讀 "},{"type":"link","attrs":{"href":"https:\/\/kyuubi.readthedocs.io\/en\/latest\/deployment\/spark\/aqe.html","title":"","type":null},"content":[{"type":"text","text":"Kyuubi與AQE"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"社區原生 AQE 的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看起來AQE已經是萬能的,我們經常遇到的問題點都被覆蓋到了,那麼實際用起來的時候真的有這麼絲滑嗎?這裏列舉一些網易在使用AQE過程中遇到的痛點。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"覆蓋場景不足"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就拿Join傾斜優化來說,這真的是一個非常棒的idea,什麼都很好但是有一個缺陷:覆蓋的場景有限。在網易的深度實踐過程中,經常會遇到一些Join明明就是肉眼可見的傾斜,但卻沒有被優化成想象中的樣子。這種情況對用戶來說會帶來極大的困擾,在成百上千行的SQL裏,哪些Join能被優化,哪些不能被優化?要花費很大一部分時間來去校驗確認。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"廣播Join不可逆"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"廣播配置"},{"type":"text","marks":[{"type":"strong"}],"text":"spark.sql.autoBroadcastJoinThreshold"},{"type":"text","text":"是我們最常修改的配置之一,其優勢是可以把Join用廣播的形式實現,避免了數據shuffle。但是廣播有個很嚴重的問題:判定一張表是否可以被廣播是基於靜態的統計數據,特別是在經過一系列的過濾操作後,再完美的代價估計都是不精確的。由這個問題引發的任務失敗報錯就很常見了,Driver端的OOM,廣播超時等。而AQE中的廣播是不可逆的,也就是說如果一個Join在進入AQE優化前已經被選定爲廣播Join,那麼AQE無法再將其轉換爲其他Join (比如Sort Merge Join)。這對於一些由於錯誤估計大小而導致被廣播的表是致命的。也是我們遇到影響任務穩定性的一大因素。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"配置不夠靈活"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然AQE真的很好用,但是配置還是不夠靈活。比如stage級別的配置隔離,我們知道AQE是基於stage的調度,那麼更進一步的,SQL的配置也可以是stage級別的,這樣可以最細粒度的優化每一次shuffle。聽起來可能有點過猶不及的感覺,但是最容易遇到的一個需求就是單獨設置最後一個stage的配置。最後一個stage是與衆不同的,它代表着寫操作,也就是說它決定了最終產生文件的數量。所以矛盾和痛點就這樣出現了,最後一個stage考慮的是存儲,是文件數,而過程中的stage考慮的是計算性能,是併發。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"網易數帆在 AQE 上的改進"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網易是AQE這個特性的重度使用者,當然不應該放着這些痛點不管,基於社區版本的分支下我們做了一系列的優化和增強,並且已經把其中的一部分內容push到了社區。在開源這個話題上,"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/4L55LVWTJHNshyW-ynqtLg","title":"","type":null},"content":[{"type":"text","text":"網易秉持着開放的理念"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"回合社區補丁"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark的發佈週期沒有那麼頻繁,就算小版本迭代一般也要小半年,那麼我們不可能隻眼睜睜看着一系列的bug存在於舊分支。因此網易在Spark分支管理上的策略是:自己維護小版本,及時跟進大版本 (小版本可能是從3.0.1到3.0.2,大版本則是從3.0到3.1)。在這個策略下,我們可以及時回合社區新發現的問題。比如AQE相關的補丁 "},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-33933","title":"","type":null},"content":[{"type":"text","text":"SPARK-33933"}]},{"type":"text","text":",這個補丁的作用是在執行子物理計劃的時候優先執行廣播其次shuffle,從而減小在調度資源不足情況下廣播超時的可能性。社區的這個補丁需要到3.2.0分支才能發佈,但是出於穩定性的考慮,網易內部把它回合到了3.1.1分支。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"回饋社區"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"提高廣播Join的穩定性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決靜態估計執行計劃的統計數據不準確以及廣播在AQE中不可逆的問題,我們支持了在AQE自己的廣播配置 "},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-35264","title":"","type":null},"content":[{"type":"text","text":"SPARK-35264"}]},{"type":"text","text":"。這個方案的思路是增加一個新的廣播配置"},{"type":"text","marks":[{"type":"strong"}],"text":"spark.sql.adaptive.autoBroadcastJoinThreshold"},{"type":"text","text":"和已有的廣播配置隔離,再基於AQE運行時的統計數據來判斷是否可以用廣播來完成Join,保證被廣播表的數據量是可信的。在這個條件下,我們可以禁用基於靜態估計的廣播Join,只開啓AQE的廣播,這樣我們就可以在享受廣播Join性能的同時兼顧穩定性。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"增加Join傾斜優化覆蓋維度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們對Join傾斜優化做了很多增強,這個case是其中之一。在描述內容之前,我們先簡單介紹一個SHJ和SMJ (Shuffled Hash Join簡稱爲SHJ,Sort Merge Join簡稱SMJ)。SMJ的實現原理是通過先把相同key shuffle到同一reduce,然後做分區內部排序,最後完成Join。而SHJ相對於SMJ有着優秀的時間複雜度,通過構建一個hash map做數據配對,節省了排序的時間,但缺點也同樣明顯,容易OOM。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一直以來SHJ是一個很容易被遺忘的Join實現,這是因爲默認配置"},{"type":"text","marks":[{"type":"strong"}],"text":"spark.sql.preferSortMerge"},{"type":"text","text":"的存在,而且社區版本里觸發SHJ的條件真的很苛刻。但自從Spark 3.0全面地支持了所有類型的Join Hint "},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-27225","title":"","type":null},"content":[{"type":"text","text":"SPARK-27225"}]},{"type":"text","text":",SHJ又逐漸進入了我們的視野。回到正題,社區版本的AQE目前只對SMJ做了傾斜優化,這對於顯式聲明瞭Join Hint爲SHJ的任務來說很不友好。在這個背景下,我們增加了AQE對SHJ傾斜優化的支持 "},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-35214","title":"","type":null},"content":[{"type":"text","text":"SPARK-35214"}]},{"type":"text","text":",使得Join傾斜優化在覆蓋維度上得到了提升。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"一些瑣碎的訂正"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於Spark在網易內部的使用場景是非常多的,包括但不限於數倉,ETL,Add hoc,因此我們需要最大程度減少負面的和誤導用戶的case。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-35239","title":"","type":null},"content":[{"type":"text","text":"SPARK-35239"}]},{"type":"text","text":",這個issue可以描述爲當輸入的RDD分區是空的時候無法對其shuffle的分區合併。看起來影響並不大,如果是空表的話那麼就算空跑一些任務也是非常快的。但是在Add hoc場景下,默認的"},{"type":"text","marks":[{"type":"strong"}],"text":"spark.sql.shuffle.partitions"},{"type":"text","text":"配置調整很大,這就會造成嚴重的task資源浪費,並且加重Driver的負擔"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-34899","title":"","type":null},"content":[{"type":"text","text":"SPARK-34899"}]},{"type":"text","text":",當我們發現某些shuffle分區在被AQE的分區合併規則成功優化後,分區數居然沒有下降,一度懷疑是沒有找到正確使用AQE的姿勢"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-35168","title":"","type":null},"content":[{"type":"text","text":"SPARK-35168"}]},{"type":"text","text":",一些Hive轉過來的同學可能會遇到的issue,理論上MapReduce中reduce的數量等價於Spark的shuffle分區數,所以Spark做了一些配置映射。但是在映射中出現了bug這肯定是不能容忍的。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"內部優化(已開源)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了和社區保持交流之外,網易數帆也做了許多基於AQE的優化,這些優化都在我們的開源項目Kyuubi裏。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"支持複雜場景下Join傾斜優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"社區版本對AQE的優化比較謹慎,只對標準的Sort Merge Join做了傾斜優化,也就是每個Join下的子算子必須包含Sort和Shuffle,這個策略極大的限制了Join傾斜優化的覆蓋率。舉例來說,有一個執行計劃先Aggregate再Join,並且這兩個算子之間沒有出現shuffle。我們可以猜到,在沒有AQE的介入下,Aggregate和Join之間的shuffle被剪枝了,這是一種常見的優化策略,一般是由於Aggregate的key和Join的key存在重複引起的。但是由於沒有擊中規則,AQE無法優化這個場景的Join。有一些可以繞過去的方法,比如手動在Aggregate和Join之間插入一個shuffle,得到的執行計劃長這樣子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/3a\/73\/3a56cf6acb760cef3d78d142e18f1673.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在這種思路下,以增加規則的方式可以在不入侵AQE代碼的前提下,自動增加shuffle來滿足Join傾斜優化的觸發條件。選擇這樣處理的理由有3個"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增加shuffle可以帶來另一個優秀的副作用,就是支持多表Join場景下的優化,可以說是一舉兩得"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不用魔改AQE的代碼,可以獨立於我們內部的Spark分支快速迭代"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然這不是最終的解決方案,和社區的交流還在繼續"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"小文件合併以及stage級別的配置隔離"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark的小文件問題已經存在很多年了,解決方案也有很多。而AQE的出現看起來可以天然的解決小文件問題,因此網易內部基於AQE的分區合併優化規則,對每個涉及寫操作的SQL,在其執行計劃的頂端動態插入一個shuffle節點,從執行計劃的角度看起來是這樣的:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/3a\/0f\/3a43c441e19a73ee46feecd633998f0f.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再結合可以控制每個分區大小的相關配置,看起來一切都是這麼美好。但問題還是來了,其中有兩個最明顯的問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單添加一個shuffle節點無法滿足動態分區寫的場景假設我們最終產生1k個分區,動態插入的分區值的數量也是1k,那麼最終會產生的文件數是1k x 1k = 1m。這肯定是不能被接受的,因此我們需要對動態分區字段做重分區,讓包含相同分區值的數據落在同一個分區內,這樣1k個分區生成的文件數最多也是1k。但是這樣處理後還有有一個潛在的風險點,不同分區值的分佈是不均勻的,也就是說可能出現數據傾斜問題。對於這樣情況,我們又額外增加了與業務無關的重分區字段,並通過配置的方式幫助用戶快速應對不同的業務場景。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單分區處理的數據量過大導致性能瓶頸成也蕭何敗蕭何,把"},{"type":"text","marks":[{"type":"strong"}],"text":"spark.sql.adaptive.advisoryPartitionSizeInBytes"},{"type":"text","text":"調大後小文件的問題是解決了,但是過程中每個分區處理的數據量也隨之增加,這導致過程中的併發度無法達到預期的要求。因此stage級別的配置隔離出現了。我們直接把整個SQL配置劃分爲兩部分,最後一個stage以及之前的stage,然後把這兩個部分之間的配置做了隔離。拿上面這個配置來說,在最後一個stage的樣子是"},{"type":"text","marks":[{"type":"strong"}],"text":"spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes"},{"type":"text","text":"。在配置隔離的幫助下,我們可以完美解決小文件和計算性能不能兼得的問題,用戶可以更加優雅地使用AQE。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"案例分享"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"多表Join傾斜"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面這兩張圖爲3表Join的執行計劃,由於長度的限制我們只截取到Join相關的片段,並且沒有被優化的任務由於數據傾斜問題沒有執行成功。可以明顯看到社區版本無法對這類多表Join做傾斜優化,而我們在動態插入shuffle之後,兩次Join都成功的被優化。在這個特性的幫助下,Join傾斜優化的覆蓋場景相對於社區有明顯提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/f6\/2c\/f6b0e1310b76a9133095835cdf16832c.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"社區版本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/5e\/0a\/5ec7b1c988dbcbf36080609083aca30a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"內部版本"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Stage配置隔離"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在支持了stage級別的配置隔離後,我們單獨設置了最後一個stage的參數,下面兩張圖是某個線上任務前後兩天的執行情況,可以明顯看到在配置隔離後,在保證最終產出的文件數一致的情況下,過程中stage的併發度得到了提升,從而使任務性能得到提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4f\/10\/4f00bb73558cde9c2705bd43f968a510.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"配置隔離前"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/8d\/cb\/8d739c57d06991fdde86391cc255c0cb.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"配置隔離後"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"任務性能對比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這張圖展示了我們部分遷移任務的資源成本以及性能對比,其中藍線是遷移前的數據,紅線是遷移後的數據。可以非常明顯看到,在資源成本大幅下降的同時任務性能有不同程度的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c1\/76\/c1a7a66478d12f042dc726d3b6156476.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先得感謝一下Apache Spark社區,在引入了AQE之後,我們的線上任務得到了不同程度的性能提升,也使得我們在遇到問題的時候可以有更多解決問題的思路。在深度實踐的過程中,我們也發現了一些可以優化的點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在優化細節上的角度,可以增加命中AQE優化的case,比如Join傾斜優化增強,讓用戶不用逐個檢查不能被優化的執行計劃"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在業務使用上的角度,可以同時支持ETL,Add hoc等側重點不一樣的場景,比如stage配置隔離這個特性,讓關注寫和讀的業務都有良好的體驗"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在完成這個階段性的優化後,接下來我們會繼續深耕在AQE的覆蓋場景上,比如支持Union算子的細粒度優化,增強AQE的代價估計算法等。除此之外,還有一些潛在的性能迴歸問題也是值得我們注意的,比如在做分區合併優化後會放大某些高時間複雜度算子的性能瓶頸。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲可能是最快在線上使用Apache Spark 3.1.1的用戶,網易在享受社區技術福利的同時也在反哺社區。這也是網易對技術的思考和理念:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲開放,我們擁抱開源,深入社區"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲熱愛,我們快速接收新的理論,實踐新的技術"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者簡介:"},{"type":"text","text":"尤夕多,目前就職於網易數帆-有數事業部,專注於開源大數據領域。網易數帆開源項目 Kyuubi Committer \/ Apache Spark Contributor。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章