數據傾斜?Spark 3.0 AQE專治各種不服

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark3.0已經發布半年之久,這次大版本的升級主要是集中在性能優化和文檔豐富上,其中46%的優化都集中在Spark SQL上,SQL優化裏最引人注意的非Adaptive Query Execution莫屬了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:","attrs":{}},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247498247&idx=1&sn=d16e3cfb5a424d4350270d1d6e28f95f&chksm=fd3ebc92ca493584044dfd97f5d2c6f010e83916045d2518f26c003ff3de6e9ee1dffaa2a1f0&token=1490675160&lang=zh_CN#rd","title":""},"content":[{"type":"text","text":"《數據傾斜?Spark 3.0 AQE專治各種不服》","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/19368ebe4af7d319ec631eda02e71e0d.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Adaptive Query Execution(AQE)是英特爾大數據技術團隊和百度大數據基礎架構部工程師在Spark 社區版本的基礎上,改進並實現的自適應執行引擎。近些年來,Spark SQL 一直在針對CBO 特性進行優化,而且做得十分成功。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"CBO基本原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我們先來介紹另一個基於規則優化(Rule-Based Optimization,簡稱RBO)的優化器,這是一種經驗式、啓發式的優化思路,優化規則都已經預先定義好,只需要將SQL往這些規則上套就可以。簡單的說,RBO就像是一個經驗豐富的老司機,基本套路全都知道。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而世界上有一種東西叫做 – 不按套路來。與其說它不按套路來,倒不如說它本身並沒有什麼套路。最典型的莫過於複雜Join算子優化,對於這些Join來說,通常有兩個選擇題要做:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Join應該選擇哪種算法策略來執行?BroadcastJoin or ShuffleHashJoin or SortMergeJoin?不同的執行策略對系統的資源要求不同,執行效率也有天壤之別,同一個SQL,選擇到合適的策略執行可能只需要幾秒鐘,而如果沒有選擇到合適的執行策略就可能會導致系統OOM。","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"對於雪花模型或者星型模型來講,多表Join應該選擇什麼樣的順序執行?不同的Join順序意味着不同的執行效率,比如A join B join C,A、B表都很大,C表很小,那A join B很顯然需要大量的系統資源來運算,執行時間必然不會短。而如果使用A join C join B的執行順序,因爲C表很小,所以A join C會很快得到結果,而且結果集會很小,再使用小的結果集 join B,性能顯而易見會好於前一種方案。","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家想想,這有什麼固定的優化規則麼?並沒有。說白了,你需要知道更多關於表的基礎信息(表大小、表記錄總條數等),再通過一定規則代價評估才能從中選擇一條最優的執行計劃。所以,CBO 意爲基於代價優化策略,它需要計算所有可能執行計劃的代價,並挑選出代價最小的執行計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE對於整體的Spark SQL的執行過程做了相應的調整和優化,它最大的亮點是可以根據已經完成的計劃結點真實且精確的執行統計結果來不停的反饋並重新優化剩下的執行計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"CBO這麼難實現,Spark怎麼解決?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CBO 會計算一些和業務數據相關的統計數據,來優化查詢,例如行數、去重後的行數、空值、最大最小值等。Spark會根據這些數據,自動選擇BHJ或者SMJ,對於多Join場景下的Cost-based Join Reorder,來達到優化執行計劃的目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,由於這些統計數據是需要預先處理的,會過時,所以我們在用過時的數據進行判斷,在某些情況下反而會變成負面效果,拉低了SQL執行效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark3.0的AQE框架用了三招解決這個問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態合併shuffle分區(Dynamically coalescing shuffle partitions)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態調整Join策略(Dynamically switching join strategies)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態優化數據傾斜Join(Dynamically optimizing skew joins) ","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們來詳細介紹這三個特性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"動態合併 shuffle 的分區","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們處理的數據量級非常大時,shuffle通常來說是最影響性能的。因爲shuffle是一個非常耗時的算子,它需要通過網絡移動數據,分發給下游算子。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在shuffle中,partition的數量十分關鍵。partition的最佳數量取決於數據,而數據大小在不同的query不同stage都會有很大的差異,所以很難去確定一個具體的數目:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果partition過少,每個partition數據量就會過多,可能就會導致大量數據要落到磁盤上,從而拖慢了查詢。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果partition過多,每個partition數據量就會很少,就會產生很多額外的網絡開銷,並且影響Spark task scheduler,從而拖慢查詢。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決該問題,我們在最開始設置相對較大的shuffle partition個數,通過執行過程中shuffle文件的數據來合併相鄰的小partitions。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,假設我們執行SELECT max(i) FROM tbl GROUP BY j,表tbl只有2個partition並且數據量非常小。我們將初始shuffle partition設爲5,因此在分組後會出現5個partitions。若不進行AQE優化,會產生5個tasks來做聚合結果,事實上有3個partitions數據量是非常小的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f1/f105e8190724e92a3612cadd5deb45df.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而在這種情況下,AQE只會生成3個reduce task。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c8ab0cf52b62fb2f538b5c92ce8b67ed.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"動態切換join策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark 支持許多 Join 策略,其中 broadcast hash join 通常是性能最好的,前提是參加 join 的一張表的數據能夠裝入內存。由於這個原因,當 Spark 估計參加 join 的表數據量小於廣播大小的閾值時,其會將 Join 策略調整爲 broadcast hash join。但是,很多情況都可能導致這種大小估計出錯——例如存在一個非常有選擇性的過濾器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於AQE擁有精確的上游統計數據,因此可以解決該問題。比如下面這個例子,右表的實際大小爲15M,而在該場景下,經過filter過濾後,實際參與join的數據大小爲8M,小於了默認broadcast閾值10M,應該被廣播。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e1/e1f2b05f9c0d84cdfef30f2de091d877.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們執行過程中轉化爲BHJ的同時,我們甚至可以將傳統shuffle優化爲本地shuffle(例如shuffle讀在mapper而不是基於reducer)來減小網絡開銷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"動態優化數據傾斜","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Join裏如果出現某個key的數據傾斜問題,那麼基本上就是這個任務的性能殺手了。在AQE之前,用戶沒法自動處理Join中遇到的這個棘手問題,需要藉助外部手動收集數據統計信息,並做額外的加鹽,分批處理數據等相對繁瑣的方法來應對數據傾斜問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據傾斜本質上是由於集羣上數據在分區之間分佈不均勻所導致的,它會拉慢join場景下整個查詢。AQE根據shuffle文件統計數據自動檢測傾斜數據,將那些傾斜的分區打散成小的子分區,然後各自進行join。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以看下這個場景,Table A join Table B,其中Table A的partition A0數據遠大於其他分區。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/00/006c0cd49b30db9c31b6e78ba1b46ca7.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE會將partition A0切分成2個子分區,並且讓他們獨自和Table B的partition B0進行join。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ee/ee53c687235ab5b6c7d9f89d1abda4ae.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果不做這個優化,SMJ將會產生4個tasks並且其中一個執行時間遠大於其他。經優化,這個join將會有5個tasks,但每個task執行耗時差不多相同,因此個整個查詢帶來了更好的性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"如何開啓AQE","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以設置參數spark.sql.adaptive.enabled爲true來開啓AQE,在Spark 3.0中默認是false,並滿足以下條件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非流式查詢","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包含至少一個exchange(如join、聚合、窗口算子)或者一個子查詢","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE通過減少了對靜態統計數據的依賴,成功解決了Spark CBO的一個難以處理的trade off(生成統計數據的開銷和查詢耗時)以及數據精度問題。相比之前具有侷限性的CBO,現在就顯得非常靈活。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Spark CBO源碼實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Adaptive Execution 模式是在使用Spark物理執行計劃注入生成的。在QueryExecution類中有 preparations 一組優化器來對物理執行計劃進行優化, InsertAdaptiveSparkPlan 就是第一個優化器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"InsertAdaptiveSparkPlan 使用 PlanAdaptiveSubqueries Rule對部分SubQuery處理後,將當前 Plan 包裝成 AdaptiveSparkPlanExec 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當執行 AdaptiveSparkPlanExec 的 collect() 或 take() 方法時,全部會先執行 getFinalPhysicalPlan() 方法生成新的SparkPlan,再執行對應的SparkPlan對應的方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"// QueryExecution類\nlazy val executedPlan: SparkPlan = {\n executePhase(QueryPlanningTracker.PLANNING) {\n QueryExecution.prepareForExecution(preparations, sparkPlan.clone())\n }\n }\n\n protected def preparations: Seq[Rule[SparkPlan]] = {\n QueryExecution.preparations(sparkSession,\n Option(InsertAdaptiveSparkPlan(AdaptiveExecutionContext(sparkSession, this))))\n }\n\n private[execution] def preparations(\n sparkSession: SparkSession,\n adaptiveExecutionRule: Option[InsertAdaptiveSparkPlan] = None): Seq[Rule[SparkPlan]] = {\n // `AdaptiveSparkPlanExec` is a leaf node. If inserted, all the following rules will be no-op\n // as the original plan is hidden behind `AdaptiveSparkPlanExec`.\n adaptiveExecutionRule.toSeq ++\n Seq(\n PlanDynamicPruningFilters(sparkSession),\n PlanSubqueries(sparkSession),\n EnsureRequirements(sparkSession.sessionState.conf),\n ApplyColumnarRulesAndInsertTransitions(sparkSession.sessionState.conf,\n sparkSession.sessionState.columnarRules),\n CollapseCodegenStages(sparkSession.sessionState.conf),\n ReuseExchange(sparkSession.sessionState.conf),\n ReuseSubquery(sparkSession.sessionState.conf)\n )\n }\n\n\n// InsertAdaptiveSparkPlan \n override def apply(plan: SparkPlan): SparkPlan = applyInternal(plan, false)\n\n private def applyInternal(plan: SparkPlan, isSubquery: Boolean): SparkPlan = plan match {\n // ...some checking\n case _ if shouldApplyAQE(plan, isSubquery) =>\n if (supportAdaptive(plan)) {\n try {\n // Plan sub-queries recursively and pass in the shared stage cache for exchange reuse.\n // Fall back to non-AQE mode if AQE is not supported in any of the sub-queries.\n val subqueryMap = buildSubqueryMap(plan)\n val planSubqueriesRule = PlanAdaptiveSubqueries(subqueryMap)\n val preprocessingRules = Seq(\n planSubqueriesRule)\n // Run pre-processing rules.\n val newPlan = AdaptiveSparkPlanExec.applyPhysicalRules(plan, preprocessingRules)\n logDebug(s\"Adaptive execution enabled for plan: $plan\")\n AdaptiveSparkPlanExec(newPlan, adaptiveExecutionContext, preprocessingRules, isSubquery)\n } catch {\n case SubqueryAdaptiveNotSupportedException(subquery) =>\n logWarning(s\"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} is enabled \" +\n s\"but is not supported for sub-query: $subquery.\")\n plan\n }\n } else {\n logWarning(s\"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} is enabled \" +\n s\"but is not supported for query: $plan.\")\n plan\n }\n case _ => plan\n }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE對Stage 分階段提交執行和優化過程如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":" private def getFinalPhysicalPlan(): SparkPlan = lock.synchronized {\n // 第一次調用 getFinalPhysicalPlan方法時爲false,等待該方法執行完畢,全部Stage不會再改變,直接返回最終plan\n if (isFinalPlan) return currentPhysicalPlan\n\n // In case of this adaptive plan being executed out of `withActive` scoped functions, e.g.,\n // `plan.queryExecution.rdd`, we need to set active session here as new plan nodes can be\n // created in the middle of the execution.\n context.session.withActive {\n val executionId = getExecutionId\n var currentLogicalPlan = currentPhysicalPlan.logicalLink.get\n var result = createQueryStages(currentPhysicalPlan)\n val events = new LinkedBlockingQueue[StageMaterializationEvent]()\n val errors = new mutable.ArrayBuffer[Throwable]()\n var stagesToReplace = Seq.empty[QueryStageExec]\n while (!result.allChildStagesMaterialized) {\n currentPhysicalPlan = result.newPlan\n // 接下來有哪些Stage要執行,參考 createQueryStages(plan: SparkPlan) 方法\n if (result.newStages.nonEmpty) {\n stagesToReplace = result.newStages ++ stagesToReplace\n // onUpdatePlan 通過listener更新UI\n executionId.foreach(onUpdatePlan(_, result.newStages.map(_.plan)))\n\n // Start materialization of all new stages and fail fast if any stages failed eagerly\n result.newStages.foreach { stage =>\n try {\n // materialize() 方法對Stage的作爲一個單獨的Job提交執行,並返回 SimpleFutureAction 來接收執行結果\n // QueryStageExec: materialize() -> doMaterialize() ->\n // ShuffleExchangeExec: -> mapOutputStatisticsFuture -> ShuffleExchangeExec\n // SparkContext: -> submitMapStage(shuffleDependency)\n stage.materialize().onComplete { res =>\n if (res.isSuccess) {\n events.offer(StageSuccess(stage, res.get))\n } else {\n events.offer(StageFailure(stage, res.failed.get))\n }\n }(AdaptiveSparkPlanExec.executionContext)\n } catch {\n case e: Throwable =>\n cleanUpAndThrowException(Seq(e), Some(stage.id))\n }\n }\n }\n\n // Wait on the next completed stage, which indicates new stats are available and probably\n // new stages can be created. There might be other stages that finish at around the same\n // time, so we process those stages too in order to reduce re-planning.\n // 等待,直到有Stage執行完畢\n val nextMsg = events.take()\n val rem = new util.ArrayList[StageMaterializationEvent]()\n events.drainTo(rem)\n (Seq(nextMsg) ++ rem.asScala).foreach {\n case StageSuccess(stage, res) =>\n stage.resultOption = Some(res)\n case StageFailure(stage, ex) =>\n errors.append(ex)\n }\n\n // In case of errors, we cancel all running stages and throw exception.\n if (errors.nonEmpty) {\n cleanUpAndThrowException(errors, None)\n }\n\n // Try re-optimizing and re-planning. Adopt the new plan if its cost is equal to or less\n // than that of the current plan; otherwise keep the current physical plan together with\n // the current logical plan since the physical plan's logical links point to the logical\n // plan it has originated from.\n // Meanwhile, we keep a list of the query stages that have been created since last plan\n // update, which stands for the \"semantic gap\" between the current logical and physical\n // plans. And each time before re-planning, we replace the corresponding nodes in the\n // current logical plan with logical query stages to make it semantically in sync with\n // the current physical plan. Once a new plan is adopted and both logical and physical\n // plans are updated, we can clear the query stage list because at this point the two plans\n // are semantically and physically in sync again.\n // 對前面的Stage替換爲 LogicalQueryStage 節點\n val logicalPlan = replaceWithQueryStagesInLogicalPlan(currentLogicalPlan, stagesToReplace)\n // 再次調用optimizer 和planner 進行優化\n val (newPhysicalPlan, newLogicalPlan) = reOptimize(logicalPlan)\n val origCost = costEvaluator.evaluateCost(currentPhysicalPlan)\n val newCost = costEvaluator.evaluateCost(newPhysicalPlan)\n if (newCost < origCost ||\n (newCost == origCost && currentPhysicalPlan != newPhysicalPlan)) {\n logOnLevel(s\"Plan changed from $currentPhysicalPlan to $newPhysicalPlan\")\n cleanUpTempTags(newPhysicalPlan)\n currentPhysicalPlan = newPhysicalPlan\n currentLogicalPlan = newLogicalPlan\n stagesToReplace = Seq.empty[QueryStageExec]\n }\n // Now that some stages have finished, we can try creating new stages.\n // 進入下一輪循環,如果存在Stage執行完畢, 對應的resultOption 會有值,對應的allChildStagesMaterialized 屬性 = true\n result = createQueryStages(currentPhysicalPlan)\n }\n\n // Run the final plan when there's no more unfinished stages.\n // 所有前置stage全部執行完畢,根據stats信息優化物理執行計劃,確定最終的 physical plan\n currentPhysicalPlan = applyPhysicalRules(result.newPlan, queryStageOptimizerRules)\n isFinalPlan = true\n executionId.foreach(onUpdatePlan(_, Seq(currentPhysicalPlan)))\n currentPhysicalPlan\n }\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"// SparkContext\n /**\n * Submit a map stage for execution. This is currently an internal API only, but might be\n * promoted to DeveloperApi in the future.\n */\n private[spark] def submitMapStage[K, V, C](dependency: ShuffleDependency[K, V, C])\n : SimpleFutureAction[MapOutputStatistics] = {\n assertNotStopped()\n val callSite = getCallSite()\n var result: MapOutputStatistics = null\n val waiter = dagScheduler.submitMapStage(\n dependency,\n (r: MapOutputStatistics) => { result = r },\n callSite,\n localProperties.get)\n new SimpleFutureAction[MapOutputStatistics](waiter, result)\n }\n\n\n// DAGScheduler\n def submitMapStage[K, V, C](\n dependency: ShuffleDependency[K, V, C],\n callback: MapOutputStatistics => Unit,\n callSite: CallSite,\n properties: Properties): JobWaiter[MapOutputStatistics] = {\n\n val rdd = dependency.rdd\n val jobId = nextJobId.getAndIncrement()\n if (rdd.partitions.length == 0) {\n throw new SparkException(\"Can't run submitMapStage on RDD with 0 partitions\")\n }\n\n // We create a JobWaiter with only one \"task\", which will be marked as complete when the whole\n // map stage has completed, and will be passed the MapOutputStatistics for that stage.\n // This makes it easier to avoid race conditions between the user code and the map output\n // tracker that might result if we told the user the stage had finished, but then they queries\n // the map output tracker and some node failures had caused the output statistics to be lost.\n val waiter = new JobWaiter[MapOutputStatistics](\n this, jobId, 1,\n (_: Int, r: MapOutputStatistics) => callback(r))\n eventProcessLoop.post(MapStageSubmitted(\n jobId, dependency, callSite, waiter, Utils.cloneProperties(properties)))\n waiter\n }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前,AdaptiveSparkPlanExec 中對物理執行的優化器列表如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"// AdaptiveSparkPlanExec\n @transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(\n ReuseAdaptiveSubquery(conf, context.subqueryCache),\n CoalesceShufflePartitions(context.session),\n // The following two rules need to make use of 'CustomShuffleReaderExec.partitionSpecs'\n // added by `CoalesceShufflePartitions`. So they must be executed after it.\n OptimizeSkewedJoin(conf),\n OptimizeLocalShuffleReader(conf),\n ApplyColumnarRulesAndInsertTransitions(conf, context.session.sessionState.columnarRules),\n CollapseCodegenStages(conf)\n )","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 OptimizeSkewedJoin方法就是針對最容易出現數據傾斜的Join進行的優化:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE模式下,每個Stage執行之前,前置依賴Stage已經全部執行完畢,那麼就可以獲取到每個Stage的stats信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當發現shuffle partition的輸出超過partition size的中位數的5倍,且partition的輸出大於 256M 會被判斷產生數據傾斜, 將partition 數據按照targetSize進行切分爲N份。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"targetSize = max(64M, 非數據傾斜partition的平均大小)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優化前 shuffle 如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bde89544e483c935e37f950aa2f9ec01.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優化後 shuffle:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a4d264a2656e7d1c09bfb326a026aae1.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Spark3.0AQE在FreeWheel的應用與實踐","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FreeWheel團隊通過高效的敏捷開發趕在 2020 年聖誕廣告季之前在生產環境順利發佈上線,整體性能提升高達 40%(對於大 batch)的數據,AWS Cost 平均節省 25%~30%之間,大約每年至少能爲公司節省百萬成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"主要升級改動","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打開 Spark 3.0 AQE 的新特性,主要配置如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":" \"spark.sql.adaptive.enabled\": true,\n \"spark.sql.adaptive.coalescePartitions.enabled\": true,\n \"spark.sql.adaptive.coalescePartitions.minPartitionNum\": 1,\n \"spark.sql.adaptive.advisoryPartitionSizeInBytes\": \"128MB\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,AQE 特性只是在 reducer 階段不用指定 reducer 的個數,但並不代表你不再需要指定任務的並行度了。因爲 map 階段仍然需要將數據劃分爲合適的分區進行處理,如果沒有指定並行度會使用默認的 200,當數據量過大時,很容易出現 OOM。建議還是按照任務之前的並行度設置來配置參數spark.sql.shuffle.partitions和spark.default.parallelism。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們來仔細看一下爲什麼升級到 3.0 以後可以減少運行時間,又能節省集羣的成本。 以 Optimus 數據建模裏的一張表的運行情況爲例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 reduce 階段從沒有 AQE 的40320個 tasks 銳減到4580個 tasks,減少了一個數量級。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖裏下半部分是沒有 AQE 的 Spark 2.x 的 task 情況,上半部分是打開 AQE 特性後的 Spark 3.x 的情況。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8a/8a76704fd1984a77ef144e588a33d241.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從更詳細的運行時間圖來看,shuffler reader後同樣的 aggregate 的操作等時間也從4.44h到2.56h,節省將近一半。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"左邊是 spark 2.x 的運行指標明細,右邊是打開 AQE 後通過custom shuffler reader後的運行指標情況。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1f80a838377ddcd6c74432d04af022fe.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"性能提升","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"AQE性能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE對於整體的 Spark SQL 的執行過程做了相應的調整和優化(如下圖),它最大的亮點是可以根據已經完成的計劃結點真實且精確的執行統計結果來不停的反饋並重新優化剩下的執行計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6f/6fc2587f2ff867187c84e6a81d0fd145.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE 自動調整 reducer 的數量,減小 partition 數量。Spark 任務的並行度一直是讓用戶比較困擾的地方。如果並行度太大的話,會導致 task 過多,overhead 比較大,整體拉慢任務的運行。而如果並行度太小的,數據分區會比較大,容易出現 OOM 的問題,並且資源也得不到合理的利用,並行運行任務優勢得不到最大的發揮。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且由於 Spark Context 整個任務的並行度,需要一開始設定好且沒法動態修改,這就很容易出現任務剛開始的時候數據量大需要大的並行度,而運行的過程中通過轉化過濾可能最終的數據集已經變得很小,最初設定的分區數就顯得過大了。AQE 能夠很好的解決這個問題,在 reducer 去讀取數據時,會根據用戶設定的分區數據的大小(spark.sql.adaptive.advisoryPartitionSizeInBytes)來自動調整和合並(Coalesce)小的 partition,自適應地減小 partition 的數量,以減少資源浪費和 overhead,提升任務的性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由上面單張表可以看到,打開 AQE 的時候極大的降低了 task 的數量,除了減輕了 Driver 的負擔,也減少啓動 task 帶來的 schedule,memory,啓動管理等 overhead,減少 cpu 的佔用,提升的 I/O 性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拿歷史 Data Pipelines 爲例,同時會並行有三十多張表在 Spark 裏運行,每張表都有極大的性能提升,那麼也使得其他的表能夠獲得資源更早更多,互相受益,那麼最終整個的數據建模過程會自然而然有一個加速的結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大 batch(>200G)相對小 batch(< 100G )有比較大的提升,有高達 40%提升,主要是因爲大 batch 本身數據量大,需要機器數多,設置併發度也更大,那麼 AQE 展現特性的時刻會更多更明顯。而小 batch 併發度相對較低,那麼提升也就相對會少一些,不過也是有 27.5%左右的加速。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內存優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了因爲 AQE 的打開,減少過碎的 task 對於 memory 的佔用外,Spark 3.0 也在其他地方做了很多內存方面的優化,比如 Aggregate 部分指標瘦身、Netty 的共享內存 Pool 功能、Task Manager 死鎖問題、避免某些場景下從網絡讀取 shuffle block等等,來減少內存的壓力。一系列內存的優化加上 AQE 特性疊加從前文內存實踐圖中可以看到集羣的內存使用同時有30%左右的下降。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"實踐成果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"升級主要的實踐成果如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"性能提升明顯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歷史數據 Pipeline 對於大 batch 的數據(200~400G/每小時)性能提升高達40%, 對於小 batch(小於 100G/每小時)提升效果沒有大 batch 提升的那麼明顯,每天所有 batches平均提升水平27.5%左右。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預測數據性能平均提升30%。由於數據輸入源不一樣,目前是分別兩個 pipelines 在跑歷史和預測數據,產生的表的數目也不太一樣,因此做了分別的評估。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以歷史數據上線後的端到端到運行時間爲例(如下圖),肉眼可見上線後整體 pipeline 的運行時間有了明顯的下降,能夠更快的輸出數據供下游使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/83/83df84f40d696fbf295fa5147c6f7cf0.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"集羣內存使用降低","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣內存使用對於大 batch 達降低30%左右,每天平均平均節省25%左右。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以歷史數據上線後的運行時集羣的 memory 在 ganglia 上的截圖爲例(如下圖),整體集羣的內存使用從 41.2T 降到 30.1T,這意味着我們可以用更少的機器花更少的錢來跑同樣的 Spark 任務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4e/4e16f96ab77fe63803966e2e48d0a1a1.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"AWS Cost 降低","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pipelines 做了自動的 Scale In/Scale Out 策略: 在需要資源的時候擴集羣的 Task 結點,在任務結束後自動去縮集羣的 Task 結點,且會根據每次 batch 數據的大小通過算法學習得到最佳的機器數。通過升級到 Spark 3.0 後,由於現在任務跑的更快並且需要的機器更少,上線後統計 AWS Cost 每天節省30%左右,大約一年能爲公司節省百萬成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247485217&idx=1&sn=3ce9fa8ad179c008754873129e51fbe7&chksm=fd3d41b4ca4ac8a25957fc9541437e6546fc2926df2908bad5adbd30a6cecf9f797fef74f894&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"SparkSQL的3種Join實現","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247494892&idx=1&sn=1e2d5ae8beb129236b1e1a1f9e997422&chksm=fd3eaa79ca49236facdde16d22a3d197b89b06af4fc019a435512cc88bfa20054fcb03d31dc5&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"MongoDB + Spark: 完整的大數據解決方案","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247487480&idx=1&sn=ec2325cb27dca653269b68f7d6101a07&chksm=fd3d496dca4ac07bbf25fea5a6dc4416be0a19d44ea5aeedc956c640e4acc74066898c195325&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"Spark SQL是如何選擇join策略的?","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎關注,","attrs":{}},{"type":"link","attrs":{"href":"https://shimo.im/docs/jdPhrtFwVCAMkoWv","title":""},"content":[{"type":"text","text":"《大數據成神之路》","attrs":{}}]},{"type":"text","text":"系列文章","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章