數據傾斜？Spark 3.0 AQE專治各種不服

原創

2021-01-21 19:33

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark3.0已經發布半年之久，這次大版本的升級主要是集中在性能優化和文檔豐富上，其中46%的優化都集中在Spark SQL上，SQL優化裏最引人注意的非Adaptive Query Execution莫屬了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接：","attrs":{}},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247498247&idx=1&sn=d16e3cfb5a424d4350270d1d6e28f95f&chksm=fd3ebc92ca493584044dfd97f5d2c6f010e83916045d2518f26c003ff3de6e9ee1dffaa2a1f0&token=1490675160&lang=zh_CN#rd","title":""},"content":[{"type":"text","text":"《數據傾斜？Spark 3.0 AQE專治各種不服》","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/19368ebe4af7d319ec631eda02e71e0d.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Adaptive Query Execution(AQE)是英特爾大數據技術團隊和百度大數據基礎架構部工程師在Spark 社區版本的基礎上，改進並實現的自適應執行引擎。近些年來，Spark SQL 一直在針對CBO 特性進行優化，而且做得十分成功。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"CBO基本原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先，我們先來介紹另一個基於規則優化（Rule-Based Optimization，簡稱RBO）的優化器，這是一種經驗式、啓發式的優化思路，優化規則都已經預先定義好，只需要將SQL往這些規則上套就可以。簡單的說，RBO就像是一個經驗豐富的老司機，基本套路全都知道。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而世界上有一種東西叫做 – 不按套路來。與其說它不按套路來，倒不如說它本身並沒有什麼套路。最典型的莫過於複雜Join算子優化，對於這些Join來說，通常有兩個選擇題要做：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Join應該選擇哪種算法策略來執行？BroadcastJoin or ShuffleHashJoin or SortMergeJoin？不同的執行策略對系統的資源要求不同，執行效率也有天壤之別，同一個SQL，選擇到合適的策略執行可能只需要幾秒鐘，而如果沒有選擇到合適的執行策略就可能會導致系統OOM。","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"對於雪花模型或者星型模型來講，多表Join應該選擇什麼樣的順序執行？不同的Join順序意味着不同的執行效率，比如A join B join C，A、B表都很大，C表很小，那A join B很顯然需要大量的系統資源來運算，執行時間必然不會短。而如果使用A join C join B的執行順序，因爲C表很小，所以A join C會很快得到結果，而且結果集會很小，再使用小的結果集 join B，性能顯而易見會好於前一種方案。","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家想想，這有什麼固定的優化規則麼？並沒有。說白了，你需要知道更多關於表的基礎信息（表大小、表記錄總條數等），再通過一定規則代價評估才能從中選擇一條最優的執行計劃。所以，CBO 意爲基於代價優化策略，它需要計算所有可能執行計劃的代價，並挑選出代價最小的執行計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE對於整體的Spark SQL的執行過程做了相應的調整和優化，它最大的亮點是可以根據已經完成的計劃結點真實且精確的執行統計結果來不停的反饋並重新優化剩下的執行計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"CBO這麼難實現，Spark怎麼解決？","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CBO 會計算一些和業務數據相關的統計數據，來優化查詢，例如行數、去重後的行數、空值、最大最小值等。Spark會根據這些數據，自動選擇BHJ或者SMJ，對於多Join場景下的Cost-based Join Reorder，來達到優化執行計劃的目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是，由於這些統計數據是需要預先處理的，會過時，所以我們在用過時的數據進行判斷，在某些情況下反而會變成負面效果，拉低了SQL執行效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark3.0的AQE框架用了三招解決這個問題：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態合併shuffle分區（Dynamically coalescing shuffle partitions）","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態調整Join策略（Dynamically switching join strategies）","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態優化數據傾斜Join（Dynamically optimizing skew joins） ","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們來詳細介紹這三個特性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"動態合併 shuffle 的分區","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們處理的數據量級非常大時，shuffle通常來說是最影響性能的。因爲shuffle是一個非常耗時的算子，它需要通過網絡移動數據，分發給下游算子。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在shuffle中，partition的數量十分關鍵。partition的最佳數量取決於數據，而數據大小在不同的query不同stage都會有很大的差異，所以很難去確定一個具體的數目：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果partition過少，每個partition數據量就會過多，可能就會導致大量數據要落到磁盤上，從而拖慢了查詢。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果partition過多，每個partition數據量就會很少，就會產生很多額外的網絡開銷，並且影響Spark task scheduler，從而拖慢查詢。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決該問題，我們在最開始設置相對較大的shuffle partition個數，通過執行過程中shuffle文件的數據來合併相鄰的小partitions。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如，假設我們執行SELECT max(i) FROM tbl GROUP BY j，表tbl只有2個partition並且數據量非常小。我們將初始shuffle partition設爲5，因此在分組後會出現5個partitions。若不進行AQE優化，會產生5個tasks來做聚合結果，事實上有3個partitions數據量是非常小的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f1/f105e8190724e92a3612cadd5deb45df.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而在這種情況下，AQE只會生成3個reduce task。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c8ab0cf52b62fb2f538b5c92ce8b67ed.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"動態切換join策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark 支持許多 Join 策略，其中 broadcast hash join 通常是性能最好的，前提是參加 join 的一張表的數據能夠裝入內存。由於這個原因，當 Spark 估計參加 join 的表數據量小於廣播大小的閾值時，其會將 Join 策略調整爲 broadcast hash join。但是，很多情況都可能導致這種大小估計出錯——例如存在一個非常有選擇性的過濾器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於AQE擁有精確的上游統計數據，因此可以解決該問題。比如下面這個例子，右表的實際大小爲15M，而在該場景下，經過filter過濾後，實際參與join的數據大小爲8M，小於了默認broadcast閾值10M，應該被廣播。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e1/e1f2b05f9c0d84cdfef30f2de091d877.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們執行過程中轉化爲BHJ的同時，我們甚至可以將傳統shuffle優化爲本地shuffle（例如shuffle讀在mapper而不是基於reducer）來減小網絡開銷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"動態優化數據傾斜","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Join裏如果出現某個key的數據傾斜問題，那麼基本上就是這個任務的性能殺手了。在AQE之前，用戶沒法自動處理Join中遇到的這個棘手問題，需要藉助外部手動收集數據統計信息，並做額外的加鹽，分批處理數據等相對繁瑣的方法來應對數據傾斜問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據傾斜本質上是由於集羣上數據在分區之間分佈不均勻所導致的，它會拉慢join場景下整個查詢。AQE根據shuffle文件統計數據自動檢測傾斜數據，將那些傾斜的分區打散成小的子分區，然後各自進行join。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以看下這個場景，Table A join Table B，其中Table A的partition A0數據遠大於其他分區。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/00/006c0cd49b30db9c31b6e78ba1b46ca7.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE會將partition A0切分成2個子分區，並且讓他們獨自和Table B的partition B0進行join。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ee/ee53c687235ab5b6c7d9f89d1abda4ae.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果不做這個優化，SMJ將會產生4個tasks並且其中一個執行時間遠大於其他。經優化，這個join將會有5個tasks，但每個task執行耗時差不多相同，因此個整個查詢帶來了更好的性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"如何開啓AQE","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以設置參數spark.sql.adaptive.enabled爲true來開啓AQE，在Spark 3.0中默認是false，並滿足以下條件：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非流式查詢","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包含至少一個exchange（如join、聚合、窗口算子）或者一個子查詢","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE通過減少了對靜態統計數據的依賴，成功解決了Spark CBO的一個難以處理的trade off（生成統計數據的開銷和查詢耗時）以及數據精度問題。相比之前具有侷限性的CBO，現在就顯得非常靈活。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Spark CBO源碼實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Adaptive Execution 模式是在使用Spark物理執行計劃注入生成的。在QueryExecution類中有 preparations 一組優化器來對物理執行計劃進行優化， InsertAdaptiveSparkPlan 就是第一個優化器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"InsertAdaptiveSparkPlan 使用 PlanAdaptiveSubqueries Rule對部分SubQuery處理後，將當前 Plan 包裝成 AdaptiveSparkPlanExec 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當執行 AdaptiveSparkPlanExec 的 collect() 或 take() 方法時，全部會先執行 getFinalPhysicalPlan() 方法生成新的SparkPlan，再執行對應的SparkPlan對應的方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"// QueryExecution類\nlazy val executedPlan: SparkPlan = {\n executePhase(QueryPlanningTracker.PLANNING) {\n QueryExecution.prepareForExecution(preparations, sparkPlan.clone())\n }\n }\n\n protected def preparations: Seq[Rule[SparkPlan]] = {\n QueryExecution.preparations(sparkSession,\n Option(InsertAdaptiveSparkPlan(AdaptiveExecutionContext(sparkSession, this))))\n }\n\n private[execution] def preparations(\n sparkSession: SparkSession,\n adaptiveExecutionRule: Option[InsertAdaptiveSparkPlan] = None): Seq[Rule[SparkPlan]] = {\n // `AdaptiveSparkPlanExec` is a leaf node. If inserted, all the following rules will be no-op\n // as the original plan is hidden behind `AdaptiveSparkPlanExec`.\n adaptiveExecutionRule.toSeq ++\n Seq(\n PlanDynamicPruningFilters(sparkSession),\n PlanSubqueries(sparkSession),\n EnsureRequirements(sparkSession.sessionState.conf),\n ApplyColumnarRulesAndInsertTransitions(sparkSession.sessionState.conf,\n sparkSession.sessionState.columnarRules),\n CollapseCodegenStages(sparkSession.sessionState.conf),\n ReuseExchange(sparkSession.sessionState.conf),\n ReuseSubquery(sparkSession.sessionState.conf)\n )\n }\n\n\n// InsertAdaptiveSparkPlan \n override def apply(plan: SparkPlan): SparkPlan = applyInternal(plan, false)\n\n private def applyInternal(plan: SparkPlan, isSubquery: Boolean): SparkPlan = plan match {\n // ...some checking\n case _ if shouldApplyAQE(plan, isSubquery) =>\n if (supportAdaptive(plan)) {\n try {\n // Plan sub-queries recursively and pass in the shared stage cache for exchange reuse.\n // Fall back to non-AQE mode if AQE is not supported in any of the sub-queries.\n val subqueryMap = buildSubqueryMap(plan)\n val planSubqueriesRule = PlanAdaptiveSubqueries(subqueryMap)\n val preprocessingRules = Seq(\n planSubqueriesRule)\n // Run pre-processing rules.\n val newPlan = AdaptiveSparkPlanExec.applyPhysicalRules(plan, preprocessingRules)\n logDebug(s\"Adaptive execution enabled for plan: $plan\")\n AdaptiveSparkPlanExec(newPlan, adaptiveExecutionContext, preprocessingRules, isSubquery)\n } catch {\n case SubqueryAdaptiveNotSupportedException(subquery) =>\n logWarning(s\"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} is enabled \" +\n s\"but is not supported for sub-query: $subquery.\")\n plan\n }\n } else {\n logWarning(s\"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} is enabled \" +\n s\"but is not supported for query: $plan.\")\n plan\n }\n case _ => plan\n }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE對Stage 分階段提交執行和優化過程如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":" private def getFinalPhysicalPlan(): SparkPlan = lock.synchronized {\n // 第一次調用 getFinalPhysicalPlan方法時爲false，等待該方法執行完畢，全部Stage不會再改變，直接返回最終plan\n if (isFinalPlan) return currentPhysicalPlan\n\n // In case of this adaptive plan being executed out of `withActive` scoped functions, e.g.,\n // `plan.queryExecution.rdd`, we need to set active session here as new plan nodes can be\n // created in the middle of the execution.\n context.session.withActive {\n val executionId = getExecutionId\n var currentLogicalPlan = currentPhysicalPlan.logicalLink.get\n var result = createQueryStages(currentPhysicalPlan)\n val events = new LinkedBlockingQueue[StageMaterializationEvent]()\n val errors = new mutable.ArrayBuffer[Throwable]()\n var stagesToReplace = Seq.empty[QueryStageExec]\n while (!result.allChildStagesMaterialized) {\n currentPhysicalPlan = result.newPlan\n // 接下來有哪些Stage要執行，參考 createQueryStages(plan: SparkPlan) 方法\n if (result.newStages.nonEmpty) {\n stagesToReplace = result.newStages ++ stagesToReplace\n // onUpdatePlan 通過listener更新UI\n executionId.foreach(onUpdatePlan(_, result.newStages.map(_.plan)))\n\n // Start materialization of all new stages and fail fast if any stages failed eagerly\n result.newStages.foreach { stage =>\n try {\n // materialize() 方法對Stage的作爲一個單獨的Job提交執行，並返回 SimpleFutureAction 來接收執行結果\n // QueryStageExec: materialize() -> doMaterialize() ->\n // ShuffleExchangeExec: -> mapOutputStatisticsFuture -> ShuffleExchangeExec\n // SparkContext: -> submitMapStage(shuffleDependency)\n stage.materialize().onComplete { res =>\n if (res.isSuccess) {\n events.offer(StageSuccess(stage, res.get))\n } else {\n events.offer(StageFailure(stage, res.failed.get))\n }\n }(AdaptiveSparkPlanExec.executionContext)\n } catch {\n case e: Throwable =>\n cleanUpAndThrowException(Seq(e), Some(stage.id))\n }\n }\n }\n\n // Wait on the next completed stage, which indicates new stats are available and probably\n // new stages can be created. There might be other stages that finish at around the same\n // time, so we process those stages too in order to reduce re-planning.\n // 等待，直到有Stage執行完畢\n val nextMsg = events.take()\n val rem = new util.ArrayList[StageMaterializationEvent]()\n events.drainTo(rem)\n (Seq(nextMsg) ++ rem.asScala).foreach {\n case StageSuccess(stage, res) =>\n stage.resultOption = Some(res)\n case StageFailure(stage, ex) =>\n errors.append(ex)\n }\n\n // In case of errors, we cancel all running stages and throw exception.\n if (errors.nonEmpty) {\n cleanUpAndThrowException(errors, None)\n }\n\n // Try re-optimizing and re-planning. Adopt the new plan if its cost is equal to or less\n // than that of the current plan; otherwise keep the current physical plan together with\n // the current logical plan since the physical plan's logical links point to the logical\n // plan it has originated from.\n // Meanwhile, we keep a list of the query stages that have been created since last plan\n // update, which stands for the \"semantic gap\" between the current logical and physical\n // plans. And each time before re-planning, we replace the corresponding nodes in the\n // current logical plan with logical query stages to make it semantically in sync with\n // the current physical plan. Once a new plan is adopted and both logical and physical\n // plans are updated, we can clear the query stage list because at this point the two plans\n // are semantically and physically in sync again.\n // 對前面的Stage替換爲 LogicalQueryStage 節點\n val logicalPlan = replaceWithQueryStagesInLogicalPlan(currentLogicalPlan, stagesToReplace)\n // 再次調用optimizer 和planner 進行優化\n val (newPhysicalPlan, newLogicalPlan) = reOptimize(logicalPlan)\n val origCost = costEvaluator.evaluateCost(currentPhysicalPlan)\n val newCost = costEvaluator.evaluateCost(newPhysicalPlan)\n if (newCost < origCost ||\n (newCost == origCost && currentPhysicalPlan != newPhysicalPlan)) {\n logOnLevel(s\"Plan changed from $currentPhysicalPlan to $newPhysicalPlan\")\n cleanUpTempTags(newPhysicalPlan)\n currentPhysicalPlan = newPhysicalPlan\n currentLogicalPlan = newLogicalPlan\n stagesToReplace = Seq.empty[QueryStageExec]\n }\n // Now that some stages have finished, we can try creating new stages.\n // 進入下一輪循環，如果存在Stage執行完畢，對應的resultOption 會有值，對應的allChildStagesMaterialized 屬性 = true\n result = createQueryStages(currentPhysicalPlan)\n }\n\n // Run the final plan when there's no more unfinished stages.\n // 所有前置stage全部執行完畢，根據stats信息優化物理執行計劃，確定最終的 physical plan\n currentPhysicalPlan = applyPhysicalRules(result.newPlan, queryStageOptimizerRules)\n isFinalPlan = true\n executionId.foreach(onUpdatePlan(_, Seq(currentPhysicalPlan)))\n currentPhysicalPlan\n }\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"// SparkContext\n /**\n * Submit a map stage for execution. This is currently an internal API only, but might be\n * promoted to DeveloperApi in the future.\n */\n private[spark] def submitMapStage[K, V, C](dependency: ShuffleDependency[K, V, C])\n : SimpleFutureAction[MapOutputStatistics] = {\n assertNotStopped()\n val callSite = getCallSite()\n var result: MapOutputStatistics = null\n val waiter = dagScheduler.submitMapStage(\n dependency,\n (r: MapOutputStatistics) => { result = r },\n callSite,\n localProperties.get)\n new SimpleFutureAction[MapOutputStatistics](waiter, result)\n }\n\n\n// DAGScheduler\n def submitMapStage[K, V, C](\n dependency: ShuffleDependency[K, V, C],\n callback: MapOutputStatistics => Unit,\n callSite: CallSite,\n properties: Properties): JobWaiter[MapOutputStatistics] = {\n\n val rdd = dependency.rdd\n val jobId = nextJobId.getAndIncrement()\n if (rdd.partitions.length == 0) {\n throw new SparkException(\"Can't run submitMapStage on RDD with 0 partitions\")\n }\n\n // We create a JobWaiter with only one \"task\", which will be marked as complete when the whole\n // map stage has completed, and will be passed the MapOutputStatistics for that stage.\n // This makes it easier to avoid race conditions between the user code and the map output\n // tracker that might result if we told the user the stage had finished, but then they queries\n // the map output tracker and some node failures had caused the output statistics to be lost.\n val waiter = new JobWaiter[MapOutputStatistics](\n this, jobId, 1,\n (_: Int, r: MapOutputStatistics) => callback(r))\n eventProcessLoop.post(MapStageSubmitted(\n jobId, dependency, callSite, waiter, Utils.cloneProperties(properties)))\n waiter\n }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前，AdaptiveSparkPlanExec 中對物理執行的優化器列表如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"// AdaptiveSparkPlanExec\n @transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(\n ReuseAdaptiveSubquery(conf, context.subqueryCache),\n CoalesceShufflePartitions(context.session),\n // The following two rules need to make use of 'CustomShuffleReaderExec.partitionSpecs'\n // added by `CoalesceShufflePartitions`. So they must be executed after it.\n OptimizeSkewedJoin(conf),\n OptimizeLocalShuffleReader(conf),\n ApplyColumnarRulesAndInsertTransitions(conf, context.session.sessionState.columnarRules),\n CollapseCodegenStages(conf)\n )","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 OptimizeSkewedJoin方法就是針對最容易出現數據傾斜的Join進行的優化：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE模式下，每個Stage執行之前，前置依賴Stage已經全部執行完畢，那麼就可以獲取到每個Stage的stats信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當發現shuffle partition的輸出超過partition size的中位數的5倍，且partition的輸出大於 256M 會被判斷產生數據傾斜，將partition 數據按照targetSize進行切分爲N份。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"targetSize = max(64M, 非數據傾斜partition的平均大小)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優化前 shuffle 如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bde89544e483c935e37f950aa2f9ec01.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優化後 shuffle：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a4d264a2656e7d1c09bfb326a026aae1.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Spark3.0AQE在FreeWheel的應用與實踐","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FreeWheel團隊通過高效的敏捷開發趕在 2020 年聖誕廣告季之前在生產環境順利發佈上線，整體性能提升高達 40%（對於大 batch）的數據，AWS Cost 平均節省 25%~30%之間，大約每年至少能爲公司節省百萬成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"主要升級改動","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打開 Spark 3.0 AQE 的新特性，主要配置如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":" \"spark.sql.adaptive.enabled\": true,\n \"spark.sql.adaptive.coalescePartitions.enabled\": true,\n \"spark.sql.adaptive.coalescePartitions.minPartitionNum\": 1,\n \"spark.sql.adaptive.advisoryPartitionSizeInBytes\": \"128MB\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是，AQE 特性只是在 reducer 階段不用指定 reducer 的個數，但並不代表你不再需要指定任務的並行度了。因爲 map 階段仍然需要將數據劃分爲合適的分區進行處理，如果沒有指定並行度會使用默認的 200，當數據量過大時，很容易出現 OOM。建議還是按照任務之前的並行度設置來配置參數spark.sql.shuffle.partitions和spark.default.parallelism。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們來仔細看一下爲什麼升級到 3.0 以後可以減少運行時間，又能節省集羣的成本。以 Optimus 數據建模裏的一張表的運行情況爲例：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 reduce 階段從沒有 AQE 的40320個 tasks 銳減到4580個 tasks，減少了一個數量級。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖裏下半部分是沒有 AQE 的 Spark 2.x 的 task 情況，上半部分是打開 AQE 特性後的 Spark 3.x 的情況。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8a/8a76704fd1984a77ef144e588a33d241.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從更詳細的運行時間圖來看，shuffler reader後同樣的 aggregate 的操作等時間也從4.44h到2.56h，節省將近一半。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"左邊是 spark 2.x 的運行指標明細，右邊是打開 AQE 後通過custom shuffler reader後的運行指標情況。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1f80a838377ddcd6c74432d04af022fe.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"性能提升","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"AQE性能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE對於整體的 Spark SQL 的執行過程做了相應的調整和優化(如下圖)，它最大的亮點是可以根據已經完成的計劃結點真實且精確的執行統計結果來不停的反饋並重新優化剩下的執行計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6f/6fc2587f2ff867187c84e6a81d0fd145.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AQE 自動調整 reducer 的數量，減小 partition 數量。Spark 任務的並行度一直是讓用戶比較困擾的地方。如果並行度太大的話，會導致 task 過多，overhead 比較大，整體拉慢任務的運行。而如果並行度太小的，數據分區會比較大，容易出現 OOM 的問題，並且資源也得不到合理的利用，並行運行任務優勢得不到最大的發揮。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且由於 Spark Context 整個任務的並行度，需要一開始設定好且沒法動態修改，這就很容易出現任務剛開始的時候數據量大需要大的並行度，而運行的過程中通過轉化過濾可能最終的數據集已經變得很小，最初設定的分區數就顯得過大了。AQE 能夠很好的解決這個問題，在 reducer 去讀取數據時，會根據用戶設定的分區數據的大小(spark.sql.adaptive.advisoryPartitionSizeInBytes)來自動調整和合並(Coalesce)小的 partition，自適應地減小 partition 的數量，以減少資源浪費和 overhead，提升任務的性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由上面單張表可以看到，打開 AQE 的時候極大的降低了 task 的數量，除了減輕了 Driver 的負擔，也減少啓動 task 帶來的 schedule，memory，啓動管理等 overhead，減少 cpu 的佔用，提升的 I/O 性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拿歷史 Data Pipelines 爲例，同時會並行有三十多張表在 Spark 裏運行，每張表都有極大的性能提升，那麼也使得其他的表能夠獲得資源更早更多，互相受益，那麼最終整個的數據建模過程會自然而然有一個加速的結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大 batch（>200G）相對小 batch（< 100G ）有比較大的提升，有高達 40%提升，主要是因爲大 batch 本身數據量大，需要機器數多，設置併發度也更大，那麼 AQE 展現特性的時刻會更多更明顯。而小 batch 併發度相對較低，那麼提升也就相對會少一些，不過也是有 27.5%左右的加速。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內存優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了因爲 AQE 的打開，減少過碎的 task 對於 memory 的佔用外，Spark 3.0 也在其他地方做了很多內存方面的優化，比如 Aggregate 部分指標瘦身、Netty 的共享內存 Pool 功能、Task Manager 死鎖問題、避免某些場景下從網絡讀取 shuffle block等等，來減少內存的壓力。一系列內存的優化加上 AQE 特性疊加從前文內存實踐圖中可以看到集羣的內存使用同時有30%左右的下降。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"實踐成果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"升級主要的實踐成果如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"性能提升明顯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歷史數據 Pipeline 對於大 batch 的數據（200~400G/每小時）性能提升高達40%，對於小 batch（小於 100G/每小時）提升效果沒有大 batch 提升的那麼明顯，每天所有 batches平均提升水平27.5%左右。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預測數據性能平均提升30%。由於數據輸入源不一樣，目前是分別兩個 pipelines 在跑歷史和預測數據，產生的表的數目也不太一樣，因此做了分別的評估。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以歷史數據上線後的端到端到運行時間爲例（如下圖），肉眼可見上線後整體 pipeline 的運行時間有了明顯的下降，能夠更快的輸出數據供下游使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/83/83df84f40d696fbf295fa5147c6f7cf0.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"集羣內存使用降低","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣內存使用對於大 batch 達降低30%左右，每天平均平均節省25%左右。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以歷史數據上線後的運行時集羣的 memory 在 ganglia 上的截圖爲例（如下圖），整體集羣的內存使用從 41.2T 降到 30.1T，這意味着我們可以用更少的機器花更少的錢來跑同樣的 Spark 任務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4e/4e16f96ab77fe63803966e2e48d0a1a1.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"AWS Cost 降低","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pipelines 做了自動的 Scale In/Scale Out 策略: 在需要資源的時候擴集羣的 Task 結點，在任務結束後自動去縮集羣的 Task 結點，且會根據每次 batch 數據的大小通過算法學習得到最佳的機器數。通過升級到 Spark 3.0 後，由於現在任務跑的更快並且需要的機器更少，上線後統計 AWS Cost 每天節省30%左右，大約一年能爲公司節省百萬成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247485217&idx=1&sn=3ce9fa8ad179c008754873129e51fbe7&chksm=fd3d41b4ca4ac8a25957fc9541437e6546fc2926df2908bad5adbd30a6cecf9f797fef74f894&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"SparkSQL的3種Join實現","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247494892&idx=1&sn=1e2d5ae8beb129236b1e1a1f9e997422&chksm=fd3eaa79ca49236facdde16d22a3d197b89b06af4fc019a435512cc88bfa20054fcb03d31dc5&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"MongoDB + Spark: 完整的大數據解決方案","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247487480&idx=1&sn=ec2325cb27dca653269b68f7d6101a07&chksm=fd3d496dca4ac07bbf25fea5a6dc4416be0a19d44ea5aeedc956c640e4acc74066898c195325&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"Spark SQL是如何選擇join策略的？","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎關注，","attrs":{}},{"type":"link","attrs":{"href":"https://shimo.im/docs/jdPhrtFwVCAMkoWv","title":""},"content":[{"type":"text","text":"《大數據成神之路》","attrs":{}}]},{"type":"text","text":"系列文章","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

“企業創新新引擎”數據庫專項賦能會，讓雲原生技術普惠千行百業！

本文分享自華爲雲社區《“企業創新新引擎”數據庫專項賦能會，讓雲原生技術普惠千行百業！》，作者： GaussDB 數據庫。 4月19日，由福州軟件園科技創新發展公司和華爲技術有限公司聯合主辦的HCDG城市行福州站——“企業創新新引擎”數據庫專

2024-04-24 10:32:53

如何增強Java API 的導入和導出性能

前言 GrapeCity Documents for Excel (以下簡稱GcExcel) 是葡萄城公司的一款服務端表格組件，它提供了一組全面的 API 以編程方式生成 Excel (XLSX) 電子表格文檔的功能，支持爲多個平臺創建、操

2024-04-23 10:23:02

SLS 查詢新範式：使用 SPL 對日誌進行交互式探索

作者：無哲引言在構建現代數據和業務系統的過程中，可觀測性已經變得至關重要，日誌服務（SLS）爲 Log/Trace/Metric 數據提供了大規模、低成本、高性能的一站式平臺服務，並提供數據採集、加工、投遞、分析、告警、可視化等功能，從

2024-04-22 21:12:05

WhaleScheduler爲銀行業全信創環境打造統一調度管理平臺解決方案

項目背景數字金融是數字經濟的重要支撐和驅動力。近年來，我國針對數字金融的發展政策頻頻出臺，《金融科技發展規劃（2022-2025年）》、《“十四五”數字經濟發展規劃》、《關於銀行業保險業數字化轉型的指導意見》、《金融標準化“十四五”

2024-04-19 21:18:25

千帆杯AI原生應用創意挑戰賽-效率工具常規賽重磅上線！

賽題內容本期比賽爲開放賽題，參賽者需要圍繞“效率工具”主題，結合自身的專業背景和創意想法，設計並開發出具有創新性和實用性的AI原生應用。要求使用工具：AppBuilder。參賽者可用0代碼創建應用調試指令，也可自定義組件與workf

2024-04-19 11:29:42

文檔圖像大模型

隨着信息技術的快速發展，文檔處理已經成爲日常生活和工作中不可或缺的一部分。傳統的文檔處理方法往往需要人工參與，效率低下且易出錯。近年來，隨着深度學習技術的突破，文檔圖像大模型在智能文檔處理領域嶄露頭角，爲提升文檔處理性能提供了新的解決方案。

2024-04-18 11:29:52

GaussDB(DWS)基於Flink的實時數倉構建

本文分享自華爲雲社區《GaussDB(DWS)基於Flink的實時數倉構建》，作者：胡辣湯。大數據時代，廠商對實時數據分析的訴求越來越強烈，數據分析時效從T+1時效趨向於T+0時效，爲了給客戶提供極速分析查詢能力，華爲雲數倉GaussDB

2024-04-18 10:32:57

五一假期暢遊指南：Python技術構建的熱門景點分析系統解讀

導言五一假期即將到來，作爲一名熱愛旅遊的技術達人，我總是希望能夠通過技術手段更好地規劃我的旅行路線。在這篇文章中，我將向大家介紹一款基於Python技術的熱門景點分析系統，幫助您在五一假期中游玩得更加盡興！ 1. 系統概述熱門景點

2024-04-16 23:25:46

裁員了！別錯過2024年大數據工程師必備的10項技能

在當今快速發展的世界中，數據被視爲新的石油。隨着對數據驅動洞察的日益依賴，大數據工程師的角色比以往任何時候都更爲關鍵。這些專業人員在管理和優化組織內的數據操作中扮演着至關重要的角色。在本文中，我們將探索2024年大數據工程師必須具備的十

2024-04-16 11:00:53

還在擔心報表不好做？不用怕，試試這個方法（四）

系列文章：《還在擔心報表不好做？不用怕，試試這個方法》（一）《還在擔心報表不好做？不用怕，試試這個方法》（二）《還在擔心報表不好做？不用怕，試試這個方法》（三）概要在上一篇文章《還在擔心報表不好做？不用怕，試試這個方法》（三）中，

2024-04-16 10:23:03

MaxCompute 近實時增全量處理一體化新架構和使用場景介紹

隨着當前數據處理業務場景日趨複雜，對於大數據處理平臺基礎架構的能力要求也越來越高，既要求數據湖的大存儲能力，也要求具備海量數據高效批處理能力，同時還可能對延時敏感的近實時鏈路有強需求，本文主要介紹基於 MaxCompute 的離線近實時一體

2024-04-15 23:41:52

普元信息顧偉：用更簡單的方式來建設數據中臺

近日，普元信息與鏡舟科技聯合舉辦“數據中臺新範式”雲端峯會，深入解析湖倉一體、批流一體、治理與運營一體的數據中臺新範式特徵，闡述以一站式聯合方案賦能企業提質增效的實踐經驗。普元信息數智研究院院長顧偉發表主旨演講《基於湖倉一體，構建開發

2024-04-12 11:43:03

Sql優化之回表

前言： MySQL的性能是大家在使用時十分關心的問題，比如在高併發訪問時，並且有慢sql存在的情況下，MySQL的性能會明顯下降，這會導致數據庫響應時間變慢，甚至導致數據庫宕機。那麼爲了避免Mysql性能問題，比較常用的方式創建適當的索引

2024-04-08 23:16:30

Ascend C 自定義PRelu算子

本文分享自華爲雲社區《Ascend C 自定義PRelu算子》，作者： jackwangcumt。 1 PRelu算子概述 PReLU是 Parametric Rectified Linear Unit的縮寫，首次由何凱明團隊提出，和Le

2024-04-08 10:33:15

Notion 開源替代品 AFFINE 部署和使用教程

AFFiNE 是一款完全開源的 Notion + Miro 替代品，與 Notion 相比，AFFiNE 更注重隱私安全，優先將筆記內容保存到本地。 GitHub 地址：https://github.com/toeverything/AFF

2024-04-07 21:14:35

24小時熱門文章

最新文章

最新評論文章