Apache Spark 3.0新特性在FreeWheel核心業務數據團隊的應用與實戰

原創

2021-01-06 15:53

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"引言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相信作爲Spark的粉絲或者平時工作與Spark相關的同學大多知道，Spark 3.0在2020年6月官方重磅發佈，並於9月發佈穩定線上版本，這是Spark有史以來最大的一次release，共包含了3400多個patches，而且恰逢Spark發佈的第十年，具有非常重大的意義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"團隊在Spark發佈後，快速動手搭好Spark 3.0的裸機集羣並在其上進行了初步的調研，發現相較於Spark 2.x 確實有性能上的提升。於是跟AWS EMR和Support團隊進行了多次溝通表達我們的迫切需求後，EMR團隊給予了快速的響應，在11月底發佈了內測版本。作爲第一批內測用戶，我們做了Data Pipelines上各個模塊的升級，測試和數據驗證。團隊通過高效的敏捷開發趕在2020年聖誕廣告季之前在生產環境順利發佈上線，整體"},{"type":"text","marks":[{"type":"strong"}],"text":"性能提升高達40%"},{"type":"text","text":"（對於大batch）的數據，"},{"type":"text","marks":[{"type":"strong"}],"text":"AWS Cost平均節省25%~30%之間"},{"type":"text","text":"，大約每年至少能爲公司節省百萬成本。目前線上穩定運行，預期藉助此次升級能夠更從容地爲 FreeWheel 高速增長業務量和數據分析需求保駕護航。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這次Spark 3.0的升級中，其實並不是一個簡簡單單的版本更換，因爲團隊的Data Pipelines所依賴的生態圈本質上其實也發生了一個很大的變化。比如EMR有一個大版本的升級，從5.26升級到最新版6.2.0，底層的Hadoop也從2.x升級到3.2.1，Scala只能支持2.12等等。本篇文章主要是想和大家分享一下Spark 3.0在FreeWheel大數據團隊升級背後的故(xuè)事(lèi)和相關的實戰經驗，希望能對大家以後的使用Spark 3.0特別是基於AWS EMR上開發有所幫助，可以在Spark升級的道路上走的更順一些。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"團隊介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FreeWheel核心業務數據團隊的主要工作是通過收集，分析來自用戶的視頻廣告數據，來幫助客戶更好地制定廣告計劃，滿足客戶不斷增長的業務需求，最終幫助客戶實現業務的增長。其中最主要的兩類數據分別是預測數據和歷史數據："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"預測數據"},{"type":"text","text":"會根據用戶歷史廣告投放情況進行算法分析和學習來得到未來預測情況，在此基礎上向客戶提供有價值的數據分析結果，比如廣告投放是否健康，廣告位是否足夠，當前的廣告售賣是否合理等等信息。通過這些數據分析的反饋可以幫助用戶更好地在廣告定價、售期等方面做出正確的決定，最終達到自己的銷售目標。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"歷史數據"},{"type":"text","text":"主要是提供用戶業務場景數據分析所需要的功能，比如數據查詢，Billing賬單，廣告投放情況，市場策略等，並且通過大量的歷史數據從多維度多指標的角度提供強有力的BI分析能力進而幫助用戶洞察數據發生的變化，發現潛在的問題和市場機會。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲核心業務數據團隊裏重要的成員，"},{"type":"text","marks":[{"type":"strong"}],"text":"Transformer"},{"type":"text","text":"團隊的主要負責："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"基於大數據平臺技術建立Data Pipelines"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"負責將交易級別的數據轉化爲分析級別的數據，服務下游所有的數據產品"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"構建統一的數據倉庫"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過分層業務模型來構建所有數據產品不同場景下（歷史或者預測）使用一致的業務視圖和指標"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供不同粒度或者維度的聚合事實數據"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供基於特定場景的數據集市"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"提供統一的數據發佈服務和接口"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據建模和Data Pipelines架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當交易級別的廣告（歷史或者預測）數據進入系統後，會通過數據建模和Data Pipelines進行統一的建模或者分析，視業務需要更進一步構建數據集市，生成的聚合事實數據會被髮布到數據倉庫Hive和Clickhouse裏供下游數據產品通過Presto或者Clickhouse查詢引擎來消費。如下是整體建模和Data Pipelines的架構圖："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/7f\/e0\/7f2340c952cfe011f6ab2da65d77c0e0.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中主要模塊包括："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Optimus"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如它的名字一樣，"},{"type":"codeinline","content":[{"type":"text","text":"Optimus"}]},{"type":"text","text":"同樣是Transformer團隊的模塊中的領袖人物，肩負業務數據團隊最重要的數據建模部分。通過分層數據建模的方式來構建統一的基於上下文的數據模型，保障所有下游產品在不同的應用和業務場景下的計算指標，計算邏輯一致，且避免來回重複計算掃描數據。比如預測數據和歷史數據同樣的指標含義，就使得提供給客戶的數據對比更有說服力和決策指導意義。目前它會產生將近四十張左右的小時粒度的歷史事實表和預測事實表。目前每天處理的數據在TB級別，會根據每個小時的數據量自動進行擴或者縮集羣，保證任務的高性能同時達到資源的高效利用目標。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"JetFire"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"JetFire"}]},{"type":"text","text":"是一個基於Spark的通用ETL框架，支持用戶通過SQL或者Code的方式靈活的定製ETL任務和分析數據任務。目前主要用於Post-Optimus的場景，生成基於特定業務場景更高聚合粒度的數據集市上。比如生成"},{"type":"codeinline","content":[{"type":"text","text":"todate"}]},{"type":"text","text":"(迄今爲止)的統計指標，像每個客戶截止到目前或者過去18個月的廣告投放總數。這樣就可以避免每次查詢對底層數據或者Optimus生成的聚合數據進行全掃。生成一次供多次查詢，可以極大提高查詢效率，降低成本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Publisher"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於Spark的數據發佈模塊，負責將數據發佈到數據倉庫裏。由於數據建模產生的數據按日期進行分區，當存在Late Data的時候，很容易生成碎小文件，Publisher通過發佈數據前合併碎小文件的功能來提升下游的查詢效率。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Bumblebee"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要是爲數據建模和Data Pipelines的各個模塊提供模塊測試和集成測試環境，供業務開發的同學使用。此外，基於此提供所有Data Pipelines的整體一致的CD和災備方案，保障在極端場景下系統的快速啓動和恢復。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Data Restatement"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了日常的Data Pipelines，在客戶數據投放出現問題或者數據倉庫數據出現偏差遺漏時，需要自動修數據的Pipelines來支持大範圍的數據修正和補償。整體的作業調度需要保證日常工作正常完成的情況下，儘快完成數據修正工作。目前提供整個batch或者delta兩種方式修數據，來滿足不同的應用場景。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Data Publish API"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"負責爲下游提供數據發佈信息，來觸發一些訂閱的報表或者產品發佈。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了 Data Publish API 服務部署在EKS上，其他相關模塊目前都運行在AWS EMR上，靈活使用Spot Instance和On Demand混合模式，高效利用資源。團隊基於以上的模塊爲公司的業務發展提供有力的數據和技術保障。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"實踐成果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次升級主要的實踐成果如下："}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"性能提升明顯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"歷史數據"},{"type":"text","text":"Pipeline對於大batch的數據（200~400G\/每小時）性能"},{"type":"codeinline","content":[{"type":"text","text":"提升高達40%"}]},{"type":"text","text":"，對於小batch（小於100G\/每小時）提升效果沒有大batch提升的那麼明顯，每天所有batches"},{"type":"codeinline","content":[{"type":"text","text":"平均提升水平27.5%"}]},{"type":"text","text":"左右。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"預測數據"},{"type":"text","text":"性能"},{"type":"codeinline","content":[{"type":"text","text":"平均提升30%"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於數據輸入源不一樣，目前是分別兩個pipelines在跑歷史和預測數據，產生的表的數目也不太一樣，因此做了分別的評估。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以歷史數據上線後的端到端到運行時間爲例（如下圖），肉眼可見上線後整體pipeline的運行時間有了明顯的下降，能夠更快的輸出數據供下游使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ea\/eace6ea359a5dead6a1d098622e0cce3.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"集羣內存使用降低"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣內存使用對於大batch達"},{"type":"codeinline","content":[{"type":"text","text":"降低30%"}]},{"type":"text","text":"左右，每天平均"},{"type":"codeinline","content":[{"type":"text","text":"平均節省25%"}]},{"type":"text","text":"左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以歷史數據上線後的運行時集羣的memory在ganglia上的截圖爲例（如下圖），整體集羣的內存使用從41.2T降到30.1T，這意味着我們可以用更少的機器花更少的錢來跑同樣的Spark任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7b\/7b8698ea1d6036c3ae1c06e61867735e.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"AWS Cost降低"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pipelines做了自動的Scale In\/Scale Out策略: 在需要資源的時候擴集羣的Task結點，在任務結束後自動去縮集羣的Task結點，且會根據每次batch數據的大小通過算法學習得到最佳的機器數。通過升級到Spark 3.0後，由於現在任務跑的更快並且需要的機器更少，上線後統計AWS Cost每天"},{"type":"codeinline","content":[{"type":"text","text":"節省30%"}]},{"type":"text","text":"左右，大約一年能爲公司節省百萬成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下是歷史數據 Pipeline 上線後，通過 AWS Billing 得到的賬單 Cost 數據，可以看到在使用 Spot Instance 情況下(花費柱狀圖較短的情況下)從上線前(藍色線)到上線後(紅色線)每天有顯著的30%左右的成本下降，如果使用 AWS On Demand 的 Instance 的話那麼節省就更可觀了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/9a\/6d\/9a7d1c58c7a68fd0999d3262a553926d.jpg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"其他"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Pipelines裏的所有的相關模塊都完成了Spark 3.0的升級，享受最新技術棧和優化帶來的收益。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於任務運行時間和需要的機器數明顯下降，整體的Spot Instance被中斷的概率也大大降低，任務穩定性得到加強。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"發佈了自動化數據驗證工具進行端到端的數據驗證。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"統一併升級了所有模塊的CD Pipelines。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們具體看看我們做了什麼，又踩了什麼樣的坑，以及背後有什麼魔法幫助達到既讓任務跑得快又能爲公司省錢的效果。對 Spark 3.0 新特性感興趣的同學可以參考我的另外一篇文章——關於"},{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/fad821a83e19c6478458e0b03","title":"xxx","type":null},"content":[{"type":"text","text":"Spark 3.0的關鍵新特性回顧"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"我們做了什麼？遇到什麼坑？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Pipelines和相關的迴歸測試框架都進行相關依賴生態圈的統一升級，接下來會跟大家詳細分享細節部分。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Spark升級到最新穩定版3.0.1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark "},{"type":"codeinline","content":[{"type":"text","text":"3.0.1"}]},{"type":"text","text":"是社區目前推薦使用的最新的穩定版本，於2020年九月正式發佈，其中解決了3.0版本里的一些潛在bug。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"主要的改動"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"打開Spark 3.0 AQE的新特性"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要配置如下："}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":" \"spark.sql.adaptive.enabled\": true,\n \"spark.sql.adaptive.coalescePartitions.enabled\": true,\n \"spark.sql.adaptive.coalescePartitions.minPartitionNum\": 1,\n \"spark.sql.adaptive.advisoryPartitionSizeInBytes\": \"128MB\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是，AQE特性只是在reducer階段不用指定reducer的個數，但"},{"type":"text","marks":[{"type":"strong"}],"text":"並不代表你不再需要指定任務的並行度了"},{"type":"text","text":"。因爲map階段仍然需要將數據劃分爲合適的分區進行處理，如果沒有指定並行度會使用默認的200，當數據量過大時，很容易出現OOM。建議還是按照任務之前的並行度設置來配置參數"},{"type":"codeinline","content":[{"type":"text","text":"spark.sql.shuffle.partitions"}]},{"type":"text","text":"和"},{"type":"codeinline","content":[{"type":"text","text":"spark.default.parallelism"}]},{"type":"text","text":"。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"升級HyperLogLog相關的UDAF到新接口"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark 3.0提供了通過用戶定製實現的Aggregator來註冊實現UDAF，可以避免對每一行的數據反覆進行序列化和反序列化來進行聚合，而只需在整個分區裏序列化一次，緩解了對cpu的壓力，提升性能。假如一個DataFrame有100萬行數據共10個paritions，那麼舊的UDAF方式的序列化反序列化需要至少100萬+10次(合併分區裏的結果)。而新的函數只需要10次即可，大大減少整體的序列化操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/45\/45a075115f52c5c1a425f0e563c61f99.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"依賴Hadoop版本升級"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"依賴的Hadoop根據Spark和EMR支持的版本升級到"},{"type":"codeinline","content":[{"type":"text","text":"3.2.1"}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"ext {\n hadoopVersion = \"3.2.1\"\n}\ncompile group: \"org.apache.hadoop\", name: \"hadoop-client\", version: \"${hadoopVersion}\"\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"打開 History Server Event Logs 滾動功能"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark 3.0提供了類似Log4j那樣對於長時間運行的日誌按照時間或者文件的大小進行切割，這樣對於Streaming長期運行的任務和大任務來說比較友好。"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":" \"spark.eventLog.rolling.enabled\": true,\n \"spark.eventLog.rolling.maxFileSize\": \"1024m\",\n \"spark.eventLog.buffer.kb\": \"10m\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"遇到的坑"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"讀Parquet文件失敗"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"升級到Spark 3.0後，讀源數據Parquet文件會出現一些莫名的問題，有些文件可以正常解析，而有些文件則會拋出失敗的異常錯誤，這個錯誤是整個升級的Blocker，非常令人苦惱。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"具體的錯誤信息"}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files.\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原因"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在仔細調試和閱讀源碼後發現，Spark 3.0在Parquet的嵌套schema的邏輯上做了修改，主要是關於使用的優化特性"},{"type":"codeinline","content":[{"type":"text","text":"spark.sql.optimizer.nestedSchemaPruning.enabled"}]},{"type":"text","text":"時的變化，具體可以進一步閱讀相關的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/spark\/pull\/24307","title":"","type":null},"content":[{"type":"text","text":"ticket"}]},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而產生的影響就是當在有嵌套schema的Parquet文件上去讀取不存在的field時，會拋出錯誤。而在2.4以前的版本是，是允許訪問不存在的field並返回none，並不會中斷整個程序。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決辦法"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於我們數據建模和上游開發模式就是面向接口編程，爲了不和schema嚴格綁定，是會存在提前讀取一些暫時還沒有上線的field並暫時存放空值。因此，新的邏輯修改直接就break了原來的開發模式，而且代碼裏也要加入各種兼容老的schema邏輯。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"於是我們將優化"},{"type":"codeinline","content":[{"type":"text","text":"spark.sql.optimizer.nestedSchemaPruning.enabled"}]},{"type":"text","text":"會關掉後，再進行性能的測試，發現性能的影響幾乎可以忽略。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鑑於上面的影響太大和性能測試結果，最終選擇設置"},{"type":"codeinline","content":[{"type":"text","text":"spark.sql.optimizer.nestedSchemaPruning.enabled = false"}]},{"type":"text","text":"。後續會進一步研究是否有更優雅的解決方式。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"History Server的Connection Refused"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark 3.0裏History Server在解析日誌文件由於內存問題失敗時， History Server會重啓，隨後會出現"},{"type":"codeinline","content":[{"type":"text","text":"Connection Refused"}]},{"type":"text","text":"的錯誤信息，而在2.x裏，並不會導致整個History Server的重啓。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決方案"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "},{"type":"text","text":"增加History Server的內存。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Master結點, Spark配置文件裏修改："}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"export SPARK_DAEMON_MEMORY=12g"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後重啓History Server即可 "},{"type":"codeinline","content":[{"type":"text","text":"sudo systemctl restart spark-history-server"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"History UI顯示任務無法結束"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原因"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打開AQE後由於會對整個查詢進行再次切分，加上3.0也會增加很多相關Observable的指標，比如Shuffle，所以整體的History Logs會變的相對較大，目前對於某些batch的任務產生的logs無法及時同步到History Server裏，導致從History UI去看任務執行進度時會存在一直在"},{"type":"codeinline","content":[{"type":"text","text":"in progress"}]},{"type":"text","text":"狀態，但實際上任務已經執行完畢。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在閱讀源碼和相關Log後，比較懷疑是Spark Driver在"},{"type":"codeinline","content":[{"type":"text","text":"eventLoggingListerner"}]},{"type":"text","text":"向升級後的HDFS(Hadoop "},{"type":"codeinline","content":[{"type":"text","text":"3.2.1"}]},{"type":"text","text":")寫eventlogs時出了什麼問題，比如丟了對應事件結束的通知信息。由於源碼裏這部分debugging相關的Log信息相對有限，還不能完全確定根本原因，後續會再繼續跟進這個問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實類似的問題在Spark 2.4也偶有發生，但升級到3.0後似乎問題變得頻率高了一些。遇到類似問題的同學可以注意一下，雖然Logs信息不全，但"},{"type":"text","marks":[{"type":"strong"}],"text":"任務的執行和最終產生的數據都是正確的"},{"type":"text","text":"。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"HDFS升級後端口發生變化"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"端口號變化列表:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Namenode 端口: 50470 –> 9871, 50070 –> 9870, 8020 –> 9820"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Secondary NN 端口: 50091 –> 9869, 50090 –> 9868"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Datanode 端口: 50020 –> 9867, 50010 –> 9866, 50475 –> 9865, 50075 –> 9864"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"EMR升級到最新版6.2.0"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"系統升級"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"EMR 6.2.0使用的操作系統是更好"},{"type":"codeinline","content":[{"type":"text","text":"Amazon Linux2"}]},{"type":"text","text":"，整體系統的服務安裝和控制從直接調用各個服務自己的起停命令(原有的操作系統版本過低)更換爲統一的"},{"type":"codeinline","content":[{"type":"text","text":"Systemd"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"啓用Yarn的結點標籤"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在EMR的6.x的發佈裏，禁用了Yarn的結點標籤功能，相較於原來Driver強制只能跑在Core結點上，新的EMR裏Driver可以跑在做任意結點，細節可以參考"},{"type":"link","attrs":{"href":"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-master-core-task-nodes.html","title":"","type":null},"content":[{"type":"text","text":"文檔"}]},{"type":"text","text":"。而由於我們的Data Pipelines需要EMR的Task節點按需進行擴或者縮，而且用的還是Spot Instance。因此這種場景下Driver更適合跑在常駐的(On Demand)的Core結點上，而不是隨時面臨收回的Task節點上。對應的EMR集羣改動："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"yarn.node-labels.enabled: true\nyarn.node-labels.am.default-node-label-expression: 'CORE'"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Spark Submit 命令的修改"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在EMR新的版本里用extraJavaOptions會報錯，這個和EMR內部的設置有關係，具體詳情可以參考"},{"type":"link","attrs":{"href":"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-spark-configure.html","title":"","type":null},"content":[{"type":"text","text":"EMR配置"}]},{"type":"text","text":" ，修改如下:"},{"type":"codeinline","content":[{"type":"text","text":"spark.executor.extraJavaOptions=-XX"}]},{"type":"text","text":" -> "},{"type":"codeinline","content":[{"type":"text","text":"spark.executor.defaultJavaOptions=-XX:+UseG1GC"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"遇到的坑"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Hive Metastore衝突"}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原因"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"EMR 6.2.0裏內置的Hive Metastore版本是"},{"type":"codeinline","content":[{"type":"text","text":"2.3.7"}]},{"type":"text","text":"，而公司內部系統使用的目前版本是"},{"type":"codeinline","content":[{"type":"text","text":"1.2.1"}]},{"type":"text","text":"，因此在使用新版EMR的時候會報莫名的各種包問題，根本原因就是使用的Metastore版本衝突問題。"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"錯誤信息示例"},{"type":"text","text":"："}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"User class threw exception: java.lang.RuntimeException: [download failed: net.minidev#accessors-smart;1.2!accessors-smart.jar(bundle), download failed: org.ow2.asm#asm;5.0.4!asm.jar, download failed: org.apache.kerby#kerb-core;1.0.1!kerb-core.jar, download failed: org.apache.kerby#kerb-server;1.0.1!kerb-server.jar, download failed: org.apache.htrace#htrace-core4;4.1.0-incubating!htrace-core4.jar, download failed: com.fasterxml.jackson.core#jackson-databind;2.7.8!jackson-databind.jar(bundle), download failed: com.fasterxml.jackson.core#jackson-core;2.7.8!jackson-core.jar(bundle), download failed: javax.xml.bind#jaxb-api;2.2.11!jaxb-api.jar, download failed: org.eclipse.jetty#jetty-util;9.3.19.v20170502!jetty-util.jar, download failed: com.google.inject#guice;4.0!guice.jar, download failed: com.sun.jersey#jersey-server;1.19!jersey-server.jar]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決方案"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始方案："}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\"spark.sql.hive.metastore.version\": \"1.2.1\",\n\"spark.sql.hive.metastore.jars\": \"maven\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但初始方案每次任務運行時都需要去maven庫裏下載，比較影響性能而且浪費資源，當多任務併發去下載的時候會出問題，並且"},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/2.4.0\/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore","title":"","type":null},"content":[{"type":"text","text":"官方文檔"}]},{"type":"text","text":"不建議在生產環境下使用。因此將lib包的下載直接打入鏡像裏，然後啓動EMR集羣的時候加載一次到"},{"type":"codeinline","content":[{"type":"text","text":"\/dependency_libs\/hive\/*"}]},{"type":"text","text":"即可，完善後方案爲："}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\"spark.sql.hive.metastore.version\": \"1.2.1\",\n\"spark.sql.hive.metastore.jars\": \"\/dependency_libs\/hive\/*\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Hive Server連接失敗"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"錯誤信息"}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"Caused by: org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=default})"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原因"}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和Hive metastore包衝突類似的問題，由於Spark 3.0 裏用的hive-jdbc.jar包版本過高。"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決方案"}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下載可用的對應的lib包，將Spark 3.0裏自帶的hive-jdbc.jar包進行替換。"}]}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"wget -P .\/ https:\/\/github.com\/timveil\/hive-jdbc-uber-jar\/releases\/download\/v1.8-2.6.3\/hive-jdbc-uber-2.6.3.0-235.jar"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"寫HDFS數據偶爾會失敗"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在最新版的EMR集羣上跑時，經常會出現寫HDFS數據階段失敗的情況。查看Log上的error信息："}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark Log"}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"Spark Log:\nCaused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File \/user\/hadoop\/output\/20201023040000\/tablename\/normal\/_temporary\/0\/_temporary\/attempt_20201103002533_0146_m_001328_760289\/event_date=2020-10-22 03%3A00%3A00\/part-01328-7c2e85a0-dfc8-4d4d-8d49-ed9b6aca06f6.c000.zlib.orc could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and 1 node(s) are excluded in this operation."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS Data Node Log"}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"Data Node Log:\n365050 java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent xcievers: 4096\n365051 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:150)\n365052 at java.lang.Thread.run(Thread.java:748)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決方案"},{"type":"text","text":"：調大對應的HDFS連接數。"}]}]}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"dfs.datanode.max.transfer.threads = 16384"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不確定EMR集羣在升級的過程中是否修改過HDFS連接數的默認參數。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Scala 升級到 2.12"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於Spark 3.0不再支持Scala 2.11版本，需要將所有的代碼升級到2.12的版本。更多Scala 2.12的新的發佈內容可以參考"},{"type":"link","attrs":{"href":"https:\/\/www.scala-lang.org\/news\/2.12.0\/#library-improvements","title":"","type":null},"content":[{"type":"text","text":"文檔"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"語法升級"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"JavaConversions"}]},{"type":"text","text":"被deprecated了，需要用"},{"type":"codeinline","content":[{"type":"text","text":"JavaConverters"}]},{"type":"text","text":"並且顯示調用"},{"type":"codeinline","content":[{"type":"text","text":".asJava"}]},{"type":"text","text":"或者"},{"type":"codeinline","content":[{"type":"text","text":".asScala"}]},{"type":"text","text":"的轉化"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"併發開發相關接口發生變化"},{"type":"codeinline","content":[{"type":"text","text":"Scala.concurrent.Future"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"周邊相關依賴包升級"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包括但不限於 "},{"type":"codeinline","content":[{"type":"text","text":"scalstest"}]},{"type":"text","text":", "},{"type":"codeinline","content":[{"type":"text","text":"scalacheck"}]},{"type":"text","text":", "},{"type":"codeinline","content":[{"type":"text","text":"scalaxml"}]},{"type":"text","text":"升級到2.12對應的版本"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"其他相關調整"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"集羣資源分配算法調整"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整體使用的集羣內存在升級3.0後有明顯的降低，Data Pipelines根據新的資源需用量重新調整了根據文件大小計算集羣資源大小的算法。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Python升級到3.x"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼既能提升性能又能省錢？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們來仔細看一下爲什麼升級到3.0以後可以減少運行時間，又能節省集羣的成本。以Optimus數據建模裏的一張表的運行情況爲例："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在reduce階段從沒有AQE的"},{"type":"codeinline","content":[{"type":"text","text":"40320"}]},{"type":"text","text":"個tasks銳減到"},{"type":"codeinline","content":[{"type":"text","text":"4580"}]},{"type":"text","text":"個tasks，減少了一個數量級。"}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖裏下半部分是沒有AQE的Spark 2.x的task情況，上半部分是打開AQE特性後的Spark 3.x的情況。"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ba\/bae7cf0451398a5a9d72d74519fae234.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從更詳細的運行時間圖來看，"},{"type":"codeinline","content":[{"type":"text","text":"shuffler reader"}]},{"type":"text","text":"後同樣的aggregate的操作等時間也從"},{"type":"codeinline","content":[{"type":"text","text":"4.44h"}]},{"type":"text","text":"到"},{"type":"codeinline","content":[{"type":"text","text":"2.56h"}]},{"type":"text","text":"，節省將近一半。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"左邊是spark 2.x的運行指標明細，右邊是打開AQE後通過"},{"type":"codeinline","content":[{"type":"text","text":"custom shuffler reader"}]},{"type":"text","text":"後的運行指標情況。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d1a8158b983e39f93f04b1e87f6e642a.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原因分析"},{"type":"text","text":"："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"AQE特性"}]},{"type":"text","text":"："}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/fad821a83e19c6478458e0b03","title":"xxx","type":null},"content":[{"type":"text","text":"AQE"}]},{"type":"text","text":"對於整體的Spark SQL的執行過程做了相應的調整和優化(如下圖)，它最大的亮點是可以根據已經完成的計劃結點"},{"type":"codeinline","content":[{"type":"text","text":"真實且精確的執行統計結果"}]},{"type":"text","text":"來不停的"},{"type":"codeinline","content":[{"type":"text","text":"反饋並重新優化"}]},{"type":"text","text":"剩下的執行計劃。"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d5\/9a\/d5e51b503c82af9f31df0f69c7887b9a.gif","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"AQE自動調整reducer的數量，減小partition數量"},{"type":"text","text":"。Spark任務的並行度一直是讓用戶比較困擾的地方。如果並行度太大的話，會導致task過多，overhead比較大，整體拉慢任務的運行。而如果並行度太小的，數據分區會比較大，容易出現OOM的問題，並且資源也得不到合理的利用，並行運行任務優勢得不到最大的發揮。"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且由於Spark Context整個任務的並行度，需要一開始設定好且沒法動態修改，這就很容易出現任務剛開始的時候數據量大需要大的並行度，而運行的過程中通過轉化過濾可能最終的數據集已經變得很小，最初設定的分區數就顯得過大了。AQE能夠很好的解決這個問題，在reducer去讀取數據時，會根據用戶設定的分區數據的大小("},{"type":"codeinline","content":[{"type":"text","text":"spark.sql.adaptive.advisoryPartitionSizeInBytes"}]},{"type":"text","text":")來自動調整和合並("},{"type":"codeinline","content":[{"type":"text","text":"Coalesce"}]},{"type":"text","text":")小的partition，自適應地減小partition的數量，以減少資源浪費和overhead，提升任務的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由上面單張表可以看到，打開AQE的時候極大的降低了task的數量，除了減輕了Driver的負擔，也減少啓動task帶來的schedule，memory，啓動管理等overhead，減少cpu的佔用，提升的I\/O性能。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拿歷史Data Pipelines爲例，同時會並行有三十多張表在Spark裏運行，每張表都有極大的性能提升，那麼也使得其他的表能夠獲得資源更早更多，互相受益，那麼最終整個的數據建模過程會自然而然有一個加速的結果。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大batch（>200G）相對小batch（<100G）有比較大的提升，有高達40%提升，主要是因爲大batch本身數據量大，需要機器數多，設置併發度也更大，那麼AQE展現特性的時刻會更多更明顯。而小batch併發度相對較低，那麼提升也就相對會少一些，不過也是有27.5%左右的加速。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"內存優化"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了因爲AQE的打開，減少過碎的task對於memory的佔用外，Spark 3.0也在其他地方做了很多內存方面的優化，比如Aggregate部分指標瘦身（"},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-29562","title":"","type":null},"content":[{"type":"text","text":"Ticket"}]},{"type":"text","text":"）、Netty的共享內存Pool功能("},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-24920","title":"","type":null},"content":[{"type":"text","text":"Ticket"}]},{"type":"text","text":")、Task Manager死鎖問題("},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-27338","title":"","type":null},"content":[{"type":"text","text":"Ticket"}]},{"type":"text","text":")、避免某些場景下從網絡讀取shuffle block("},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-27651","title":"","type":null},"content":[{"type":"text","text":"Ticket"}]},{"type":"text","text":")等等，來減少內存的壓力。一系列內存的優化加上AQE特性疊加從前文內存實踐圖中可以看到集羣的內存使用同時有"},{"type":"codeinline","content":[{"type":"text","text":"30%"}]},{"type":"text","text":"左右的下降。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Pipelines裏端到端的每個模塊都升級到Spark 3.0，充分獲得新技術棧帶來的好處。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上所述，"},{"type":"codeinline","content":[{"type":"text","text":"Spark任務得到端到端的加速 + 集羣資源使用降低 = 提升性能且省錢"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來，團隊會繼續緊跟技術棧的更新，並持續對Data Pipelines上做代碼層次和技術棧方面的調優和貢獻，另外會引入更多的監控指標來更好的解決業務建模中可能出現的數據傾斜問題，以更強力的技術支持和保障FreeWheel正在蓬勃發展的業務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後特別感謝AWS EMR和Support團隊在升級的過程中給予的快速響應和支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"肖紅梅，畢業於北京大學，曾任職於微策略，美團，Pegasus大數據公司，具備豐富大數據開發與調優、大數據產品分析、數據倉庫\/建模、項目管理及敏捷開發的經驗。現擔任 Comcast FreeWheel 核心業務數據Transformer團隊負責人，主要負責基於大數據Data Pipelines平臺的搭建、實踐、優化及數據倉庫的建模與核心數據發佈。熱愛大數據技術沉澱和分享，致力於構建讓數據業務產品更易用的大數據生態圈，爲業務增值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Reference"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/fad821a83e19c6478458e0b03","title":"xxx","type":null},"content":[{"type":"text","text":"Spark 3.0關鍵新特性回顧"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/releases\/spark-release-3-0-0.html","title":"","type":null},"content":[{"type":"text","text":"Spark 3.0 Release Notes"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/databricks.com\/blog\/2020\/05\/29\/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html","title":"","type":null},"content":[{"type":"text","text":"AQE"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/blog.knoldus.com\/dynamic-partition-pruning-in-spark-3-0\/","title":"","type":null},"content":[{"type":"text","text":"DPP"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-27296","title":"","type":null},"content":[{"type":"text","text":"UDAF"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/databricks.com\/blog\/2017\/08\/31\/cost-based-optimizer-in-apache-spark-2-2.html","title":"","type":null},"content":[{"type":"text","text":"CBO"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/spark\/blob\/master\/docs\/sql-ref-syntax-qry-select.md","title":"","type":null},"content":[{"type":"text","text":"Spark SQL語法"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/latest\/configuration.html#spark-sql","title":"","type":null},"content":[{"type":"text","text":"Spark SQL配置"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/spark\/blob\/master\/docs\/web-ui.md","title":"","type":null},"content":[{"type":"text","text":"Spark Web UI使用"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-29779","title":"","type":null},"content":[{"type":"text","text":"Spark Event Logs滾動"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-spark-configure.html","title":"","type":null},"content":[{"type":"text","text":"EMR Spark配置"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/spark\/pull\/24307","title":"","type":null},"content":[{"type":"text","text":"Parquet嵌套schema問題"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/2.4.0\/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore","title":"","type":null},"content":[{"type":"text","text":"Spark Hive Metastore配置"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-master-core-task-nodes.html","title":"","type":null},"content":[{"type":"text","text":"EMR 結點標籤配置"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.scala-lang.org\/news\/2.12.0\/#library-improvements","title":"","type":null},"content":[{"type":"text","text":"Scala 2.12改進"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-29562","title":"","type":null},"content":[{"type":"text","text":"Spark Aggregation指標改進"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-24920","title":"","type":null},"content":[{"type":"text","text":"Spark Netty 共享內存Pool"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-27338","title":"","type":null},"content":[{"type":"text","text":"Spark Task Manager 死鎖問題"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/SPARK-27651","title":"","type":null},"content":[{"type":"text","text":"Spark Shuffle Block避免網絡讀取"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

三十分鐘入門基礎Go（Java小子版）

前言 Go語言定義 Go（又稱 Golang）是 Google 的 Robert Griesemer，Rob Pike 及 Ken Thompson 開發的一種靜態、強類型、編譯型語言。Go 語言語法與 C 相近，但功能上有：內存安

2024-04-25 23:17:43

數據結構筆記淺記（十三）哈希表

「哈希表 hash table」，又稱「散列表」，它通過建立鍵 key 與值 value 之間的映射，實現高效的元素查詢。具體而言，我們向哈希表中輸入一個鍵 key ，則可以在 𝑂(1) 時間內獲取對應的值 value 。從本質上看，哈

2024-04-24 23:39:16

數組和鏈表的適用場景

簡介在計算機中要對給定的數據集進行若干處理，首要任務是把數據集的一部分（當數據量非常大時，可能只能一部分一部分地讀取數據到內存中來處理）或全部存儲到內存中，然後再對內存中的數據進行各種處理。例如，對於數據集 S{1，2，3，4，5，6

2024-04-24 09:31:34

MySQL 核心模塊揭祕 | 15 期 | 事務模塊小結

✍ 專欄小結 1 月 3 日，我在社區發佈事務模塊的第一篇文章；4 月 17 日，發佈了最後一篇文章。歷時 3 個半月，用 14 篇文章對事務模塊做了比較全面的介紹。本文我們對事務模塊已經發布的 14 篇文章做個簡單回顧。 01 期《事

2024-04-24 23:20:56

一則 TCP 緩存超負荷導致的 MySQL 連接中斷的案例分析

除了 MySQL 本身之外，如何分析定位其他因素的可能性？作者：龔唐傑，愛可生 DBA 團隊成員，主要負責 MySQL 技術支持，擅長 MySQL、PG、國產數據庫。愛可生開源社區出品，原創內容未經授權不得隨意使用，轉載請聯繫小編並註

2024-04-24 23:20:53

離開工位老是忘記鎖屏？試着讓電腦自動完成這事吧！

1.場景說明公司要求離開工位要立刻鎖定電腦屏幕防止信息泄露，但無論是使用鎖屏快捷鍵還是設置觸發角，總感覺不得勁。想想汽車現在基本都是自動鎖車了，電腦它就不能自己鎖屏嗎？於是抽空蒐羅了一些自動化的解決方案，並按照Win和Mac進行分類。

2024-04-24 23:17:17

高可用 - 隔離原則

前言當討論高可用時，那麼必然有與之對應的低可用甚至不可用，但無論是哪種可用描述，其中都暗含了一個大衆共識，即不存在永久穩定運行的系統程序。事實上，幾十年前圖靈也論證過類似的問題，稱爲“停機問題”，具體的描述是：能否爲A計算機編程，使得

2024-04-24 23:17:13

對接HiveMetaStore，擁抱開源大數據

本文分享自華爲雲社區《對接HiveMetaStore，擁抱開源大數據》，作者：睡覺是大事。 1. 前言適用版本：9.1.0及以上在大數據融合分析時代，面對海量的數據以及各種複雜的查詢，性能是我們使用一款數據處理引擎最重要的考量

2024-04-24 22:33:08

DataGear 企業版 1.1.0 發佈，數據可視化分析平臺

DataGear 企業版 1.1.0 正式發佈，歡迎大家瞭解試用！ http://datagear.tech/pro/ 企業版 1.1.0 新增了MQTT、WebSocket實時數據集功能，新增了Redis、MongoDB數據集功能，具體更

2024-04-24 21:42:05

用DolphinScheduler輕鬆實現Flume數據採集任務自動化！

轉載自天地風雷水火山澤目的因爲我們的數倉數據源是Kafka，離線數倉需要用Flume採集Kafka中的數據到HDFS中。在實際項目中，我們不可能一直在Xshell中啓動Flume任務，一是因爲項目的Flume任務很多，二是一旦Xsh

2024-04-24 21:18:09

自學編程兩個月，現在我月入 4 萬元

這個外國小哥叫 Nico，他一開始是個編程小白，後來把自己關在房間裏花了兩個月時間學會了編程，如今正在開發一款名爲 Talknotes 的應用，可以將語音備忘錄轉化爲結構化的內容，月收入 5000 美元。 Nico 從高中畢業就開始創業，

2024-04-24 21:14:29

沙特2030年願景和對中國IT企業的市場機會分析

沙特2030年願景和對中國IT企業的市場機會分析前言：最近“開源老DJ，帶你去沙特”欄目第一期已經播出，收到了不錯的反響。見COPU官網的回顧。（https://mp.weixin.qq.com/s/3B0jNVhybxTF1xPiy

2024-04-23 22:24:54

2024 開源數據工程生態系統全景圖

點擊藍字關注我們作者 | ALIREZA SADEGHI翻譯 | Debra Chen 01 簡介

2024-04-23 21:30:36

RAG 修煉手冊｜如何評估 RAG 應用？

如果你是一名用戶，擁有兩個不同的 RAG 應用，如何評判哪個更好？對於開發者而言，如何定量迭代提升你的 RAG 應用的性能？顯然，無論對於用戶還是開發者而言，準確評估 RAG 應用的性能都十分重要。然而，簡單的幾個例子對比並不能全面衡量

2024-04-23 21:20:22

Xmake v2.9.1 發佈，新增 native lua 模塊和鴻蒙系統支持

Xmake 是一個基於 Lua 的輕量級跨平臺構建工具。它非常的輕量，沒有任何依賴，因爲它內置了 Lua 運行時。它使用 xmake.lua 維護項目構建，相比 makefile/CMakeLists.txt，配置語法更加簡潔直觀，

2024-04-23 12:10:57

24小時熱門文章

最新文章

Apache Spark 3.0新特性在FreeWheel核心業務數據團隊的應用與實戰

最新評論文章