基於Apache Doris的小米增長分析平臺實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"1、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着小米互聯網業務的發展,各個產品線利用用戶行爲數據對業務進行增長分析的需求越來越迫切。顯然,讓每個業務產品線都自己搭建一套增長分析系統,不僅成本高昂,也會導致效率低下。我們希望能有一款產品能夠幫助他們屏蔽底層複雜的技術細節,讓相關業務人員能夠專注於自己的技術領域,從而提高工作效率。通過分析調查發現,小米已有的統計平臺無法支持靈活的維度交叉查詢,數據查詢分析效率較低,複雜查詢需要依賴於研發人員,同時缺乏根據用戶行爲高效的分羣工具,對於用戶的運營策略囿於設施薄弱而較爲粗放,運營效率較低和效果不佳。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於上述需求和痛點,小米大數據和雲平臺聯合開發了增長分析系統(Growing Analytics, 下面簡稱GA),旨在提供一個靈活的多維實時查詢和分析平臺,統一數據接入和查詢方案,幫助業務線做精細化運營。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"2、增長分析場景介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/94/940635f36678d4fe9724c1df716eaebb.jpeg","alt":"24264162-be4a33da995fe293","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/41b231d2208793434081655ee44f0f3d772ddfa3.jpeg","title":"24264162-be4a33da995fe293"},"content":[{"type":"text","text":"24264162-be4a33da995fe293960×540 36.9 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖所示,分析、決策、執行是一個循環迭代的過程,因此,增長分析查詢非常靈活,涉及分析的維度有幾十上百個,我們無法預先定義好所有要計算的結果,代價太高,所以這也就要求了所有的數據需要即時計算和分析。同時,決策具有時效性,因此數據從攝入到可以查詢的時延不能太高。另外,業務發展迅速,需要增加新的分析維度,所以我們需要能夠支持schema的變更(主要是在線增加字段)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們的業務中,增長分析最常用的三個功能是事件分析(佔絕大多數)、留存分析和漏斗分析;這三個功能業務都要求針對實時入庫(只有append)的明細數據,能夠即席選擇維度和條件(通常還要join業務畫像表或者圈選的人羣包),然後在秒級返回結果(業界相關的產品如神策、GrowingIO等都能達到這個性能)。一些只支持提前聚合的預計算引擎(如Kylin),雖然查詢性能優秀,但難以支持schema隨時變更,衆多的維度也會造成Cube存儲佔用失控,而Hive能夠在功能上滿足要求,但是性能上較差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上,我們需要存儲和計算明細數據,需要一套支持近實時數據攝取,可靈活修改schema和即席查詢的數據分析系統解決方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"3、技術架構演進"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"3.1 初始架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GA立項於2018年年中,當時基於開發時間和成本,技術棧等因素的考慮,我們複用了現有各種大數據基礎組件(HDFS, Kudu, SparkSQL等),搭建了一套基於Lamda架構的增長分析查詢系統。GA系統初代版本的架構如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a4ec20a764ddda7e98e6b1a7cd799163.png","alt":"24264162-a71af0c4c9704e31","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/20f6df314b39e8e4f9fe05cd25628380bea85f84.png","title":"24264162-a71af0c4c9704e31"},"content":[{"type":"text","text":"24264162-a71af0c4c9704e311011×540 83.9 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GA系統涵蓋了數據採集、數據清洗、數據查詢和BI報表展示等一整套流程。首先,我們將從數據源收集到的數據進行統一的清洗,以統一的json格式寫入到Talos(注:小米自研的消息隊列)中。接着我們使用Spark Streaming將數據轉儲到Kudu中。Kudu作爲一款優秀的OLAP存儲引擎,具有支持實時攝取數據和快速查詢的能力,所以這裏將Kudu作爲熱數據的存儲,HDFS作爲冷數據的存儲。爲了不讓用戶感知到冷熱數據的實際存在,我們使用了動態分區管理服務來管理表分區數據的遷移,定期將過期的熱數據轉化爲冷數據存儲到HDFS上,並且更新Kudu表和HDFS表的聯合視圖,當用戶使用SparkSQL服務查詢視圖時,計算引擎會根據查詢SQL自動路由,對Kudu表的數據和HDFS表的數據進行處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在當時的歷史背景下,初代版本的GA幫助我們用戶解決了運營策略較爲粗放、運營效率較低的痛點,但同時也暴露了一些問題。首先是運維成本的問題,原本的設計是各個組件都使用公共集羣的資源,但是實踐過程中發現執行查詢作業的過程中,查詢性能容易受到公共集羣其他作業的影響,容易抖動,尤其在讀取HDFS公共集羣的數據時,有時較爲緩慢,因此GA集羣的存儲層和計算層的組件都是單獨搭建的。另一個是 "},{"type":"text","marks":[{"type":"strong"}],"text":"性能"},{"type":"text","text":" 的問題,SparkSQL是基於批處理系統設計的查詢引擎,在每個Stage之間交換數據shuffle的過程中依然需要落盤操作,完成SQL查詢的時延較高。爲了保證SQL查詢不受資源的影響,我們通過添加機器來保證查詢性能,但是實踐過程中發現,性能提升的空間有限,這套解決方案並不能充分地利用機器資源來達到高效查詢的目的,存在一定的資源浪費。因此,我們希望有一套新的解決方案,能夠提高查詢性能和降低我們的運維成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"3.2 重新選型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MPP架構的SQL查詢引擎,如Impala,presto等能夠高效地支持SQL查詢,但是仍然需要依賴Kudu, HDFS, Hive Metastore等組件, 運維成本依然比較高,同時,由於計算存儲分離,查詢引擎不能很好地及時感知存儲層的數據變化,就無法做更細緻的查詢優化,如想在SQL層做緩存就無法保證查詢的結果是最新的。因此,我們的目標是尋求一款計算存儲一體的MPP數據庫來替代我們目前的存儲計算層的組件。我們對這款MPP數據庫有如下要求:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"足夠快的查詢性能。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"對標準SQL支持較全面,用戶使用友好。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"不依賴其他外部系統,運維簡單。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"社區開發活躍,方便我們後續的維護升級。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Doris是百度開源到Apache社區的基於 MPP 的交互式 SQL 數據倉庫, 主要用於解決報表和多維分析。它主要集成了 Google Mesa 和 Cloudera Impala 技術,滿足了我們的上述要求。我們對Doris進行了內部的性能測試並和社區溝通交流,確定了Doris替換原來的計算存儲組件的解決方案,於是我們新的架構就簡化爲如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/04/04a4bc6ccb228b0a5977c67556a9396d.png","alt":"24264162-bf6c7565dbd57eb3","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/282688489d21e7d66019326c556f7412d1b22421.png","title":"24264162-bf6c7565dbd57eb3"},"content":[{"type":"text","text":"24264162-bf6c7565dbd57eb3787×217 26.9 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"3.3 性能測試"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在配置大體相同計算資源的條件下,我們選取了一個日均數據量約10億的業務,分別測試不同場景下(6個事件分析,3個留存分析,3個漏斗分析),不同時間範圍(一週到一個月)的SparkSQL和Doris的查詢性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/6272791127ac265116fee3211549a230.png","alt":"24264162-6af8225416912cb7","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/822009acd1739b9faf6fe0a6c0b63a3461952fe8.png","title":"24264162-6af8225416912cb7"},"content":[{"type":"text","text":"24264162-6af8225416912cb71061×546 11.3 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖測試結果,在增長分析的場景下,Doris查詢性能相比於SparkSQL+Kudu+HDFS方案具有明顯的提升,在事件分析場景下平均降低約85%左右的查詢時間,在留存和漏斗場景下平均降低約50%左右的查詢時間。對於我們我們業務大多數都是事件分析需求來講,這個性能提升很大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"4、Doris實踐與優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"4.1 Doris在增長分析平臺的使用情況"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5b/5bd8d71866c945db2a38b3aa8f3bc8a3.png","alt":"24264162-2a96ac54f6206b68","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/79dcfa0da4711bcd27303ef758b556ae35a68cc6.png","title":"24264162-2a96ac54f6206b68"},"content":[{"type":"text","text":"24264162-2a96ac54f6206b68923×281 57.5 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着接入業務的增多,目前,我們的增長分析集羣單集羣最大規模已經擴展到了近百臺,存量數據到了PB級別。其中,近實時的產品線作業有數十個,每天有幾百億條的數據入庫,每日有效的業務查詢SQL達1.2w+。業務的增多和集羣規模的增大,讓我們也遇到不少問題和挑戰,下面我們將介紹運維Doris集羣過程中遇到的一些問題和應對措施或改進。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"4.2 Doris數據導入實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Doris大規模接入業務的第一個挑戰是數據導入,基於我們目前的業務需求,數據要儘可能實時導入。而對於增長分析集羣,目前有數十個業務明細數據表需要近實時導入,這其中還包含了幾個大業務(大業務每天的數據條數從幾十億到上百億不等,字段數在200~400)。爲了保證數據不重複插入,Doris採用label標記每批數據的導入,並採用兩階段提交來保證數據導入的事務性,要麼全部成功,要麼全部失敗。爲了方便監控和管理數據導入作業,我們使用Spark Streaming封裝了stream load操作,實現了將Talos的數據導入到Doris中。每隔幾分鐘,Spark Streaming會從Talos讀取一個批次的數據並生成相應的RDD,RDD的每個分區和Talos的每個分區一一對應,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/db/db1c7756c8d8f64650972fe6863151ec.png","alt":"24264162-1380e0f17b3f32a4","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/23011d6ff2ff4b7e2aa68494f0ed5e07f82c72c2.png","title":"24264162-1380e0f17b3f32a4"},"content":[{"type":"text","text":"24264162-1380e0f17b3f32a4897×514 83.2 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於Doris來說,一次stream load作業會產生一次事務,Doris的fe進程的master節點會負責整個事務生命週期的管理,如果短時間內提交了太多的事務,則會對fe進程的master節點造成很大的壓力。對於每個單獨的流式數據導入產品線作業來說,假設消息隊列一共有m個分區,每批次的每個分區的數據導入可能執行最多n次stream load操作,於是對消息隊列一個批次的數據的處理就可能會產生m*n次事務。爲了Doris的數據導入的穩定性,我們把Spark Streaming每批次數據的時間間隔根據業務數據量的大小和實時性要求調整爲1min到3min不等,並儘量地加大每次stream load發送的數據量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在集羣接入業務的初期,這套流式數據導入Doris的機制基本能平穩運行。但是隨着接入業務規模的增長,問題也隨之而來。首先,我們發現某些存了很多天數據的大表頻繁地出現數據導入失敗問題,具體表現爲數據導入超時報錯。經過我們的排查,確定了導致數據導入超時的原因,由於我們使用stream load進行數據導入的時候,沒有指定表的寫入分區(這裏線上的事件表都是按天進行分區),有的事件表已經保留了三個多月的數據,並且每天擁有600多個數據分片,加上每張表默認三副本保存數據,所以每次寫入數據之前都需要打開約18萬個writer,導致在打開writer的過程中就已經超時,但是由於數據是實時導入,其他天的分區沒有數據寫入,所以並不需要打開writer。定位到原因之後,我們做了相應的措施,一個是根據數據的日期情況,在數據導入的時候指定了寫入分區,另一個措施是縮減了每天分區的數據分片數量,將分片數據量從600+降低到了200+(分片數量過多會影響數據導入和查詢的效率)。通過指定寫入數據分區和限制分區的分片數量,大表也能流暢穩定地導入數據而不超時了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個困擾我們的問題就是需要實時導入數據的業務增多給fe的master節點帶來了較大的壓力,也影響了數據導入的效率。每一次的 stream load操作,coordinator be節點都需要多次和fe節點進行交互,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a2/a2e5d2379179a6ff601929c97b7fa120.png","alt":"24264162-712bfa3a32bfd4e0","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/9e94e79a46e360c06a6aa1c73f409ffd1fa69c49.png","title":"24264162-712bfa3a32bfd4e0"},"content":[{"type":"text","text":"24264162-712bfa3a32bfd4e0816×566 48.9 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"曾經有段時間,我們發現master節點偶爾出現線程數飆升,隨後cpu load升高, 最後進程掛掉重啓的情況。我們的查詢併發並不是很高,所以不太可能是查詢導致的。但同時我們通過對max_running_txn_num_per_db參數的設置已經對數據導入在fe端做了限流,所以爲何fe的master節點的線程數會飆升讓我們感到比較奇怪。經過查看日誌發現,be端有大量請求數據導入執行計劃失敗的日誌。我們的確限制住了單個db能夠允許同時存在的最大事務數目,但是由於fe在計算執行計劃的時候需要獲取db的讀鎖,提交和完成事務需要獲取db的寫鎖,一些長尾任務的出現導致了好多計算執行計劃的任務都堵塞在獲取db鎖上邊,這時候be客戶端發現rpc請求超時了,於是立即重試,fe端的thirft server需要啓動新的線程來處理新的請求,但是之前的事務任務並沒有取消,這時候積壓的任務不斷增多,最終導致了雪崩效應。針對這種情況,我們對Doris主要做了以下的改造:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"在構造fe的thrift server的線程池時使用顯式創建線程池的方式而非原生的newCachedThreadPool方式,對線程數做了相應的限制,避免因爲線程數飆升而導致資源耗盡,同時添加了相應的監控。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"當be對fe的rpc請求超時時,大部分情況下都是fe無法在指定時間內處理完請求導致的,所以在重試之前加上緩衝時間,避免fe端處理請求的堵塞情況進一步惡化。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"重構了下GlobalTransactionMgr的代碼,在保持兼容原有接口的基礎上,支持db級別的事務隔離,儘量減少不同事務請求之間的相互影響,同時優化了部分事務處理邏輯,加快事務處理的效率。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"獲取db鎖添加了超時機制,如果指定時間內獲取不到db鎖,則取消任務,因爲這時候be端的rpc請求也已經超時了,繼續執行取消的任務沒有意義。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"對coordinator be每一步操作的耗時添加metric記錄,如請求開始事務的耗時,獲取執行計劃的耗時等,在最終的執行結果中返回,方便我們及時瞭解每個stream load操作的耗時分佈。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過以上改造,我們數據導入穩定性有了比較好的提升,至今再沒發生過因爲fe處理數據導入事務壓力過大而導致master節點掛掉的問題。但是數據導入依然存在一些問題待改進:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"be端使用了libevent來處理http請求,使用了Reactor模式的libevent一般是編寫高性能網絡服務器的首選,但是這裏卻不適用於我們的場景,Doris在回調函數中多次地調用包含阻塞邏輯的業務代碼,如rpc請求,等待數據分發完成等,由於多個請求共用同一個線程,這將部分請求的回調操作不能得到及時的處理。目前這塊我們並沒有好的解決方法,唯一的應對措施只是調大了libevent的併發線程數,以減弱不同請求之間的相互影響,徹底的解決方案仍有待社區的進一步討論。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"fe端在更新表的分區版本時採用了db級別的隔離,這個鎖的粒度過大,導致了相同db不同表的數據導入都要競爭db鎖,這大大降低了fe處理事務的效率。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"發佈事務的操作現在依然比較容易出現publish timeout的問題(這意味着無法在指定時間內得到大多數事務相關be節點完成發佈事務操作的響應),這對數據導入的效率提升是一個比較大的阻礙。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"4.3 Doris在線查詢實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在增長分析業務場景中,事件表是我們的核心表,需要實時導入明細日誌。這些事件表沒有聚合和去重需求,而且業務需求是能夠查詢明細信息,所以都選用了冗餘模型(DUPLICATE KEY)。事件表根據天級別分區,分桶字段使用了日誌id字段(實際上是一個隨機產生的md5),其hash值能夠保證分桶之間數據均勻分佈,避免數據傾斜導致的寫入和查詢問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖是我們線上規模最大的集羣最近30天的查詢性能統計(查詢信息的統計來自於Doris的查詢審計日誌),最近一週每天成功的SQL查詢數在1.2w~2w之間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/55/55befabe38fbc6a3939d6ff11d376657.png","alt":"24264162-7cc1bbe373eaa09b","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/125d367fde9f3d31a89508d4ab18408ff89ea653.png","title":"24264162-7cc1bbe373eaa09b"},"content":[{"type":"text","text":"24264162-7cc1bbe373eaa09b916×483 11.6 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從圖中可以看出,使用了Doris後,平均查詢時間保持在10秒左右,最大不超過15秒;查詢時間P95一般能保證在30秒內。這個查詢體驗,相對於原來的SparkSQL,提升效果比較明顯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Doris提供了查詢併發度參數parallel_fragment_exec_instance_num,查詢服務端根據正在運行的任務個數動態調整它來優化查詢,低負載下增加併發度提高查詢性能,高負載下減少併發度保證集羣穩定性。在分析業務查詢profile時,我們發現Doris默認執行過程中exchange前後併發度是一樣的,實際上對於聚合型的查詢,exchange後的數據量是大大減少的,這時如果繼續用一樣的併發度不僅浪費了資源,而且exchange後較少數據量用較大的併發執行,理論上反而降低了查詢性能。因此,我們添加了參數doris_exchange_instances控制exchange後任務併發度(如下圖所示),在實際業務測試中取得了較好的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/66/6676846608ebde7818d42b2d76f1aff9.png","alt":"24264162-0f0ecc04a8848b7b","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/0193ef28ad766f42603a225cfee7d53becf320ab.png","title":"24264162-0f0ecc04a8848b7b"},"content":[{"type":"text","text":"24264162-0f0ecc04a8848b7b1034×684 14 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個對數據量巨大的業務或者exchange後不能明顯降低數據量級的查詢並不明顯,但是這個對於中小業務(尤其是那些用了較多bucket的小業務)的聚合或join查詢,優化比較明顯。我們對不同數量級業務的測試,也驗證了我們的推斷。我們選取了一個數據量4億/日的小業務,分別測試了不同場景下查詢性能:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cd1cdd2b0b8e1f3263f85a8ed9333233.png","alt":"24264162-373204a2d88d93d3","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/1f99ad9b0cde518bf19c404aa25de6987e0ba48e.png","title":"24264162-373204a2d88d93d3"},"content":[{"type":"text","text":"24264162-373204a2d88d93d3899×509 8.24 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上圖結果可以看出,doris_exchange_instances對於聚合和join類型的小查詢改進明顯。當然,這個測試是在很多次測試之後找到的最優doris_exchange_instances值,在實際業務中每次都能找到最優值可行性較低,一般對於中小業務根據查詢計劃中需要掃描的buckets數目結合集羣規模適當降低,用較小的代價獲得一定性能提升即可。後來我們將這個改進貢獻到社區,該參數名被修改爲parallel_exchange_instance_num。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了擴展SQL的查詢能力,Doris也提供了和SparkSQL,Hive類似的UDF(User-Defined Functions)框架支持。當Doris內置函數無法滿足用戶需求時,用戶可以根據Doris的UDF框架來實現自定義函數進行查詢。Doris支持的UDF分成兩類(目前不支持UDTF,User-Defined Table-Generating Functions,一行數據輸入對應多行數據輸出),一類是普通UDF,根據單個數據行的輸入,產生一個數據行的輸出。另一類是UDAF(User-Defined Aggregate Functions),該類函數屬於聚合函數,接收多個數據行的輸入併產生一個數據行的輸出。UDAF的執行流程如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/33/3317c45fc91ac61d9a0086cd93eaa888.png","alt":"24264162-40b4a38cc55c18fc","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://forum.dorisdb.com/uploads/default/original/1X/219c2a2e592fb2cac13da4c282b8eca4f65eda01.png","title":"24264162-40b4a38cc55c18fc"},"content":[{"type":"text","text":"24264162-40b4a38cc55c18fc966×698 57 KB"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UDAF一般需要定義4個函數,分別爲Init、Update、Merge、Finalize函數,若爲中間輸出的數據類型爲複雜數據類型時,則還需要實現Serialize函數,在Shuffle過程中對中間類型進行序列化,並在Merge函數中對該類型進行反序列化。在增長分析場景中,留存分析、漏斗分析等都使用到了UDAF。以留存分析爲例,它是一種用來分析用戶參與情況/活躍程度的分析模型,考查進行初始行爲後的用戶中有多少人會進行後續行爲。針對以上需求,我們首先定義了函數retention_info,輸入是每個用戶的行爲信息,然後以每個用戶的id爲key進行分組,生成每個用戶在指定時間內的每個時間單元(如天,周,月等)的留存信息,然後定義函數retention_count,輸入是retention_info函數生成的每個用戶的留存信息,然後我們以留存的時間單位(這裏通常是天)爲key進行分組,就可以算出每個單位時間內留存的用戶數。這樣在UDAF的幫助下,我們就可以順利完成留存分析的計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"4.4 Doris表的管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們的增長分析場景中,從是否分區的角度上看,Doris的olap表主要分成兩種類型,一種是非分區表,如人羣包和業務畫像表,人羣包表的特徵是數據量較小,但是表的數量多;業務畫像表數據量較少,數據量中等,但有更新需求。另一種是分區表,如事件表,這類表一般單表數據規模都比較大,在設計上,我們以時間字段爲分區鍵,需要每天增加爲表添加新的分區,使得實時的數據能夠成功地導入當天的分區,並且需要及時地刪掉過期的分區。顯然,讓每個業務自己去管理表的分區,不僅繁瑣,而且可能出錯。在我們原先的GA架構中,就有動態分區管理服務,使用Doris系統後,我們將動態分區管理服務集成到了Doris系統中,支持用戶按天、周、月來設置需要保留的分區個數以及需要提前創建的分區數量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個表管理的典型場景是修改表的schema,主要操作爲增加表的字段。Doris現階段只支持一些基本數據類型,在大數據場景下業務打點上報的日誌的數據類型多爲嵌套類型(list,map),所以接入Doris時需要展開或者轉換,導致Doris表字段數目較爲龐大,部分類型字段展開困難不得不用varchar存儲導致使用起來非常不方便,查詢性能也相對低下。由於Doris不支持嵌套數據類型,當嵌套類型新增元素時,則Doris表需要增加字段,從提交增加字段請求到添加字段成功等待的時間較長,當集羣管理的tablet數目龐大並且表的數據量和tablet數目都比較多的情況下可能會出現添加列失敗的問題。針對以上問題,目前我們主要做了以下兩點改進:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"縮短用戶提交修改schema請求到真正執行的等待時長,當系統創建一個修改表的schema的事務的時候,原先的設計是要等待同一個db的所有大於該事務id號的事務都完成了才能開始修改表的schema,我們修改爲等待與該表有關且在該事務id號之前的所有事務完成即可修改表的schema。當同一個db的數據導入作業很多的時候,這個修改可以大大縮短修改schema的等待時間,也避免了其他表的一些數據導入故障問題可能導致修改表schema的操作遲遲不能執行。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"加快創建表包含新的schema的tablet的速度。Doris修改schema的原理是通過創建包含新的schema的tablet,然後將舊的tablet的數據遷移到新的tablet來完成schema的修改。be節點通過一個map的數據結構來管理所有該節點上的tablet。由於這裏只有一把全局鎖,當tablet數量非常多的時候,一些管理tablet的操作都要去獲取全局鎖來對tablet進行操作,此時會導致創建新的tablet超時,使得修改schema的操作失敗。針對這種情況,我們對map和全局鎖做了shard操作,避免了創建tablet超時情況的發生。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"5、總結與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Doris在小米從2019年9月上線接入第一個業務至今,已經在海內外部署近十個集羣(總體達到幾百臺BE的規模),每天完成數萬個在線分析查詢,承擔了我們包括增長分析和報表查詢在內的大多數在線分析需求。從結果上來看,用Doris替換SparkSQL作爲主要OLAP引擎,既大幅度提高查詢性能,又簡化了目前的數據分析架構,是Doris基於明細數據查詢的大規模服務的一個比較成功的實踐。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在接下來的一段時間內,我們將繼續投入精力提升數據實時導入效率和優化總體的查詢性能,由於公司內部有不少業務有使用UNIQUE KEY模型的需求,目前該模型與DUPLICATE KEY模型的scan性能相比還是有比較明顯的差距,這塊也是未來我們需要重點解決的性能問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"6、致謝"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着社區的發展,Doris正在變得成熟和完善。Doris核心研發團隊的成員組建了鼎石科技,他們專注於提升Doris的性能和完善Doris的相關功能,如可視化管理運維平臺,安全性組件等。在使用Doris的過程中,鼎石科技的小夥伴們也給予了我們很大的幫助,特此感謝!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"7、作者簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"蔡聰輝,小米OLAP工程師,Apache Doris Committer"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鍾雲,小米大數據工程師"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章