貝殼基於Druid的OLAP引擎應用實踐
{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼作爲全國領先的房產交易和租賃在線服務平臺,有很多業務場景會產出大量實時和離線數據,針對這些數據進行查詢分析,對於企業發展和業務拓展至關重要。不同業務線不同查詢場景下,單一技術手段很難滿足業務方的需求,Druid就是我們在探索之路上發現的比較切合業務方需求的OLAP引擎之一,基於Druid我們做了深入地實踐,接下來就由我和業界朋友們一起分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內容包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼OLAP平臺介紹"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OLAP技術選型策略"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Druid在貝殼的應用實踐"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Druid結合貝殼業務場景的改進"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"未來規劃"}]}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"貝殼OLAP平臺介紹"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 平臺簡介"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ca\/cad4f21629ca5871bf1476ca9ead7dfa.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺的使用對象主要是經紀人、運營人員、房產分析師和客戶。平臺的架構就如上圖所示,整個平臺分爲四層,第一層爲應用層,應用層主要是看板和報表。第二層是指標層,提供了一個一站式的指標平臺,主要使用對象是數倉人員,數倉人員可以在一站式平臺上做數據建模、例行作業任務配置、指標定義加工以及指標API輸出。第三層爲路由層,路由層是一個統一查詢引擎,提供查詢語義轉換、查詢緩存、查詢降級以及不同類型的引擎切換。第四層是OLAP引擎,目前使用的主要引擎是Kylin、Druid以及Clickhouse,其中Kylin和Druid主要是分擔離線指標業務,Clickhouse負擔實時指標業務。在2020年4月份之前,平臺底層的離線指標引擎主要依託Kylin爲主,在2020年5月份之後,逐漸引入Druid引擎,目前兩個引擎的流量比例Druid佔60%左右,kylin在40%左右。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 引入Druid的原因"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c5\/c5899c59227e35c0364df6f5babea319.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引入Druid引擎是因爲在使用離線指標的時候發現用Kylin引擎存在一些問題,主要包括五大問題:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Kylin的數據源構建時間比較長,有些業務方要求在某一個時間點前數據必須就緒。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據源的底層存儲佔用比較大。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢的靈活性比較差,有時候爲了適配不一樣的查詢,需要構建多個cube進行適配。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"源數據的膨脹率巨大,相對於源數據表裏ORC的文件大小,它的膨脹率比較驚人,有可能產生可怕的維度爆炸。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它的調優門檻相對較高,需要對數倉的同學做一定的培訓才能上手。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"OLAP技術選型策略"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 爲什麼選擇Druid"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0a\/0a4f9e71424980f8240250c9a15fefb8.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先分享下貝殼在選用Druid時候採用了哪些選型策略。其實選型最重要的是大家要知道自己需要什麼樣的一個OLAP引擎。針對貝殼來說,主要有五點要求,第一個是PB級別的數據量;第二個是亞秒級別的響應;第三個是支持較高的併發,在貝殼的業務場景下,QPS來說相對較高,平均QPS在五六百左右,峯值可達到2000;第四個是靈活應用的查詢接口,支持在QE引擎層靈活接入,讓統一SQL查詢引擎屏蔽底層OALP引擎查詢語法差異;第五個是快捷導入數據,按時生成查詢數據源,滿足業務方查詢需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3b\/3beaa5d15a2d4e6cd908ff5d14c33ed2.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據如上五點要求,可選用OLAP引擎主要是Druid、Kylin、Doris、Clickhouse這四種。根據貝殼需要支持高併發和支持精確去重的需求,Druid的併發性能和Kylin的併發性能比較接近,Doris比Clickhouse好一些,Clickhouse的高併發性能沒有那麼好。Druid的原生版本支持SQL級別的精確去重,但是離線的多列精確去重原生版本是不支持的。在社區裏有夥伴用bitmap實現的一個版本,它參照Kylin的字典編碼用AppendTrie Tree實現了離線多列精確去重。綜合以上因素及運營成本,最終選定了使用Druid。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Druid與Kylin對比測試"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們做了一些初步測試來驗證兩種引擎的性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/68\/68346342376b91de97e18adb53544993.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在構建時長方面,選用了包括全量表,增量表在內的線上常用七個數據源,使用相同的計算資源情況下,對近一個月的平均導入時長進行對比,Druid導入時長比Kylin導時長要短,基本上是Kylin的1\/3左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4a\/4a8e1cdb713e53d8fb0a8ea59599257e.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在平均查詢時長方面,對比了在200QPS左右的查詢時長,選取的查詢範圍包括日\/周\/月\/季度\/半年等時間範圍,Druid的查詢時間和Kylin比較接近。理論上Kylin的查詢時間應該更快,因爲Kylin的預聚合程度更高,相當於把所有的查詢條件及度量都已經進行了預計算,只要調優的技術比較好,它的查詢速度應該是最好的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/02585590e215d46ba93d9d4358e7fa70.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據源佔用HDFS存儲及相較於源數據的膨脹率方面,也做了一些統計,發現Druid在HDFS佔用存儲量相較與Kylin的cube佔用存儲明顯佔優。圖中前面一排藍色是Druid,後面幾個顏色都是kylin的cube,可能有同學會問爲什麼有的數據源有很多cube,因爲kylin要適配不同的查詢類型,會預聚合多個cube來滿足查詢速度。從膨脹率看,相較於ADS層的hive源表數據,Druid的膨脹率大概是1到3倍左右,Kylin的膨脹率在18倍到100倍。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. Druid的架構"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0c\/0ca8f35a5c0b72d151abeea861239dac.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Druid架構主要有四層,第一個是查詢服務層,第二個是數據存儲層,第三個是集羣管理層,第四個是數據攝取層。查詢服務層主要是它的broker,用於接收用戶端的查詢請求。數據存儲層在生產部署的時候,再分兩層,一個是數據熱層,一個是數據冷層。數據熱層一般存儲近半年的數據,以天爲聚合條件;數據冷層一般存儲超過半年的數據,包括一年、兩年甚至五年的數據,以月度粒度聚合。它們的存儲介質也有區別,熱層用NVMe的SSD盤,冷層用HDD普通的機械硬盤做raid10,以提升IO讀寫性能。數據攝取執行層主要是負責離線任務、實時任務執行。集羣管理層就是管理數據攝取的master即 overlord和數據存儲層的master即coordinator。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Druid在貝殼的應用實踐"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 指標構建"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f5\/f53a454a82304eb96dcd74e25f4f3a6d.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Druid在貝殼的應用是通過一站式指標平臺實現的,平臺整合了數倉建模、指標定義、指標加工、指標API四項功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/38\/38f9cafbcb13f2c435c5cf2766bdcf57.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個Druid指標的創建流程如下:首先,用戶在元數據平臺上去找到目標OLAP表;然後,創建model和cube,這裏model和cube參考kylin建模思想。其中model指定事實表和維度表的join關係,同時指定度量列和維度列。cube是在model的基礎上再次做維度和度量選取,指定度量列的度量規則,包括count distinct\/sum\/count\/avg等度量規則。在創建完模型後一站式指標平臺會自動構建一個Hive2Druid作業任務。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/07\/074861667e6c44e39344678a2fbadc6e.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前離線指標任務支持小時級別、天級、周級、月級,也支持in多少個pt(日期分區)、大於等於、小於等於這些複雜的時間表達式,用於支持用戶做歷史數據重刷。任務構建完成之後,會自動在定時的時間點往Druid裏灌入數據。最後用戶可以在一站式指標平臺上創建指標,比如經紀人帶客戶看房的指標,建完後關聯相應的cube就可用了。一些看板的RD研發同學可以通過API的方式直接調用,還有一些用戶可以直接在Odin上配置看板,來查詢底層數據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 應用效果"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e4\/e4b8d77b3da30d2e01e5b0051683fadc.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從支撐的查詢量看,當下查詢量每日6000萬左右,四月份前約爲3000萬左右,相較於年初翻了一倍,Kylin和Druid的分擔比例大概是4:6,Druid佔了60%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b7\/b7cd50092394a8cf886ac6c6febe6257.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從構建時長看,Druid的數據源構建時長僅爲Kylin構建時長的1\/2左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fa\/faafae3be481f7b599ebb6b7ef6a5a7d.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從存儲佔用大小看,據不完全統計,在2020年4月份,Kylin底層佔用600TB,全部遷移Druid後,存儲量預計爲Kylin的1\/10左右,節省的底層存儲資源非常可觀。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/10\/108c30fc393f871e11e14a1803fc58db.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從三秒內到達率看,Druid三秒內到達率基本維持在99.9%以上,kylin維持在99.3%-99.4%的水平,因爲Kylin的cube調優比較麻煩,所以預期也會比Druid稍低。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Druid結合貝殼業務場景的改進"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 改進總體說明"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/25\/25d5257b2240427577dfc4ec7aa59529.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於Druid,貝殼結合業務場景做了一些改進,本次分享的改進點主要從兩個方面介紹,第一個是Druid數據源的導入方式,第二個是保障Druid集羣穩定性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Druid數據源導入優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Index hadoop作業類型數據導入優化"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/73\/73f4c0dc2ed02218a779999f00ca3857.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個方面的優化是針對離線的hadoop作業類型數據導入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,介紹一下整個數據導入過程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 之前提到cube構建好後,會自動構建一個調度任務,在觸發點觸發的時候,會從hive倉庫中獲取數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 從hive數倉中獲取數據落盤形成parquet的文件,parquet文件就緒後會通知Druid overload(執行數據灌入的master節點)數據已經就緒。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ overload會去加載hdfs上的parquet文件,開始執行hadoop index作業;Hadoop index作業的執行主要是分爲三個步驟,第一步是partition作業,這一步決定會分多少個segments。第二步是構建字典左右,當前構建字典部分引用的是社區小夥伴提供的離線精確去重的版本,注意只有度量規則中指定count distinct列的時候,纔會去觸發這個作業。第三步是生成索引作業,針對維度列和度量列生成倒排索引和bitmap索引。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"④ Hadoop index作業運行完後,segment持久化到深存hdfs,落盤後historical從hdfs上拉取文件,生成自有存儲格式,這樣整個數據導入就結束了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5a\/5af64562ba68d61be31b75ba5993bc4e.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於上述步驟中離線數據灌入的作業時長,主要取決於兩個因素,第一個是源表本身的數據量大小,第二個是使用了多少map reduce的資源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏舉一個例子,對於一個數據量是1億4000萬、列數40列、有count distinct和sum度量,基數在600萬左右的數據增量表,去hive裏查數據生成parquet文件時候,我們預先repartition出20個分區,500萬行一個分區,生成20個parquet文件,分區的數量會決定partition job的map數量;第二個是時序字段裏面只有一天,因爲它是一個增量表,當日PT裏面只有昨天一日的數據,如果按照昨天一日數據直接向druid裏面灌入,在生成索引階段,它的reduce數量只有一個,作業運行效率非常差,這裏我們根據經驗直接指定numShards的數量是5,每個map reduce的內存資源是8G。按如上步驟進行優化後數據導入效率提升明顯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/de\/ded63f4720b7a6d813df7c57896b05c6.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖爲近七日優化前和優化後數據源導入時長對比,優化後的時間約爲優化前的1\/3。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0c\/0c0b0c6dda1dcdbd9ffd1d31116cfda7.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用index hadoop作業類型往Druid導入數據存在列精確去重的一些問題,如果列爲高基數列,比如5000萬、6000萬、1億這樣的高基數,index generator job在map階段去拉取字典生成bitmap數組的時候,在container裏會出現full GC的問題。一般的解決方法就是調整map的內存,但是不可能無限制的調整,這個也是我們未來優化的一個重點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Kafka Index 作業類型增加多列實時精確去重能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cb\/cb37afe0529943c825ab5a53f19b2342.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個改進是針對kafka index作業類型增加多列實時精確去重的能力,主要是因爲我們的業務方有GMV、GSV、分享數實時精確去重統計需求。Druid的原生版本支持sql語法的精確去重,但這種查詢性能並不高且只能支持單列精確去重,也就是說一個查詢語句裏面只能執行select count(distinct A) from table 1,不能執行select count(distinct A), count(distinct B) from table 1。社區之前提供的離線精確去重版本不能支持實時場景,僅可在近實時場景中採用(即小時級別任務),秒級別或者分鐘級別不適合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/df\/dfa5a8fbedc387ec83ff5106475cf6e3.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼的解決方法是借鑑社區的一些經驗,實現了一個CommonUnique的擴展。擴展的實現主要有三點,第一點是用雪花算法生成數字編碼,就是在執行kafka index job時候在本地服務內起動一個生成雪花ID的service,這樣可以使得ID生成速率更快。第二點是用redis集羣實現字典存儲,通過redis基於緩存的分佈式鎖,可以保證字符串的數值編碼的唯一性。因爲ID是一直遞增的,傳遞進來一個字符串生成一個ID,再往Redis裏set的時候,如果發現value值已經有了,直接就返回已有值。第三點是查詢的時候使用64位的bitmap做去重計數,這是因爲雪花ID生成的數值編碼是long型,所以用64位更合理。測試了近一個月內的查詢,基本可以達到秒級以下的返回。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/aa\/aa0df3ee97330efc23c8c061d2faea49.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是如何使用CommonUnique這種指標類型的說明,在數據攝入階段,可以在metricsSpec裏去定義CommonUnique這種指標計算類型,fieldName是原始列,name是需要進行bitmap編碼的列。右邊舉的是一個groupby的查詢例子,在數據查詢的時候,在aggregation裏可以指定CommonUnique,然後name和fieldName都指定bitmap編碼列,就可以實現一個select count distinct這一列的查詢需求。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 保障Druid集羣穩定的措施"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"背景說明"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/32\/32d2358e07210a89e041b60abd938d02.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前Druid查詢的高峯期在上午7點到中午12點,Druid峯值QPS最高約1200左右,上層統一SQL查詢引擎峯值QPS在2000左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/578706c8444cecdde4f80959b9f9a7e0.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Druid上面承載了20多個業務線的查詢需求,如果僅僅依靠Druid原生提供的負載限流策略是沒有辦法滿足的。因爲每個業務線的查詢重要程度不一樣,查詢的sql的複雜程度也不一樣,所以需要針對不同的業務線、不同的查詢數據源做精細化的控制,原生控制粒度太粗,單純只是超時時長自動kill在高峯期很難滿足不同業務線的查詢需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7b\/7bb2b14cf0368fe298697852de757c29.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三項集羣穩定措施:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 查詢緩存"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 動態限流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ hdfs存儲優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"查詢緩存"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5b\/5b681fcc1adc8a322e71091a56889b1d.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏的查詢緩存不是指Druid自身的緩存(即Druid broker、historical上的緩存)。此處的緩存是上層服務針對Druid查詢結果的緩存,即指標API緩存和統一SQL查詢引擎緩存。在實際應用當中,指標API層緩存命中率約爲30%,查詢引擎的緩存命中率約爲17%,這樣上層就可以緩解掉40%多的請求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/35\/3554545be0741bb915f5ee59b2f900a6.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然使用了查詢緩存,就需要思考在什麼樣的時間點去清理查詢緩存,不讓用戶查到一些髒數據。我們緩存清理的時機一般設置在historical segment cache就緒的一個時間點。這裏需要提一下,Druid數據攝取任務的hadoop index task作業結束的時間跟segment落盤的時間是不一致的,也就是說task任務結束了,但segment落盤可能還要很長時間。segment落盤取決於兩個因素,一個是historical的數量,一個是historical上用於load segment的線程數。此時就不能用task執行結束時間作爲數據就緒的時間。在社區裏面也有人提過能不能讓task執行結束的時間就是historical segment落盤的時間,但社區沒有同意這個改良建議,因爲如果task裏需要很多segment去落盤,比如說要灌兩年的數據,就要每次落700多個segment,有的用戶還可能要落五年的數據,這樣去落盤時會影響task線程,落的數據量大會導致task一直處於佔用狀態,進而會影響新的數據攝取,浪費線程資源。我們的解決辦法是用戶提交這種index hadoop作業的時候,會將taskid放在一個隊列裏,當任務是success後,去比較成功任務的執行時間和落到盤上的segment的version版本時間。如果segment version版本時間比 hadoop index作業的時間戳要大,就認爲已經落盤成功了,這樣纔會去觸發清理緩存,如果沒有的話,就直接放到隊列裏面等待下次輪詢,超過一定生命週期會自動清理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"動態限流"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4b\/4b7a5b44bf6eb526959d372496bc6f55.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個舉措是動態限流,原生Druid限流策略是在broker端限流,比如說集羣能扛400QPS,如果超過400QPS就直接拒絕,但是這不足維持我們業務場景下集羣規模限流,因爲當流量打過來的時候,400QPS內的查詢如果語法複雜度比較高,會直接把historical的CPU打滿,進而影響到其他高優先業務線的查詢。我們採用的限流策略是通過broker端去收集historical的CPU load負載信息,如果historical的負載信息相對比較高,會根據業務線的重要程度及近五分鐘內高熱度的查詢數據源逐級去限流,也就是說會去保障最高優先級的業務線不被限流,對次要的業務線進行限流。實際應用中有些業務線的查詢請求不是人工觸發的,而是他們爲了展現報表速度更快會用程序去刷非常高的QPS。像這種只要保證它一次成功就可以。所以我們可以針對這種查詢進行限流,在CPU負載比較低的時候再去執行該查詢請求,保證高優的一級報表產出。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f0\/f0549152bfd49c294b6c4c4ff9dbaf2e.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖反映的是限流後一些效果,高峯期7點到12點左右會有很多毛刺現象,其實是CPU打得比較高的時候會針對某些次要的業務線或者是它使用的一些數據源進行限流,起到一個削峯填谷的作用,保證高優先級的業務線查詢不受到較大影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"深存優化"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b9\/b9fe295d0f07799efd8e12972f9425f6.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個舉措是針對深存HDFS上的優化。當前整個平臺的Druid引擎承載了300多個數據源,10萬多個segment,但是在HDFS的數目上,有一個特別有趣的現象,就是它的目錄數竟然達到400萬,文件數也達到400萬,小於20兆的小文件非常多,佔比在50%以上。主要原因是有些數據源是全量表,一次作業任務可能刷一年、兩年的數據,它的segment是按照天進行聚合的,因此當任務例行了一個月或者是一個季度,它在hdfs上目錄數會特別多。如果目錄數太多會影響整個hdfs namenode的性能。我們的優化策略主要是三個方面,第一個是近半年的數據用天粒度去聚合,其他歷史數據用月粒度的聚合。第二個就是在查詢低峯期的時候,用Druid的健康檢查功能自動獲取哪些數據源的分片不合理,我們需要進行shard合併,然後在低峯區觸發compact任務合併多餘的分片信息。大家需要注意是合併任務一般不要放在查詢高峯期執行,因爲會影響整個集羣的線程資源,特別是影響segment落盤,對查詢性能影響比較大。第三個是剛纔說的那種全量表加載時間跨度比較大,短則一年,多則五年,歷史沉澱的segment只保留最近三天的版本。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來規劃"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a8\/a824f77952d2cb9c345578e0f21c774d.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於未來的規劃,主要涉及兩個方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個是實時指標在Druid上的深入實踐,目前Druid主要承載的是離線指標,在實時指標方面,主要是用Clickhouse做了100+的實時指標,數量還比較少,後續會把實時指標業務逐漸往Druid上傾斜。因爲Druid已經實現了實時精確去重能力,相較於Clickhouse較高的運維成本,具備了分擔實時業務的能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個是離線導入方式的進一步探索,主要分爲三點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 針對離線作業類型,之前原生支持的是index hadoop,計劃嘗試用index spark job作爲導入類型,這樣會比map reduse的導入速度有較大提升。② 嘗試使用index parallel job這種針對小的數據源導入,如果數據源的數據級別不大,可以不依賴hadoop方式,因爲分配map reduce資源也會比較耗時。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ 使用hive做全局字典,因爲在做高基數列精確去重的時候,index generator map階段很容易出現fullGC的問題,因爲不可能無限制地對map內存進行調優,所以希望能參照kylin 4.0實現用hive做字典存儲用於精確去重。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"王嘯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼找房 | 資深研發工程師"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼資深研發工程師,碩士,畢業於北京郵電大學。曾就職於中國電信、百度,多年來深耕大數據領域,從0到1深度參與百度adhoc平臺PINGO、一站式機器學習JARVIS平臺構建與開發,同時負責過百度商業化產品“魯班”項目等多個大數據產品上雲和私有化交付工作。於2019年加入貝殼,目前主要從事大數據OLAP查詢引擎相關研發工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:dataFunTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/omSEyuaikf_SI6VzNmFjNA","title":"xxx","type":null},"content":[{"type":"text","text":"貝殼基於Druid的OLAP引擎應用實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.