唯品會翻牌ClickHouse後,實現百億級數據自助分析

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是唯品會實施平臺OLAP團隊負責人王玉,負責唯品會、Presto、ClickHouse、Kylin、Kudu等OLAP之間的開源修改,組建優化和維護,業務使用範圍支持和指引等工作。本次我要分享的主題是唯品會基於ClickHouse的百億級數據自助分析實踐,主要分爲以下4個部分:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"唯品會OLAP的演進 "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實驗平臺基於Flink和ClickHouse如何實現數據自助分析"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用ClickHouse中遇到的問題和挑戰"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse未來在唯品會的展望"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、唯品會OLAP的演進"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、OLAP在唯品會的演進"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/85\/85d24680dfdc49463d861062955782fe.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)Presto"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto作爲當前唯品會OLAP的主力軍,經歷了數次架構和使用方式的進化。目前,具有500多臺高配的物理機,CPU和內存都非常強悍,服務於20多個線上的業務。日均查詢高峯可高達到500萬次,每天讀取和處理接近3PB的數據,查詢的數據量也是幾千億級。另外,我們在的Presto的使用和改進上也做了一些工作,下面給大家詳細地介紹一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)Kylin"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中中間的符號就是Kylin,Kylin是國人開源的Apache項目,Kylin作爲Presto的補充,在一些數據量級不適用於Presto實時查詢原來的表的情況下,數據量又非常的大,例如需要查詢幾個月的打點流量數據,這樣的數據有時候就可以用固定的規則預計算,這樣就可以用Kylin作爲補充。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3)ClickHouse"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前有兩個集羣,每個集羣大概是20至30臺高配物理機,服務於實驗平臺A\/B-test,還有 OLAP的日誌查詢,打點監控等項目。ClickHouse是未來我們發展的關鍵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin的數據量比Presto要小很多,它主要是打點幾個特定的查詢,比如有供應規則的,類似於以前分析人們消費的品質或者出固定的報表纔會用,它的數據量比Presto要小很多。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、唯品會OLAP部署架構模式"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/89\/8964d0e1209130de5b7677acbb2e2011.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面這一部分是數據層,有Hadoop、Spark,例如ClickHouse從Hive導入數據到ClickHouse就是通過Waterdrop,老版本使用Waterdrop是用Spark去實現,而新版本的Waterdrop可以使用Flink實現,用戶可以通過不同的配置中心去配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外還有Kafka和Flink,實時數倉數據寫入ClickHouse或者Hbase是用Kafka加Flink來寫,根據具體的業務需求和架構而定,在引擎層上面有一層代理層,ClickHouse我們是用chproxy,這也是一個開源項目,而且它的功能比較完善,像有一些用戶權限在chproxy裏面都可以去設置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們自己在chproxy之上,還搭了一個HAproxy,主要是做一層HA層,以防chproxy有時候可能會出現問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶是直接連HAproxy的地址和端口,這樣我們會去做HA分發,保障高可用,在Presto代理層都使用了我們自主研發的一個OLAP跨越的服務工具——Nebula,老版本是叫Spider,就是各個對外提供的數據服務和各種業務,平行於右邊的整套鏈路,向左側對應的就是這部分的各個監控,有的是用Prometheus,有的是用Presto,Presto會自己有API和JDBC來提供這些接口,我們會將這些接口採集過來,在 Open faculty裏面進行監控和告警。這就是整個OLAP的部署架構模式。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、Presto的業務架構圖"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d6\/d6f13f0e56909f2dfbbe477f0f3175e7.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"底層利用了 Presto多數據源的Connector特性,如上圖底層數據有Hive、Kudu、MySQL、Alluxio。Presto它是可以配置Catelog的,可以把數據源放在一個SQL裏面做JOIN,是非常方便的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"中間層包括有多個Presto集羣,最初是根據不同的業務來獨立不同的Presto集羣。針對Presto集羣,可以部署多個Spider,診斷分析記錄所有的查詢,因爲Presto本身的歷史查詢是不持久的,放在內存裏面也會有過期時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最上層就是用Presto服務的業務應用,最初只有兩個魔方和自主分析兩個業務應用,發展到現在進入到20個業務應用。因爲Presto在使用上確實十分方便,用戶只需要寫SQL,而且SQL的語法跟Hive也比較相似,也有很多UDF,並且接入也很方便,可以直接接client或者JDBC。所以在Presto的使用上還是非常頻繁的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、對Presto的改進"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/58\/58dcad819e679379a507d86b8f4c72ea.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個是開發了一套Presto的管控工具,修改了Presto server和kinda的源碼,用自營的管理工具Spide\/Nebula從Presto所暴露的API和系統表裏獲取到節點和查詢信息,一方面將查詢落入MySQL,通過etl-job落入Hive便於存儲和分析,包括以後也可以做一些報表,考慮從哪些方面進行優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一方面根據集羣查詢數和節點信息來給集羣打分。就是看Presto當前每個集羣上面大概跑了多少個查詢,查詢量大不大?對Presto的集羣負載情況如何?打了分以後,每次從client這邊發起的查詢,就會去選取分數相對較低的,也就是資源佔用度比較低的集羣,把SQL打進去,這樣子就能實現負載均衡的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多熟悉Presto的同學都知道Presto clinter的 HA是需要自己做的,因爲官方的版本只提供了單節點的clinter,我們通過打分和負載均衡的情況去做了HA,這樣子就保障了Presto的 HA,不管是在部署、升級,還是在評集羣,這些時候用戶都是無感。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、Presto容器化"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ce\/ce0b3d31c89109a4063b181a918171fd.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto上雲接入K8S,可以智能擴縮容Presto集羣,做到資源合理調度、智能部署等功能,運維起來就非常方便。我們只需要給Presto加一些容器就可以智能的調用它的資源情況。再配合上文提到的准入的工具,就形成了整個鏈路的智能路由效果。我們也通過這一套上雲和廣告工具,實現了和一線團隊Spark資源集羣借調。白天把資源返還給我們去做PrestoADHOC,夜間我們會把資源釋通過k8s釋放出來,給spark去跑離線叫法,這樣就實現了資源的高效利用。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6、ClickHouse的引入"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務對於OLAP的要求越來越高,部分業務場景Presto和Kylin無法滿足現在的需求,比如有百億JOIN百億(Local join)的低延遲實時數據場景,和對中等的QPS的查詢平均響應時間要求低於1秒的OLAP使用場景等,所以我們現在把目光轉向更快的ClickHouse。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖我們對比了ClickHouse和已經使用的Presto和Kylin。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a1\/a1aa11f66e7ee6009a8f9c801bf2513f.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse的優勢如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據存儲方面,ClickHouse的原數據一部分會儲存在CK裏面,路徑和其他東西會儲存在ZK裏面。本地也有文件目錄儲存原數據表、本地表、分佈式表和線表文等信息。這些都在本地的文件目錄裏面,感興趣的同學可以去看一下ClickHouse的整個的目錄構成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據按照設置的策略存儲在本地的目錄上,在查詢上的特定場景中,甚至可以比Presto快10倍。但是即使用了更新的表引擎還是存在很多限制和問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們發現它的場景對於我們來說,比較實用的場景就是有百億級數據量的JOIN,和一些比較複雜的數學數字的查詢,還有人羣計算等,比如說涉及到一些Bit map,後面將會詳細的介紹這方面的內容。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"7、ClickHouse針對於Presto等傳統的OLAP引擎的優勢"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/41\/417c7219a482fd989cfb3384af39ee17.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大寬表查詢性能優異,它主要的分析都是大寬表的SQL聚合,ClickHouse整個聚合耗時都非常小,並且具有量級的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單表性能分析以及分區對其的join計算都能取得很好的性能優勢。比如百億級數據量級join幾十億級數據量級的大表關聯大表的場景,在24C*128G*10shard(2副本)通過優化取得了10秒左右的查詢性能。如果數據量級更小一點的話,基本上都能維持在1到2秒內,就是用比Presto更少的物理資源,實現更快的查詢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse帶來了很多比較高效的數據算法,比如各種估算,各種map的計算和Bit map與或非的預算。在很多場景下,這些都值得去深挖。後面我們會簡單介紹一下,我們現在掌握的一些Bit map的場景。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、實驗平臺基於Flink和ClickHouse如何實現數據自助分析"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b4\/b4a395da68214ccd388abd7b7a4513a7.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如今很多技術也都在圍繞A\/B-test展開,所以我們最近也在關注這些這樣的應用場景。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、實驗平臺OLAP業務場景"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b5\/b5522d4920b41dfb8b70c4cb1c353a03.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是實驗平臺一個典型的從最左側的 A\/B-test的日誌到下段再到曝光、點擊、加入購物車、收藏,最後生成訂單。這明顯的是一個從上到下的漏斗下降的過程。我們實現了Flink SQL Redis connector,支持Redis的sink、Source維表關聯內可配置cache,極大地提高應用的TPS。通過Flink SQL實現實時數據流的pipeline,最終將大寬表sink到CK裏,並按照某個字段粒度做murmurhash的存儲。保證相同用戶的數據都存在同一個副本的節點組內。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們自己在Flink的ClickHouse的connector這一端也做了一些代碼的修改,支持寫入本地表的一個相關功能。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、Flink寫入ClickHouse本地表的流程"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/81\/817da985a4a7b0243f3c2e72764cbff2.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一步是根據庫名和表明查ClickHouse的原數據表, SQL表示system.tables,這是ClickHouse自帶的系統表。獲取要寫入的表的engine信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二步是解析engine信息,獲取這個表所存儲的一些集羣名,獲取本地表的表名等這些信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三步是根據集羣名和查詢數據的表,通過system.clusters也就是ClickHouse的系統表,獲取集羣的分片節點信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後根據下的配置的信息,初始化生成隨機的shard group裏的URL來連接。這樣子Flink的ClickHouse就會通過 URL來連接機器,並根據設置好的進度時間觸發flush,將一批數據真正的sink到ClickHouse的server端。(這裏也涉及到一些參數優化)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上就是我們如何經過修改,完整地把數據從Kafka通過Flink寫入ClickHouse的一個流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink數據寫入的時序圖可以參考下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/85\/8599875ee50561475568e706462f2059.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、ClickHouse百億級數據JOIN的解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實際應用場景中我們發現了一些特定的場景,我們需要拿一天的用戶流量點擊情況,來JOIN A\/B-test的日誌,用以匹配實驗和人羣的關係,這就給我們帶來了很大的挑戰,因爲需求是提到Presto這端的,在這端發現百億JOIN百億,直接內存爆表,直接報出out of limit memory這種錯誤,無法查出。兩張大分佈式表JOIN出來的性能也非常不理想,我們把它縮到幾十億JOIN十億的情況,SQL能查出來,但是可能要查個幾分鐘,這樣子是完全不能作爲ADHOC的性能來給用戶提供服務的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個情況下,甚至用戶可能還會想加入幾張尾表,這樣的表格數據量會更多,查詢會更加麻煩。我們想用ClickHouse去看看應該怎樣去解決這樣的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d7\/d755660f8d92ffeab23d1235e02c7683.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"分桶join字段"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這種情況下,我們用了類似於分桶概念。首先把左表和右表join的字段,建表時用hash來落到不同的機器節點,murmurHash3_64(mid)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是寫入分佈式表的話,在建表的時候直接指定這個表的規則是murmurhash字段,insert into分佈式表,會自動按照 murmurhash的規則,通過 ClickHouse來寫入不同的表。如果是寫入本地表,在Flink寫入段路由策略加入murmurhash即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總而言之不管你從哪裏寫,分桶規則一定要保持一致。在查詢的時候,使用分佈式表JOIN本地表,來達到想要的效果。如上圖右邊的 SQL,左表-all是分佈式表,JOIN右邊的-local表,並不是用-all表去JOIN-all表,這樣就達到了分桶的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/81\/814cd251b4230169eea1bb3817b5b748.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣分桶後JOIN的效果,是等於分佈式表JOIN分佈式表,我們做了多次驗證,且處理的數據量只有總數據量\/(集羣機器數\/副本數)。比如我們有20臺機器,然後每個機器有兩個副本,我們看一下input size,等於總數據量除以10,那麼20臺機器有2個副本的話就有10個不同的shard。而且這個是可以通過擴容去進一步縮小數據量的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我們第一步做的就是把JOIN的數據量放小,這樣子的話JOIN寫出來以後,它也是在本機上進行,根本不用走分佈式表,只是在最後聚合運算的時候再去做分佈式。值得注意的是,即在左表JOIN右表的過程中,如果左表是子查詢,則分佈規則不生效,查詢出的結果也遠遠小於預期值。等於本地表JOIN本地表,對分佈式沒有要求。如上圖所示,SELECT)左邊是子查詢SELECT*FROM tableA_local,其實等同於SELECT*FROM tableA_all,這個地方如果子查詢寫了all跟local其實效果是一樣的,它都是當成local表來做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個是ClickHouse的自己的語義的解釋,所以我們也不去評判它對不對。但它確實不同的語義查出來的效果也是不同的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以在用這種分桶效果的時候,要注意一下左表是不能用子查詢的,如果用子查詢,10個shard查出來最後只有1\/10左右,結果是不準確的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、增量數據更新場景"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/85\/8569885da97a1a65fb3e06c4b7e47cee.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訂單類數據需要像寫Kudu或者MySQL一樣,做去重,由於流量數據都實時寫入數據,爲了訂單數據和流量數據做JOIN,就需要對訂單數據做去重,由於訂單數據是有生命週期的,從產生之後會不停地update。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們用了一下這4種方案去驗證了ClickHouse,看看整個ClickHouse的性能如何:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ReplacingMergeTree數據無法merge,忽大忽小,不能滿足需求。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶副本的Mergetree可以做去重。對Hash字段不變化的情況下合適。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"remote表。查詢複雜,對性能有影響,存在副本的可靠性問題,但結果是準確的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink方案規避去重和Hash字段變化的問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總而言之,ClickHouse並不是一個主打更新功能的引擎,有選擇的使用,針對需求選擇不同的使用方式,才比較合適。使用ClickHouse的更新的時候,個人建議要謹慎使用。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、Flink寫入端遇到的問題及優化"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ea\/eae65a7a04c946b35cc4524b581a69a3.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"需要注意:"},{"type":"text","text":"這樣是爲了滿足兩個場景,第一個是高寫入場景速度快,可能21萬條記錄先到達,還有就是夜裏可能打點數據很小,60秒會產生一個半時再往裏寫,這樣也是爲了一個低延時,我們也是會選擇60秒這樣的場景做兜底。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b6\/b684dda66e51da8af234e09da5870700.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"需要注意:"},{"type":"text","text":"coalesce空值處理函數是在Flink SQL裏面加上的,這樣子就能保障數據在sink的時候是完整的,這個是比較準確的,也是我們比較推薦的一種方法。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、使用ClickHouse中遇到的問題和挑戰"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、ClickHouse查詢優化"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3f\/3fd1a1b344d5f6a7ec6d09a9102bdbbe.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據業務和數據特性選擇合適的引擎,根據副本、Merge、更新之類的場景,選取表引擎。ClickHouse表引擎選擇好,能達到事半功倍的效果,而且選不選副本對數據的穩定性都很重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做好數據的分級分區,做好第一級分區、二級分區,很多時候分區是拿來運用數據的,比如日期分區、typed種類等,都會起到第一層索引的作用,而且效果比索引的好,這樣能方便快捷地找文件、找數據。最好是在分區數和分區內的part文件取一個平衡,一個分區的Part數也不宜過多,過多的話會導致掃描效率低下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據SQL特性,我們會去做order by的排序。類似於上圖就是order by的tape,做OLAP都知道,降低掃描的數據量對提升效率的加成是非常大的,這裏也是爲了減少查詢主要的數據量,所以索引是非常重要的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做好生命週期,生命週期是用TTL ,這個ClickHouse建表的時候也是自帶的,你看這裏我們就是用了它自己的dt,dt是一個date類型,看到這兩個是個 data類型,然後加一個間隔,32天的間隔做一個生命週期,生命週期也能省很多事情。這樣的話歷史數據等於也不會太多,我們也可以把歷史數據導到其他的存儲裏面去,或者冷數據裏面去做這個事情。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"索引的粒度的設置,默認是8192,但是在一些少行多列或者存在一些列特別大的表中就不是正式的,比如Bit map,它是一行是一個標籤,但是他Bit map一個字段非常大,可能是幾十G,一列就是幾十G,但是它一個表可能也就一兩千行,這個時候值就要設置小一點。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、常用參數調整"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/807838476626ce960eee7526f8873c98.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/96\/961e9ecc5e8b20b71c44cfab0f0b74d7.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在正式的ClickHouse實踐中,我們也一步步修改了ClickHouse默認的一些參數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如圖上面針對查詢和merge的併發度、內存設置的參數,配置都在config或者user.xml這些配置文件裏面,針對一些特殊場景,可以用session來設計特殊參數,就不是全局的參數。這個參數的設置情況也可以在CK的系統表裏面去查到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個是merge的一些參數,包括它的後臺線程數,這個也是按照CPU設的,根據你CPU的核數來設置線程數,這個都是根據業務和機器數去調優,然後再去寫一些通用的SQL去測試。我們也是在很多次測試後設置了這麼一個參數,現在性能也達到了一個比較滿意的效果。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、物化視圖"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/39\/3920274ffe0b2980df21a656c9fe93f9.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說到ClickHouse就必須說說物化視圖,因爲ClickHouse的物化視圖是一種查詢結果的持久化,它確實是給我們帶來了查詢效率的提升。用戶查起來跟表沒有區別,它就是一張表,它也像是一張時刻在預計算的表,創建的過程它是用了一個特殊引擎,加上後來 as select,有點像我們寫的etl-join,熟悉etl-join的同事都知道,create一個table as select的寫法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse的物化視圖就是通過這種方式來表達所需要的需求和規則。如果歷史數據也需要進行初始化,需要加上一個關鍵字popular,但是這種創建的時候你要注意原表在往物化視圖計算曆史數據的時候,你要停止寫,要不然他popular生成物化視圖去追歷史數據這段時間,你新寫入的數據它是不會計算難度的,包括以後也不會統計這段時間你寫的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以介入歷史數據加popular的時候,在物化視圖完成之前,不要寫入,這是一個注意事項。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要特別注意的就是我們在使用物化視圖的過程中,也體會到它的一些優點和缺點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:查詢速度快,要是把物化視圖這些規則全部寫好,它比原數據查詢快了很多,總的行數少了,因爲都預計算好了。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:它的本質是一個流式數據的使用場景,是累加式的技術,所以要用歷史數據做去重、去核,或者是一些這樣分析,在物化視圖裏面是不太好用的。在某些場景的使用也是有限的。而且如果一張表你加了好多物化視圖,在寫這張表的時候,就會消耗很多機器的資源,你可能會發現你寫數據的時候突然數據帶寬碼,或者是存儲一下子增加了很多,你就會需要看一下整個物化視圖的情況。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、展望"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bc\/bc80f8b124a8b9285120bac52b108bd5.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、RoaringBitmap留存分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用RoaringBitmap在人羣標籤的場景加速流程分析,利用Bit map能提升很多空間,而且能減少許多存儲代價。可以用ClickHouse裏自帶的Bit map and Bit map all這種雲或非的計算的定義性,來完成業務的需求。比如一個標籤, A標籤有1億個人羣,它就是一個Bit map,B標籤它有一個5000萬人羣,我們想要同時滿足A和B標籤,只要用一個Bit map and把這兩個Bit map往裏塞,它直接就可以告訴你估算的數值,效率很高。Bit map對於幾百萬、幾千萬這樣的量級,他可能10秒之內就能返回。如果用其他的方式想把Presto寫出口,SQL又麻煩表又大掃描的行數又多。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、存算分離"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們準備找一個合適的分佈式存儲去研究。例如Juice FS,這是我們國內開源的非常優秀的文件系統。還有一個S3,前段時間在ClickHouse官方放出來的實踐,就是用亞馬遜S3來做一個底層的分佈式存儲,這樣做一個分佈式存儲是爲了解決因存儲而頻繁擴容ClickHouse集羣的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲ClickHouse是單機性的,所以擴容起來比較麻煩,需要把老數據也拿過去搬。但鑑於老數據跟新數據的一些規則不一樣,在擴容的時候,新機器和老機器的數量會不一樣,它murmurhash的結果也不一樣,放到集羣也不一樣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而研究存在分離雲存儲這樣的方法,雖然我們要用的機器數不變,但卻可以解決上述的問題。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、資源開發工具"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前我們屬於ClickHouse業務推廣階段,對ClickHouse使用方管控較少,也沒做過多的儲存、計算、查詢角色等方面的管控。數據安全乃大數據重中之重,我們將在接下來的工作逐漸完善這一塊。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、權限管控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在ClickHouse的新版中,已經加入了RBAC的訪問控制,官方也推薦使用這種方式。我們也想把這一套授權用到我們自己的ClickHouse的集羣裏面來。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、資源管控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在資源層面我們會結合存算分離,給不同的業務分配不同的用戶,不同的用戶在雲平臺上申請的存儲資源,我們就可以對每個用戶的儲存進行價值計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上5點是我們未來準備在ClickHouse這個方向上面繼續要投入去計算的事情。像Bit map的話,我們現在已經在與廣告、人羣還有實驗平臺在深度的合作當中了,以後有機會再跟大家分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q1:ClickHouse數據入庫的時延是多少?能達到秒級嗎?在秒級時延下,單節點入庫的吞吐量是多少?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A1:"},{"type":"text","text":"這裏看你的ClickHouse用什麼去寫入,如果你是直接到GEPC端口插入的話,是可以達到秒級的,但是他主要是消耗在merge部分。這個是看你怎麼用,如果你是用大數據的場景,用海量打點這樣的數據,再用GEPC去查詢的話就不太合適,它是達不到秒級這樣的場景。但如果你是把它當成一個小存儲,比如寫入百萬級這種的,它的merger也發揮不了什麼作用,這也是能達到秒級的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單節點入庫的吞吐量是多少?其實這是要看是它的數據量條數還是它整個的量級,像Flink用sink去寫入,我們達到的TPS峯值基本上是60萬GPS左右,分到10臺機器,每臺機器大概是六七萬TPS左右。這可能和機器的帶寬、CPU、io和底層存儲是不是SID這方面有關。。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q2:請問OLAP場景下都有哪些存儲空間和訪問效率優化方法?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A2:"},{"type":"text","text":"首先看一下OLAP場景你是用哪一種引擎,是哪一種存儲,像我們現在用的比較多的就是Presto,我們主要是用它的壓縮或者是HyperLogLog,這個是Spark和Presto都有的,但是不通用。我們是在Spark上把Presto HyperLogLog的類進行重構,所以說用Spark寫入,然後用Presto去做 HyperLogLog,這樣其實也是用一種預計算或是把儲存的預計算和想要的結果數據場景來實現存儲空間的優化,訪問效率的話其實主要就是掃描數據量,第二就是SQL的優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們主要是關注,根據這些SQL的explain去看它到底是慢在哪裏,是慢在讀取數據這一步,Presto的SOURCE這一步有分schedule和plan,還是慢在HMS一步,像我們最近做的就是把HMS搞組件分離。Presto這邊的話,我們有etl,寫這些的話是寫主庫的HMS,讀的話是讀從庫的HMS,這樣的話當你寫的時候就不會影響讀的效率,adhoc寫的就非常快,所以說這個問題本身還是很大的,每一個細節還值得深挖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q3:請問Kudu和CK的區別在哪?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A3:"},{"type":"text","text":"CK本身是帶了一套非常強大的查詢引擎,Kudu據我所知,要不就是用 Impala.,要不就是用Presto等引擎去查。CK本身還是一個單機的儲存模式,其實是用分佈式表的概念,把這些單機的數據放在一起,然後去做聚合。它主要用的還是單機計算的效果。Kudu的話本身更多地偏向於存儲性能,偏向於實時數倉,像update、upset這些。像人羣Bit map場景,我們可能只能用CK這一套,用它的儲存和算法,用它Bit map的語法function,這時候用Kudu就不太合適了。選型的時候會根據不同的業務場景來進行不同的架構選擇。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q4:請問ClickHouse和Flink如何保證一致性?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A4:"},{"type":"text","text":"我們這邊的話,是Kafka首先會保留三天的數據,包括offset會自動保存。我們本身是實時這邊一套,離線那邊一套,我們會去監控離線那邊五分鐘的表,然後和ClickHouse做數據質量校驗,如果發現數據不同,會看一下差異在哪些地方。在ClickHouse這邊也會有分區把數據刪了做一個重建,目前看來數據還是比較準確的。大部分數據不準確是在Waterdrop中,但是用FlinkSQL寫入的話,現在看來Flink Connector的性能和準確性還是夠的。另外我們已經做好了跟離線活動表的JOIN,這個也可以用來判斷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q5:請問Kudu在唯品會還有使用場景麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A5:"},{"type":"text","text":"我們Kudu主要是做了訂單數據,打點數據曝光數據很少上去,只有一定的曝光數據上去,我們Kudu的話機器不是特別多,只有不到20臺,三個Kudu master,其他都是TABLET,主要是把數據通過Flink落到Kudu裏面,然後用Presto來查,用Presto把kudu跟一些其他的Hive或MySQL的一些維表來JOIN,基本上就是訂單和很少的一部分曝光場景在用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q6:請問存算分離有什麼方案?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A6: "},{"type":"text","text":"存分離的話,現在用Juice FS非常合適,大家可以去看一下,這個是國內的開源的項目,他們也非常願意跟ClickHouse來做這件事情。第一個情況就是把這種分佈式的存儲,直接當成本地存儲的掛載,通過這種方式來實現。第二個就是更深度定製,就是通過寫入文件流或者是那種outsteam流去改一些CK的源碼,來對接FS,直接把Juice FS這種文件系統的FS類拿過來,直接在ClickHouse源碼裏面去用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果剛開始不是深度使用的話,可以用第一種掛載方案來試一下。第二,如果有源碼修改的能力,有意願的公司可以去試一下。主要是在FS、Stream這種文件流的寫入和讀取這一塊要改一下源碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q7:請問ClickHouse 和Kylin 如何選擇?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A7: "},{"type":"text","text":"ClickHouse 和Kylin 差別還是比較大的。首先kylin是通過Hive預計算以後,把數據放到Hbase裏面,在最新的Kylin 4.0的話,它是用Spark去做一些事情。ClickHouse它本身是用C寫的,所以說他整個在CPU的SIMD這個層面,是有優勢的。而且ClickHouse數據就存在那裏,你會用不同的規則,不同的匯聚條件,更多的是根據更靈活的條件做查詢adhoc的場景。像Kylin的話,我們更多的願意把它作爲固定報表或者預計算的場景,兩個的使用方式可能還是不太一樣的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q8:請問多表關聯性能如何?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A8: "},{"type":"text","text":"剛纔我們看到很多表,如果你是用分佈式表JOIN分佈式表,不用我們這種分桶方案,它的JOIN性能是不行的,因爲他的理論是把所有分佈式表的數據,拿到一臺單機上來做JOIN,這個效果是不好的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是像我跟大家分享的,我就是通過一些具體的主鍵ID或者其他東西,把它進行分桶,就是a表、b表、c表三個表JOIN,我們把三個表的某一個表的user這個字段都是按照一定規則落到這臺機器,本地做一個JOIN,10臺機器都把它本地化做起來,最後再通過整個分佈式表,把他們放在一起。這樣子就形成了一個多表JOIN,就是剛纔我分享的,大家可以看一看PPT裏面說用分桶來做這種JOIN,如果是分桶不了的話,你多表分佈式JOIN的話,性能是比較一般的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q9:請問ClickHouse可不可以解讀Kafka?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A9: "},{"type":"text","text":"可以,但是我們Flink裏面還做了很多事情,比如數據處理,我們是通過Flink SQL做的,包括關聯維表,關聯Redis維表這種你直接鏈kafka引擎是做不了的事情。我們是有這種需求,就會去做這樣的事情,能在Flink裏面實現實時維表的這些功能的部分就在Flink裏面做掉,減少ClickHouse這一端更多的關於維表關聯和其他的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q10:請問CK的併發不好,如何解決2c場景?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A10: "},{"type":"text","text":"目前解決方式就是加機器,用不同的chproxy連到不同的機器上,然後再用不同的chproxy去扛,chproxy最終是會平均打到這麼多臺機器上,所以你機器越多,它併發的性能越好。但是它有一個瓶頸,就是如果2c的話,看你們公司的整個場景是如何的,如果是像淘寶這樣的2c,我覺得是不太合適的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q11:請問Doris有什麼缺點,才採用ClickHouse替換?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A11: "},{"type":"text","text":"目前解決方式就是加機器,用不同的chproxy連到不同的機器上,然後再用不同的chproxy去扛,chproxy最終是會平均打到這麼多臺機器上,所以你機器越多,它併發的性能越好。但是它有一個瓶頸,就是如果2c的話,看你們公司的整個場景是如何的,如果是像淘寶這樣的2c,我覺得是不太合適的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q12:請問Doris有什麼缺點,才採用CK替換?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A12: "},{"type":"text","text":"我們替換並不是說Doris有什麼缺點,更多的是ClickHouse有優點。像Flink這種維表的場景,在ClickHouse裏面用大寬表JOIN的場景,包括後來Bit map場景,我們是根據場景需要用到ClickHouse,而且也不太想多維護的原因才用ClickHouse替換,因爲Doris是能實現的ClickHouse也能夠實現,比如指標等。我們現在的監控日誌用ClickHouse也能做,像有些ClickHouse能做,但Doris做不了,比如Bit map。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Doris本身引擎沒有什麼問題,是因爲有更好更實用的場景,ClickHouse能做更多的事情,所以我們也是基於這些考慮加過去ClickHouse,把原來的一些東西都寫到ClickHouse上去也可以減少維護成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"王玉,"},{"type":"text","text":"唯品會實時平臺OLAP團隊負責人。負責唯品會Presto、ClickHouse、Kylin、Kudu等OLAP組件的開源修改、組件優化和維護,業務使用方案支持和指引等工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:dbaplus社羣(ID:dbaplus)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/JNUPz9G7cxNxTVUwvIi1nQ","title":"xxx","type":null},"content":[{"type":"text","text":"唯品會翻牌ClickHouse後,實現百億級數據自助分析"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章