淺淡 Apache Kylin 與 ClickHouse 的對比

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Kylin 和 ClickHouse 都是目前市場流行的大數據 OLAP 引擎;Kylin 最初由 eBay 中國研發中心開發,2014 年開源並貢獻給 Apache 軟件基金會,憑藉着亞秒級查詢的能力和超高的併發查詢能力,被許多大廠所採用,包括美團,滴滴,攜程,貝殼找房,騰訊,58同城等;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OLAP 領域這兩年炙手可熱的 ClickHouse,由俄羅斯搜索巨頭 Yandex 開發,於2016年開源,典型用戶包括字節跳動、新浪、騰訊等知名企業。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這兩種 OLAP 引擎有什麼差異,各自有什麼優勢,如何選擇 ? "},{"type":"text","marks":[{"type":"strong"}],"text":"本文將嘗試從技術原理、存儲結構、優化方法和優勢場景等方面,對比這兩種 OLAP 引擎, 爲大家的技術選型提供一些參考。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"01 技術原理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術原理方面,我們主要從 "},{"type":"text","marks":[{"type":"strong"}],"text":"架構"},{"type":"text","text":" 和 "},{"type":"text","marks":[{"type":"strong"}],"text":"生態"},{"type":"text","text":" 兩方面做個比較。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.1 技術架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Kylin 是基於 Hadoop 的 MOLAP (Multi-dimensional OLAP) 技術,核心技術是 OLAP Cube"},{"type":"text","text":" ;與傳統 MOLAP 技術不同,Kylin 運行在 Hadoop 這個功能強大、擴展性強的平臺上,從而可以支持海量 (TB到PB) 的數據;它將預計算(通過 MapReduce 或 Spark 執行)好的多維 Cube 導入到 HBase 這個低延遲的分佈式數據庫中,從而可以實現亞秒級的查詢響應;最近的 Kylin 4 開始使用 "},{"type":"text","marks":[{"type":"strong"}],"text":"Spark + Parquet"},{"type":"text","text":" 來替換 HBase,從而進一步簡化架構。由於大量的聚合計算在離線任務(Cube 構建)過程中已經完成,所以執行 SQL 查詢時,它不需要再訪問原始數據,而是直接利用索引結合聚合結果再二次計算,性能比訪問原始數據高百倍甚至千倍;由於 CPU 使用率低,它可以支持較高的併發量,尤其適合自助分析、固定報表等多用戶、交互式分析的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"ClickHouse 是基於 MPP 架構的分佈式 ROLAP (Relational OLAP)分析引擎"},{"type":"text","text":" ,各節點職責對等,各自負責一部分數據的處理(shared nothing),開發了向量化執行引擎,利用日誌合併樹、稀疏索引與 CPU 的 SIMD(單指令多數據 ,Single Instruction Multiple Data)等特性,充分發揮硬件優勢,達到高效計算的目的。因此當 ClickHouse 面對大數據量計算的場景,通常能達到 CPU 性能的極限。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.2 技術生態"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin 採用 Java 編寫,充分融入 Hadoop 生態系統,使用 HDFS 做分佈式存儲,計算引擎可選 MapReduce、Spark、Flink;存儲引擎可選 HBase、Parquet(結合 Spark)。源數據接入支持 Hive、Kafka、RDBMS 等,多節點協調依賴 Zookeeper;兼容 Hive 元數據,Kylin 只支持 SELECT 查詢,schema 的修改等都需要在 Hive 中完成,然後同步到 Kylin;建模等操作通過 Web UI 完成,任務調度通過 Rest API 進行,Web UI 上可以查看任務進度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse 採用 C++ 編寫,自成一套體系,對第三方工具依賴少。支持較完整的 DDL 和 DML,大部分操作可以通過命令行結合 SQL 就可以完成;分佈式集羣依賴 Zookeper 管理,單節點不用依賴 Zookeper,大部分配置需要通過修改配置文件完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"02 存儲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin 採用 Hadoop 生態的 HBase 或 Parquet 做存儲結構,依靠 HBase 的 rowkey 索引或 Parquet 的 Row group 稀疏索引來做查詢提速,使用 HBase Region Server 或 Spark executor 做分佈式並行計算。ClickHouse 自己管理數據存儲,它的存儲特點包括:MergeTree 作主要的存儲結構,數據壓縮分塊,稀疏索引等。下面將針對兩者的引擎做詳細對比。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 Kylin 的存儲結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin 通過預聚合計算出多維 Cube 數據,查詢的時候根據查詢條件,動態選擇最優的 Cuboid (類似於物化視圖),這會極大減小 CPU 計算量和 IO 的讀取量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/6a\/bb\/6a5f5b0de0cb3e3f2ee61af11c678bbb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Cube 構建過程中,Kylin 將維度值進行一定的編碼壓縮如字典編碼,力圖最小化數據存儲;由於 Kylin 的存儲引擎和構建引擎都是可插拔式的,對於不同的存儲引擎,存儲結構也有所差異。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"HBase 存儲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用 HBase 作爲存儲引擎的情況下,在預計算時會對各個維度進行編碼,保證維度值長度固定,並且在生成 hfile 時把計算結果中的維度拼接成 rowkey,聚合值作爲 value。維度的順序決定 rowkey 的設計,也會直接影響查詢的效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c1\/ab\/c10bba682822bcd88b54a3918dc6c3ab.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d5\/98\/d5c01ebbf85a04caa42638a565ed3f98.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Parquet 存儲引擎"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用 Parquet 作爲存儲格式時則會直接存儲維度值和聚合值,而不需要進行編碼和 rowkey 拼接。在存成 Parquet 之前,計算引擎會根據維度對計算結果進行排序,維度字段越是靠前,那麼在其上的過濾效率也就越高。另外在同一個分區下 shard 的數量和 parquet 文件的 row group 數量也同樣會影響查詢的效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 ClickHouse 的存儲結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse 在創建表結構的時候一般要求用戶指定分區列。採用數據壓縮和純粹的列式存儲技術, 使用 Mergetree 對每一列單獨存儲並壓縮分塊,"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/48\/a2\/486bc75900ed9ff0a583e4456d51c0a2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時數據總會以片段的形式寫入磁盤,當滿足一定條件後 ClickHouse 會通過後臺線程定期合併這些數據片段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a2\/f9\/a29c2805780b64cb7a07b3cacf5850f9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據量持續增大,ClickHouse,會針對分區目錄的數據進行合併,提高數據掃描的效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時 ClickHouse 針對每個數據塊,提供稀疏索引。在處理查詢請求的時候,就能夠利用稀疏索引,減少數據掃描起到加速作用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/cd\/b6\/cd0f6dc367a461dee570c1e1a2a968b6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"03 優化方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin 和 ClickHouse 都是大數據處理系統,當數據量級持續增大的時候,採用合適的優化方法往往能事半功倍,極大地降低查詢響應時間,減少存儲空間,提升查詢性能。由於二者的計算系統和存儲系統不同,因此採用的優化方式也不一樣,下一小節將着重分析 Kylin 和 ClickHouse 兩者的優化方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 Kylin 的優化方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Kylin 的核心原理是預計算"},{"type":"text","text":" ,正如第一小節技術原理所說:Kylin 的計算引擎用 Apache Spark,MapReduce;存儲用 HBase,Parquet;SQL 解析和後計算用 Apache Calcite。 "},{"type":"text","marks":[{"type":"strong"}],"text":"Kylin 的核心技術是研發了一系列的優化方法,來幫助解決維度爆炸和掃描數據過多的問題"},{"type":"text","text":" ,這些方法包括:設置聚合組,設置聯合維度,設置衍生維度,設置維度錶快照,設置 Rowkey 順序,設置 shard by 列等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設置聚合組:通過聚合組進行剪枝,減少不必要的預計算組合;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設置聯合維度:將經常成對出現的維度組合放在一起,減少不必要的預計算;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設置衍生維度:將能通過其他維度計算出來的維度(例如年,月,日能通過日期計算出來)設置爲衍生維度,減少不必要的預計算;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設置維度錶快照:放入內存現算,減少佔用的存儲空間;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"字典編碼:減少佔用的存儲空間;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RowKey 編碼,設置 shard by 列:通過減少數據掃描的行數,加速查詢效率"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a6\/b0\/a6d9e4e2714yy9a2a5a353c37e79b9b0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 ClickHouse 優化方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MPP 架構的系統最常見的優化方式就是分庫分表,類似的, "},{"type":"text","marks":[{"type":"strong"}],"text":"ClickHouse 最常見的優化方式包括設置分區和分片,此外 ClickHouse 也包括一些特有的引擎"},{"type":"text","text":" 。總結歸納下來,這些優化方法包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"用平表結構,代替多表 Join,避免昂貴的 Join 操作和數據混洗"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"設置合理的分區鍵,排序鍵,二級索引,減少數據掃描"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"搭建 ClickHouse 分佈式集羣增加分片和副本,添加計算資源"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"結合物化視圖,適當採用 SummingMergetree,AggregateMergetree 等以預計算爲核心的引擎"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着後面性能和併發的要求越來越高,對機器的資源消耗也越來越大。在 ClickHouse 的官方網站文檔中建議 ClickHouse 的併發數不超過 100,當併發要求高,爲減少 ClickHouse 的資源消耗,可以結合 ClickHouse 的一些特殊引擎進行優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特殊引擎中最常用的是 SummingMergetree 和 AggregateMergetree,這兩種數據結構是從 Mergetree 中派生而來,本質是通過預計算將需要查詢的數據提前算出來,保存在 ClickHouse 中,這樣查詢的時候就能進一步減少資源消耗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從使用原理來看 SummingMergetree 和 AggregateMergetree 與 Kylin 的Cube有異曲同工之妙。但是當維度過多的時候,管理很多個物化視圖是不現實的做法,存在管理成本高等問題。與 ClickHouse 不同,Kylin 提供一系列簡單直接的優化方法,來避免維度爆炸的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到,ClickHouse 和 Kylin 都提供一些方法減少存儲佔用的空間,降低查詢時掃描數據的行數。通常認爲:對 ClickHouse 和 Kylin 進行適當優化,都能在大數據量場景下滿足業務需求。ClickHouse 採用 MPP 現算,Kylin 採用預計算,由於兩者採用的技術路線不同因此相應優勢場景也不同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"04 優勢場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin 因爲採用預計算技術, 適合有固定模式的聚合查詢,例如:SQL 中的 join、group by、where條件模式比較固定等,數據量越大,使用 Kylin 的優勢越明顯;特別的, "},{"type":"text","marks":[{"type":"strong"}],"text":"Kylin 在去重(count distinct)、Top N、Percentile 等場景的優勢尤爲巨大,大量使用在 Dashboard、各類報表、大屏展示、流量統計、用戶行爲分析等場景"},{"type":"text","text":" 。美團、極光、貝殼找房等使用 Kylin 構建了他們的數據服務平臺,每日提供高達數百萬到數千萬次的查詢服務,且大部分查詢可以在 2 - 3 秒內完成。這樣的高併發場景幾乎沒有更好的替代方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"ClickHouse 因爲採用 MPP 架構現場計算能力很強,當查詢請求比較靈活,或者有明細查詢需求,併發量不大的時候比較適用"},{"type":"text","text":" 。場景包括:非常多列且 where 條件隨意組合的用戶標籤篩選,併發量不大的複雜即席查詢等。如果數據量和訪問量較大,需要部署分佈式 ClickHouse 集羣,這時候對運維的挑戰會比較高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果有些查詢非常靈活,但不經常查,採用現算就比較節省資源,由於查詢量少,即使每個查詢消耗計算資源大整體來說也可以是划算的。如果有些查詢有固定的模式,查詢量較大就更適合 Kylin,因爲查詢量大,利用大的計算資源將計算結果保存,前期的計算成本能夠攤薄每個查詢中,因此是最經濟的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"05 總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文就技術原理,存儲結構,優化方法及優勢場景,對 Kylin 和 ClickHouse 進行了對比。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"技術原理方面"},{"type":"text","text":" :ClickHouse 採用 MPP + Shared nothing 架構,查詢比較靈活,安裝部署和操作簡便,由於數據存儲在本地,擴容和運維相對較麻煩;Kylin 採用 MOLAP 預計算,基於 Hadoop,計算與存儲分離(特別是使用 Parquet 存儲後)、Shared storage 的架構,更適合場景相對固定但數據體量很大的場景,基於 Hadoop 便於與現有大數據平臺融合,也便於水平伸縮(特別是從 HBase 升級爲 Spark + Parquet 後)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"存儲結構方面"},{"type":"text","text":" :ClickHouse 存儲明細數據,特點包括MergeTree 存儲結構和稀疏索引,在明細之上可以進一步創建聚合表來加速性能;Kylin 採用預聚合以及 HBase 或 Parquet 做存儲,物化視圖對查詢透明,聚合查詢非常高效但不支持明細查詢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化方法方面"},{"type":"text","text":" :ClickHouse 包括分區分片和二級索引等優化手段, Kylin 採用聚合組、聯合維度、衍生維度、層級維度,以及 rowkey 排序等優化手段"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優勢場景方面"},{"type":"text","text":" :ClickHouse 通常適合幾億~幾十億量級的靈活查詢(更多量級也支持只是集羣運維難度會加大)。Kylin 則更適合幾十億~百億以上的相對固定的查詢場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖是一個多方面的彙總:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a7\/10\/a7ea7da802d48b5d844f19b7f48fec10.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜合下來, Kylin 和 ClickHouse 有各種使用的領域和場景 。現代數據分析領域沒有一種能適應所有場景的分析引擎。企業需要根據自己的業務場景,選擇合適的工具解決具體問題。希望本文能夠幫助企業做出合適的技術選型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"周耀,Kyligence 解決方案架構師,Apache Kylin、Apache Superset Contributor。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文轉載自公衆號apachekylin(ID:ApacheKylin)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzAwODE3ODU5MA==&mid=2653081811&idx=1&sn=a30d9f66cedaa8b466fd56202e9ac1b3&chksm=80a4ae22b7d327345b635cbca42fb865a13e98166e47b3b8c075ee8eb23fddb68c27bb42be03&token=1340822333&lang=zh_CN#rd","title":"","type":null},"content":[{"type":"text","text":"淺淡 Apache Kylin 與 ClickHouse 的對比"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章