ClickHouse在大數據領域企業級應用實踐和探索總結

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"ClickHouse簡介","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2020年下半年在OLAP領域有一匹黑馬以席捲之勢進入大數據開發者的領域,它就是ClickHouse。在2019年小編也曾介紹過ClickHouse,大家可以參考這裏進行入門:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?_biz=MzU3MzgwNTU2Mg==&mid=2247492821&idx=1&sn=6b27fb340afc633b6152cb9e022db611&chksm=fd3ea240ca492b56a34ee8a140528a6b3a2e6c82f164b1b622dbbbb8b54d89aa3d401283346f&token=1901516241&lang=zhCN#rd","title":""},"content":[{"type":"text","text":"來自俄羅斯的兇猛彪悍的分析數據庫-ClickHouse","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?_biz=MzU3MzgwNTU2Mg==&mid=2247492803&idx=1&sn=04283aa0913674e158a8519120fc14cd&chksm=fd3ea256ca492b40f60ac90eee4d23f9c2b8156a9c83439297317efcb826b32779b762f554e4&token=1901516241&lang=zhCN#rd","title":""},"content":[{"type":"text","text":"基於ClickHouse的用戶行爲分析實踐","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?_biz=MzU3MzgwNTU2Mg==&mid=2247489470&idx=1&sn=734886246f2413f72d605cad726b59f2&chksm=fd3d512bca4ad83d516e5d6f0dd4c21c83867de758238277a162fecad0ec21a9eff9a3a59c23&token=1901516241&lang=zhCN#rd","title":""},"content":[{"type":"text","text":"Prometheus+Clickhouse實現業務告警","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼我們有必要先從全局瞭解一下ClickHouse到底是個什麼樣的數據庫?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse是一個開源的,面向列的分析數據庫,由Yandex爲OLAP和大數據用例創建。 ClickHouse對實時查詢處理的支持使其適用於需要亞秒級分析結果的應用程序。 ClickHouse的查詢語言是SQL的一種方言,它支持強大的聲明性查詢功能,同時爲最終用戶提供熟悉度和較小的學習曲線。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse全稱是Click Stream,Data Warehouse,簡稱ClickHouse就是基於頁面的點擊事件流,面向數據倉庫進行OLAP分析。ClickHouse是一款開源的數據分析數據庫,由戰鬥民族俄羅斯Yandex公司研發的,Yandex是做搜索引擎的,就類似與Google,百度等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們都知道搜索引擎的營收主要來源與流量和廣告業務,所以搜索引擎公司會着重分析用戶網路流量,像Google有Anlytics,百度有百度統計,那麼Yandex就對應於Yandex.Metrica。ClickHouse就式在Yandex.Metrica下產生的技術。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面向列的數據庫將記錄存儲在按列而不是行分組的塊中。通過不加載查詢中不存在的列的數據,面向列的數據庫在完成查詢時花費的時間更少。因此,對於某些工作負載(如OLAP),這些數據庫可以比傳統的基於行的系統更快地計算和返回結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"爲什麼ClickHouse能夠異軍突起","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"優異的性能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據官網的介紹(https://clickhouse.tech/benchmark/dbms/),ClickHouse在相同的服務器配置與數據量下,平均響應速度:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Vertica的2.63倍(Vertica是一款收費的列式存儲數據庫)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"InfiniDB的17倍(可伸縮的分析數據庫引擎,基於Mysql搭建)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MonetDB的27倍(開源的列式數據庫)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive的126倍","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MySQL的429倍","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Greenplum的10倍","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark的1倍","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"性能是衡量 OLAP 數據庫的關鍵指標,我們可以通過 ClickHouse 官方測試結果感受下 ClickHouse 的極致性能,其中綠色代表性能最佳,紅色代表性能較差,紅色越深代表性能越弱。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/86/8645bd397e3e452f1979c34834dc9a52.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在之前用了一篇文章","attrs":{}},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?_biz=MzU3MzgwNTU2Mg==&mid=2247493466&idx=1&sn=bc5bddf48e3b8227dc2796ff883062dc&chksm=fd3ea1cfca4928d918cbf978dad994f6f53e5ea54cd0cbbd62e983d541f8db9ee3808f22562c&token=1901516241&lang=zhCN#rd","title":""},"content":[{"type":"text","text":"《數據告訴你ClickHouse有多快》","attrs":{}}]},{"type":"text","text":" 詳細講解了 ClickHouse的優異性能表現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"優勢和侷限性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse主要特點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ROLAP(關係型的聯機分析處理,和它一起比較的還有OLTP聯機事務處理,我們常見的ERP,CRM系統就屬於OLTP)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在線實時查詢","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完整的DBMS(關係數據庫)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"列式存儲(區別與HBase,ClickHouse的是完全列式存儲,HBase具體說是列族式存儲)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不需要任何數據預處理","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持批量更新","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"擁有完善的SQl支持和函數","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持高可用(多主結構,在後面的結構設計中會講到)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不依賴Hadoop複雜生態(像ES一樣,開箱即用)","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一些不足:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不支持事務(這其實也是大部分OLAP數據庫的缺點)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不擅長根據主鍵按行粒度查詢(但是支持這種操作)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不擅長按行刪除數據(但是支持這種操作)","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"核心概念和原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse 採用了典型的分組式的分佈式架構,集羣架構如下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8e/8ed1ced89452b5490ab41e3d3eb85314.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這其中的角色包括:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Shard :集羣內劃分爲多個分片或分組(Shard 0 … Shard N),通過 Shard 的線性擴展能力,支持海量數據的分佈式存儲計算。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Node :每個 Shard 內包含一定數量的節點(Node,即進程),同一 Shard 內的節點互爲副本,保障數據可靠。ClickHouse 中副本數可按需建設,且邏輯上不同 Shard 內的副本數可不同。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ZooKeeper Service :集羣所有節點對等,節點間通過 ZooKeeper 服務進行分佈式協調。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse基礎架構如下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/3551b2bc38b19a46436a817abc83316c.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Column與Field","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Column和Field是ClickHouse數據最基礎的映射單元。內存中的一列數據由一個Column對象表示。Column對象分爲接口和實現兩個部分,在IColumn接口對象中,定義了對數據進行各種關係運算的方法。在大多數場合,ClickHouse都會以整列的方式操作數據,但凡事也有例外。如果需要操作單個具體的數值 ( 也就是單列中的一行數據 ),則需要使用Field對象,Field對象代表一個單值。與Column對象的泛化設計思路不同,Field對象使用了聚合的設計模式。在Field對象內部聚合了Null、UInt64、String和Array等13種數據類型及相應的處理邏輯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DataType","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據的序列化和反序列化工作由DataType負責。IDataType接口定義了許多正反序列化的方法,它們成對出現。IDataType也使用了泛化的設計模式,具體方法的實現邏輯由對應數據類型的實例承載。DataType雖然負責序列化相關工作,但它並不直接負責數據的讀取,而是轉由從Column或Field對象獲取。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Block與Block流","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse內部的數據操作是面向Block對象進行的,並且採用了流的形式。Block對象可以看作數據表的子集。Block對象的本質是由數據對象、數據類型和列名稱組成的三元組,即Column、DataType及列名稱字符串。僅通過Block對象就能完成一系列的數據操作。Block並沒有直接聚合Column和DataType對象,而是通過ColumnWithTypeAndName對象進行間接引用。Block流操作有兩組頂層接口:IBlockInputStream負責數據的讀取和關係運算,IBlockOutputStream負責將數據輸出到下一環節。IBlockInputStream接口定義了讀取數據的若干個read虛方法,而具體的實現邏輯則交由它的實現類來填充。IBlockInputStream接口總共有60多個實現類,這些實現類大致可以分爲三類:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一類用於處理數據定義的DDL操作","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二類用於處理關係運算的相關操作","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三類則是與表引擎呼應,每一種表引擎都擁有與之對應的BlockInputStream實現","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"IBlockOutputStream的設計與IBlockInputStream如出一轍。這些實現類基本用於表引擎的相關處理,負責將數據寫入下一環節或者最終目的地。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Table","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據表的底層設計中並沒有所謂的Table對象,它直接使用IStorage接口指代數據表。表引擎是ClickHouse的一個顯著特性,不同的表引擎由不同的子類實現。IStorage接口負責數據的定義、查詢與寫入。IStorage負責根據AST查詢語句的指示要求,返回指定列的原始數據。後續的加工、計算和過濾則由下面介紹的部分進行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Parser與Interpreter","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Parser分析器負責創建AST對象;而Interpreter解釋器則負責解釋AST,並進一步創建查詢的執行管道。它們與IStorage一起,串聯起了整個數據查詢的過程。Parser分析器可以將一條SQL語句以遞歸下降的方法解析成AST語法樹的形式。不同的SQL語句,會經由不同的Parser實現類解析。Interpreter解釋器的作用就像Service服務層一樣,起到串聯整個查詢過程的作用,它會根據解釋器的類型,聚合它所需要的資源。首先它會解析AST對象;然後執行\"業務邏輯\" ( 例如分支判斷、設置參數、調用接口等 );最終返回IBlock對象,以線程的形式建立起一個查詢執行管道。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Functions 與Aggregate Functions","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse主要提供兩類函數—普通函數(Functions)和聚合函數(Aggregate Functions)。普通函數由IFunction接口定義,擁有數十種函數實現,採用向量化的方式直接作用於一整列數據。聚合函數由IAggregateFunction接口定義,相比無狀態的普通函數,聚合函數是有狀態的。以COUNT聚合函數爲例,其AggregateFunctionCount的狀態使用整型UInt64記錄。聚合函數的狀態支持序列化與反序列化,所以能夠在分佈式節點之間進行傳輸,以實現增量計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Cluster與Replication","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse的集羣由分片 ( Shard ) 組成,而每個分片又通過副本 ( Replica ) 組成。這種分層的概念,在一些流行的分佈式系統中十分普遍。這裏有幾個與衆不同的特性。ClickHouse的1個節點只能擁有1個分片,也就是說如果要實現1分片、1副本,則至少需要部署2個服務節點。分片只是一個邏輯概念,其物理承載還是由副本承擔的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"ClickHouse的核心特性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse的特性有很多,一款被認可的OLAP引擎能夠得到大家的頻繁使用一定是有獨特的特性,小編列舉了幾個ClickHouse異於常人的特性:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"列式存儲&數據壓縮","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按列存儲與按行存儲相比,前者可以有效減少查詢時所需掃描的數據量,這一點可以用一個示例簡單說明。假設一張數據表A擁有50個字段A1~A50,以及100行數據。現在需要查詢前5個字段並進行數據分析,那麼通過列存儲,我們僅需讀取必要的列數據,相比於普通行存,可減少 10 倍左右的讀取、解壓、處理等開銷,對性能會有質的影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果數據按行存儲,數據庫首先會逐行掃描,並獲取每行數據的所有50個字段,再從每一行數據中返回A1~A5這5個字段。不難發現,儘管只需要前面的5個字段,但由於數據是按行進行組織的,實際上還是掃描了所有的字段。如果數據按列存儲,就不會發生這樣的問題。由於數據按列組織,數據庫可以直接獲取A1~A5這5列的數據,從而避免了多餘的數據掃描。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按列存儲相比按行存儲的另一個優勢是對數據壓縮的友好性。ClickHouse的數據按列進行組織,屬於同一列的數據會被保存在一起,列與列之間也會由不同的文件分別保存 ( 這裏主要指MergeTree表引擎 )。數據默認使用LZ4算法壓縮,在Yandex.Metrica的生產環境中,數據總體的壓縮比可以達到8:1 ( 未壓縮前17PB,壓縮後2PB )。列式存儲除了降低IO和存儲的壓力之外,還爲向量化執行做好了鋪墊。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"向量化執行","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"坊間有句玩笑,即\"能用錢解決的問題,千萬別花時間\"。而業界也有種調侃如出一轍,即\"能升級硬件解決的問題,千萬別優化程序\"。有時候,你千辛萬苦優化程序邏輯帶來的性能提升,還不如直接升級硬件來得簡單直接。這雖然只是一句玩笑不能當真,但硬件層面的優化確實是最直接、最高效的提升途徑之一。向量化執行就是這種方式的典型代表,這項寄存器硬件層面的特性,爲上層應用程序的性能帶來了指數級的提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"向量化執行,可以簡單地看作一項消除程序中循環的優化。這裏用一個形象的例子比喻。小胡經營了一家果汁店,雖然店裏的鮮榨蘋果汁深受大家喜愛,但客戶總是抱怨製作果汁的速度太慢。小胡的店裏只有一臺榨汁機,每次他都會從籃子裏拿出一個蘋果,放到榨汁機內等待出汁。如果有8個客戶,每個客戶都點了一杯蘋果汁,那麼小胡需要重複循環8次上述的榨汁流程,才能榨出8杯蘋果汁。如果製作一杯果汁需要5分鐘,那麼全部製作完畢則需要40分鐘。爲了提升果汁的製作速度,小胡想出了一個辦法。他將榨汁機的數量從1臺增加到了8臺,這麼一來,他就可以從籃子裏一次性拿出8個蘋果,分別放入8臺榨汁機同時榨汁。此時,小胡只需要5分鐘就能夠製作出8杯蘋果汁。爲了製作n杯果汁,非向量化執行的方式是用1臺榨汁機重複循環制作n次,而向量化執行的方式是用n臺榨汁機只執行1次。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了實現向量化執行,需要利用CPU的SIMD指令。SIMD的全稱是Single Instruction Multiple Data,即用單條指令操作多條數據。現代計算機系統概念中,它是通過數據並行以提高性能的一種實現方式 ( 其他的還有指令級並行和線程級並行 ),它的原理是在CPU寄存器層面實現數據的並行操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在計算機系統的體系結構中,存儲系統是一種層次結構。典型服務器計算機的存儲層次結構如圖1所示。一個實用的經驗告訴我們,存儲媒介距離CPU越近,則訪問數據的速度越快。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d5/d509664f924a7e6e807d0f529e63c86a.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上圖中可以看到,從左向右,距離CPU越遠,則數據的訪問速度越慢。從寄存器中訪問數據的速度,是從內存訪問數據速度的300倍,是從磁盤中訪問數據速度的3000萬倍。所以利用CPU向量化執行的特性,對於程序的性能提升意義非凡。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse目前利用SSE4.2指令集實現向量化執行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"關係模型與SQL查詢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比HBase和Redis這類NoSQL數據庫,ClickHouse使用關係模型描述數據並提供了傳統數據庫的概念 ( 數據庫、表、視圖和函數等 )。與此同時,ClickHouse完全使用SQL作爲查詢語言 ( 支持GROUP BY、ORDER BY、JOIN、IN等大部分標準SQL ),這使得它平易近人,容易理解和學習。因爲關係型數據庫和SQL語言,可以說是軟件領域發展至今應用最爲廣泛的技術之一,擁有極高的\"羣衆基礎\"。也正因爲ClickHouse提供了標準協議的SQL查詢接口,使得現有的第三方分析可視化系統可以輕鬆與它集成對接。在SQL解析方面,ClickHouse是大小寫敏感的,這意味着SELECT a 和 SELECT A所代表的語義是不同的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關係模型相比文檔和鍵值對等其他模型,擁有更好的描述能力,也能夠更加清晰地表述實體間的關係。更重要的是,在OLAP領域,已有的大量數據建模工作都是基於關係模型展開的 ( 星型模型、雪花模型乃至寬表模型 )。ClickHouse使用了關係模型,所以將構建在傳統關係型數據庫或數據倉庫之上的系統遷移到ClickHouse的成本會變得更低,可以直接沿用之前的經驗成果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"多樣化的表引擎","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也許因爲Yandex.Metrica的最初架構是基於MySQL實現的,所以在ClickHouse的設計中,能夠察覺到一些MySQL的影子,表引擎的設計就是其中之一。與MySQL類似,ClickHouse也將存儲部分進行了抽象,把存儲引擎作爲一層獨立的接口。截至本書完稿時,ClickHouse共擁有合併樹、內存、文件、接口和其他6大類20多種表引擎。其中每一種表引擎都有着各自的特點,用戶可以根據實際業務場景的要求,選擇合適的表引擎使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常而言,一個通用系統意味着更廣泛的適用性,能夠適應更多的場景。但通用的另一種解釋是平庸,因爲它無法在所有場景內都做到極致。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在軟件的世界中,並不會存在一個能夠適用任何場景的通用系統,爲了突出某項特性,勢必會在別處有所取捨。其實世間萬物都遵循着這樣的道理,就像信天翁和蜂鳥,雖然都屬於鳥類,但它們各自的特點卻鑄就了完全不同的體貌特徵。信天翁擅長遠距離飛行,環繞地球一週只需要1至2個月的時間。因爲它能夠長時間處於滑行狀態,5天才需要扇動一次翅膀,心率能夠保持在每分鐘100至200次之間。而蜂鳥能夠垂直懸停飛行,每秒可以揮動翅膀70~100次,飛行時的心率能夠達到每分鐘1000次。如果用數據庫的場景類比信天翁和蜂鳥的特點,那麼信天翁代表的可能是使用普通硬件就能實現高性能的設計思路,數據按粗粒度處理,通過批處理的方式執行;而蜂鳥代表的可能是按細粒度處理數據的設計思路,需要高性能硬件的支持。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將表引擎獨立設計的好處是顯而易見的,通過特定的表引擎支撐特定的場景,十分靈活。對於簡單的場景,可直接使用簡單的引擎降低成本,而複雜的場景也有合適的選擇。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"多線程與分佈式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse幾乎具備現代化高性能數據庫的所有典型特徵,對於可以提升性能的手段可謂是一一用盡,對於多線程和分佈式這類被廣泛使用的技術,自然更是不在話下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果說向量化執行是通過數據級並行的方式提升了性能,那麼多線程處理就是通過線程級並行的方式實現了性能的提升。相比基於底層硬件實現的向量化執行SIMD,線程級並行通常由更高層次的軟件層面控制。現代計算機系統早已普及了多處理器架構,所以現今市面上的服務器都具備良好的多核心多線程處理能力。由於SIMD不適合用於帶有較多分支判斷的場景,ClickHouse也大量使用了多線程技術以實現提速,以此和向量化執行形成互補。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果一個籃子裝不下所有的雞蛋,那麼就多用幾個籃子來裝,這就是分佈式設計中分而治之的基本思想。同理,如果一臺服務器性能喫緊,那麼就利用多臺服務的資源協同處理。爲了實現這一目標,首先需要在數據層面實現數據的分佈式。因爲在分佈式領域,存在一條金科玉律—計算移動比數據移動更加划算。在各服務器之間,通過網絡傳輸數據的成本是高昂的,所以相比移動數據,更爲聰明的做法是預先將數據分佈到各臺服務器,將數據的計算查詢直接下推到數據所在的服務器。ClickHouse在數據存取方面,既支持分區 ( 縱向擴展,利用多線程原理 ),也支持分片 ( 橫向擴展,利用分佈式原理 ),可以說是將多線程和分佈式的技術應用到了極致。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"多主架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS、Spark、HBase和Elasticsearch這類分佈式系統,都採用了Master-Slave主從架構,由一個管控節點作爲Leader統籌全局。而ClickHouse則採用Multi-Master多主架構,集羣中的每個節點角色對等,客戶端訪問任意一個節點都能得到相同的效果。這種多主的架構有許多優勢,例如對等的角色使系統架構變得更加簡單,不用再區分主控節點、數據節點和計算節點,集羣中的所有節點功能相同。所以它天然規避了單點故障的問題,非常適合用於多數據中心、異地多活的場景。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"在線查詢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse經常會被拿來與其他的分析型數據庫作對比,比如Vertica、SparkSQL、Hive和Elasticsearch等,它與這些數據庫確實存在許多相似之處。例如,它們都可以支撐海量數據的查詢場景,都擁有分佈式架構,都支持列存、數據分片、計算下推等特性。這其實也側面說明了ClickHouse在設計上確實吸取了各路奇技淫巧。與其他數據庫相比,ClickHouse也擁有明顯的優勢。例如,Vertica這類商用軟件價格高昂;SparkSQL與Hive這類系統無法保障90%的查詢在1秒內返回,在大數據量下的複雜查詢可能會需要分鐘級的響應時間;而Elasticsearch這類搜索引擎在處理億級數據聚合查詢時則顯得捉襟見肘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如ClickHouse的\"廣告詞\"所言,其他的開源系統太慢,商用的系統太貴,只有Clickouse在成本與性能之間做到了良好平衡,即又快又開源。ClickHouse當之無愧地闡釋了\"在線\"二字的含義,即便是在複雜查詢的場景下,它也能夠做到極快響應,且無須對數據進行任何預處理加工。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據分片與分佈式查詢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據分片是將數據進行橫向切分,這是一種在面對海量數據的場景下,解決存儲和查詢瓶頸的有效手段,是一種分治思想的體現。ClickHouse支持分片,而分片則依賴集羣。每個集羣由1到多個分片組成,而每個分片則對應了ClickHouse的1個服務節點。分片的數量上限取決於節點數量 ( 1個分片只能對應1個服務節點 )。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse並不像其他分佈式系統那樣,擁有高度自動化的分片功能。ClickHouse提供了本地表 ( Local Table ) 與分佈式表 ( Distributed Table ) 的概念。一張本地表等同於一份數據的分片。而分佈式表本身不存儲任何數據,它是本地表的訪問代理,其作用類似分庫中間件。藉助分佈式表,能夠代理訪問多個數據分片,從而實現分佈式查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種設計類似數據庫的分庫和分表,十分靈活。例如在業務系統上線的初期,數據體量並不高,此時數據表並不需要多個分片。所以使用單個節點的本地表 ( 單個數據分片 ) 即可滿足業務需求,待到業務增長、數據量增大的時候,再通過新增數據分片的方式分流數據,並通過分佈式表實現分佈式查詢。這就好比一輛手動擋賽車,它將所有的選擇權都交到了使用者的手中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"簡單入門","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於ClickHouse的安裝,我們在這裏不再詳細展開,官網有詳細的文檔可以參考。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們使用Java客戶端連接ClickHouse進行一些簡單的操作。首先Clickhouse 有兩種 JDBC 驅動實現:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一種是官方給出的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"\n ru.yandex.clickhouse\n clickhouse-jdbc\n 0.1.52\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有一種第三方提供的驅動:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"\n com.github.housepower\n clickhouse-native-jdbc\n 1.6-stable\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩者間的主要區別如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"驅動類加載路徑不同,分別爲 ru.yandex.clickhouse.ClickHouseDriver 和 com.github.housepower.jdbc.ClickHouseDriver\n默認連接端口不同,分別爲 8123 和 9000\n連接協議不同,官方驅動使用 HTTP 協議,而三方驅動使用 TCP 協議","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,兩種驅動不可共用,同個項目中只能選擇其中一種驅動。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"Class.forName(\"com.github.housepower.jdbc.ClickHouseDriver\");\nConnection connection = DriverManager.getConnection(\"jdbc:clickhouse://192.168.60.131:9000\");\n\nStatement statement = connection.createStatement();\nstatement.executeQuery(\"create table test.example(day Date, name String, age UInt8) Engine=Log\");\n\nPreparedStatement pstmt = connection.prepareStatement(\"insert into test.example values(?, ?, ?)\");\n\n// insert 10 records\nfor (int i = 0; i < 10; i++) {\n pstmt.setDate(1, new Date(System.currentTimeMillis()));\n pstmt.setString(2, \"panda_\" + (i + 1));\n pstmt.setInt(3, 18);\n pstmt.addBatch();\n}\npstmt.executeBatch();\n\nStatement statement = connection.createStatement();\n\nString sql = \"select * from test.jdbc_example\";\nResultSet rs = statement.executeQuery(sql);\n\nwhile (rs.next()) {\n // ResultSet 的下標值從 1 開始,不可使用 0,否則越界,報 ArrayIndexOutOfBoundsException 異常\n System.out.println(rs.getDate(1) + \", \" + rs.getString(2) + \", \" + rs.getInt(3));\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 clickhouse-client 命令行界面查看錶情況:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"ck-master :) show tables;\n\nSHOW TABLES\n\n┌─name─────────┐\n│ hits │\n│ jdbc_example │\n└──────────────┘\n\nck-master :) select * from example;\n\nSELECT *\nFROM jdbc_example\n\n┌────────day─┬─name─────┬─age─┐\n│ 2019-04-25 │ panda_1 │ 18 │\n│ 2019-04-25 │ panda_2 │ 18 │\n│ 2019-04-25 │ panda_3 │ 18 │\n│ 2019-04-25 │ panda_4 │ 18 │\n│ 2019-04-25 │ panda_5 │ 18 │\n│ 2019-04-25 │ panda_6 │ 18 │\n│ 2019-04-25 │ panda_7 │ 18 │\n│ 2019-04-25 │ panda_8 │ 18 │\n│ 2019-04-25 │ panda_9 │ 18 │\n│ 2019-04-25 │ panda_10 │ 18 │\n└────────────┴──────────┴─────┘\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"企業級應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從2019年起,已經有很多大廠開始在生產環境使用ClickHouse,小編在這裏根據各大公司使用情況作了一些總結,適用場景、方案、優化等等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ClickHouse在攜程酒店數據智能平臺的應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖是攜程實際應用ClickHouse的架構圖,底層數據大部分是離線的,一部分是實時的,離線數據現在大概有將近 3000 多個 job 每天都是把數據從 HIVE 拉到 ClickHouse 裏面去,實時數據主要是接外部數據,然後批量寫到 ClickHouse 裏面。數據智能平臺 80%以上的數據都在 ClickHouse 上面。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/de/deb42e12845f4b6359379492e8ba13ce.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,攜程還存在全量數據同步和增量數據同步的場景。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全量數據同步的流程流程如下圖:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"清空A","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"temp表,將最新的數據從Hive通過ETL導入到A","attrs":{}},{"type":"text","text":"temp表","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將A rename 成A","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"temp","attrs":{}},{"type":"text","text":"temp","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將A_temp rename成 A","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將A","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"temp","attrs":{}},{"type":"text","text":"temp rename成 A_tem","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/19514723db9be9c610a7ed72ea954e2c.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,ClickHouse每執行一個 insert 的時候會產生一個進程 ID,如果沒有執行完,直接 Rename 就會造成數據丟失,數據就不對了,所以必須要有一個 job 在輪詢看這邊是不是執行完了,只有當 insert 的進程 id 執行完成後再做後面一系列的 rename。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增量的數據同步流程入下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"清空A","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"temp表,將最近3個月的數據從Hive通過ETL導入到A","attrs":{}},{"type":"text","text":"temp表","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將A表中3個月之前的數據select into到A_temp表","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將A rename 成A","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"temp","attrs":{}},{"type":"text","text":"temp","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將A_temp rename成 A","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將A","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"temp","attrs":{}},{"type":"text","text":"temp rename成 A_tem","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/22/227e8775a3b5ed0715aab2dd26c3dd9e.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是攜程使用ClickHouse的經驗:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、數據導入之前要評估好分區字段","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse 因爲是根據分區文件存儲的,如果說你的分區字段真實數據粒度很細,數據導入的時候就會把你的物理機打爆。其實數據量可能沒有多少,但是因爲你用的字段不合理,會產生大量的碎片文件,磁盤空間就會打到底。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、數據導入提前根據分區做好排序,避免同時寫入過多分區導致 clickhouse 內部來不及 Merge","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據導入之前我們做好排序,這樣可以降低數據導入後 ClickHouse 後臺異步 Merge 的時候涉及到的分區數,肯定是涉及到的分區數越少服務器壓力也會越小。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、左右表 join 的時候要注意數據量的變化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再就是左右表 join 的問題,ClickHouse 它必須要大表在左邊,小表在右邊。但是我們可能某些業務場景跑着跑着數據量會返過來了,這個時候我們需要有監控能及時發現並修改這個 join 關係。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、根據數據量以及應用場景評估是否採用分佈式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式要根據應用場景來,如果你的應用場景向上彙總後數據量已經超過了單物理機的存儲或者 CPU/內存瓶頸而不得不採用分佈式 ClickHouse 也有很完善的 MPP 架構,但同時你也要維護好你的主 keyboard。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5、監控好服務器的 CPU/內存波動","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再就是做好監控,我前面說過 ClickHouse 的 CPU 拉到 60%的時候,基本上你的慢查詢馬上就出來了,所以我這邊是有對 CPU 和內存的波動進行監控的,類似於 dump,這個我們抓下來以後就可以做分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6、數據存儲磁盤儘量採用 SSD","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據存儲儘量用 SSD,因爲我之前也開始用過機械硬盤,機械硬盤有一個問題就是當你的服務器要運維以後需要重啓,這個時候數據要加載,我們現在單機數據量存儲有超過了 200 億以上,這還是我幾個月前統計的。這個數據量如果說用機械硬盤的話,重啓一次可能要等上好幾個小時服務器纔可用,所以儘量用 SSD,重啓速度會快很多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然重啓也有一個問題就是說會導致你的數據合併出現錯亂,這是一個坑。所以我每次維護機器的時候,同一個集羣我不會同時維護幾臺機器,我只會一臺一臺維護,A 機器好了以後會跟它的備用機器對比數據,否則機器起來了,但是數據不一定是對的,並且可能是一大片數據都是不對的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7、減少數據中文本信息的冗餘存儲","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要減少一些中文信息的冗餘存儲,因爲中文信息會導致整個服務器的 IO 很高,特別是導數據的時候。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"8、特別適用於數據量大,查詢頻次可控的場景,如數據分析、埋點日誌系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於它的應用,我認爲從成本角度來說,就像以前我們有很多業務數據的修改日誌,大家開發的時候可能都習慣性的存到 MySQL 裏面,但是實際上我認爲這種數據非常適合於落到 ClickHouse 裏面,比落到 MySQL 裏面成本會更低,查詢速度會更快。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Clickhouse在快手的應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b0/b08ba8bbd5e6b0570ee94eb69f698cf4.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e00e0dc804fa0312e669950a7cc1a118.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e279d8fd6e796b67d6f123d0cd0a7884.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fa/fa7e9b68c24e709f414049a761969b74.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/31/312c599220563366689f0f60084a1e7f.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/20/207fd5933541390379000ff5f5f60ff9.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c0/c04b71cf9687b4f03c786b18857d22df.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Clickhouse在QQ的應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"QQ音樂大數據團隊基於ClickHouse+Superset等基礎組件,結合騰訊雲EMR產品的雲端能力,搭建起高可用、低延遲的實時OLAP分析計算可視化平臺。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣日均新增萬億數據,規模達到上萬核CPU,PB級數據量。整體實現秒級的實時數據分析、提取、下鑽、監控數據基礎服務,大大提高了大數據分析與處理的工作效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a66ff60869f2854cbb391f5a06a60cd.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過OLAP分析平臺,極大降低了探索數據的門檻,做到全民BI,全民數據服務,實現了實時PV、UV、營收、用戶圈層、熱門歌曲等各類指標高效分析,全鏈路數據秒級分析定位,加強數據上報規範,形成一個良好的正循環。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對上萬核集羣規模、PB級的數據量,經過QQ音樂大數據團隊和騰訊雲EMR雙方技術團隊無數次技術架構升級優化,性能優化,逐步形成高可用、高性能、高安全的OLAP計算分析平臺。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)基於SSD盤的ZooKeeper","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse依賴於ZooKeeper實現分佈式系統的協調工作,在ClickHouse併發寫入量較大時,ZooKeeper對元數據存儲處理不及時,會導致ClickHouse副本間同步出現延遲,降低集羣整體性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決方案:採用SSD盤的ZooKeeper大幅提高IO的性能,在表個數小於100,數據量級在TB級別時,也可採用HDD盤,其他情況都建議採用SSD盤。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1ff4d9cd55fd2b0edb2389d812979d0e.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)數據寫入一致性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據在寫入ClickHouse失敗重試後內容出現重複,導致了不同系統,如Hive離線數倉中分析結果,與ClickHouse集羣中運算結果不一致。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7b9abf65491443d0526e1670d143f7f5.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決方案:基於統一全局的負載均衡調度策略,完成數據失敗後仍然可寫入同一Shard,實現數據冪等寫入,從而保證在ClickHouse中數據一致性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)實時離線數據寫入","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse數據主要來自實時流水上報數據和離線數據中間分析結果數據,如何在架構中完成上萬億基本數據的高效安全寫入,是一個巨大的挑戰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決方案:基於Tube消息隊列,完成統一數據的分發消費,基於上述的一致性策略實現數據冪同步,做到實時和離線數據的高效寫入。 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a1/a1f9d0ea70985ba85c752062b8571264.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4)表分區數優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部分離線數據倉庫採用按小時落地分區,如果採用原始的小時分區更新同步,會造成ClickHouse中Select查詢打開大量文件及文件描述符,進而導致性能低下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決方案:ClickHouse官方也建議,表分區的數量建議不超過10000,上述的數據同步架構完成小時分區轉換爲天分區,同時程序中完成數據冪等消費。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(5)讀/寫分離架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頻繁的寫動作,會消耗大量CPU/內存/網卡資源,後臺合併線程得不到有效資源,降低Merge Parts速度,MergeTree構建不及時,進而影響讀取效率,導致集羣性能降低。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決方案:ClickHouse臨時節點預先完成數據分區文件構建,動態加載到線上服務集羣,緩解ClickHouse在大量併發寫場景下的性能問題,實現高效的讀/寫分離架構,具體步驟和架構如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a)利用K8S的彈性構建部署能力,構建臨時ClickHouse節點,數據寫入該節點完成數據的Merge、排序等構建工作;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b)構建完成數據按MergeTree結構關聯至正式業務集羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然對一些小數據量的同步寫入,可採用10000條以上批量的寫入。 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/10/10e4dd5b2e3d6309f86966f632e3ec41.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(6)跨表查詢本地化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在ClickHouse集羣中跨表進行Select查詢時,採用Global IN/Global Join語句性能較爲低下。分析原因,是在此類操作會生成臨時表,並跨設備同步該表,導致查詢速度慢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決方案:採用一致性hash,將相同主鍵數據寫入同一個數據分片,在本地local表完成跨表聯合查詢,數據均來自於本地存儲,從而提高查詢速度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種優化方案也有一定的潛在問題,目前ClickHouse尚不提供數據的Reshard能力,當Shard所存儲主鍵數據量持續增加,達到磁盤容量上限需要分拆時,目前只能根據原始數據再次重建CK集羣,有較高的成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ec/ec61052d8895e1a2a042ed9b04f2e171.png","alt":"file","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse從進入大衆視野到開始廣泛應用實踐並不久,ClickHouse社區也在飛速發展中。ClickHouse仍然年輕,雖然在某些方面存在不足,但極致性能的存儲引擎,使得ClickHouse成爲一個非常優秀的存儲底座。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:","attrs":{}},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247497962&idx=1&sn=efe8ff8f6679dda84b67f1bb27845f44&chksm=fd3ebe7fca493769335b0b60b914f58be126ad542fcf55ddfddfa76cbfcc35da1da04130978b&token=808170609&lang=zh_CN#rd","title":""},"content":[{"type":"text","text":"ClickHouse大數據領域企業級應用實踐和探索總結","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎關注,","attrs":{}},{"type":"link","attrs":{"href":"https://shimo.im/docs/jdPhrtFwVCAMkoWv","title":""},"content":[{"type":"text","text":"《大數據成神之路》","attrs":{}}]},{"type":"text","text":"系列文章","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://blog.csdn.net/u013411339/article/details/112426724","title":""},"content":[{"type":"text","text":"《2021年最新版大數據面試題全面開啓更新》","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章