大數據之ClickHouse:安裝部署與性能測試

記錄過程

概述

個人總結式理解,詳細的去官網看吧

  • 俄羅斯搜索引擎公司Yandex研發,2016年開源的列式存儲數據庫
  • 主要用於在線OLAP不支持事務所以不支持OLTP
  • ClickHouse中文社區
  • ClickHouse中文官網
  • 優勢在於大寬表查詢,join多個大表查詢性能比不上一般的OLAP工具
  • 極致性能在於極致的壓榨服務器性能
  • 百億數據集的查詢都可秒級別響應
  • 列式存儲,所以count等聚合查詢很快,數據壓縮比比一般存儲方式要高很多
  • 建表需要指定合適的查詢引擎來達到更高的查詢性能
  • 支持索引,適合在線查詢
  • 併發低,官方建議100,但是有增強插件CHproxy可提高
  • 任何一個sql語句都會全力使用服務器資源來執行來達到極致性能
  • 近標準sql,很少部分與sql2003協議不一樣
  • 副本機制保證安全
  • 支持近似計算
  • 支持實時數據更新,大批量更新性能更好
  • 爲了提高CPU利用,設計了向量引擎
  • 稀疏索引使得ClickHouse不適合通過其鍵檢索單行的點查詢
  • 僅能用於批量刪除或修改數據
  • 對於Ubuntu系統和Debian系統支持更好

環境

  • Centos7(由於大部分生產環境用的操作系統還是Centos,故這裏也用Centos來描述)

單機安裝

ClickHouse可以在任何具有x86_64,AArch64或PowerPC64LE CPU架構的Linux,FreeBSD或Mac OS X上運行

  • 檢查環境是否支持

    [root@bigdata01 module]# grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
    SSE 4.2 supported
    
  • CentOS取消打開文件數限制

    分別編輯如下兩個文件

    vim /etc/security/limits.conf

    vim /etc/security/limits.d/20-nproc.conf

    注意有些環境可能不叫20-nproc.conf,變通下,先ls /etc/security/limits.d看看叫啥名

    增加如下內容,注意*號也要

    * soft nofile 65536 
    * hard nofile 65536 
    * soft nproc 131072 
    * hard nproc 131072
    

    重啓服務器之後生效,用ulimit -n 或者ulimit -a查看設置結果

  • 安裝依賴

    yum install -y libtool
    yum install -y *unixODBC*
    
  • 下載說明,不是要去下載,可以直接使用yum安裝,如下只是個說明

官網:https://clickhouse.yandex/

下載地址:http://repo.red-soft.biz/repos/clickhouse/stable/el7/

https://packagecloud.io/Altinity/clickhouse

這裏下載半年前的,clickHouse版本更新很快,需注意更新內容

安裝的版本:*-19.15.5.18-1.el7.x86_64.rpm

包括:

  • 下載yum源

    curl -s https://packagecloud.io/install/repositories/Altinity/clickhouse/script.rpm.sh | sudo bash
    
  • Yum安裝

    如下是安裝指定版本,若安裝最新版則可直接 sudo yum install -y clickhouse-server clickhouse-client

    sudo yum install clickhouse-server-common-19.15.5.18-1.el7.x86_64
    sudo yum install clickhouse-server-19.15.5.18-1.el7.x86_64	注意:這個會同時依賴安裝 clickhouse-common-static
    sudo yum install clickhouse-debuginfo-19.15.5.18-1.el7.x86_64
    sudo yum install clickhouse-client-19.15.5.18-1.el7.x86_64
    

    檢查安裝情況:

    sudo yum list installed 'clickhouse*'
    
  • 各個安裝的組件文件分佈情況

    可以從https://packagecloud.io/Altinity/clickhouse點進去對應版本對應組建裏看到File的分佈情況,這裏列舉幾個關注度較高的文件目錄

    /etc/clickhouse-client/config.xml
    /usr/bin/clickhouse-client
    /usr/bin/clickhouse-benchmark
    
    /etc/clickhouse-server/users.xml
    /etc/clickhouse-server/config.xml
    /usr/bin/clickhouse-server
    /etc/security/limits.d/clickhouse.conf
    /etc/init.d/clickhouse-server
    /etc/cron.d/clickhouse-server
    

常用配置

  • 服務端配置

    注意修改了服務端配置要重啓服務哦

    配置文件在/etc/clickhouse-server目錄下

    • users.xml 用戶配置信息。默認有個default用戶無密碼。

      增加用戶的話直接參考default用戶的配置方式,也就是標籤配置方式去增加即可

    • config.xml 服務的配置信息。可修改端口號、綁定IP、安全信息等

  • 客戶端配置

    • 執行clickhouse命令時,默認會讀取/etc/clickhouse-client/config.xml配置文件進行啓動客戶端

    • 可通過-c參數指定config.xml位置如clickhouse-client -c /opt/software/config.xml

    • /etc/clickhouse-client/config.xml記錄的是連接服務端的一些信息

啓動/檢查服務

service clickhouse-server start
service clickhouse-server status

[root@bigdata01 ~]# service clickhouse-server start
Start clickhouse-server service: Path to data directory in /etc/clickhouse-server/config.xml: /var/lib/clickhouse/
DONE
[root@bigdata01 ~]# service clickhouse-server status
clickhouse-server service is running

命令行客戶端

[root@bigdata01 ~]# clickhouse-client
ClickHouse client version 19.15.5.18.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.15.5 revision 54426.

bigdata01 :) show databases;

SHOW DATABASES

┌─name────┐
│ default │
│ system  │
└─────────┘

2 rows in set. Elapsed: 0.001 sec.

指定端口或服務地址加參數 --port 8080 --host 127.0.0.1

分佈式集羣安裝

每臺機器都按如上單機安裝步驟安裝好的前提下

  • 每臺機器修改/etc/clickhouse-server/config.xml

    <listen_host>::</listen_host>
    <!-- <listen_host>::1</listen_host> -->
    <!-- <listen_host>127.0.0.1</listen_host> -->
    
  • 每臺機器etc目錄下新建metrika.xml文件

    vim /etc/metrika.xml
    
    添加如下內容
    
    <yandex>
    <clickhouse_remote_servers>
        <!-- 如果是3臺集羣1個副本就叫如下標籤  -->
        <perftest_3shards_1replicas>
            <!-- 每臺機器配置  -->
            <shard>
                 <internal_replication>true</internal_replication>
                <replica>
                    <host>bigdata01</host>
                    <port>19000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <internal_replication>true</internal_replication>
                    <host>bigdata02</host>
                    <port>19000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>bigdata03</host>
                    <port>19000</port>
                </replica>
            </shard>
        </perftest_3shards_1replicas>
    </clickhouse_remote_servers>
    
    <!-- zookeeper集羣配置  -->
    <zookeeper-servers>
      <node index="1">
        <host>bigdata01</host>
        <port>32181</port>
      </node>
    
      <node index="2">
        <host>bigdata02</host>
        <port>32181</port>
      </node>
      <node index="3">
        <host>bigdata03</host>
        <port>32181</port>
      </node>
    </zookeeper-servers>
    
    <!-- macros配置,寫當前機器host  -->
    <macros>
        <replica>bigdata01</replica>
    </macros>
    
    <networks>
       <ip>::/0</ip>
    </networks>
    
    <clickhouse_compression>
    <case>
      <min_part_size>10000000000</min_part_size>                                          
      <min_part_size_ratio>0.01</min_part_size_ratio>                                      	 <method>lz4</method>
    </case>
    </clickhouse_compression>
    
    </yandex>
    
  • 啓動每臺機器

    注意先啓動zookeeper

卸載

  • 列舉安裝了哪些模塊

    [root@bigdata01 ~]# yum list installed | grep clickhouse
    clickhouse-client.x86_64              19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-common-static.x86_64       19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-debuginfo.x86_64           19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-server.x86_64              19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-server-common.x86_64       19.15.5.18-1.el7                @Altinity_clickhouse
    
  • 依次卸載模塊

    yum remove -y clickhouse-client.x86_64 clickhouse-common-static.x86_64 clickhouse-debuginfo.x86_64 clickhouse-server.x86_64 clickhouse-server-common.x86_64
    
  • 再次全局檢查剩餘文件然後刪除

    find / -name 'clickhouse'
    rm -rf 查出來的結果
    
  • 卸載報錯時強制刪除

    # 刪除rpm包的時候不調用卸載腳本
    sudo rpm -e clickhouse-server.x86_64 --noscripts
    

性能測試

使用官網提供的航班飛行數據進行測試:19872017年的。由於存儲空間有限,故只用20002017年的數據進行測試

測試機器情況:百度雲服務器:2核/4GB/40GB/計算型c3 1Mbps

經測試如下大數據集並沒有達到機器性能極限。

如下測試,官網都有介紹

  • 官網下載數據參考:https://clickhouse.tech/docs/zh/getting_started/example_datasets/ontime/

  • 創建表結構(注意登陸時clickhouse -m 如果不加-m啓用多行會報錯)

    CREATE TABLE `ontime` (
      `Year` UInt16,
      `Quarter` UInt8,
      `Month` UInt8,
      `DayofMonth` UInt8,
      `DayOfWeek` UInt8,
      `FlightDate` Date,
      `UniqueCarrier` FixedString(7),
      `AirlineID` Int32,
      `Carrier` FixedString(2),
      `TailNum` String,
      `FlightNum` String,
      `OriginAirportID` Int32,
      `OriginAirportSeqID` Int32,
      `OriginCityMarketID` Int32,
      `Origin` FixedString(5),
      `OriginCityName` String,
      `OriginState` FixedString(2),
      `OriginStateFips` String,
      `OriginStateName` String,
      `OriginWac` Int32,
      `DestAirportID` Int32,
      `DestAirportSeqID` Int32,
      `DestCityMarketID` Int32,
      `Dest` FixedString(5),
      `DestCityName` String,
      `DestState` FixedString(2),
      `DestStateFips` String,
      `DestStateName` String,
      `DestWac` Int32,
      `CRSDepTime` Int32,
      `DepTime` Int32,
      `DepDelay` Int32,
      `DepDelayMinutes` Int32,
      `DepDel15` Int32,
      `DepartureDelayGroups` String,
      `DepTimeBlk` String,
      `TaxiOut` Int32,
      `WheelsOff` Int32,
      `WheelsOn` Int32,
      `TaxiIn` Int32,
      `CRSArrTime` Int32,
      `ArrTime` Int32,
      `ArrDelay` Int32,
      `ArrDelayMinutes` Int32,
      `ArrDel15` Int32,
      `ArrivalDelayGroups` Int32,
      `ArrTimeBlk` String,
      `Cancelled` UInt8,
      `CancellationCode` FixedString(1),
      `Diverted` UInt8,
      `CRSElapsedTime` Int32,
      `ActualElapsedTime` Int32,
      `AirTime` Int32,
      `Flights` Int32,
      `Distance` Int32,
      `DistanceGroup` UInt8,
      `CarrierDelay` Int32,
      `WeatherDelay` Int32,
      `NASDelay` Int32,
      `SecurityDelay` Int32,
      `LateAircraftDelay` Int32,
      `FirstDepTime` String,
      `TotalAddGTime` String,
      `LongestAddGTime` String,
      `DivAirportLandings` String,
      `DivReachedDest` String,
      `DivActualElapsedTime` String,
      `DivArrDelay` String,
      `DivDistance` String,
      `Div1Airport` String,
      `Div1AirportID` Int32,
      `Div1AirportSeqID` Int32,
      `Div1WheelsOn` String,
      `Div1TotalGTime` String,
      `Div1LongestGTime` String,
      `Div1WheelsOff` String,
      `Div1TailNum` String,
      `Div2Airport` String,
      `Div2AirportID` Int32,
      `Div2AirportSeqID` Int32,
      `Div2WheelsOn` String,
      `Div2TotalGTime` String,
      `Div2LongestGTime` String,
      `Div2WheelsOff` String,
      `Div2TailNum` String,
      `Div3Airport` String,
      `Div3AirportID` Int32,
      `Div3AirportSeqID` Int32,
      `Div3WheelsOn` String,
      `Div3TotalGTime` String,
      `Div3LongestGTime` String,
      `Div3WheelsOff` String,
      `Div3TailNum` String,
      `Div4Airport` String,
      `Div4AirportID` Int32,
      `Div4AirportSeqID` Int32,
      `Div4WheelsOn` String,
      `Div4TotalGTime` String,
      `Div4LongestGTime` String,
      `Div4WheelsOff` String,
      `Div4TailNum` String,
      `Div5Airport` String,
      `Div5AirportID` Int32,
      `Div5AirportSeqID` Int32,
      `Div5WheelsOn` String,
      `Div5TotalGTime` String,
      `Div5LongestGTime` String,
      `Div5WheelsOff` String,
      `Div5TailNum` String
    ) ENGINE = MergeTree
    PARTITION BY Year
    ORDER BY (Carrier, FlightDate)
    SETTINGS index_granularity = 8192;

  • 下載數據(官方提供)
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
wget http://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
  • 加載數據下載的數據(個人使用per_test庫來進行測試故注意下語句,注意host和端口)
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=127.0.0.1 --p 19000 --query="INSERT INTO per_test.ontime FORMAT CSVWithNames"; done

  • 查詢從2000年到2008年每天的航班數

    SELECT 
        DayOfWeek, 
        count(*) AS c
    FROM ontime
    WHERE (Year >= 2000) AND (Year <= 2008)
    GROUP BY DayOfWeek
    ORDER BY c DESC
    
    ┌─DayOfWeek─┬───────c─┐
    │         11024694 │
    │         31019282 │
    │         21015141 │
    │         51014324 │
    │         41013083 │
    │         7979170 │
    │         6908404 │
    └───────────┴─────────┘
    
    7 rows in set. Elapsed: 0.042 sec. Processed 6.97 million rows, 20.92 MB (167.64 million rows/s., 502.92 MB/s.)
    
  • 查詢從2000年到2008年每週延誤超過10分鐘的航班數

    SELECT 
        DayOfWeek, 
        count(*) AS c
    FROM ontime
    WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
    GROUP BY DayOfWeek
    ORDER BY c DESC
    
    ┌─DayOfWeek─┬──────c─┐
    │         5274999 │
    │         4254490 │
    │         7238941 │
    │         1209985 │
    │         3201997 │
    │         6183685 │
    │         2178767 │
    └───────────┴────────┘
    
    7 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 48.82 MB (44.71 million rows/s., 313.00 MB/s.) 
    
    
  • 查詢2000年到2008年每個機場延誤超過10分鐘以上的次數

    SELECT 
        Origin, 
        count(*) AS c
    FROM ontime
    WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
    GROUP BY Origin
    ORDER BY c DESC
    LIMIT 10
    
    ┌─Origin─┬──────c─┐
    │ ORD    │ 105023 │
    │ ATL    │  73496 │
    │ DFW    │  67485 │
    │ PHX    │  66968 │
    │ LAX    │  66964 │
    │ LAS    │  50462 │
    │ STL    │  47812 │
    │ DEN    │  46164 │
    │ SFO    │  43537 │
    │ DTW    │  43341 │
    └────────┴────────┘
    
    10 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 76.72 MB (44.59 million rows/s., 490.50 MB/s.) 
    
  • 查詢2000至2008年各航空公司延誤超過10分鐘以上的百分比

    SELECT 
        Carrier, 
        c, 
        c2, 
        (c * 100) / c2 AS c3
    FROM 
    (
        SELECT 
            Carrier, 
            count(*) AS c
        FROM ontime
        WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
        GROUP BY Carrier
    )
    INNER JOIN 
    (
        SELECT 
            Carrier, 
            count(*) AS c2
        FROM ontime
        WHERE (Year >= 2000) AND (Year <= 2008)
        GROUP BY Carrier
    ) USING (Carrier)
    ORDER BY c3 DESC
    
    ┌─Carrier─┬──────c─┬──────c2─┬─────────────────c3─┐
    │ UA      │ 26245191591128.654640025067938 │
    │ AS5197718888427.51794752334766 │
    │ WN      │ 314159114864927.350304575200955 │
    │ HP      │  6985926418026.44371262018321 │
    │ US      │ 18568988611520.95540646530078 │
    │ AA      │ 18178989634920.281051242317446 │
    │ TW      │  6422031976420.08356162669969 │
    │ DL      │ 199886108911618.353049629240594 │
    │ NW      │ 11510266731717.24847411350228 │
    │ CO      │  7859347414516.575731052737034 │
    │ MQ      │  1722910841015.89244534637026 │
    │ AQ      │   19101525812.518023332022546 │
    └─────────┴────────┴─────────┴────────────────────┘
    
    12 rows in set. Elapsed: 0.186 sec. Processed 13.95 million rows, 83.69 MB (75.06 million rows/s., 450.37 MB/s.) 
    
    

    更好的查詢語句版本

    SELECT 
        Carrier, 
        avg(DepDelay > 10) * 100 AS c3
    FROM ontime
    WHERE (Year >= 2000) AND (Year <= 2008)
    GROUP BY Carrier
    ORDER BY c3 DESC
    
    ┌─Carrier─┬─────────────────c3─┐
    │ UA      │  28.65464002506794 │
    │ AS27.517947523347665 │
    │ WN      │  27.35030457520095 │
    │ HP      │ 26.443712620183206 │
    │ US      │  20.95540646530078 │
    │ AA      │ 20.281051242317446 │
    │ TW      │  20.08356162669969 │
    │ DL      │ 18.353049629240594 │
    │ NW      │  17.24847411350228 │
    │ CO      │ 16.575731052737034 │
    │ MQ      │ 15.892445346370259 │
    │ AQ      │ 12.518023332022546 │
    └─────────┴────────────────────┘
    
    12 rows in set. Elapsed: 0.129 sec. Processed 6.97 million rows, 55.79 MB (53.97 million rows/s., 431.75 MB/s.)
    
  • 每年航班延誤超過10分鐘的百分比

SELECT 
    Year, 
    avg(DepDelay > 10) * 100
FROM ontime
GROUP BY Year
ORDER BY Year ASC

┌─Year─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐
│ 200023.17167181619297 │
│ 200117.505660117222323 │
└──────┴───────────────────────────────────────────┘

2 rows in set. Elapsed: 0.084 sec. Processed 6.97 million rows, 41.84 MB (83.21 million rows/s., 499.26 MB/s.)
  • 每年更受人們喜愛的目的地

    SELECT 
        DestCityName, 
        uniqExact(OriginCityName) AS u
    FROM ontime
    WHERE (Year >= 2000) AND (Year <= 2010)
    GROUP BY DestCityName
    ORDER BY u DESC
    LIMIT 10
    
    ┌─DestCityName──────────┬───u─┐
    │ Chicago, IL           │ 117 │
    │ Dallas/Fort Worth, TX │ 115 │
    │ Atlanta, GA           │ 100 │
    │ Minneapolis, MN       │  88 │
    │ Houston, TX           │  81 │
    │ Detroit, MI           │  81 │
    │ St. Louis, MO         │  76 │
    │ Charlotte, NC         │  70 │
    │ Pittsburgh, PA        │  69 │
    │ Newark, NJ            │  67 │
    └───────────────────────┴─────┘
    
    10 rows in set. Elapsed: 0.559 sec. Processed 6.97 million rows, 322.81 MB (12.47 million rows/s., 577.07 MB/s.)
    
  • Q10

    SELECT 
        min(Year), 
        max(Year), 
        Carrier, 
        count(*) AS cnt, 
        sum(ArrDelayMinutes > 30) AS flights_delayed, 
        round(sum(ArrDelayMinutes > 30) / count(*), 2) AS rate
    FROM ontime
    WHERE (DayOfWeek NOT IN (6, 7)) AND (OriginState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (DestState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (FlightDate < '2010-01-01')
    GROUP BY Carrier
    HAVING (cnt > 100000) AND (max(Year) > 1990)
    ORDER BY rate DESC
    LIMIT 1000
    
    ┌─min(Year)─┬─max(Year)─┬─Carrier─┬────cnt─┬─flights_delayed─┬─rate─┐
    │      20002001 │ UA      │ 6498621218890.19 │
    │      20002001 │ HP      │ 192068274800.14 │
    │      20002001 │ AA      │ 615877795390.13 │
    │      20002001 │ US      │ 638984797080.12 │
    │      20002001 │ TW      │ 224711257780.11 │
    │      20002001 │ WN      │ 855501972600.11 │
    │      20002001 │ NW      │ 483807529310.11 │
    │      20002001 │ CO      │ 349268385600.11 │
    │      20002001 │ DL      │ 764713799540.1 │
    └───────────┴───────────┴─────────┴────────┴─────────────────┴──────┘
    
    9 rows in set. Elapsed: 0.370 sec. Processed 6.97 million rows, 84.86 MB (18.86 million rows/s., 229.45 MB/s.)
    
  • Q多維度1

    SELECT 
        Year, 
        OriginCityName, 
        DepartureDelayGroups, 
        CancellationCode, 
        avg(DepDelay > 10) * 100
    FROM ontime
    GROUP BY 
        Year, 
        OriginCityName, 
        DepartureDelayGroups, 
        CancellationCode
    ORDER BY Year ASC
    LIMIT 1
    
    ┌─Year─┬─OriginCityName───┬─DepartureDelayGroups─┬─CancellationCode─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐
    │ 2000 │ Fayetteville, NC │ 7                    │                  │                                       100 │
    └──────┴──────────────────┴──────────────────────┴──────────────────┴───────────────────────────────────────────┘
    
    1 rows in set. Elapsed: 0.613 sec. Processed 6.97 million rows, 275.48 MB (11.38 million rows/s., 449.40 MB/s.)
    

綜上測試,性能極佳。OLAP分析一大神器

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章