大数据之ClickHouse:安装部署与性能测试

记录过程

概述

个人总结式理解,详细的去官网看吧

  • 俄罗斯搜索引擎公司Yandex研发,2016年开源的列式存储数据库
  • 主要用于在线OLAP不支持事务所以不支持OLTP
  • ClickHouse中文社区
  • ClickHouse中文官网
  • 优势在于大宽表查询,join多个大表查询性能比不上一般的OLAP工具
  • 极致性能在于极致的压榨服务器性能
  • 百亿数据集的查询都可秒级别响应
  • 列式存储,所以count等聚合查询很快,数据压缩比比一般存储方式要高很多
  • 建表需要指定合适的查询引擎来达到更高的查询性能
  • 支持索引,适合在线查询
  • 并发低,官方建议100,但是有增强插件CHproxy可提高
  • 任何一个sql语句都会全力使用服务器资源来执行来达到极致性能
  • 近标准sql,很少部分与sql2003协议不一样
  • 副本机制保证安全
  • 支持近似计算
  • 支持实时数据更新,大批量更新性能更好
  • 为了提高CPU利用,设计了向量引擎
  • 稀疏索引使得ClickHouse不适合通过其键检索单行的点查询
  • 仅能用于批量删除或修改数据
  • 对于Ubuntu系统和Debian系统支持更好

环境

  • Centos7(由于大部分生产环境用的操作系统还是Centos,故这里也用Centos来描述)

单机安装

ClickHouse可以在任何具有x86_64,AArch64或PowerPC64LE CPU架构的Linux,FreeBSD或Mac OS X上运行

  • 检查环境是否支持

    [root@bigdata01 module]# grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
    SSE 4.2 supported
    
  • CentOS取消打开文件数限制

    分别编辑如下两个文件

    vim /etc/security/limits.conf

    vim /etc/security/limits.d/20-nproc.conf

    注意有些环境可能不叫20-nproc.conf,变通下,先ls /etc/security/limits.d看看叫啥名

    增加如下内容,注意*号也要

    * soft nofile 65536 
    * hard nofile 65536 
    * soft nproc 131072 
    * hard nproc 131072
    

    重启服务器之后生效,用ulimit -n 或者ulimit -a查看设置结果

  • 安装依赖

    yum install -y libtool
    yum install -y *unixODBC*
    
  • 下载说明,不是要去下载,可以直接使用yum安装,如下只是个说明

官网:https://clickhouse.yandex/

下载地址:http://repo.red-soft.biz/repos/clickhouse/stable/el7/

https://packagecloud.io/Altinity/clickhouse

这里下载半年前的,clickHouse版本更新很快,需注意更新内容

安装的版本:*-19.15.5.18-1.el7.x86_64.rpm

包括:

  • 下载yum源

    curl -s https://packagecloud.io/install/repositories/Altinity/clickhouse/script.rpm.sh | sudo bash
    
  • Yum安装

    如下是安装指定版本,若安装最新版则可直接 sudo yum install -y clickhouse-server clickhouse-client

    sudo yum install clickhouse-server-common-19.15.5.18-1.el7.x86_64
    sudo yum install clickhouse-server-19.15.5.18-1.el7.x86_64	注意:这个会同时依赖安装 clickhouse-common-static
    sudo yum install clickhouse-debuginfo-19.15.5.18-1.el7.x86_64
    sudo yum install clickhouse-client-19.15.5.18-1.el7.x86_64
    

    检查安装情况:

    sudo yum list installed 'clickhouse*'
    
  • 各个安装的组件文件分布情况

    可以从https://packagecloud.io/Altinity/clickhouse点进去对应版本对应组建里看到File的分布情况,这里列举几个关注度较高的文件目录

    /etc/clickhouse-client/config.xml
    /usr/bin/clickhouse-client
    /usr/bin/clickhouse-benchmark
    
    /etc/clickhouse-server/users.xml
    /etc/clickhouse-server/config.xml
    /usr/bin/clickhouse-server
    /etc/security/limits.d/clickhouse.conf
    /etc/init.d/clickhouse-server
    /etc/cron.d/clickhouse-server
    

常用配置

  • 服务端配置

    注意修改了服务端配置要重启服务哦

    配置文件在/etc/clickhouse-server目录下

    • users.xml 用户配置信息。默认有个default用户无密码。

      增加用户的话直接参考default用户的配置方式,也就是标签配置方式去增加即可

    • config.xml 服务的配置信息。可修改端口号、绑定IP、安全信息等

  • 客户端配置

    • 执行clickhouse命令时,默认会读取/etc/clickhouse-client/config.xml配置文件进行启动客户端

    • 可通过-c参数指定config.xml位置如clickhouse-client -c /opt/software/config.xml

    • /etc/clickhouse-client/config.xml记录的是连接服务端的一些信息

启动/检查服务

service clickhouse-server start
service clickhouse-server status

[root@bigdata01 ~]# service clickhouse-server start
Start clickhouse-server service: Path to data directory in /etc/clickhouse-server/config.xml: /var/lib/clickhouse/
DONE
[root@bigdata01 ~]# service clickhouse-server status
clickhouse-server service is running

命令行客户端

[root@bigdata01 ~]# clickhouse-client
ClickHouse client version 19.15.5.18.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.15.5 revision 54426.

bigdata01 :) show databases;

SHOW DATABASES

┌─name────┐
│ default │
│ system  │
└─────────┘

2 rows in set. Elapsed: 0.001 sec.

指定端口或服务地址加参数 --port 8080 --host 127.0.0.1

分布式集群安装

每台机器都按如上单机安装步骤安装好的前提下

  • 每台机器修改/etc/clickhouse-server/config.xml

    <listen_host>::</listen_host>
    <!-- <listen_host>::1</listen_host> -->
    <!-- <listen_host>127.0.0.1</listen_host> -->
    
  • 每台机器etc目录下新建metrika.xml文件

    vim /etc/metrika.xml
    
    添加如下内容
    
    <yandex>
    <clickhouse_remote_servers>
        <!-- 如果是3台集群1个副本就叫如下标签  -->
        <perftest_3shards_1replicas>
            <!-- 每台机器配置  -->
            <shard>
                 <internal_replication>true</internal_replication>
                <replica>
                    <host>bigdata01</host>
                    <port>19000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <internal_replication>true</internal_replication>
                    <host>bigdata02</host>
                    <port>19000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>bigdata03</host>
                    <port>19000</port>
                </replica>
            </shard>
        </perftest_3shards_1replicas>
    </clickhouse_remote_servers>
    
    <!-- zookeeper集群配置  -->
    <zookeeper-servers>
      <node index="1">
        <host>bigdata01</host>
        <port>32181</port>
      </node>
    
      <node index="2">
        <host>bigdata02</host>
        <port>32181</port>
      </node>
      <node index="3">
        <host>bigdata03</host>
        <port>32181</port>
      </node>
    </zookeeper-servers>
    
    <!-- macros配置,写当前机器host  -->
    <macros>
        <replica>bigdata01</replica>
    </macros>
    
    <networks>
       <ip>::/0</ip>
    </networks>
    
    <clickhouse_compression>
    <case>
      <min_part_size>10000000000</min_part_size>                                          
      <min_part_size_ratio>0.01</min_part_size_ratio>                                      	 <method>lz4</method>
    </case>
    </clickhouse_compression>
    
    </yandex>
    
  • 启动每台机器

    注意先启动zookeeper

卸载

  • 列举安装了哪些模块

    [root@bigdata01 ~]# yum list installed | grep clickhouse
    clickhouse-client.x86_64              19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-common-static.x86_64       19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-debuginfo.x86_64           19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-server.x86_64              19.15.5.18-1.el7                @Altinity_clickhouse
    clickhouse-server-common.x86_64       19.15.5.18-1.el7                @Altinity_clickhouse
    
  • 依次卸载模块

    yum remove -y clickhouse-client.x86_64 clickhouse-common-static.x86_64 clickhouse-debuginfo.x86_64 clickhouse-server.x86_64 clickhouse-server-common.x86_64
    
  • 再次全局检查剩余文件然后删除

    find / -name 'clickhouse'
    rm -rf 查出来的结果
    
  • 卸载报错时强制删除

    # 删除rpm包的时候不调用卸载脚本
    sudo rpm -e clickhouse-server.x86_64 --noscripts
    

性能测试

使用官网提供的航班飞行数据进行测试:19872017年的。由于存储空间有限,故只用20002017年的数据进行测试

测试机器情况:百度云服务器:2核/4GB/40GB/计算型c3 1Mbps

经测试如下大数据集并没有达到机器性能极限。

如下测试,官网都有介绍

  • 官网下载数据参考:https://clickhouse.tech/docs/zh/getting_started/example_datasets/ontime/

  • 创建表结构(注意登陆时clickhouse -m 如果不加-m启用多行会报错)

    CREATE TABLE `ontime` (
      `Year` UInt16,
      `Quarter` UInt8,
      `Month` UInt8,
      `DayofMonth` UInt8,
      `DayOfWeek` UInt8,
      `FlightDate` Date,
      `UniqueCarrier` FixedString(7),
      `AirlineID` Int32,
      `Carrier` FixedString(2),
      `TailNum` String,
      `FlightNum` String,
      `OriginAirportID` Int32,
      `OriginAirportSeqID` Int32,
      `OriginCityMarketID` Int32,
      `Origin` FixedString(5),
      `OriginCityName` String,
      `OriginState` FixedString(2),
      `OriginStateFips` String,
      `OriginStateName` String,
      `OriginWac` Int32,
      `DestAirportID` Int32,
      `DestAirportSeqID` Int32,
      `DestCityMarketID` Int32,
      `Dest` FixedString(5),
      `DestCityName` String,
      `DestState` FixedString(2),
      `DestStateFips` String,
      `DestStateName` String,
      `DestWac` Int32,
      `CRSDepTime` Int32,
      `DepTime` Int32,
      `DepDelay` Int32,
      `DepDelayMinutes` Int32,
      `DepDel15` Int32,
      `DepartureDelayGroups` String,
      `DepTimeBlk` String,
      `TaxiOut` Int32,
      `WheelsOff` Int32,
      `WheelsOn` Int32,
      `TaxiIn` Int32,
      `CRSArrTime` Int32,
      `ArrTime` Int32,
      `ArrDelay` Int32,
      `ArrDelayMinutes` Int32,
      `ArrDel15` Int32,
      `ArrivalDelayGroups` Int32,
      `ArrTimeBlk` String,
      `Cancelled` UInt8,
      `CancellationCode` FixedString(1),
      `Diverted` UInt8,
      `CRSElapsedTime` Int32,
      `ActualElapsedTime` Int32,
      `AirTime` Int32,
      `Flights` Int32,
      `Distance` Int32,
      `DistanceGroup` UInt8,
      `CarrierDelay` Int32,
      `WeatherDelay` Int32,
      `NASDelay` Int32,
      `SecurityDelay` Int32,
      `LateAircraftDelay` Int32,
      `FirstDepTime` String,
      `TotalAddGTime` String,
      `LongestAddGTime` String,
      `DivAirportLandings` String,
      `DivReachedDest` String,
      `DivActualElapsedTime` String,
      `DivArrDelay` String,
      `DivDistance` String,
      `Div1Airport` String,
      `Div1AirportID` Int32,
      `Div1AirportSeqID` Int32,
      `Div1WheelsOn` String,
      `Div1TotalGTime` String,
      `Div1LongestGTime` String,
      `Div1WheelsOff` String,
      `Div1TailNum` String,
      `Div2Airport` String,
      `Div2AirportID` Int32,
      `Div2AirportSeqID` Int32,
      `Div2WheelsOn` String,
      `Div2TotalGTime` String,
      `Div2LongestGTime` String,
      `Div2WheelsOff` String,
      `Div2TailNum` String,
      `Div3Airport` String,
      `Div3AirportID` Int32,
      `Div3AirportSeqID` Int32,
      `Div3WheelsOn` String,
      `Div3TotalGTime` String,
      `Div3LongestGTime` String,
      `Div3WheelsOff` String,
      `Div3TailNum` String,
      `Div4Airport` String,
      `Div4AirportID` Int32,
      `Div4AirportSeqID` Int32,
      `Div4WheelsOn` String,
      `Div4TotalGTime` String,
      `Div4LongestGTime` String,
      `Div4WheelsOff` String,
      `Div4TailNum` String,
      `Div5Airport` String,
      `Div5AirportID` Int32,
      `Div5AirportSeqID` Int32,
      `Div5WheelsOn` String,
      `Div5TotalGTime` String,
      `Div5LongestGTime` String,
      `Div5WheelsOff` String,
      `Div5TailNum` String
    ) ENGINE = MergeTree
    PARTITION BY Year
    ORDER BY (Carrier, FlightDate)
    SETTINGS index_granularity = 8192;

  • 下载数据(官方提供)
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
wget http://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
  • 加载数据下载的数据(个人使用per_test库来进行测试故注意下语句,注意host和端口)
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=127.0.0.1 --p 19000 --query="INSERT INTO per_test.ontime FORMAT CSVWithNames"; done

  • 查询从2000年到2008年每天的航班数

    SELECT 
        DayOfWeek, 
        count(*) AS c
    FROM ontime
    WHERE (Year >= 2000) AND (Year <= 2008)
    GROUP BY DayOfWeek
    ORDER BY c DESC
    
    ┌─DayOfWeek─┬───────c─┐
    │         11024694 │
    │         31019282 │
    │         21015141 │
    │         51014324 │
    │         41013083 │
    │         7979170 │
    │         6908404 │
    └───────────┴─────────┘
    
    7 rows in set. Elapsed: 0.042 sec. Processed 6.97 million rows, 20.92 MB (167.64 million rows/s., 502.92 MB/s.)
    
  • 查询从2000年到2008年每周延误超过10分钟的航班数

    SELECT 
        DayOfWeek, 
        count(*) AS c
    FROM ontime
    WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
    GROUP BY DayOfWeek
    ORDER BY c DESC
    
    ┌─DayOfWeek─┬──────c─┐
    │         5274999 │
    │         4254490 │
    │         7238941 │
    │         1209985 │
    │         3201997 │
    │         6183685 │
    │         2178767 │
    └───────────┴────────┘
    
    7 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 48.82 MB (44.71 million rows/s., 313.00 MB/s.) 
    
    
  • 查询2000年到2008年每个机场延误超过10分钟以上的次数

    SELECT 
        Origin, 
        count(*) AS c
    FROM ontime
    WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
    GROUP BY Origin
    ORDER BY c DESC
    LIMIT 10
    
    ┌─Origin─┬──────c─┐
    │ ORD    │ 105023 │
    │ ATL    │  73496 │
    │ DFW    │  67485 │
    │ PHX    │  66968 │
    │ LAX    │  66964 │
    │ LAS    │  50462 │
    │ STL    │  47812 │
    │ DEN    │  46164 │
    │ SFO    │  43537 │
    │ DTW    │  43341 │
    └────────┴────────┘
    
    10 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 76.72 MB (44.59 million rows/s., 490.50 MB/s.) 
    
  • 查询2000至2008年各航空公司延误超过10分钟以上的百分比

    SELECT 
        Carrier, 
        c, 
        c2, 
        (c * 100) / c2 AS c3
    FROM 
    (
        SELECT 
            Carrier, 
            count(*) AS c
        FROM ontime
        WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
        GROUP BY Carrier
    )
    INNER JOIN 
    (
        SELECT 
            Carrier, 
            count(*) AS c2
        FROM ontime
        WHERE (Year >= 2000) AND (Year <= 2008)
        GROUP BY Carrier
    ) USING (Carrier)
    ORDER BY c3 DESC
    
    ┌─Carrier─┬──────c─┬──────c2─┬─────────────────c3─┐
    │ UA      │ 26245191591128.654640025067938 │
    │ AS5197718888427.51794752334766 │
    │ WN      │ 314159114864927.350304575200955 │
    │ HP      │  6985926418026.44371262018321 │
    │ US      │ 18568988611520.95540646530078 │
    │ AA      │ 18178989634920.281051242317446 │
    │ TW      │  6422031976420.08356162669969 │
    │ DL      │ 199886108911618.353049629240594 │
    │ NW      │ 11510266731717.24847411350228 │
    │ CO      │  7859347414516.575731052737034 │
    │ MQ      │  1722910841015.89244534637026 │
    │ AQ      │   19101525812.518023332022546 │
    └─────────┴────────┴─────────┴────────────────────┘
    
    12 rows in set. Elapsed: 0.186 sec. Processed 13.95 million rows, 83.69 MB (75.06 million rows/s., 450.37 MB/s.) 
    
    

    更好的查询语句版本

    SELECT 
        Carrier, 
        avg(DepDelay > 10) * 100 AS c3
    FROM ontime
    WHERE (Year >= 2000) AND (Year <= 2008)
    GROUP BY Carrier
    ORDER BY c3 DESC
    
    ┌─Carrier─┬─────────────────c3─┐
    │ UA      │  28.65464002506794 │
    │ AS27.517947523347665 │
    │ WN      │  27.35030457520095 │
    │ HP      │ 26.443712620183206 │
    │ US      │  20.95540646530078 │
    │ AA      │ 20.281051242317446 │
    │ TW      │  20.08356162669969 │
    │ DL      │ 18.353049629240594 │
    │ NW      │  17.24847411350228 │
    │ CO      │ 16.575731052737034 │
    │ MQ      │ 15.892445346370259 │
    │ AQ      │ 12.518023332022546 │
    └─────────┴────────────────────┘
    
    12 rows in set. Elapsed: 0.129 sec. Processed 6.97 million rows, 55.79 MB (53.97 million rows/s., 431.75 MB/s.)
    
  • 每年航班延误超过10分钟的百分比

SELECT 
    Year, 
    avg(DepDelay > 10) * 100
FROM ontime
GROUP BY Year
ORDER BY Year ASC

┌─Year─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐
│ 200023.17167181619297 │
│ 200117.505660117222323 │
└──────┴───────────────────────────────────────────┘

2 rows in set. Elapsed: 0.084 sec. Processed 6.97 million rows, 41.84 MB (83.21 million rows/s., 499.26 MB/s.)
  • 每年更受人们喜爱的目的地

    SELECT 
        DestCityName, 
        uniqExact(OriginCityName) AS u
    FROM ontime
    WHERE (Year >= 2000) AND (Year <= 2010)
    GROUP BY DestCityName
    ORDER BY u DESC
    LIMIT 10
    
    ┌─DestCityName──────────┬───u─┐
    │ Chicago, IL           │ 117 │
    │ Dallas/Fort Worth, TX │ 115 │
    │ Atlanta, GA           │ 100 │
    │ Minneapolis, MN       │  88 │
    │ Houston, TX           │  81 │
    │ Detroit, MI           │  81 │
    │ St. Louis, MO         │  76 │
    │ Charlotte, NC         │  70 │
    │ Pittsburgh, PA        │  69 │
    │ Newark, NJ            │  67 │
    └───────────────────────┴─────┘
    
    10 rows in set. Elapsed: 0.559 sec. Processed 6.97 million rows, 322.81 MB (12.47 million rows/s., 577.07 MB/s.)
    
  • Q10

    SELECT 
        min(Year), 
        max(Year), 
        Carrier, 
        count(*) AS cnt, 
        sum(ArrDelayMinutes > 30) AS flights_delayed, 
        round(sum(ArrDelayMinutes > 30) / count(*), 2) AS rate
    FROM ontime
    WHERE (DayOfWeek NOT IN (6, 7)) AND (OriginState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (DestState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (FlightDate < '2010-01-01')
    GROUP BY Carrier
    HAVING (cnt > 100000) AND (max(Year) > 1990)
    ORDER BY rate DESC
    LIMIT 1000
    
    ┌─min(Year)─┬─max(Year)─┬─Carrier─┬────cnt─┬─flights_delayed─┬─rate─┐
    │      20002001 │ UA      │ 6498621218890.19 │
    │      20002001 │ HP      │ 192068274800.14 │
    │      20002001 │ AA      │ 615877795390.13 │
    │      20002001 │ US      │ 638984797080.12 │
    │      20002001 │ TW      │ 224711257780.11 │
    │      20002001 │ WN      │ 855501972600.11 │
    │      20002001 │ NW      │ 483807529310.11 │
    │      20002001 │ CO      │ 349268385600.11 │
    │      20002001 │ DL      │ 764713799540.1 │
    └───────────┴───────────┴─────────┴────────┴─────────────────┴──────┘
    
    9 rows in set. Elapsed: 0.370 sec. Processed 6.97 million rows, 84.86 MB (18.86 million rows/s., 229.45 MB/s.)
    
  • Q多维度1

    SELECT 
        Year, 
        OriginCityName, 
        DepartureDelayGroups, 
        CancellationCode, 
        avg(DepDelay > 10) * 100
    FROM ontime
    GROUP BY 
        Year, 
        OriginCityName, 
        DepartureDelayGroups, 
        CancellationCode
    ORDER BY Year ASC
    LIMIT 1
    
    ┌─Year─┬─OriginCityName───┬─DepartureDelayGroups─┬─CancellationCode─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐
    │ 2000 │ Fayetteville, NC │ 7                    │                  │                                       100 │
    └──────┴──────────────────┴──────────────────────┴──────────────────┴───────────────────────────────────────────┘
    
    1 rows in set. Elapsed: 0.613 sec. Processed 6.97 million rows, 275.48 MB (11.38 million rows/s., 449.40 MB/s.)
    

综上测试,性能极佳。OLAP分析一大神器

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章