记录过程
概述
个人总结式理解,详细的去官网看吧
- 俄罗斯搜索引擎公司Yandex研发,2016年开源的列式存储数据库
- 主要用于在线OLAP不支持事务所以不支持OLTP
- ClickHouse中文社区
- ClickHouse中文官网
- 优势在于大宽表查询,join多个大表查询性能比不上一般的OLAP工具
- 极致性能在于极致的压榨服务器性能
- 百亿数据集的查询都可秒级别响应
- 列式存储,所以count等聚合查询很快,数据压缩比比一般存储方式要高很多
- 建表需要指定合适的查询引擎来达到更高的查询性能
- 支持索引,适合在线查询
- 并发低,官方建议100,但是有增强插件CHproxy可提高
- 任何一个sql语句都会全力使用服务器资源来执行来达到极致性能
- 近标准sql,很少部分与sql2003协议不一样
- 副本机制保证安全
- 支持近似计算
- 支持实时数据更新,大批量更新性能更好
- 为了提高CPU利用,设计了向量引擎
- 稀疏索引使得ClickHouse不适合通过其键检索单行的点查询
- 仅能用于批量删除或修改数据
- 对于Ubuntu系统和Debian系统支持更好
环境
- Centos7(由于大部分生产环境用的操作系统还是Centos,故这里也用Centos来描述)
单机安装
ClickHouse可以在任何具有x86_64,AArch64或PowerPC64LE CPU架构的Linux,FreeBSD或Mac OS X上运行
-
检查环境是否支持
[root@bigdata01 module]# grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported" SSE 4.2 supported
-
CentOS取消打开文件数限制
分别编辑如下两个文件
vim /etc/security/limits.conf
vim /etc/security/limits.d/20-nproc.conf
注意有些环境可能不叫20-nproc.conf,变通下,先ls /etc/security/limits.d看看叫啥名
增加如下内容,注意*号也要
* soft nofile 65536 * hard nofile 65536 * soft nproc 131072 * hard nproc 131072
重启服务器之后生效,用
ulimit -n
或者ulimit -a
查看设置结果 -
安装依赖
yum install -y libtool yum install -y *unixODBC*
-
下载说明,不是要去下载,可以直接使用yum安装,如下只是个说明
官网:https://clickhouse.yandex/
下载地址:http://repo.red-soft.biz/repos/clickhouse/stable/el7/
https://packagecloud.io/Altinity/clickhouse
这里下载半年前的,clickHouse版本更新很快,需注意更新内容
安装的版本:*-19.15.5.18-1.el7.x86_64.rpm
包括:
- clickhouse-test-19.15.5.18-1.el7.x86_64.rpm (测试模块可不必安装)
- clickhouse-server-common-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-server-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-debuginfo-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-common-static-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-client-19.15.5.18-1.el7.x86_64.rpm
-
下载yum源
curl -s https://packagecloud.io/install/repositories/Altinity/clickhouse/script.rpm.sh | sudo bash
-
Yum安装
如下是安装指定版本,若安装最新版则可直接 sudo yum install -y clickhouse-server clickhouse-client
sudo yum install clickhouse-server-common-19.15.5.18-1.el7.x86_64 sudo yum install clickhouse-server-19.15.5.18-1.el7.x86_64 注意:这个会同时依赖安装 clickhouse-common-static sudo yum install clickhouse-debuginfo-19.15.5.18-1.el7.x86_64 sudo yum install clickhouse-client-19.15.5.18-1.el7.x86_64
检查安装情况:
sudo yum list installed 'clickhouse*'
-
各个安装的组件文件分布情况
可以从https://packagecloud.io/Altinity/clickhouse点进去对应版本对应组建里看到File的分布情况,这里列举几个关注度较高的文件目录
/etc/clickhouse-client/config.xml /usr/bin/clickhouse-client /usr/bin/clickhouse-benchmark /etc/clickhouse-server/users.xml /etc/clickhouse-server/config.xml /usr/bin/clickhouse-server /etc/security/limits.d/clickhouse.conf /etc/init.d/clickhouse-server /etc/cron.d/clickhouse-server
常用配置
-
服务端配置
注意修改了服务端配置要重启服务哦
配置文件在
/etc/clickhouse-server
目录下-
users.xml 用户配置信息。默认有个default用户无密码。
增加用户的话直接参考default用户的配置方式,也就是标签配置方式去增加即可
-
config.xml 服务的配置信息。可修改端口号、绑定IP、安全信息等
-
-
客户端配置
-
执行clickhouse命令时,默认会读取/etc/clickhouse-client/config.xml配置文件进行启动客户端
-
可通过
-c
参数指定config.xml位置如clickhouse-client -c /opt/software/config.xml
-
/etc/clickhouse-client/config.xml记录的是连接服务端的一些信息
-
启动/检查服务
service clickhouse-server start
service clickhouse-server status
[root@bigdata01 ~]# service clickhouse-server start
Start clickhouse-server service: Path to data directory in /etc/clickhouse-server/config.xml: /var/lib/clickhouse/
DONE
[root@bigdata01 ~]# service clickhouse-server status
clickhouse-server service is running
命令行客户端
[root@bigdata01 ~]# clickhouse-client
ClickHouse client version 19.15.5.18.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.15.5 revision 54426.
bigdata01 :) show databases;
SHOW DATABASES
┌─name────┐
│ default │
│ system │
└─────────┘
2 rows in set. Elapsed: 0.001 sec.
指定端口或服务地址加参数 --port 8080 --host 127.0.0.1
分布式集群安装
每台机器都按如上单机安装步骤安装好的前提下
-
每台机器修改
/etc/clickhouse-server/config.xml
<listen_host>::</listen_host> <!-- <listen_host>::1</listen_host> --> <!-- <listen_host>127.0.0.1</listen_host> -->
-
每台机器etc目录下新建metrika.xml文件
vim /etc/metrika.xml
添加如下内容 <yandex> <clickhouse_remote_servers> <!-- 如果是3台集群1个副本就叫如下标签 --> <perftest_3shards_1replicas> <!-- 每台机器配置 --> <shard> <internal_replication>true</internal_replication> <replica> <host>bigdata01</host> <port>19000</port> </replica> </shard> <shard> <replica> <internal_replication>true</internal_replication> <host>bigdata02</host> <port>19000</port> </replica> </shard> <shard> <internal_replication>true</internal_replication> <replica> <host>bigdata03</host> <port>19000</port> </replica> </shard> </perftest_3shards_1replicas> </clickhouse_remote_servers> <!-- zookeeper集群配置 --> <zookeeper-servers> <node index="1"> <host>bigdata01</host> <port>32181</port> </node> <node index="2"> <host>bigdata02</host> <port>32181</port> </node> <node index="3"> <host>bigdata03</host> <port>32181</port> </node> </zookeeper-servers> <!-- macros配置,写当前机器host --> <macros> <replica>bigdata01</replica> </macros> <networks> <ip>::/0</ip> </networks> <clickhouse_compression> <case> <min_part_size>10000000000</min_part_size> <min_part_size_ratio>0.01</min_part_size_ratio> <method>lz4</method> </case> </clickhouse_compression> </yandex>
-
启动每台机器
注意先启动zookeeper
卸载
-
列举安装了哪些模块
[root@bigdata01 ~]# yum list installed | grep clickhouse clickhouse-client.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse clickhouse-common-static.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse clickhouse-debuginfo.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse clickhouse-server.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse clickhouse-server-common.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse
-
依次卸载模块
yum remove -y clickhouse-client.x86_64 clickhouse-common-static.x86_64 clickhouse-debuginfo.x86_64 clickhouse-server.x86_64 clickhouse-server-common.x86_64
-
再次全局检查剩余文件然后删除
find / -name 'clickhouse' rm -rf 查出来的结果
-
卸载报错时强制删除
# 删除rpm包的时候不调用卸载脚本 sudo rpm -e clickhouse-server.x86_64 --noscripts
性能测试
使用官网提供的航班飞行数据进行测试:19872017年的。由于存储空间有限,故只用20002017年的数据进行测试
测试机器情况:百度云服务器:2核/4GB/40GB/计算型c3 1Mbps
经测试如下大数据集并没有达到机器性能极限。
如下测试,官网都有介绍
-
官网下载数据参考:https://clickhouse.tech/docs/zh/getting_started/example_datasets/ontime/
-
创建表结构(注意登陆时
clickhouse -m
如果不加-m启用多行会报错)
CREATE TABLE `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree
PARTITION BY Year
ORDER BY (Carrier, FlightDate)
SETTINGS index_granularity = 8192;
- 下载数据(官方提供)
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
wget http://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
- 加载数据下载的数据(个人使用per_test库来进行测试故注意下语句,注意host和端口)
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=127.0.0.1 --p 19000 --query="INSERT INTO per_test.ontime FORMAT CSVWithNames"; done
-
查询从2000年到2008年每天的航班数
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE (Year >= 2000) AND (Year <= 2008) GROUP BY DayOfWeek ORDER BY c DESC ┌─DayOfWeek─┬───────c─┐ │ 1 │ 1024694 │ │ 3 │ 1019282 │ │ 2 │ 1015141 │ │ 5 │ 1014324 │ │ 4 │ 1013083 │ │ 7 │ 979170 │ │ 6 │ 908404 │ └───────────┴─────────┘ 7 rows in set. Elapsed: 0.042 sec. Processed 6.97 million rows, 20.92 MB (167.64 million rows/s., 502.92 MB/s.)
-
查询从2000年到2008年每周延误超过10分钟的航班数
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008) GROUP BY DayOfWeek ORDER BY c DESC ┌─DayOfWeek─┬──────c─┐ │ 5 │ 274999 │ │ 4 │ 254490 │ │ 7 │ 238941 │ │ 1 │ 209985 │ │ 3 │ 201997 │ │ 6 │ 183685 │ │ 2 │ 178767 │ └───────────┴────────┘ 7 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 48.82 MB (44.71 million rows/s., 313.00 MB/s.)
-
查询2000年到2008年每个机场延误超过10分钟以上的次数
SELECT Origin, count(*) AS c FROM ontime WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008) GROUP BY Origin ORDER BY c DESC LIMIT 10 ┌─Origin─┬──────c─┐ │ ORD │ 105023 │ │ ATL │ 73496 │ │ DFW │ 67485 │ │ PHX │ 66968 │ │ LAX │ 66964 │ │ LAS │ 50462 │ │ STL │ 47812 │ │ DEN │ 46164 │ │ SFO │ 43537 │ │ DTW │ 43341 │ └────────┴────────┘ 10 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 76.72 MB (44.59 million rows/s., 490.50 MB/s.)
-
查询2000至2008年各航空公司延误超过10分钟以上的百分比
SELECT Carrier, c, c2, (c * 100) / c2 AS c3 FROM ( SELECT Carrier, count(*) AS c FROM ontime WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008) GROUP BY Carrier ) INNER JOIN ( SELECT Carrier, count(*) AS c2 FROM ontime WHERE (Year >= 2000) AND (Year <= 2008) GROUP BY Carrier ) USING (Carrier) ORDER BY c3 DESC ┌─Carrier─┬──────c─┬──────c2─┬─────────────────c3─┐ │ UA │ 262451 │ 915911 │ 28.654640025067938 │ │ AS │ 51977 │ 188884 │ 27.51794752334766 │ │ WN │ 314159 │ 1148649 │ 27.350304575200955 │ │ HP │ 69859 │ 264180 │ 26.44371262018321 │ │ US │ 185689 │ 886115 │ 20.95540646530078 │ │ AA │ 181789 │ 896349 │ 20.281051242317446 │ │ TW │ 64220 │ 319764 │ 20.08356162669969 │ │ DL │ 199886 │ 1089116 │ 18.353049629240594 │ │ NW │ 115102 │ 667317 │ 17.24847411350228 │ │ CO │ 78593 │ 474145 │ 16.575731052737034 │ │ MQ │ 17229 │ 108410 │ 15.89244534637026 │ │ AQ │ 1910 │ 15258 │ 12.518023332022546 │ └─────────┴────────┴─────────┴────────────────────┘ 12 rows in set. Elapsed: 0.186 sec. Processed 13.95 million rows, 83.69 MB (75.06 million rows/s., 450.37 MB/s.)
更好的查询语句版本
SELECT Carrier, avg(DepDelay > 10) * 100 AS c3 FROM ontime WHERE (Year >= 2000) AND (Year <= 2008) GROUP BY Carrier ORDER BY c3 DESC ┌─Carrier─┬─────────────────c3─┐ │ UA │ 28.65464002506794 │ │ AS │ 27.517947523347665 │ │ WN │ 27.35030457520095 │ │ HP │ 26.443712620183206 │ │ US │ 20.95540646530078 │ │ AA │ 20.281051242317446 │ │ TW │ 20.08356162669969 │ │ DL │ 18.353049629240594 │ │ NW │ 17.24847411350228 │ │ CO │ 16.575731052737034 │ │ MQ │ 15.892445346370259 │ │ AQ │ 12.518023332022546 │ └─────────┴────────────────────┘ 12 rows in set. Elapsed: 0.129 sec. Processed 6.97 million rows, 55.79 MB (53.97 million rows/s., 431.75 MB/s.)
-
每年航班延误超过10分钟的百分比
SELECT
Year,
avg(DepDelay > 10) * 100
FROM ontime
GROUP BY Year
ORDER BY Year ASC
┌─Year─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐
│ 2000 │ 23.17167181619297 │
│ 2001 │ 17.505660117222323 │
└──────┴───────────────────────────────────────────┘
2 rows in set. Elapsed: 0.084 sec. Processed 6.97 million rows, 41.84 MB (83.21 million rows/s., 499.26 MB/s.)
-
每年更受人们喜爱的目的地
SELECT DestCityName, uniqExact(OriginCityName) AS u FROM ontime WHERE (Year >= 2000) AND (Year <= 2010) GROUP BY DestCityName ORDER BY u DESC LIMIT 10 ┌─DestCityName──────────┬───u─┐ │ Chicago, IL │ 117 │ │ Dallas/Fort Worth, TX │ 115 │ │ Atlanta, GA │ 100 │ │ Minneapolis, MN │ 88 │ │ Houston, TX │ 81 │ │ Detroit, MI │ 81 │ │ St. Louis, MO │ 76 │ │ Charlotte, NC │ 70 │ │ Pittsburgh, PA │ 69 │ │ Newark, NJ │ 67 │ └───────────────────────┴─────┘ 10 rows in set. Elapsed: 0.559 sec. Processed 6.97 million rows, 322.81 MB (12.47 million rows/s., 577.07 MB/s.)
-
Q10
SELECT min(Year), max(Year), Carrier, count(*) AS cnt, sum(ArrDelayMinutes > 30) AS flights_delayed, round(sum(ArrDelayMinutes > 30) / count(*), 2) AS rate FROM ontime WHERE (DayOfWeek NOT IN (6, 7)) AND (OriginState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (DestState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (FlightDate < '2010-01-01') GROUP BY Carrier HAVING (cnt > 100000) AND (max(Year) > 1990) ORDER BY rate DESC LIMIT 1000 ┌─min(Year)─┬─max(Year)─┬─Carrier─┬────cnt─┬─flights_delayed─┬─rate─┐ │ 2000 │ 2001 │ UA │ 649862 │ 121889 │ 0.19 │ │ 2000 │ 2001 │ HP │ 192068 │ 27480 │ 0.14 │ │ 2000 │ 2001 │ AA │ 615877 │ 79539 │ 0.13 │ │ 2000 │ 2001 │ US │ 638984 │ 79708 │ 0.12 │ │ 2000 │ 2001 │ TW │ 224711 │ 25778 │ 0.11 │ │ 2000 │ 2001 │ WN │ 855501 │ 97260 │ 0.11 │ │ 2000 │ 2001 │ NW │ 483807 │ 52931 │ 0.11 │ │ 2000 │ 2001 │ CO │ 349268 │ 38560 │ 0.11 │ │ 2000 │ 2001 │ DL │ 764713 │ 79954 │ 0.1 │ └───────────┴───────────┴─────────┴────────┴─────────────────┴──────┘ 9 rows in set. Elapsed: 0.370 sec. Processed 6.97 million rows, 84.86 MB (18.86 million rows/s., 229.45 MB/s.)
-
Q多维度1
SELECT Year, OriginCityName, DepartureDelayGroups, CancellationCode, avg(DepDelay > 10) * 100 FROM ontime GROUP BY Year, OriginCityName, DepartureDelayGroups, CancellationCode ORDER BY Year ASC LIMIT 1 ┌─Year─┬─OriginCityName───┬─DepartureDelayGroups─┬─CancellationCode─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐ │ 2000 │ Fayetteville, NC │ 7 │ │ 100 │ └──────┴──────────────────┴──────────────────────┴──────────────────┴───────────────────────────────────────────┘ 1 rows in set. Elapsed: 0.613 sec. Processed 6.97 million rows, 275.48 MB (11.38 million rows/s., 449.40 MB/s.)
综上测试,性能极佳。OLAP分析一大神器