TPC-DS用於Clickhouse和Doris性能測試

大致介紹

TPC-DS採用星型、雪花型等多維數據模式。它包含7張事實表,17張緯度表平均每張表含有18列。其工作負載包含99個SQL查詢,覆蓋SQL99和2003的核心部分以及OLAP。這個測試集包含對大數據集的統計、報表生成、聯機查詢、數據挖掘等複雜應用,測試用的數據和值是有傾斜的,與真實數據一致。可以說TPC-DS是與真實場景非常接近的一個測試集,也是難度較大的一個測試集。
Clickhouse是俄羅斯Yandex公司開源的一個非常快的數據管理系統,性能非常強悍。Apache Doris是百度開源的另一個基於 MPP 的交互式 SQL 數據倉庫,主要用於解決報表和多維分析,成熟穩定。

下載編譯

下載官方網址
也可以通過Git下載

git clone https://github.com/gregrahn/tpcds-kit.git
cd tpcds-kit/tools
make OS=LINUX
tpcds.sql是建表SQL
[tools]$ ll *.sql
-rw-r----- 1 prodadmin prodadmin 13875 Mar 12 03:44 tpcds_ri.sql
-rw-r----- 1 prodadmin prodadmin 22153 Mar 12 03:44 tpcds_source.sql
-rw-r----- 1 prodadmin prodadmin 30001 Mar 12 03:44 tpcds.sql
dsdgen是生成數據的工具,dsqgen是生成Query的工具
[tools]$ ll ds*
-rwxr-x--- 1 prodadmin prodadmin 455880 Apr 16 19:15 dsdgen
-rwxr-x--- 1 prodadmin prodadmin 292254 Apr 16 19:15 dsqgen
在query_templates中是query模板
[tools]$ ls ../query_templates/
ansi.tpl     query12.tpl  query18.tpl  query23.tpl  query29.tpl  query34.tpl  query3.tpl   query45.tpl  query50.tpl  query56.tpl  query61.tpl  query67.tpl  query72.tpl  query78.tpl  query83.tpl  query89.tpl  query94.tpl  query9.tpl
db2.tpl      query13.tpl  query19.tpl  query24.tpl  query2.tpl   query35.tpl  query40.tpl  query46.tpl  query51.tpl  query57.tpl  query62.tpl  query68.tpl  query73.tpl  query79.tpl  query84.tpl  query8.tpl   query95.tpl  README
netezza.tpl  query14.tpl  query1.tpl   query25.tpl  query30.tpl  query36.tpl  query41.tpl  query47.tpl  query52.tpl  query58.tpl  query63.tpl  query69.tpl  query74.tpl  query7.tpl   query85.tpl  query90.tpl  query96.tpl  sqlserver.tpl
oracle.tpl   query15.tpl  query20.tpl  query26.tpl  query31.tpl  query37.tpl  query42.tpl  query48.tpl  query53.tpl  query59.tpl  query64.tpl  query6.tpl   query75.tpl  query80.tpl  query86.tpl  query91.tpl  query97.tpl  templates.lst
query10.tpl  query16.tpl  query21.tpl  query27.tpl  query32.tpl  query38.tpl  query43.tpl  query49.tpl  query54.tpl  query5.tpl   query65.tpl  query70.tpl  query76.tpl  query81.tpl  query87.tpl  query92.tpl  query98.tpl
query11.tpl  query17.tpl  query22.tpl  query28.tpl  query33.tpl  query39.tpl  query44.tpl  query4.tpl   query55.tpl  query60.tpl  query66.tpl  query71.tpl  query77.tpl  query82.tpl  query88.tpl  query93.tpl  query99.tpl

建表語句

請根據Hive、Doris、Clickhouse等組件特點,修改建表語句,請注意,列是否爲空,列的順序等和後面步驟的導入數據密切相關,請勿輕易修改。
1 Clickhouse數據類型
2 Doris建表和數據類型

create table dbgen_version
create table customer_address
create table customer_demographics
create table date_dim
create table warehouse
create table ship_mode
create table time_dim
create table reason
create table income_band
create table item
create table store
create table call_center
create table customer
create table web_site
create table store_returns
create table household_demographics
create table web_page
create table promotion
create table catalog_page
create table inventory
create table catalog_returns
create table web_returns
create table web_sales
create table catalog_sales
create table store_sales

數據生成

可以建一個腳本,來生成數據
1 數據分隔符是“|”,空值默認爲空。Clickhouse支持的格式Doris Load格式
2 scale單位爲G,指生成的數據量大小,paralle指分割多少個文件,child指第幾個文件

[tools]$ cat build_data_tsv.h
echo $1
mkdir ../../data_tsv/
nohup ./dsdgen  -scale 100 -dir ../../data_tsv/ -paralle 10 -child $1 > child$1.log &

數據導入

  • Clickhouse數據導入,一個例子,我寫了一個convert.py的小腳本,處理分隔符、空值、列順序等問題
if [ ! -f "./data_tsv/dbgen_version_$1_10.done" ]; then
cat ./data_tsv/dbgen_version_$1_10.dat|python convert.py ck|clickhouse-client --query="INSERT INTO default.dbgen_version_dist FORMAT CSV"
touch ./data_tsv/dbgen_version_$1_10.done
fi
  • Doris數據導入
cat ./data_tsv/dbgen_version_$1_10.dat|python convert.py dr dbgen_version|curl --location-trusted -u root: -H "label:dbgen_version_$1_10_" -H "timeout:1200" -T - http://ip:port/api/testdb/dbgen_version/_stream_load

生成Query

cat build_sql.sh
./dsqgen \
-DIRECTORY ../query_templates \
-INPUT ../query_templates/templates.lst \
-VERBOSE Y \
-QUALIFY Y \
-SCALE 100 \
-DIALECT sqlserver \
-OUTPUT_DIR ../../query/

Query改寫

比如如下的SQL,在Doris中是可以正確的執行的,但是在Clickhouse中不行,CK中需要子查詢嵌套或者用global inner join來顯示指定broadcast的字表。

select
  c_last_name,c_first_name,substr(s_city,1,30),ss_ticket_number,amt,profit
  from
   (select ss_ticket_number
          ,ss_customer_sk
          ,store.s_city
          ,sum(ss_coupon_amt) amt
          ,sum(ss_net_profit) profit
    from store_sales,date_dim,store,household_demographics
    where store_sales.ss_sold_date_sk = date_dim.d_date_sk
    and store_sales.ss_store_sk = store.s_store_sk
    and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
    and (household_demographics.hd_dep_count = 1 or household_demographics.hd_vehicle_count > -1)
    and date_dim.d_dow = 1
    and date_dim.d_year in (2000,2000+1,2000+2)
    and store.s_number_employees between 200 and 295
    group by ss_ticket_number,ss_customer_sk,ss_addr_sk,store.s_city) ms,customer
    where ss_customer_sk = c_customer_sk
 order by c_last_name,c_first_name,substr(s_city,1,30), profit
 limit 10;

改寫後

select c_last_name, c_first_name, substr(tbl1.s_city,1,30), ss_ticket_number, amt, profit
from customer_dist
global inner join (
   select ss_ticket_number, ss_customer_sk, s_city, sum(ss_coupon_amt) as amt, sum(ss_net_profit) as profit
   from store_sales_dist
   global inner join ( select d_date_sk from date_dim_dist where date_dim_dist.d_dow = 1 and d_year in (2000,2000+1,2000+2) ) on ss_sold_date_sk = d_date_sk
   global inner join ( select s_store_sk, s_city from store_dist where s_number_employees between 200 and 295 ) on ss_store_sk = s_store_sk
   global inner join ( select hd_demo_sk from household_demographics_dist where hd_dep_count = 1 or hd_vehicle_count > -1 ) on ss_hdemo_sk = hd_demo_sk
   group by ss_ticket_number, ss_customer_sk, ss_addr_sk, s_city
) tbl1 on ss_customer_sk = c_customer_sk
order by c_last_name, c_first_name, substr(s_city,1,30), profit
limit 10;

測試結論

1 導入數據Clickhouse快
2 數據壓縮率Clickhouse好
3 單表查詢Clickhouse快
4 Join查詢兩者各有優劣,數據量小情況下Clickhouse好,數據量大Doris好
5 Doris對SQL支持情況要好

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章