前言
- 操作系統:CentOS 7
- hadoop:2.7.7
- hive:2.3.0
- 實驗目的:本文主要測試在某字段各種不同值個數情況下,記錄對此字段其使用
DISTINCT/GROUP BY
去重的查詢語句執行時間,對比兩者在不同場景下的去重性能 - 實驗表格:
表名 | 記錄數 | 查詢字段不同值個數 | DISTINCT | GROUP BY |
---|---|---|---|---|
tab_1 | 100000 | 3 | ||
tab_2 | 100000 | 10000 |
實驗過程
1)創建測試用表
drop table if exists tab_1;
create table tab_1(
id int,
value int
)
row format delimited
fields terminated by '\t';
drop table if exists tab_2;
create table tab_2 like tab_1;
2)加載測試數據集
測試用數據集tab_1.txt:
1 1
2 1
3 1
4 3
5 1
...
99997 3
99998 2
99999 3
100000 2
測試用數據集tab_2.txt:
1 3715
2 7211
3 4909
4 2913
5 9839
...
99997 2884
99998 698
99999 4839
100000 2101
分別加載數據集到對應表:
load data local inpath '/tmp/hive/data/tbl/tab_1.txt' overwrite into table tab_1;
load data local inpath '/tmp/hive/data/tbl/tab_2.txt' overwrite into table tab_2;
3)執行查詢語句,記錄執行時間
取消自動本地模式:
hive> set hive.exec.mode.local.auto = false;
手動指定Reducer個數:
hive> set mapreduce.job.reduces = 3;
執行測試查詢語句,記錄執行時間:
select distinct(value) from tab_1; -- 31.335s
select value from tab_1 group by value; -- 31.587s
select distinct(value) from tab_2; -- 32.376s
select value from tab_2 group by value; -- 33.834s
4)執行計劃對比
explain select distinct(value) from tab_1;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select distinct(value) from tab_1;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tab_1 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: value (type: int) |
| outputColumnNames: value |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: value (type: int) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: int) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: int) |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: int) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
explain select value from tab_1 group by value;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select value from tab_1 group by value;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tab_1 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: value (type: int) |
| outputColumnNames: value |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: value (type: int) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: int) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: int) |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: int) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
5)實驗結果
表名 | 記錄數 | 查詢字段不同值個數 | DISTINCT | GROUP BY |
---|---|---|---|---|
tab_1 | 100000 | 3 | 31.335s | 31.587s |
tab_2 | 100000 | 10000 | 32.376s | 33.834s |
實驗結論:
在Hive 2.3.0中,使用DISTINCT去重和使用Group By去重的執行計劃相同,執行時間也大致相同,因此兩者去重性能基本無差異
實驗過程及結論,如有不足之處,歡迎指正,此實驗結論僅供參考。
PS:在Hive中使用聚集函數時一定要注意,在使用聚集函數時,一般Hive都只會使用單個Reducer來進行聚集操作(即使手動設置Reducer個數也是如此),如果此時查詢數據量過大,則會導致Reducer節點失效,因此在使用聚集函數時,且數據量較大時,可以使用子查詢來實現分步聚合,如:可以先在子查詢中,按照某個字段進行分組,然後聚合,這樣就可以使用多個Reducer加快查詢過程,最後在外部查詢中對子查詢結果進行整體聚合。