Hive中distinct和group by去重性能對比

前言

  • 操作系統:CentOS 7
  • hadoop:2.7.7
  • hive:2.3.0
  • 實驗目的:本文主要測試在某字段各種不同值個數情況下,記錄對此字段其使用DISTINCT/GROUP BY去重的查詢語句執行時間,對比兩者在不同場景下的去重性能
  • 實驗表格:
表名 記錄數 查詢字段不同值個數 DISTINCT GROUP BY
tab_1 100000 3
tab_2 100000 10000

實驗過程

1)創建測試用表

drop table if exists tab_1;
create table tab_1(
    id int,
    value int
)
row format delimited
fields terminated by '\t';

drop table if exists tab_2;
create table tab_2 like tab_1;

2)加載測試數據集

測試用數據集tab_1.txt:

1	1
2	1
3	1
4	3
5	1
...
99997	3
99998	2
99999	3
100000	2

測試用數據集tab_2.txt:

1	3715
2	7211
3	4909
4	2913
5	9839
...
99997	2884
99998	698
99999	4839
100000	2101

分別加載數據集到對應表:

load data local inpath '/tmp/hive/data/tbl/tab_1.txt' overwrite into table tab_1;
load data local inpath '/tmp/hive/data/tbl/tab_2.txt' overwrite into table tab_2;

3)執行查詢語句,記錄執行時間

取消自動本地模式:

hive> set hive.exec.mode.local.auto = false;

手動指定Reducer個數:

hive> set mapreduce.job.reduces = 3;

執行測試查詢語句,記錄執行時間:

select distinct(value) from tab_1; -- 31.335s
select value from tab_1 group by value; -- 31.587s
select distinct(value) from tab_2; -- 32.376s
select value from tab_2 group by value; -- 33.834s

4)執行計劃對比

  • explain select distinct(value) from tab_1;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select distinct(value) from tab_1;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: tab_1                           |
|             Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: value (type: int)       |
|               outputColumnNames: value             |
|               Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: value (type: int)            |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: int) |
|                   Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: int)              |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

  • explain select value from tab_1 group by value;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select value from tab_1 group by value;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: tab_1                           |
|             Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: value (type: int)       |
|               outputColumnNames: value             |
|               Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: value (type: int)            |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: int) |
|                   Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: int)              |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

5)實驗結果

表名 記錄數 查詢字段不同值個數 DISTINCT GROUP BY
tab_1 100000 3 31.335s 31.587s
tab_2 100000 10000 32.376s 33.834s

實驗結論:

在Hive 2.3.0中,使用DISTINCT去重和使用Group By去重的執行計劃相同,執行時間也大致相同,因此兩者去重性能基本無差異

實驗過程及結論,如有不足之處,歡迎指正,此實驗結論僅供參考。

PS:在Hive中使用聚集函數時一定要注意,在使用聚集函數時,一般Hive都只會使用單個Reducer來進行聚集操作(即使手動設置Reducer個數也是如此),如果此時查詢數據量過大,則會導致Reducer節點失效,因此在使用聚集函數時,且數據量較大時,可以使用子查詢來實現分步聚合,如:可以先在子查詢中,按照某個字段進行分組,然後聚合,這樣就可以使用多個Reducer加快查詢過程,最後在外部查詢中對子查詢結果進行整體聚合。


End~

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章