1. 什麼是分桶表
分桶表是按照某列屬性值,把數據打散存儲在不同文件的Hive表.
2. 分桶的原理
Hive官網解釋:
How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
翻譯過來就是:
分桶列的hash值對分桶數取模,餘數相同的會被分到同一個桶內.
在hdfs上,每個桶對應一個數據文件.
比如按照name屬性分爲3個桶,就是對name屬性值的hash值對3取模,按照取模結果對數據分桶。
-
取模結果爲0的數據記錄存放到一個文件
-
取模結果爲1的數據記錄存放到一個文件
-
取模結果爲2的數據記錄存放到一個文件
3. 分桶表的優點
- 避免數據傾斜
- 有利於抽樣
- 使map-side JOIN更高效
4. 如何創建分桶表
-- 創建分桶表
create table db_hive.user_buckets_demo(id int, name string)
clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';
注意:
* clustered by: 指定分桶列
* into x buckets: 指定分桶數
5. 分桶表加載數據
注意:
* 需要先把數據加載到普通表,再通過insert into...select...from 插入數據.
否則,將不會產生分桶.
* hive2.0以前版本,插入數據前需要設置:
set hive.enforce.bucketing=true;
或者
set mapreduce.job.reduces=4;
創建普通表user_demo
create table db_hive.user_demo(id int, name string)
row format delimited fields terminated by '\t';
準備數據文件 buckets.txt
1 xiaoming1
2 xiaoming2
3 xiaoming3
4 xiaoming4
5 xiaoming5
6 xiaoming6
7 xiaoming7
8 xiaoming8
9 xiaoming9
10 xiaoming10
加載數據到普通表user_demo
load data local inpath '/home/hadoop/hive/data/user_bucket.txt' overwrite into table user_demo;
加載數據到分桶表user_buckets_demo
set hive.enforce.bucketing=true;
insert into user_buckets_demo select * from user_demo;
頁面查看數據文件