【Hive】分桶表

1. 什麼是分桶表

分桶表是按照某列屬性值,把數據打散存儲在不同文件的Hive表.

2. 分桶的原理

Hive官網解釋:

How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

翻譯過來就是:

分桶列的hash值對分桶數取模,餘數相同的會被分到同一個桶內.
在hdfs上,每個桶對應一個數據文件.

比如按照name屬性分爲3個桶,就是對name屬性值的hash值對3取模,按照取模結果對數據分桶。

  • 取模結果爲0的數據記錄存放到一個文件

  • 取模結果爲1的數據記錄存放到一個文件

  • 取模結果爲2的數據記錄存放到一個文件

3. 分桶表的優點

  • 避免數據傾斜
  • 有利於抽樣
  • 使map-side JOIN更高效

4. 如何創建分桶表

-- 創建分桶表
create table db_hive.user_buckets_demo(id int, name string)
clustered by(id) 
into 4 buckets 
row format delimited fields terminated by '\t';

注意:
* clustered by: 指定分桶列
* into x buckets: 指定分桶數

5. 分桶表加載數據

注意:
* 需要先把數據加載到普通表,再通過insert into...select...from 插入數據.
  否則,將不會產生分桶.
* hive2.0以前版本,插入數據前需要設置:  
  set hive.enforce.bucketing=true; 
  或者 
  set mapreduce.job.reduces=4; 

創建普通表user_demo

create table db_hive.user_demo(id int, name string)
row format delimited fields terminated by '\t';

準備數據文件 buckets.txt

1	xiaoming1
2	xiaoming2
3	xiaoming3
4	xiaoming4
5	xiaoming5
6	xiaoming6
7	xiaoming7
8	xiaoming8
9	xiaoming9
10	xiaoming10

加載數據到普通表user_demo

load data local inpath '/home/hadoop/hive/data/user_bucket.txt' overwrite into table user_demo;

加載數據到分桶表user_buckets_demo

set hive.enforce.bucketing=true;
insert into user_buckets_demo select * from user_demo;

頁面查看數據文件
在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章