談笑間學會大數據-Hive中的分桶表

原創

2020-06-29 10:30

談笑間學會大數據-Hive中的分桶表

你可以不夠優秀，但是不要甘於平凡

Hive中的分桶表

官方文檔

首先我們可以參考下官方文檔，對於創建分桶表的一些描述

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables

什麼是分桶表？

分桶是相對分區進行更細粒度的劃分。分桶將整個數據內容安裝某列屬性值得hash值進行區分，如要安裝name屬性分爲3個桶，就是對name屬性值的hash值對3取摸，按照取模結果對數據分桶。如取模結果爲0的數據記錄存放到一個文件，取模爲1的數據存放到一個文件，取模爲2的數據存放到一個文件。桶是比表或分區更爲細粒度的數據範圍劃分。針對某一列進行桶的組織，對列值哈希，然後除以桶的個數求餘，決定將該條記錄存放到哪個桶中。

語法

DDL語法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path]

分桶表語法

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
   COLLECTION ITEMS TERMINATED BY '\002'
   MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE;

In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

There is also an example of creating and populating bucketed tables.

分桶表的操作使用

先創建分桶表，通過直接導入數據文件的方式

數據準備 student.txt

創建分桶表

create table stu_buck(id int, name string) 
clustered by(id) 
into 4 buckets 
row format delimited fields terminated by '\t';

查看錶結構

hive (default)> desc formatted stu_buck; 
Num Buckets: 4

導入數據到分桶表中

hive (default)> load data local inpath '/datas/student.txt' into table stu_buck;

查看創建的分桶表中是否分成 4 個桶，發現並沒有分成 4 個桶。是什麼原因呢？

直接load data不會有分桶的效果，這樣和不分桶一樣，在HDFS上只有一個文件。

另一種方法試試呢？

創建分桶表時，數據通過子查詢的方式導入

先建一個普通的 stu 表

create table stu(id int, name string) 
row format delimited fields terminated by '\t';

向普通的 stu 表中導入數據

load data local inpath '/datas/student.txt' into table stu;

清空 stu_buck 表中數據

truncate table stu_buck; 
select * from stu_buck;

導入數據到分桶表，通過子查詢的方式
```
insert into table stu_buck 
select id, name from stu; 
```
發現還是隻有一個分桶。爲什麼尼？

其實在官網上面已經有所描述了需要配置一個參數纔可以de

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

需要設置一個屬性
```
hive (default)> set hive.enforce.bucketing=true; 
hive (default)> insert into table stu_buck 
select id, name from stu; 
```

查詢分桶的數據

hive (default)> select * from stu_buck; 
OK
stu_buck.id stu_buck.name 
1004 ss4 
1008 ss8 
1012 ss12 
1016 ss16 
1001 ss1 
1005 ss5 
1009 ss9 
1013 ss13 
1002 ss2 
1006 ss6 
1010 ss10 
1014 ss14 
1003 ss3 
1007 ss7 
1011 ss11 
1015 ss15

分桶抽樣查詢

對於非常大的數據集，有時用戶需要使用的是一個具有代表性的查詢結果而不是全部結果。Hive 可以通過對錶進行抽樣來滿足這個需求。

查詢表 stu_buck 中的數據。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注：tablesample 是抽樣語句，語法：TABLESAMPLE(BUCKET x OUT OF y) 。

y 必須是 table 總 bucket 數的倍數或者因子。hive 根據 y 的大小，決定抽樣的比例。例如，table 總共分了 4 份，當 y=2 時，抽取(4/2=)2 個 bucket 的數據，當 y=8 時，抽取(4/8=)1/2個 bucket 的數據。 x 表示從哪個 bucket 開始抽取，如果需要取多個分區，以後的分區號爲當前分區號加上 y。

例如，table 總 bucket 數爲 4，tablesample(bucket 1 out of 2)，表示總共抽取（4/2=）2 個 bucket 的數據，抽取第 1(x)個和第 4(x+y)個 bucket 的數據。

注意：x 的值必須小於等於 y 的值，否則會報錯。報錯信息如下：

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

場景

抽樣查詢

對於非常大的數據集，用戶不需要全部查詢的結果，只需要一個代表性的查詢結果時，可以通過對錶進行分桶抽樣。

每個桶內的數據是排序的，這樣每個桶進行連接時就變成了高效的歸併排序

數據分片

能夠有效的根據某一個key，將數據進行分片，方便數據存儲和計算

分區表 && 分桶表

分桶針對的是數據文件，分區針對的是數據目錄而非文件

分區提供一個隔離數據和優化查詢的便利方式。不過，並非所有的數據集都可形成合理的分區，特別是要確定分區合適的劃分大小。

分桶是將數據集分解成更容易管理的若干部分的另一個技術。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

談笑間學會大數據-Hive中的分桶表

談笑間學會大數據-Hive中的分桶表

Hive中的分桶表

官方文檔

什麼是分桶表？

語法

DDL語法

分桶表語法

分桶表的操作使用

場景

抽樣查詢

數據分片

分區表 && 分桶表

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

談笑間學會大數據-Hive中的分桶表

算法③:構建乘積數組

談笑間學會大數據-Hive命令

談笑間學會大數據-Hive函數

談笑間學會大數據-Hive中的排序

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結