Bucket 桶表的基本相關概念
- 對於每一個表(table)或者分區, Hive可以進一步組織成桶,也就是說桶是更爲細粒度的數據範圍劃分。Hive也是針對某一列進行桶的組織。Hive採用對列值哈希,然後除以桶的個數求餘的方式決定該條記錄存放在哪個桶當中。基本可以這麼說分區表是粗粒度的劃分,桶在細粒度的劃分。當數據量比較大,我們需要更快的完成任務,多個map和reduce進程是唯一的選擇。
但是如果輸入文件是一個的話,map任務只能啓動一個。此時bucket table是個很好的選擇,通過指定CLUSTERED的字段,將文件通過hash打散成多個小文件。
把表(或者分區)組織成桶(Bucket)有兩個理由:
獲得更高的查詢處理效率。桶爲表加上了額外的結構,Hive 在處理有些查詢時能利用這個結構。具體而言,連接兩個在(包含連接列的)相同列上劃分了桶的表,可以使用 Map 端連接 (Map-side join)高效的實現。比如JOIN操作。對於JOIN操作兩個表有一個相同的列,如果對這兩個表都進行了桶操作。那麼將保存相同列值的桶進行JOIN操作就可以,可以大大較少JOIN的數據量。
使取樣(sampling)更高效。在處理大規模數據集時,在開發和修改查詢的階段,如果能在數據集的一小部分數據上試運行查詢,會帶來很多方便。
桶表的操作演示:
1.創建桶表,我們使用CLUSTERED BY 子句來指定劃分桶所有的列和劃分的桶的個數。
- 1
- 2
CREATE TABLE bucketed_user (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
- 1
- 2
- 3
- 4
2、查看錶結構
- 1
- 2
hive> desc bucketed_user;
OK
id int
name string
Time taken: 0.052 seconds, Fetched: 2 row(s)
- 1
- 2
- 3
- 4
- 5
t_user表 數據如下:
hive> select * from t_user;
OK
1 hello
2 world
3 java
4 hadoop
5 android
6 hive
7 hbase
8 sqoop
9 sqark
1 hello
2 world
3 java
4 hadoop
5 android
6 hive
7 hbase
8 sqoop
9 sqark
1 aaaa
1 bbbb
1 cccc
1 dddd
Time taken: 0.076 seconds, Fetched: 22 row(s)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
將t_user表數據加載到bucketed_user表中
- 向這種帶桶的表裏面導入數據有兩種方式,一種是外部生成的數據導入到桶表,一種是利用hive來幫助你生成桶表數據。
- 由於hive在load數據的時候不能檢查數據文件的格式與桶的定義是否匹配,如果不匹配在查詢的時候就會報錯,所以最好還是讓hive來幫你生成數據,簡單來說就是利用現有的表的數據導入到新定義的帶有桶的表中
>insert overwrite table bucketed_user
select * from t_user;
運行過程如下:
Query ID = centosm_20170325122000_5f5c9c5f-9d6f-4f4b-9b94-d1d257ed852f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1488032149798_0006, Tracking URL = http://centosm:8088/proxy/application_1488032149798_0006/
Kill Command = /home/centosm/hadoopM/bin/hadoop job -kill job_1488032149798_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2017-03-25 12:20:20,784 Stage-1 map = 0%, reduce = 0%
2017-03-25 12:20:47,857 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.6 sec
2017-03-25 12:21:47,186 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 9.39 sec
2017-03-25 12:22:20,580 Stage-1 map = 100%, reduce = 79%, Cumulative CPU 17.81 sec
2017-03-25 12:22:24,591 Stage-1 map = 100%, reduce = 87%, Cumulative CPU 20.31 sec
2017-03-25 12:22:25,778 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 23.46 sec
MapReduce Total cumulative CPU time: 23 seconds 460 msec
Ended Job = job_1488032149798_0006
Loading data to table default.bucketed_user
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 27.3 sec HDFS Read: 18519 HDFS Write: 381 SUCCESS
Total MapReduce CPU Time Spent: 27 seconds 300 msec
OK
Time taken: 177.489 seconds
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
運行後查
hive> select * from bucketed_user;
OK
8 sqoop
4 hadoop
4 hadoop
8 sqoop
1 hello
1 cccc
1 bbbb
1 aaaa
9 sqark
5 android
1 dddd
1 hello
9 sqark
5 android
6 hive
6 hive
2 world
2 world
7 hbase
3 java
7 hbase
3 java
Time taken: 0.082 seconds, Fetched: 22 row(s)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
在hive倉庫中的數據
[centosm@centosm test]$hdfs dfs -ls /user/hive/warehouse/bucketed_user
Found 4 items
-rwxr-xr-x 1 centosm supergroup 17 2017-03-25 12:22 /user/hive/warehouse/bucketed_user/000000_0
-rwxr-xr-x 1 centosm supergroup 26 2017-03-25 12:22 /user/hive/warehouse/bucketed_user/000001_0
-rwxr-xr-x 1 centosm supergroup 15 2017-03-25 12:22 /user/hive/warehouse/bucketed_user/000002_0
-rwxr-xr-x 1 centosm supergroup 15 2017-03-25 12:22 /user/hive/warehouse/bucketed_user/000003_0
[centosm@centosm test]$
[centosm@centosm test]$ hdfs dfs -cat /user/hive/warehouse/bucketed_user/000000_0
8,sqoop
4,hadoop
4,hadoop
8,sqoop
[centosm@centosm test]$ hdfs dfs -cat /user/hive/warehouse/bucketed_user/000001_0
1,hello
1,cccc
1,bbbb
1,aaaa
9,sqark
5,android
1,dddd
1,hello
9,sqark
5,android
[centosm@centosm test]$ hdfs dfs -cat /user/hive/warehouse/bucketed_user/000002_0
6,hive
6,hive
2,world
2,world
[centosm@centosm test]$ hdfs dfs -cat /user/hive/warehouse/bucketed_user/000003_0
7,hbase
3,java
7,hbase
3,java
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
運用tablesample 進行查詢
hive> select * from bucketed_user tablesample(bucket 1 out of 4 on id);;
OK
8 sqoop
4 hadoop
4 hadoop
8 sqoop
Time taken: 0.104 seconds, Fetched: 4 row(s)
hive>
> select * from bucketed_user tablesample(bucket 2 out of 4 on id);;
OK
1 hello
1 cccc
1 bbbb
1 aaaa
9 sqark
5 android
1 dddd
1 hello
9 sqark
5 android
Time taken: 0.067 seconds, Fetched: 10 row(s)
hive>
> select * from bucketed_user tablesample(bucket 3 out of 4 on id);;
OK
6 hive
6 hive
2 world
2 world
Time taken: 0.075 seconds, Fetched: 4 row(s)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
ablesample的作用就是讓查詢發生在一部分桶上而不是整個數據集上,上面就是查詢4個桶裏面第一個桶的數據;相對與不帶桶的表這無疑是效率很高的,因爲同樣都是需要一小部分數據,但是不帶桶的表需要使用rand()函數在整個數據集上檢索。
結論:由上述運行結果可以很明顯得出分桶會將同一個用戶id的文件放到同一個桶中,一個桶也會同時存在多個用戶id的數據,例如/user/hive/warehouse/bucketed_user/000003_0 這個桶會存儲所有id爲3和7的數據。這樣當我們要查詢具體某一個id對應的所有的數據便可大大的縮小了查找的範圍。