Hive 桶的分區

(一)、桶的概念：
對於每一個表（table）或者分區， Hive可以進一步組織成桶(沒有分區能分桶嗎？)，
也就是說桶是更爲細粒度的數據範圍劃分。Hive也是針對某一列進行桶的組織。Hive採用
對列值哈希，然後除以桶的個數求餘的方式決定該條記錄存放在哪個桶當中。
把表（或者分區）組織成桶（Bucket）有兩個理由：
(1)、獲得更高的查詢處理效率。桶爲表加上了額外的結構，Hive 在處理有些查詢時能利用
這個結構。具體而言，連接兩個在（包含連接列的）相同列上劃分了桶的表，可以使用
Map 端連接（Map-side join）高效的實現。比如JOIN操作。對於JOIN操作兩個表有一個
相同的列，如果對這兩個表都進行了桶操作。那麼將保存相同列值的桶進行JOIN操作就可
以，可以大大較少JOIN的數據量。
(2)、使取樣（sampling）更高效。在處理大規模數據集時，在開發和修改查詢的階段，
如果能在數據集的一小部分數據上試運行查詢，會帶來很多方便。

(3)、強制多個 reduce 進行輸出：
插入數據前需設置，不設置將會只有一個文件：
set hive.enforce.bucketing = true
要向分桶表中填充數據，需要將 hive.enforce.bucketing 屬性設置爲 true。
這樣，Hive 就知道用表定義中聲明的數量來創建桶。然後使用 INSERT 命令即可。
需要注意的是： clustered by和sorted by不會影響數據的導入，這意味着，用戶必須自己負責數據如何如何導入，包括數據的分桶和排序。
'set hive.enforce.bucketing = true' 可以自動控制上一輪reduce的數量從而適配bucket的個數，
當然，用戶也可以自主設置mapred.reduce.tasks去適配bucket個數，推薦使用'set hive.enforce.bucketing = true'

二、案例操作
1、以用戶ID作爲分桶依據，將用戶數據分4個桶存放
創建普通表：
create table if not exists u_users(
uid int,
uname string,
uage int
)
row format delimited fields terminated by',';

vi u_users.txt
1,xiaoA,12
2,xiaoB,10
3,xiaoC,12
4,xiaoD,17
5,xiaoE,12
6,xiaoF,16
7,xiaoG,15
8,xiaoH,12
9,xiaoW,12
10,xiaoT,12
11,xiaoL,18

load data local inpath '/opt/data/u_users.txt' into table u_users;

創建分桶表(用戶ID作爲分桶依據)：
create table if not exists bk_users(
uid int,
uname string,
uage int
)
clustered by(uid) into 4 buckets
row format delimited fields terminated by',';

說明：
1.clustered by(uid) into 4 buckets 在row format delimited fields terminated by','前面，順序不能調
2.clustered by(uid) into 4 buckets 是以表的uid作爲分桶依據，然後將數據分爲4個桶操作。

強制多個 reduce 進行輸出桶文件
set hive.enforce.bucketing = true

加載數據到分桶表：
注意：對分桶表數據的導入只能以結果集的方式添加
insert into table bk_users select * from u_users;

查看分桶表目錄下的桶文件：
hive> dfs -ls hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users;
Found 4 items
-rwxr-xr-x 3 root supergroup 22 2017-04-24 14:49 hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000000_0
-rwxr-xr-x 3 root supergroup 33 2017-04-24 14:49 hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000001_0
-rwxr-xr-x 3 root supergroup 34 2017-04-24 14:49 hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000002_0
-rwxr-xr-x 3 root supergroup 34 2017-04-24 14:49 hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000003_0

hive> dfs -cat hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000000_0;
8,xiaoH,12
4,xiaoD,17
hive> dfs -cat hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000001_0;
9,xiaoW,12
5,xiaoE,12
1,xiaoA,12
hive> dfs -cat hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000002_0;
10,xiaoT,12
6,xiaoF,16
2,xiaoB,10
hive> dfs -cat hdfs://Hadoop001:9000/user/hive/warehouse/db_1608c.db/bk_users/000003_0;
11,xiaoL,18
7,xiaoG,15
3,xiaoC,12

分桶表的查詢：
select * from bk_users;

tablesample是桶抽樣語句，語法：TABLESAMPLE(BUCKET x OUT OF y)
select * from bk_users TABLESAMPLE(BUCKET x OUT OF y);

y儘可能是table總bucket數的倍數或者因子。

y必須要大於x，否則報錯。
hive根據y的大小，決定抽樣的比例;

clustered by(id) into 16 buckets;
例如，table總共分了16桶，當y=8時，抽取(16/8=)2個bucket的數據，
當y=32時，抽取(16/32=)1/2個bucket的數據。
x表示從哪個bucket開始抽取。

clustered by(id) into 32 buckets;
例如，table總bucket數爲32，tablesample(bucket 3 out of 16)，
表示總共抽取（32/16=）2個bucket的數據，
分別爲第3個bucket和第（3+16=）19個bucket的數據。

bk_users分桶結構:clustered by(uid) into 4 buckets
#從bk_users分桶表抽出一桶數據：
x=2,y=4
select * from bk_users TABLESAMPLE(BUCKET 2 OUT OF 4);

#從bk_users分桶表抽出二桶數據：
x=2,y=2
select * from bk_users TABLESAMPLE(BUCKET 2 OUT OF 2);

#從bk_users分桶表抽出四桶數據：
x=1,y=1
select * from bk_users TABLESAMPLE(BUCKET 1 OUT OF 1);

#從bk_users分桶表抽出半桶數據：
x=1,y=8
select * from bk_users TABLESAMPLE(BUCKET 1 OUT OF 8);

=================抽樣查詢======================
#隨機從某表中取5條數據：
select * from u_users order by rand() limit 5;

#數據塊取樣 (TABLESAMPLE (n PERCENT))抽取表大小的n%
select * from u_users TABLESAMPLE (10 PERCENT);

#指定數據大小取樣(TABLESAMPLE (nM)) M爲MB單位
select * from u_users TABLESAMPLE (10M);

#指定抽取條數(TABLESAMPLE (n ROWS))
select * from u_users TABLESAMPLE (5 ROWS);

************分區+分桶+混合方式分區******************************
案例2：按國家、城市分桶，以f1字段作爲分桶依據
create external table if not exists tb_part_bk_users(
f1 string,
f2 string,
f3 string,
contry string,
city string
)
row format delimited fields terminated by'\t';

load data local inpath '/opt/data/par_buc.txt' into table tb_part_bk_users;

創建分區+分桶表：
create external table if not exists part_bk_users(
f1 string,
f2 string,
f3 string
)
partitioned by(contry string,city string)
clustered by(f1) into 5 buckets
row format delimited fields terminated by'\t';

說明：
1.partitioned by(contry string,city string)
clustered by(f1) into 5 buckets
先寫分區操作、在設置分桶操作

混合方式將數據添加到分區分桶表：
1.打開動態分區設置、設置動態分區模式爲非嚴格模式
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

2.強制多個 reduce 進行輸出桶文件
set hive.enforce.bucketing = true

3.只能以結果集的方式添加數據到分桶表
insert into table part_bk_users partition(contry='CA',city)
select f1,f2,f3,city from tb_part_bk_users where contry='CA';

insert into table part_bk_users partition(contry='US',city)
select f1,f2,f3,city from tb_part_bk_users where contry='US';

python gdal 安裝使用（Windows， python 3.6.8）

jdbc遠程連接hiveserver2

python加itchat 獲取微信羣用戶信息

linux服務器端口查看的方法

利用 django+wechat-python-sdk 創建微信服務器接入

CentOS7.0安裝JDK1.8

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結