（四）Hive中的幾種表

原創

2018-12-26 03:04

先有表，後有數據。先創建了表對應的文件夾，再把數據上傳到文件夾下作爲表數據。

create table people (col1 string, col2 string) row format delimited fields terminated by '\t';

先有數據，後有表。先在hdfs上有了數據文件，在創建表關聯到數據，來管理數據。

create external table people (col1 string, col2 string) row format delimited fields terminated by '\t' location '/people';

目的是提升查詢效率

此處分區字段有兩個

create table phone (col1 string, col2 string) partitioned by (country string， size string) row format delimited fields terminated by '\t';

加載數據時指明分區的值，可以看到phone目錄下會創建一個 country=china 的目錄，count=china 目錄下又會創建一個 size=large 的目錄，更多分區字段時以此類推。

load data local inpath '/opt/test.txt' overwrite into table phone partition (country='china', size='large');

目的：實現數據抽樣，即把大的數據分爲多份小的數據，但每個小的數據也保留源數據的特性。

場景：使用龐大的數據做測試，耗費時間。所以需要從龐大的數據中抽樣出少部分數據來做測試。

實現原理：利用hash算法的散列特性，對原數據中的某個字段，進行計算得到一個值，並對桶的個數取餘。這樣就能把原數據比較的均勻的散列到各個桶，且每個桶的數據都保持這原數據的特性，可以代替原數據做測試。

------------------------------------------------------------------------------------------------------------------------------------------------

hive默認關閉分桶功能，需要手動開啓： set hive.enforce.bucketing = true;

若有一個原表 students，現在我們則新創建一個分桶表，指定分桶個數爲2，並按照字段 id去做hash分桶。

create table students_temp (id int, name string) clustered by (id) into 2 buckets row format delimited fields terminated by '\t';

把原表數據插入分桶表，該命令會轉爲map和reduce任務，且有2個reduce。每個桶其實對應到文件夾中的一個文件。

insert into students_temp select * from students;

查詢某個桶的數據，如下，把數據分成兩份，取其中的第一份。

如果表沒有分桶，也會查詢出取樣後的數據。只是會把數據加載到內存中計算。

如果份數和表指定的分桶個數不一致，則會把數據加載到內存中計算，得到取樣結果。

如果份數和表指定的分桶個數一致，則直接找到分桶對應的文件即可，效率大大提高。

 select * from students_temp tablesample(bucket 1 out of 2 on id);

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.