Hive 的數據類型內外部表分區分桶

Hive的數據類型

對於Hive的String類型相當於數據庫的varchar類型，該類型是一個可變的字符串

Hive有三種複雜數據類型ARRAY、MAP 和 STRUCT。ARRAY和MAP與Java中的Array和Map類似，而STRUCT與C語言中的Struct類似，它封裝了一個命名字段集合，複雜數據類型允許任意層次的嵌套。

DDL部分

創建數據庫

避免要創建的數據庫已經存在錯誤，增加if not exists判斷。（標準寫法）

create database if not exists db_hive;

顯示數據庫

show databases;

使用數據庫

use db_hive;

顯示數據庫信息

desc database db_hive;

刪除數據庫（空數據庫）

drop database if exists db_hive2;

刪除不爲空數據庫（強制刪除）

drop database db_hive cascade;

創建表

內部表

默認創建的表都被稱爲內部表。因爲這種表，Hive會（或多或少地）控制着數據的生命週期。Hive默認情況下會將這些表的數據存儲在由配置項hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定義的目錄的子目錄下。當我們刪除一個內部表時，Hive也會刪除這個表中數據。內部表不適合和其他工具共享數據。

普通創建表

create table if not exists student(
id int,
name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student';

按照什麼分割 ‘’ 號裏可填 ‘，’ ‘空格’ ‘\t’等
row format delimited fields terminated by ‘\t’

hive文件存儲格式，不寫默認爲textfile型
stored as textfile

location：數據來源
location ‘/user/hive/warehouse/student’;

根據查詢結果創建表（as）

create table employee as select * from employee;

子查詢生成內部表

create tabe employees as 
with 
r1 as (select name from emps where empname='zhangsan' and sex=1),
r2 as (select name from emps where sex=0)
select * from r1 union all select * from r2;

根據已經存在的表結構創建表

create table if not exists student1 like student;

查看創建的表

show tables;

查看錶信息

desc student;

查看錶具體信息

desc formatted student;

外部表

external

因爲表是外部表，所以Hive並非認爲其完全擁有這份數據。刪除該表並不會刪除掉這份數據，不過描述表的元數據信息會被刪除掉。

create external table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';

向外部表中導入數據
數據在hdfs中

load data inpath '/myinfo/4.txt' into table mytest;

數據在linux中

load data local inpath '/myinfo/4.txt' into table mytest;

含有集合的表
集合中數據以什麼分割
collection items terminated by ‘,’
比如數據爲：

1 zhangsan football,basketball,drink 22
2 caixukun sing,tiao,rap 12

每個字段以空格分割，集合中數據以逗號分割

create externale table myuser(
uid int,
uname string,
ulike array<string>,
uage int
)
row format delimited fields terminated by ' '
collection items terminated by ','
location '/myinfo'

分區表

分區表實際上就是對應一個HDFS文件系統上的獨立的文件夾，該文件夾下是該分區所有的數據文件。Hive中的分區就是分目錄，把一個大的數據集根據業務需要分割成小的數據集。在查詢時通過WHERE子句中的表達式選擇查詢所需要的指定的分區，這樣的查詢效率會提高很多。

分區分爲靜態分區和動態分區

靜態分區：若分區的值是確定的，那麼稱爲靜態分區。新增分區或者是加載分區數據時，已經指定分區名。

創建靜態分區表

create table partition(
id int, 
name string, 
age int
)
partitioned by (sex int)
row format delimited fields terminated by '\t';

加載數據

load data local inpath '/opt/module/datas/data1.txt' into table partition partition(sex='1');
load data local inpath '/opt/module/datas/data2.txt' into table partition partition(sex='0');

查詢
select * from partition where sex=‘1’;
select * from partition where sex=‘0’;

聯合查詢

select * from partition where sex='1'
              union
              select * from partition where sex='0';

增加分區刪除分區（add，drop）

alter table partition add partition(sex='3') ;
alter table partition add partition(sex='3') ;

查看錶有多少分區

show partitions partition;

多級分區表

create table partition2(
id int,
name string,
)
partitioned by (age int,sex int)
row format delimited fields terminated by '\t';

加載數據到多級分區表

load data local inpath '/opt/module/datas/data.txt' into table partition2 partition(age='20',sex='0');

查詢多級分區表

select * from partition2 where age='19' and sex='1';

動態分區

動態分區：分區的值是非確定的，由輸入數據來確定

動態分區需要在hive中設置

開啓動態分區
字段可全部動態分區

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

創建動態分區表

create external table origninfos(
id string,
name string,
sex string,
age int
)
row format delimited fields terminated by ' '
location '/orgin';

create external table partinfos(
id string,
name string,
age int
)
partitioned by (sex string)
row format delimited fields terminated by ' '

insert into partinfos partition(sex) select id,name,age,sex from origninfos;

分桶表

對於每一個表（table）或者分區， Hive可以進一步組織成桶，也就是說桶是更爲細粒度的數據範圍劃分。

2、提高join查詢效率
獲得更高的查詢處理效率。桶爲表加上了額外的結構，Hive 在處理有些查詢時能利用這個結構。具體而言，連接兩個在（包含連接列的）相同列上劃分了桶的表，可以使用 Map 端連接（Map-side join）高效的實現。比如JOIN操作。對於JOIN操作兩個表有一個相同的列，如果對這兩個表都進行了桶操作。那麼將保存相同列值的桶進行JOIN操作就可以，可以大大較少JOIN的數據量。

1、方便抽樣
使取樣（sampling）更高效。在處理大規模數據集時，在開發和修改查詢的階段，如果能在數據集的一小部分數據上試運行查詢，會帶來很多方便。

修改表操作

修改的爲表的元數據而不是表的具體數據

注意：Truncate只能刪除內部表，不能刪除外部表中數據

Hive 的數據類型內外部表分區分桶