Apache Hive 基本語法

前言

這篇文檔是我兩年前剛學習Hive的時候寫的相當於筆記一樣的東西，無意中被我翻出來了。。。

一、數據庫

1、創建數據庫

create database|schema [if not exists] xiaoming;

中括號中的if not exists表示如果數據庫已經存在就不創建，不存在則創建。

2、查看數據庫

show databases|schemas;

3、使用數據庫

use xiaoming;

4、刪除數據庫

drop database [if exists] xiaoming [restrict|cascade];

刪除數據庫，默認情況下，hive不允許刪除含有表的數據庫，要先將數據庫中的表清空才能drop，否則會報錯，可以在語句後面cascade關鍵字，強制刪除一個數據庫，默認是restrict，表示有限制的

二、數據庫表

1、創建表

A、創建內部表

create table [if not exists] xiaoming01(id int,name string);

此時創建的表沒有手動指定分隔符，所以採用hive默認的分隔符/001。

加載數據:

load data [local] inpath 'path' [overwrite] into table xiaoming01;

B、創建外部表

create external table [if not exists] xiaoming01(id int,name string1) location 'path';

外部表和內部表的區別：Hive在創建內部表的時候，會將映射爲表的數據移動到數據倉庫指定的路徑下，而創建外部表不會，創建外部表hive只會記錄數據所在的路徑，不會對數據的位置做任何改變。再刪除表的時候，內部表會將元數據和數據一起刪除，外部表只會刪除元數據，不會刪除數據。

裝載數據：(此處的path需和創建表的時候的location指點的路徑一致)

load data [local] inpath 'path' into table xiaoming01;

C、創建分區表

分區建表分爲2種，一種是單分區，也就是說在表文件夾目錄下只有一級文件夾目錄。另外一種是多分區，表文件夾下出現多文件夾嵌套模式。

create table xiaoming01(id int,name string) partitioned by(country string);

以上是創建單分區表，以國家爲分區字段，注意分區字段一定不能是表中已經存在的字段。

create table xiaoming01(id int,name string) partitioned by(country string,province string);

以上是創建雙分區表，以國家爲第一分區字段，省份爲第二分區字段。

裝載數據：

-- 單分區表：
load data [local] inpath 'path' [overwrite] into table xioaming01 partition(country='CN');
-- 雙分區表：
load data [local] inpath 'paht' [overwrite] into table xioaming02 partition(country='CN',province='ShangHai');

查看分區表的分區：

show partitions xiaoming01;

基於分區的查詢：

select * from xiaoming01 where country = 'CN'; -- 查詢CN分區下的數據

desc xiaoming01; -- 查詢表結構

注意：

分區表是一個虛擬的字段，不存放任何數據。
分區字段的數據是在裝載分區表數據是時候指定的
分區表的目的是爲了減少查詢查詢數據時進行全表掃描的成本，提高查詢效率。

D、創建分桶表

首先，hive在默認情況下是不支持分桶操作的，需要我們手動開啓。

set hive.enforce.bucketing = true; -- 開啓分桶表

set mapreduce.job.reduces = 4; -- 設置reduce的個數爲4，也是最大分桶的個數

create table xiaoming01(id int,name string) clustered by(id) into 4 buckets; -- 創建一個分桶表分桶字段爲id，指定4個分桶。

裝載數據：(分桶表裝載數據不能使用load的方式)

insert overwrite table xiaoming01 select * from student cluster by(id);

分桶表裝載數據需要使用insert+select，需要使用一箇中間臨時表，進行分桶查詢，再將查詢到的結果插入到分桶表中。（分桶需要經過reduce這一過程，普通的load本質就是hive替我們做了put操作，沒有經過MR程序）。

需求： 對某列進行分桶的同時，根據另一列進行排序

insert table xiaoming01 select * from student distribute by(id) sort by(name asc|desc);

在排序的時候不能使用cluster by 和 sort by 進行組合，因爲cluster by默認是分桶且排序的，如果再進行排序就會衝突。

cluster by（分桶且排序，同一字段） == distribute by（分桶） + sort by（排序，字段可以不同）

注意：

分桶表（分簇表）創建的時候分桶字段必須是表中已經存儲（存在）的字段。

分桶表數據採用insert+select裝載數據的時候進行了mr程序，插入的分桶數據來自對應的mr程序的partition中。所以默認是採用哈希分桶。

分桶表也是把表所映射的結構化數據文件分成更細緻的數據，但是更多的是用在join查詢上提高效率。

2、修改表

A、修改普通表

查詢重命名錶爲xiaoming01；

alter table xiaoming01 rename to xiaoming02;

在xiaoming01表中增加了一列dept，字段類型爲string；後面的comment是註釋，可有可無。

alter table xiaoming01 add columns(dept string comment '部門');

刪除xiaoming01中的dept列；

alter table xiaoming01 drop[column] dept;

查詢更改xiaoming01中的name字段，更改爲newName，並將字段類型改爲string，可以更改字段的順序；

alter table xiaoming01 change name newName string[first|after column_name];

替換xiaoming01中的newName字段爲name字段；

alter table xiaoming01 replace columns (newName string name string);

B、修改分區表

增加分區

alter table xiaoming01 add partition (country='USA') location 'path';

在執行添加分區時，path文件夾下的數據不會被移動。並且沒有分區目錄country=USA

增加多個個分區

alter table xiaoming01 add partition(country='USA',province='NewYork') location 'path' partition(country='CN',province='ShangHai') location 'path';

刪除分區

alter table xiaoming01 drop if exists partition(country='USA');

修改分區

alter table xiaoming01 partition(country='USA') rename to partition(country='CN');

3、刪除表

truncate table xiaoming01; -- 刪除xiaoming01表的所有數據。

drop table xiaoming01; -- 刪除xiaoming01這張表。

三、其他操作

1、load

使用load裝載數據時，Hive不會進行任何轉換，加載操作是將數據文件移動到與 Hive 表對應的位置的純複製/移動操作。

load data [local] inpath 'path' [overwrite] into table tablename [partition (partcol1=val1, partcol2=val2 ...)] ;

2、insert

Hive 可以使用 insert 子句將查詢結果插入到表中

-- overwrite 關鍵字會將原本的數據進行覆蓋
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

-- into 關鍵字則是直接導入
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

需要保證查詢結果列的數目和需要插入數據表格的列數目一致。

如果查詢出來的數據類型和插入表格對應的列數據類型不一致，將會進行轉換，但是不能保證轉換一定成功，轉換失敗的數據將會爲 NULL。

可以將一個表查詢出來的數據插入到原表中, 結果就是相當於複製了一份 cite 表格中的數據。

A、多重插入

首先創建三張表，第一張表中的第一個字段和第二張表的字段想同，第二個字段和第三張表的字段相同

create table source_table (id int, name string) row format delimited fields terminated by ',';

create table test_insert1 (id int) row format delimited fields terminated by ',';

create table test_insert2 (name string) row format delimited fields terminated by ',';

下面這條的語句的也是就是從source_table中查詢出來id插入到test_insert1表中，將name查詢出來插入到test_insert2表中，這樣既完成了多重插入，將一個表中的字段分別插入到若干個表中。

from source_table -- 查詢 source_table 表

insert overwrite table test_insert1

select id -- 將id字段插入 test_insert1 表中

insert overwrite table test_insert2

select name; -- 將name字段插入 test_insert2 表中

B、動態分區插入

動態分區功能和分桶功能一樣都是默認關閉的，我們需要手動開啓。

set hive.exec.dynamic.partition=true; -- 是否開啓動態分區功能，默認false關閉。

set hive.exec.dynamic.partition.mode=nonstrict; -- 動態分區的模式，默認strict，表示必須指定至少一個分區爲靜態分區，nonstrict模式表示允許所有的分區字段都可以使用動態分區。

需求： 將dynamic_partition_table中的數據按照時間(day)，插入到目標表d_p_t的相應分區中。

-- 創建源數據表
create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ","; 

load data local inpath 'path' into table dynamic_partition_table;
2015-05-10,ip1
2015-05-10,ip2
2015-06-14,ip3
2015-06-14,ip4
2015-06-15,ip1
2015-06-15,ip2
 
-- 創建導入目標表：
create table d_p_t(ip string) partitioned by (month string,day string);

-- 進行動態插入操作：
insert overwrite table d_p_t partition (month,day) 
select ip,substr(day,1,7) as month,day 
from dynamic_partition_table;

需求： 查詢結果導出到文件系統

-- 將查詢結果保存到指定的文件目錄（可以是本地，也可以是hdfs）


-- 將t_p表的數據全部查出導入到本地文件中
insert overwrite local directory '/home/hadoop/test'
select * from t_p;

-- 將t_p表的數據全部查出導入到HDFS中
insert overwrite directory '/aaa/test'
select * from t_p;

注意：

多態插入的字段是按位置一一映射的，所以即使是字段名字不一樣但是如果位置對應就會插入成功。

3、select

基本的 Select 語法結構

select [ all | distinct ] select_expr,select_expr,...from table_name join table_other on expr

[where where_condition]

[group by col_list [hiving condition] ]

[cluster by col_list | [distribute by col_list ]-[sort by | order by col_list] ]

[limit number]

以上語法的順序不可變!

說明:

1、order by：會對輸入做全局排序，因此只會有一個reduce task，當輸入的數據量大時，會導致計算需要較長的時間。

2、sort by：不是全局排序，會在數據進入reduce task前排序完成，所以sort by只保證每個reduce的輸出排序，不保證全局排序。

3、distribute by：是根據指定字段的數據將數據分到不同的reduce，分發算法是hash散列。

4、cluster by：除了具有distribute by的功能外，還會對數據指定的字段進行排序。如果分桶和 sort 字段是同一個時，此時，cluster by = distribute by + sort by。

5、distinct：表示從一個字段中獲取不同的值。

4、Hive join

內連接：將符合兩邊連接條件的數據查詢出來

select * from t_a a inner join t_b b on a.id=b.id;

左外連接：以左表數據爲匹配標準，右邊若匹配不上則數據顯示null

select * from t_a a left join t_b b on a.id=b.id;

右外連接：與左外連接相反

select * from t_a a right join t_b b on a.id=b.id;

左半連接：左半連接會返回左邊表的記錄，前提是其記錄對於右邊表滿足on語句中的判定條件。

select * from t_a a left semi join t_b b on a.id=b.id;

全連接(full outer join)：返回左右兩邊所有的數據，匹配不到的先生爲null。

select * from t_a a full join t_b b on a.id=b.id;

in/exists關鍵字(1.2.1之後新特性)：效果等同於left semi join

select * from t_a a where a.id in (select id from t_b);
select * from t_a a where exists (select * from t_b b where a.id=b.id);

cross join（##慎用）返回兩個表的笛卡爾積結果，不需要指定關聯鍵。

select a.*,b.* from a cross join b;

注意：

reduce在join時會緩存除了最後一個表的所有表的數據，因此，在開發中，我們應該把最大的表放在最後面，減小內存的緩存。

hive支持等值join查詢，不支持非等值查詢，另外，hive支持2張表以上的join。

join應該在在where語句前面。

join是不能交換位置的，無論是left還是right join都是左連接的。

5、UDF（user-defined function）

當hive內置的函數無發滿足我們的實際需求時，我們就可以考慮編寫一個自定義函數了。

自定義函數類別：

UDF：作用於單個數據行，產生一個數據行作爲輸出。（數學函數，字符串函數）
UDAF（用戶定義聚集函數）：接收多個輸入數據行，併產生一個輸出數據行。（count， max）

如何編寫一個UDF程序：

編寫java程序，繼承UDF類，並重載evaluate方法。

import org.apache.hadoop.hive.ql.exec.UDF;
public class AddUdf extends UDF {
 public Integer evaluate(Integer a, Integer b) {
     if (null == a || null == b) {
     	return null;
     } 
     return a + b;
 }
 public Double evaluate(Double a, Double b) {
     if (a == null || b == null)
     	return null;
     return a + b;
     }
 }

打成jar包上傳到服務器
將jar包添加到hive add jar /path/AddUdf.jar;

創建臨時函數與開發好的class關聯起來

create temporary function add_example as 'xxx.AddUdf';

使用自定義函數 SELECT add_example(scores.math, scores.art) FROM scores;
```
--銷燬臨時函數
drop temporary function add_example
```

6、hive的分隔符

A、hive的讀文件機制：首先調用inputformat（默認爲TextInputFormat）去讀取數據，一行一行的讀入，然後使用SerDe（默認LazySimpleSerDe）的 Deserializer，將一條記錄切分爲各個字段（默認分隔符\001）

所以，hive的默認分隔符是\001，所以如果我們沒有指定分隔符的時候，我們load的文件中的分隔符也需要是\001，否則程序雖然不會報錯，但會識別不出數據，返回null，null，null。。。

B、Hive 對文件中字段的分隔符默認情況下只支持單字節分隔符，如果數據文件中的分隔符是多字符的，如下所示： 01||zhangsan 02||lisi 可用使用 RegexSerDe 通過正則表達式來抽取字段。

create table t_bi_reg(id string,name string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties( 
'input.regex'='(.*)\\|\\|(.*)', 
'output.format.string'='%1$s %2$s' 
) 
stored as textfile;

其中： input.regex：輸入的正則表達式表示 || 左右兩邊任意字符被抽取爲一個字段 output.format.string：輸出的正則表達式%1 $s %2$ s 則分別表示表中的第一個字段、第二個地段

注意事項：

a、使用 RegexSerDe 類時，所有的字段必須爲 string

b、input.regex 裏面，以一個匹配組，表示一個字段