HIVE 基礎知識

hive 其實是一個客戶端，類似於navcat、plsql 這種，不同的是Hive 是讀取 HDFS 上的數據，作爲離線查詢使用，離線就意味着速度很慢，有可能跑一個任務需要幾個小時甚至更長時間都有可能。

在日常開發中 Hive 用的還是挺廣泛的，常做一些統計工作，就我自己工作來看其實 80% 的工作由 Hive 的基礎部分就能完成了，只有很少的情況需要用到一些複雜查詢或者調優工作。

本文着重挑選出一些易於被忽略基礎知識,篇幅較多，建議收藏，分次閱讀，後臺文檔中有詳細的知識點說明，需要深入瞭解 Hive 的去文末下載即可。

本文涉及的內容主要是:

在開始內容之前先來簡單看下 Hive 原理：

01 Hive 查詢原理

Hive 其實是將 hql 轉成 MR 程序去跑，這裏我們不去深入底層瞭解到底是怎麼轉換的，就簡單看下Hive查詢過程：

1 根據HDFS上數據格式，創建hive表

2 通過映射關係將HDFS數據導入到表中

3 此時hive表對應的元數據信息記錄到 mysql 中，元數據可不是指的HDFS上的數據，它是指 hive 表的一些參數。

4 寫 select 語句時，根據表與數據的映射關係去寫對應的查詢語句

5 在執行查詢操作時 ,先從元數據庫中找到對應表對應的文件位置，

再通過 hive 的解析器、編譯器、優化器執行器將 sql 語句轉換成 MR 程序，運行在 Yarn 上，最終得到結果。

PS：Hive 裏有三種查詢方式，分別是bin/hive (客戶端)、jdbc、webui，一般的使用jdbc方式居多。（後臺文檔中有詳細操作方式，爲方便後續執行sql，建議先搭好環境。）

02內部表外部表區別

Hive 表與常規的數據庫表不同，它分爲內部表和外部表，它們的區別在創建表和刪除表時有所不同。

創建表時：
內部表會移動數據到指定位置，將數據文件移動到默認位置，一般都是/usr/hive/warehouse/ 目錄下

外部表不會移動數據，數據在哪就是哪

刪除表時：

內部表刪除，數據一起刪除

外部表不會刪除數據
所以區別就很明顯了，一般工作中使用外部表做爲數據映射，而統計出的結果一般多使用內部表，因爲內部表僅僅用於儲存結果或者關聯，與 HDFS 數據無關。

Question:

那麼怎麼區分表是外部表或者是內部表呢？

Answer:

對於已經創建的表可以使用：
desc formatted 表名即可查看。

對於新建表：
使用建表語句時即可區分，其中帶 EXTERNAL 的是外部表，不帶的則是內部表。

建表語句如下：

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)] -- 列名 列字段類型
[COMMENT table_comment] -- 註釋
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] -- 分區字段
[CLUSTERED BY (col_name, col_name, ...) -- 分桶
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] -- 排序字段
[ROW FORMAT row_format] row format delimited fields terminated by “分隔符”
[STORED AS file_format] -- 以什麼形式存儲
[LOCATION hdfs_path] -- 對應HDFS文件路徑

03 Hive 分區表

分區表幾乎是必用的，一般以自然年月爲分區，這樣數據比較好管理。而且在執行查詢語句時可以指定查詢分區數據，

不加分區的 sql 情況：

select a1,a2 .. from table1;

這樣會掃描全表數據，假如數據量比較大，那要等執行結果估計猴年馬月了。
添加分區情況：

select a1,a2 .. from table1 where (year = '2019' and month='12');

這樣的話就只會查詢2019年12月的數據了，善用分區會大大提升查詢效率。

那分區怎麼創建呢？

在建表語句中的分區那行加上就是了，

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] -- 分區字段

就這麼簡單。

舉個例子，一個分區字段的分區表就不說了，多個分區分區表 sql 如下：

create table student (id string,name string, age int)
partitioned by (year string,month string,day string)
row format delimited fields terminated by '\t';

分區劃分：

分區分爲靜態分區和動態分區。

靜態分區需要人爲指定分區，並且需要明確分區字段，舉例 sql 如下：

1 創建分區表：

create table order_partition(
order_number string,
order_price double,
order_time string
)
partitioned BY(month string)
row format delimited fields terminated by '\t';

2 準備數據在 order.txt 中內容如下：

10001 100 2019-03-02
10002 200 2019-03-02
10003 300 2019-03-02
10004 400 2019-03-03
10005 500 2019-03-03
10006 600 2019-03-03
10007 700 2019-03-04
10008 800 2019-03-04
10009 900 2019-03-04

3 將本地文件加載到表中

load data local inpath '/bigdata/install/hivedatas/order.txt' overwrite into table order_partition partition(month='2019-03');

這裏再最後指定分區爲 2019-03 ，所以以上所有內容都會在 2019-03 這個分區中。分區可以手動添加、刪除。

4 查詢結果

select * from order_partition where month='2019-03';

結果爲：

分區

10001 100.0 2019-03-02 2019-03
10002 200.0 2019-03-02 2019-03
10003 300.0 2019-03-02 2019-03
10004 400.0 2019-03-03 2019-03
10005 500.0 2019-03-03 2019-03
10006 600.0 2019-03-03 2019-03
10007 700.0 2019-03-04 2019-03
10008 800.0 2019-03-04 2019-03
10009 900.0 2019-03-04 2019-03

動態分區則可以將數據自動導入表的不同分區中，與靜態分區不同的是動態分區只需要指定分區字段，不需要明確分區字段的值。

例如：

1 創建分區表：

--創建普通表
create table t_order(
order_number string,
order_price double,
order_time string
)row format delimited fields terminated by '\t';

--創建目標分區表
create table order_dynamic_partition(
order_number string,
order_price double
)partitioned BY(order_time string)
row format delimited fields terminated by '\t';

2 準備數據 order_created.txt內容，內容同靜態分區

10001 100 2019-03-02
10002 200 2019-03-02
10003 300 2019-03-02
10004 400 2019-03-03
10005 500 2019-03-03
10006 600 2019-03-03
10007 700 2019-03-04
10008 800 2019-03-04
10009 900 2019-03-04

3 向普通表t_order加載數據

load data local inpath '/bigdata/install/hivedatas/order_partition.txt' overwrite into table t_order;

這裏沒有手動指明分區字段的值，而是根據分區字段有內部自己判斷數據落在哪個分區中。

4 動態加載數據到分區表中

要想進行動態分區，需要設置參數：

//開啓動態分區功能

hive> set hive.exec.dynamic.partition=true;

//設置hive爲非嚴格模式

hive> set hive.exec.dynamic.partition.mode=nonstrict;

//加載數據

hive> insert into table order_dynamic_partition partition(order_time) select order_number,order_price,order_time from t_order;

5、查看分區

hive> show partitions order_dynamic_partition;

2019-03-02
2019-03-03
2019-03-04

04 Hive分桶表

分桶表一般在超大數據時纔會使用，分桶將整個數據內容按某列屬性值取hash值進行區分，具有相同hash值的數據進入到同一個文件中，意味着原本屬於一個文件的數據經過分桶後會落到多個文件中。

例如：

創建分桶表之前要設置一些參數：

1 開啓分桶

set hive.enforce.bucketing = true

2 設置桶個數

set mapreduce.job.reduces = 4;

創建分桶表：
// 1 創建分桶表

create table user_bucket_demo(id int,name string)
cluster by (id)
into 4 buckets
row format delimited fields terminated by '\t'

// 2 創建普通表

create table user_demo(id int,name string)
row format delimited fields terminated by '\t'

// 3 加載本地數據到普通表

load data local inpath '/home/hadoop/data/02/user_bucket.txt' into table user_demo;

注意：

//使用這個方式加載數據到分桶表，數據不會分桶

load data local inpath '/home/hadoop/data/02/user_bucket.txt' into table user_bucket_demo;

// 4 正確的分桶表加載數據方式:

insert into user_bucket_demo select * from user_demo;

// 5 查看結果

select * from user_bucket_demo tablesample(bucket 1 out of 2)

– 需要的總桶數=4/2=2個

– 先從第1個桶中取出數據

– 再從第1+2=3個桶中取出數據

tablesample(buket x out of y) 函數說明：

x表示從第幾個桶開始取數據
y表示桶數的倍數，一共需要從桶數/y 個桶中取數據

05 Hive數據導入

數據導入一般是初始化的工作，一般將表與 HDFS 路徑映射好之後，後續的分區數據會自動與表做好映射。所以這塊一般來說用的不多，在自己測試時使用的居多吧。

數據導入方式如下：

load 方式加載數據

這種方式在之前分區表時已經使用過了。

load data [local] inpath 'dataPath' [overwrite ] into table student [partition (partcol1=val1,…)];

添加 local 表示從本地加載，不添加表示從 HDFS 上加載

添加 overwrite 表示覆蓋原表數據，不添加 overwrite 表示追加

添加 partition 表示向某個分區添加數據

查詢方式加載數據

insert overwrite table yourTableName partition(month = '201806') select column1,column2 from otherTable;

查詢語句中創建表並加載數據

create table yourTableName as select * from otherTable;

使用location 指定加載數據路徑（常用）

1 創建表，並指定HDFS上路徑

create external table score (s_id string,c_id string,s_score int) row format delimited fields terminated by '\t' location '/myscore';

2 上傳數據到 HDFS 上，可在 Hive客戶端通過 dfs 命令操作 HDFS

//創建 HDFS 路徑

dfs -mkdir -p /myscore;

//上傳數據到 HDFS 上，測試數據在文末。

dfs -put /bigdata/install/hivedatas/score.csv /myscore;

//查看結果

select * from score;

注意：

如果查詢不到數據可使用：

msck repair table score;

進行表的修復,說白了就是建立我們表與我們數據文件之間的一個關係映射

06 使用複合數據類型建表

Hive 中複合數據類型有 Array、Map、Struct 這三種。

Array 代表數組，類型相同的數據
Map 映射 k–v 對
Struct 則存儲類型不同的一組數據

創建表時除需要指定每行的分隔符（row format），要是有複合類型的還需要指定複合類型的分隔符。

複合數據建表語句：

create table tablename (id string,name string,...)
row format delimited fields terminated by ' '
Collection items terminated by '\t' -- array 分隔符 Array、Struct
Map keys terminated by ':' -- map 分隔符

語句說明

建表：

Array/Struct/map 創建表時使用分割符都爲 Collection items terminated by ‘’
map 如果是多個 map，多個 KV 使用 Collection items terminated by ‘\t’
map KV 間使用 map keys terminated by ‘：’

查詢使用：

array -- select locations[0]
map -- info['name']
struct -- info.name info.age

測試案例：

Array
準備測試數據文檔 t_array.txt，多個字段使用“，”拼接

數據：

1 zhangsan beijing,shanghai
2 lisi shanghai,tianjin

建表：

create table t_array(
id string,
name string,
locations array<string>
)
row format delimited fields terminated by ' ' collection items terminated by ',';

加載數據到表中

load data local inpath '/home/hadoop/data/01/t_array.txt' into table t_array;

測試查詢結果：

1 簡單查詢：

select id,locations[0],locations[1] from t_array;

2 查詢數組中元素個數

select size(locations) from t_array

3 查詢locations中包含 beijing 的信息

select * from t_array
where array_contains(address,'beijing')

Map

準備測試數據文檔t_map.txt

數據：

1 name:zhangsan#age:30
2 name:lisi#age:40

建表：

create table t_map
(id string,info map<string,string>)
row format delimited fields terminated by ' '
collection items terminated by '#' --- 表示多個 KV 之間拼接的符號
map keys terminated by ':' ----- 表示一個 KV 間的分隔符

加載數據：

load data local inpath '/home/hadoop/data/01/t_map.txt' into table t_map;

查詢結果：

1 簡單查詢：

select id,info['name'],info['age'] from t_map;

2 查詢 map 的所有 key 值：

select map_keys(info) from t_map;

3 查詢 map_values 所有 value 值：

select map_values(info) from t_map;

Struct

準備測試數據文檔t_struct.txt

數據：

1 zhangsan:30:beijing
2 lisi:40:shanghai

建表：

create table t_struct(id string,info struct<name:string,age:int,address:string>)
row format delimited fields terminated by ' ' --字段間分隔符
collection items terminated by ':' -- struct間分隔符

加載數據：

load data local inpath '' into table t_struct;

查詢結果：

select id,info.name,info.age,info.address from t_struct;

07 Hive 中 4 個 by 的區別

order by 全局排序，不論 reduce 個數是幾，結果全局有序
sort by 每個 reduce 內有序，當reduce個數爲1時，結果同 order by 是全局有序，當 reduce 個數大於1，則每個reduce內有序
distribute by + sort by 使用，分區排序，與 sort by 區別在於可以指定分區字段，將map端查詢結果hash相同的結果分發到對應的reduce，每個reduce 內有序
cluster by 當 distribute by + sort by 字段相同時，可換成 cluster by
08 實際需求-表連接時使用分區查詢

Hive表連接與常規數據庫的表連接使用方法一樣，關鍵字還是 inner join ,left join 等等，下面看一下實際工作中用到的需求。

需求如下：

Hive 中一張存儲文章表，
字段說明：
title --標題
content – 內容
pubtime --發佈時間
serviceId --文章類型

表分區字段 --year month
查詢文章發佈時間 2019年11月份 11-18號，標題與內容相同，並且標題大於 30 的文章，文章類型在 1-5

結果使用子查詢 + 自連接查相同文章
注意：一定要使用分區，不然程序會卡死。

結果 sql 如下：

select t1.id, t1.title,t1.content, t1.pubtime,t1.serviceId
from (select id, title,content, pubtime,serviceId from article_info where (year = '2019' and month = '11')) t1
inner join (select id, url, content, pubtime,serviceId from article_info where (year = '2019' and month = '11')) t2
on t1.id = t2.id
where t1.pubtime >= '2019-11-11 00:00:00' and t1.pubtime <='2019-11-18 23:59:59'
and length(t1.title) < 30 and t1.serviceId in (1,2,3,4,5) and t1.title = t2.content

程序員說。。。。

HBase2.2.4完全分佈式安裝（基於Hadoop3.2.1）

Zookeeper完全分佈式集羣的搭建

HIVE 基礎知識

Hadoop 3完全分佈式集羣搭建方法（CentOS 7+Hadoop 3.2.1）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結