大數據分析利器之hive(3)

1、hive的參數傳遞

1.1 Hive命令行

hive [-hiveconf x=y]* [<-i filename>]* [<-f filename>|<-e query-string>] [-S]

說明：

1、 -i 從文件初始化HQL。

2、 -e從命令行執行指定的HQL

3、 -f 執行HQL腳本

4、 -v 輸出執行的HQL語句到控制檯

5、 -p connect to Hive Server on port number

6、 -hiveconf x=y Use this to set hive/hadoop configuration variables. 設置hive運行時候的參數配置

1.2 Hive參數配置方式

Hive參數大全：
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
開發Hive應用時，不可避免地需要設定Hive的參數。設定Hive的參數可以調優HQL代碼的執行效率，或幫助定位問題。
然而實踐中經常遇到的一個問題是，爲什麼設定的參數沒有起作用？這通常是錯誤的設定方式導致的。

對於一般參數，有以下三種設定方式：

配置文件  hive-site.xml

命令行參數  啓動hive客戶端的時候可以設置參數

參數聲明   進入客戶單以後設置的一些參數  set

配置文件：Hive的配置文件包括

用戶自定義配置文件：$HIVE_CONF_DIR/hive-site.xml
默認配置文件：$HIVE_CONF_DIR/hive-default.xml
用戶自定義配置會覆蓋默認配置

Hive也會讀入Hadoop的配置，因爲Hive是作爲Hadoop的客戶端啓動的，Hive的配置會覆蓋Hadoop的配置。配置文件的設定對本機啓動的所有Hive進程都有效。

命令行參數：啓動Hive（客戶端或Server方式）時，可以在命令行添加-hiveconf param=value來設定參數，例如：

bin/hive -hiveconf hive.root.logger=INFO,console

這一設定對本次啓動的Session（對於Server方式啓動，則是所有請求的Sessions）有效。

參數聲明：可以在HQL中使用SET關鍵字設定參數，例如：

set mapred.reduce.tasks=100;

這一設定的作用域也是session級的。

上述三種設定方式的優先級依次遞增。即參數聲明覆蓋命令行參數，命令行參數覆蓋配置文件設定。注意某些系統級的參數，例如log4j相關的設定，必須用前兩種方式設定，因爲那些參數的讀取在Session建立以前已經完成了。

參數聲明  >   命令行參數   >  配置文件參數（hive）

1.3 使用變量傳遞參數

實際工作當中，我們一般都是將hive的hql語法開發完成之後，就寫入到一個腳本里面去，然後定時的通過命令hive -f 去執行hive的語法即可，然後通過定義變量來傳遞參數到hive的腳本當中.

hive0.9以及之前的版本是不支持傳參的
hive1.0版本之後支持 hive -f 傳遞參數

在hive當中我們一般可以使用hivevar或者hiveconf來進行參數的傳遞

1.3.1 hiveconf使用說明

hiveconf用於定義HIVE執行上下文的屬性(配置參數)，可覆蓋覆蓋hive-site.xml（hive-default.xml）中的參數值，如用戶執行目錄、日誌打印級別、執行隊列等。例如我們可以使用hiveconf來覆蓋我們的hive屬性配置，
hiveconf變量取值必須要使用hiveconf作爲前綴參數，具體格式如下:

${hiveconf:key} 
bin/hive --hiveconf "mapred.job.queue.name=root.default"

1.3.2 hivevar使用說明

hivevar用於定義HIVE運行時的變量替換，類似於JAVA中的“PreparedStatement”，與${key}配合使用或者與${hivevar:key}

對於hivevar取值可以不使用前綴hivevar，具體格式如下：

使用前綴:
 ${hivevar:key}
不使用前綴:
 ${key}
--hivevar  name=zhangsan    ${hivevar:name}  
也可以這樣取值  ${name}

1.3.3 define使用說明

define與hivevar用途完全一樣，還有一種簡寫“-d
bin/hive --hiveconf "mapred.job.queue.name=root.default" -d my="201809" --database mydb

select * from mydb where concat(year, month) = ${my} limit 10;

1.3.4 hiveconf與hivevar使用實戰

需求：hive當中執行以下hql語句，並將參數全部都傳遞進去

select * from student left join score on student.s_id = score.s_id where score.month = '201807' and score.s_score > 80 and score.c_id = 03;

第一步：創建student表並加載數據

hive (myhive)> create external table student
(s_id string,s_name string,s_birth string , s_sex string ) row format delimited
fields terminated by '\t';

hive (myhive)> load data local inpath '/zsc/install/hivedatas/student.csv' overwrite into table student;

第二步：定義hive腳本

開發hql腳本，並使用hiveconf和hivevar進行參數傳遞

cd /zsc/instal/hivedatas

vim hivevariable.hql
use myhive;
select * from student left join score on student.s_id = score.s_id where score.month = ${hiveconf:month} and score.s_score > ${hivevar:s_score} and score.c_id = ${c_id};

第三步：調用hive腳本並傳遞參數

[root@node03 hive-1.1.0-cdh5.14.2]# bin/hive --hiveconf month=201807 --hivevar s_score=80 --hivevar c_id=03  -f /zsc/install/hivedatas/hivevariable.hql

2、hive的常用函數介紹

2.1 系統內置函數

1．查看系統自帶的函數
hive> show functions;
2．顯示自帶的函數的用法
hive> desc function upper;
3．詳細顯示自帶的函數的用法
hive> desc function extended upper;

2.2 數值計算

2.2.1 取整函數: round

語法: round(double a)
返回值: BIGINT
說明: 返回double類型的整數值部分（遵循四捨五入）

hive> select round(3.1415926) from tableName;
3
hive> select round(3.5) from tableName;
4
hive> create table tableName as select round(9542.158) from tableName;

2.2.2 指定精度取整函數: round

語法: round(double a, int d)
返回值: DOUBLE
說明: 返回指定精度d的double類型

hive> select round(3.1415926,4) from tableName;
3.1416

2.2.3 向下取整函數: floor

語法: floor(double a)
返回值: BIGINT
說明: 返回等於或者小於該double變量的最大的整數

hive> select floor(3.1415926) from tableName;
3
hive> select floor(25) from tableName;
25

2.2.4 向上取整函數: ceil

語法: ceil(double a)
返回值: BIGINT
說明: 返回等於或者大於該double變量的最小的整數

hive> select ceil(3.1415926) from tableName;
4
hive> select ceil(46) from tableName;
46

2.2.5 向上取整函數: ceiling

語法: ceiling(double a)
返回值: BIGINT
說明: 與ceil功能相同

hive> select ceiling(3.1415926) from tableName;
4
hive> select ceiling(46) from tableName;
46

2.2.6 取隨機數函數: rand

語法: rand(),rand(int seed)
返回值: double
說明: 返回一個0到1範圍內的隨機數。如果指定種子seed，則會等到一個穩定的隨機數序列

hive> select rand() from tableName;
0.5577432776034763
hive> select rand() from tableName;
0.6638336467363424
hive> select rand(100) from tableName;
0.7220096548596434
hive> select rand(100) from tableName;
0.7220096548596434

2.3 日期函數

2.3.1 UNIX時間戳轉日期函數: from_unixtime

語法: from_unixtime(bigint unixtime[, string format])
返回值: string
說明: 轉化UNIX時間戳（從1970-01-01 00:00:00 UTC到指定時間的秒數）到當前時區的時間格式

hive> select from_unixtime(1323308943,'yyyyMMdd') from tableName;
20111208

2.3.2 獲取當前UNIX時間戳函數: unix_timestamp

語法: unix_timestamp()
返回值: bigint
說明: 獲得當前時區的UNIX時間戳

hive> select unix_timestamp() from tableName;
1323309615

2.3.3 日期轉UNIX時間戳函數: unix_timestamp

語法: unix_timestamp(string date)
返回值: bigint
說明: 轉換格式爲"yyyy-MM-dd HH:mm:ss"的日期到UNIX時間戳。如果轉化失敗，則返回0。

hive> select unix_timestamp('2011-12-07 13:01:03') from tableName;
1323234063

2.3.4 指定格式日期轉UNIX時間戳函數: unix_timestamp

語法: unix_timestamp(string date, string pattern)
返回值: bigint
說明: 轉換pattern格式的日期到UNIX時間戳。如果轉化失敗，則返回0。

hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName;
1323234063

2.3.5 日期時間轉日期函數: to_date

語法: to_date(string timestamp)
返回值: string
說明: 返回日期時間字段中的日期部分。

hive> select to_date('2011-12-08 10:03:01') from tableName;
2011-12-08

2.3.6 日期轉年函數: year

語法: year(string date)
返回值: int
說明: 返回日期中的年。

hive> select year('2011-12-08 10:03:01') from tableName;
2011
hive> select year('2012-12-08') from tableName;
2012

2.3.7 日期轉月函數: month

語法: month (string date)
返回值: int
說明: 返回日期中的月份。

hive> select month('2011-12-08 10:03:01') from tableName;
12
hive> select month('2011-08-08') from tableName;
8

2.3.8 日期轉天函數: day

語法: day (string date)
返回值: int
說明: 返回日期中的天。

hive> select day('2011-12-08 10:03:01') from tableName;
8
hive> select day('2011-12-24') from tableName;
24

2.3.9 日期轉小時函數: hour

語法: hour (string date)
返回值: int
說明: 返回日期中的小時。

hive> select hour('2011-12-08 10:03:01') from tableName;
10

2.3.10 日期轉分鐘函數: minute

語法: minute (string date)
返回值: int
說明: 返回日期中的分鐘。

hive> select minute('2011-12-08 10:03:01') from tableName;
3

hive> select second('2011-12-08 10:03:01') from tableName;
1

2.3.11日期轉周函數: weekofyear

語法: weekofyear (string date)
返回值: int
說明: 返回日期在當前的週數。

hive> select weekofyear('2011-12-08 10:03:01') from tableName;
49

2.3.12 日期比較函數: datediff

語法: datediff(string enddate, string startdate)
返回值: int
說明: 返回結束日期減去開始日期的天數。

hive> select datediff('2012-12-08','2012-05-09') from tableName;
213

2.3.13 日期增加函數: date_add

語法: date_add(string startdate, int days)
返回值: string
說明: 返回開始日期startdate增加days天后的日期。

hive> select date_add('2012-12-08',10) from tableName;
2012-12-18

2.3.14 日期減少函數: date_sub

語法: date_sub (string startdate, int days)
返回值: string
說明: 返回開始日期startdate減少days天后的日期。

hive> select date_sub('2012-12-08',10) from tableName;
2012-11-28

2.4 條件函數

2.4.1 If函數: if

語法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值: T
說明: 當條件testCondition爲TRUE時，返回valueTrue；否則返回valueFalseOrNull

hive> select if(1=2,100,200) from tableName;
200
hive> select if(1=1,100,200) from tableName;
100

2.4.2 非空查找函數: COALESCE

語法: COALESCE(T v1, T v2, …)
返回值: T
說明: 返回參數中的第一個非空值；如果所有值都爲NULL，那麼返回NULL

hive> select COALESCE(null,'100','50') from tableName;
100

2.4.3 條件判斷函數：CASE

語法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
返回值: T
說明：如果a等於b，那麼返回c；如果a等於d，那麼返回e；否則返回f

hive> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
mary
hive> Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
tim

2.4.4 條件判斷函數：CASE

語法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
返回值: T
說明：如果a爲TRUE,則返回b；如果c爲TRUE，則返回d；否則返回e

hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
mary
hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
tom

2.5 字符串函數

2.5.1 字符串長度函數：length

語法: length(string A)
返回值: int
說明：返回字符串A的長度

hive> select length('abcedfg') from tableName;

2.5.2 字符串反轉函數：reverse

語法: reverse(string A)
返回值: string
說明：返回字符串A的反轉結果

hive> select reverse('abcedfg') from tableName;
gfdecba

2.5.3 字符串連接函數：concat

語法: concat(string A, string B…)
返回值: string
說明：返回輸入字符串連接後的結果，支持任意個輸入字符串

hive> select concat('abc','def','gh') from tableName;
abcdefgh

2.5.4 字符串連接並指定字符串分隔符：concat_ws

語法: concat_ws(string SEP, string A, string B…)
返回值: string
說明：返回輸入字符串連接後的結果，SEP表示各個字符串間的分隔符

hive> select concat_ws(',','abc','def','gh')from tableName;
abc,def,gh

2.5.5 字符串截取函數：substr

語法: substr(string A, int start),substring(string A, int start)
返回值: string
說明：返回字符串A從start位置到結尾的字符串

hive> select substr('abcde',3) from tableName;
cde
hive> select substring('abcde',3) from tableName;
cde
hive>  select substr('abcde',-1) from tableName;  （和ORACLE相同）
e

2.5.6 字符串截取函數：substr,substring

語法: substr(string A, int start, int len),substring(string A, int start, int len)
返回值: string
說明：返回字符串A從start位置開始，長度爲len的字符串

hive> select substr('abcde',3,2) from tableName;
cd
hive> select substring('abcde',3,2) from tableName;
cd
hive>select substring('abcde',-2,2) from tableName;
de

2.5.7 字符串轉大寫函數：upper,ucase

語法: upper(string A) ucase(string A)
返回值: string
說明：返回字符串A的大寫格式

hive> select upper('abSEd') from tableName;
ABSED
hive> select ucase('abSEd') from tableName;
ABSED

2.5.8 字符串轉小寫函數：lower,lcase

語法: lower(string A) lcase(string A)
返回值: string
說明：返回字符串A的小寫格式

hive> select lower('abSEd') from tableName;
absed
hive> select lcase('abSEd') from tableName;
absed

2.5.9 去空格函數：trim

語法: trim(string A)
返回值: string
說明：去除字符串兩邊的空格

hive> select trim(' abc ') from tableName;
abc

2.5.10 url解析函數 parse_url

語法:
parse_url(string urlString, string partToExtract [, string keyToExtract])
返回值: string
說明：返回URL中指定的部分。partToExtract的有效值爲：HOST, PATH,
QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') 
from tableName;
www.tableName.com 
hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1')
 from tableName;
v1

2.5.11 json解析 get_json_object

語法: get_json_object(string json_string, string path)
返回值: string
說明：解析json的字符串json_string,返回path指定的內容。如果輸入的json字符串無效，那麼返回NULL。

hive> select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;

2.5.12 重複字符串函數：repeat

語法: repeat(string str, int n)
返回值: string
說明：返回重複n次後的str字符串

hive> select repeat('abc',5) from tableName;
abcabcabcabcabc

2.5.13 分割字符串函數: split

語法: split(string str, string pat)
返回值: array
說明: 按照pat字符串分割str，會返回分割後的字符串數組

hive> select split('abtcdtef','t') from tableName;
["ab","cd","ef"]

2.6 集合統計函數

2.6.1 個數統計函數: count

語法: count(*), count(expr), count(DISTINCT expr[, expr_.])
返回值：Int

說明: count(*)統計檢索出的行的個數，包括NULL值的行；count(expr)返回指定字段的非空值的個數；count(DISTINCT
expr[, expr_.])返回指定字段的不同的非空值的個數

hive> select count(*) from tableName;
20
hive> select count(distinct t) from tableName;
10

2.6.2 總和統計函數: sum

語法: sum(col), sum(DISTINCT col)
返回值: double
說明: sum(col)統計結果集中col的相加的結果；sum(DISTINCT col)統計結果中col不同值相加的結果

hive> select sum(t) from tableName;
100
hive> select sum(distinct t) from tableName;
70

2.6.3 平均值統計函數: avg

語法: avg(col), avg(DISTINCT col)
返回值: double
說明: avg(col)統計結果集中col的平均值；avg(DISTINCT col)統計結果中col不同值相加的平均值

hive> select avg(t) from tableName;
50
hive> select avg (distinct t) from tableName;
30

2.6.4 最小值統計函數: min

語法: min(col)
返回值: double
說明: 統計結果集中col字段的最小值

hive> select min(t) from tableName;
20

2.6.5 最大值統計函數: max

語法: maxcol)
返回值: double
說明: 統計結果集中col字段的最大值

hive> select max(t) from tableName;
120

2.7 複合類型構建函數

2.7.1 Map類型構建: map

語法: map (key1, value1, key2, value2, …)
說明：根據輸入的key和value對構建map類型

create table score_map(name string, score map<string,int>)
row format delimited fields terminated by '\t' 
collection items terminated by ',' map keys terminated by ':';
-- 集合之間的分隔符只能指定一個
創建數據內容如下並加載數據
cd /zsc/install/hivedatas/
vim score_map.txt

zhangsan	數學:80,語文:89,英語:95
lisi	語文:60,數學:80,英語:99

加載數據到hive表當中去
load data local inpath '/zsc/install/hivedatas/score_map.txt' overwrite into table score_map;

map結構數據訪問：
獲取所有的value：
select name,map_values(score) from score_map;

獲取所有的key：
select name,map_keys(score) from score_map;

按照key來進行獲取value值
select name,score["數學"]  from score_map;

查看map元素個數
select name,size(score) from score_map;

2.7.2 Struct類型構建: struct

語法: struct(val1, val2, val3, …)
說明：根據輸入的參數構建結構體struct類型，似於C語言中的結構體，內部數據通過X.X來獲取，假設我們的數據格式是這樣的，電影ABC，有1254人評價過，打分爲7.4分

創建struct表
hive> create table movie_score( name string,  info struct<number:int,score:float> )row format delimited fields terminated by "\t"  collection items terminated by ":"; 

加載數據
cd /zsc/install/hivedatas/
vim struct.txt

ABC	1254:7.4  
DEF	256:4.9  
XYZ	456:5.4

加載數據
load data local inpath '/zsc/install/hivedatas/struct.txt' overwrite into table movie_score;


hive當中查詢數據
hive> select * from movie_score;  
hive> select info.number,info.score from movie_score;  
OK  
1254    7.4  
256     4.9  
456     5.4

2.7.3 array類型構建: array

語法: array(val1, val2, …)
說明：根據輸入的參數構建數組array類型

hive> create table  person(name string,work_locations array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',';

加載數據到person表當中去
cd /zsc/install/hivedatas/
vim person.txt

數據內容格式如下
biansutao	beijing,shanghai,tianjin,hangzhou
linan	changchu,chengdu,wuhan

加載數據
hive > load  data local inpath '/zsc/install/hivedatas/person.txt' overwrite into table person;

查詢所有數據數據
hive > select * from person;

按照下表索引進行查詢
hive > select work_locations[0] from person;

查詢所有集合數據
hive  > select work_locations from person; 

查詢元素個數
hive >  select size(work_locations) from person;

2.8複雜類型長度統計函數

2.8.1 Map類型長度函數: size(Map<k .V>)

語法: size(Map<k .V>)
返回值: int
說明: 返回map類型的長度

hive> select size(t) from map_table2;
2

2.8.2 array類型長度函數: size(Array)

語法: size(Array)
返回值: int
說明: 返回array類型的長度

hive> select size(t) from arr_table2;
4

2.8.3 類型轉換函數

類型轉換函數: cast
語法: cast(expr as )
返回值: Expected “=” to follow “type”
說明: 返回轉換後的數據類型

hive> select cast('1' as bigint) from tableName;
1

2.9 hive當中的lateral view 與 explode以及reflect和分析函數

2.9.1 使用explode函數將hive表中的Map和Array字段數據進行拆分

lateral view用於和split、explode等UDTF(hive中的聚合函數)一起使用的，能將一行數據拆分成多行數據，在此基礎上可以對拆分的數據進行聚合，lateral view首先爲原始表的每行調用UDTF，UDTF會把一行拆分成一行或者多行，lateral view在把結果組合，產生一個支持別名表的虛擬表。
其中explode還可以用於將hive一列中複雜的array或者map結構拆分成多行

需求：現在有數據格式如下
zhangsan	child1,child2,child3,child4	k1:v1,k2:v2
lisi	child5,child6,child7,child8	 k3:v3,k4:v4

字段之間使用\t分割，需求將所有的child進行拆開成爲一列
 
+----------+--+
| mychild  |
+----------+--+
| child1   |
| child2   |
| child3   |
| child4   |
| child5   |
| child6   |
| child7   |
| child8   |
+----------+--+

將map的key和value也進行拆開，成爲如下結果

+-----------+-------------+--+
| mymapkey  | mymapvalue  |
+-----------+-------------+--+
| k1        | v1          |
| k2        | v2          |
| k3        | v3          |
| k4        | v4          |
+-----------+-------------+--+

select   name, mychild from  t3  lateral view explode(children) tempTable as mychild;
-- tempTable 虛擬表名
-- mychild 虛擬表的列名
-- lateral view可以理解爲行轉列的一個函數 explode只是將複雜結構數據炸開
-- 實現下面的需求
zhangsan        child1
zhangsan        child2
zhangsan        child3
zhangsan        child4
lisi    child5
lisi    child6
lisi    child7
lisi    child8

第一步：創建hive數據庫

創建hive數據庫

hive (default)> create database hive_explode;
hive (default)> use hive_explode;

第二步：創建hive表，然後使用explode拆分map和array

hive (hive_explode)> create  table hive_explode.t3(name string,children array<string>,address Map<string,string>) row format delimited fields terminated by '\t'  collection items    terminated by ','  map keys terminated by ':' stored as textFile;

第三步：加載數據

cd  /zsc/install/hivedatas/

vim maparray
數據內容格式如下

zhangsan	child1,child2,child3,child4	k1:v1,k2:v2
lisi	child5,child6,child7,child8	k3:v3,k4:v4

hive表當中加載數據

hive (hive_explode)> load data local inpath '/zsc/install/hivedatas/maparray' into table hive_explode.t3;

第四步：使用explode將hive當中數據拆開

將array當中的數據拆分開

hive (hive_explode)> SELECT explode(children) AS myChild FROM hive_explode.t3;

將map當中的數據拆分開

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM hive_explode.t3;

2.9.2 使用explode拆分json字符串

需求：現在有一些數據格式如下：

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

其中字段與字段之間的分隔符是 |
我們要解析得到所有的monthSales對應的值爲以下這一列（行轉列）

4900
2090
6987

第一步：創建hive表

hive (hive_explode)> create table hive_explode.explode_lateral_view  (area string, goods_id string, sale_info string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS textfile;

第二步：準備數據並加載數據

cd /zsc/install/hivedatas
vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

加載數據到hive表當中去

hive (hive_explode)> load data local inpath '/zsc/install/hivedatas/explode_json' overwrite into table hive_explode.explode_lateral_view;

第三步：使用explode拆分Array

hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from hive_explode.explode_lateral_view;

第四步：使用explode拆解Map

hive (hive_explode)> select explode(split(area,',')) as area from hive_explode.explode_lateral_view;

第五步：拆解json字段

hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as  sale_info from hive_explode.explode_lateral_view;

然後我們想用get_json_object來獲取key爲monthSales的數據：

hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') as  sale_info from hive_explode.explode_lateral_view;


然後出現異常FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
UDTF explode不能寫在別的函數內
如果你這麼寫，想查兩個字段，select explode(split(area,',')) as area,good_id from explode_lateral_view;
會報錯FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'
使用UDTF的時候，只支持一個字段，這時候就需要LATERAL VIEW出場了

2.9.3 配合LATERAL VIEW使用

配合lateral view查詢多個字段

hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2;

其中LATERAL VIEW explode(split(goods_id,’,’))goods相當於一個虛擬表，與原表explode_lateral_view笛卡爾積關聯

也可以多重使用

hive (hive_explode)> select goods_id2,sale_info,area2 from explode_lateral_view  LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;

本質是三個表笛卡爾積的結果

最終，我們可以通過下面的句子，把這個json格式的一行數據，完全轉換成二維表的方式展現

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source, get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales, get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,  get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales from explode_lateral_view   LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;

總結：

Lateral View通常和UDTF一起出現，爲了解決UDTF不允許在select字段的問題
Multiple Lateral View可以實現類似笛卡爾乘積
Outer關鍵字可以把不輸出的UDTF的空結果，輸出成NULL，防止丟失數據

2.10 列轉行

2.10.1 相關函數說明

CONCAT(string A/col, string B/col…)：返回輸入字符串連接後的結果，支持任意個輸入字符串;按照指定分隔符連接任意字段
CONCAT_WS(separator, str1, str2,...)：它是一個特殊形式的 CONCAT()。第一個參數剩餘參數間的分隔符。分隔符可以是與剩餘參數一樣的字符串。如果分隔符是 NULL，返回值也將爲 NULL。這個函數會跳過分隔符參數後的任何 NULL 和空字符串。分隔符將被加到被連接的字符串之間;將集合當中的數據按照指定的分隔符進行分割
COLLECT_SET(col)：函數只接受基本數據類型，它的主要作用是將某字段的值進行去重彙總，產生array類型字段。

2.10.2 數據準備

name	constellation	blood_type
孫悟空	白羊座	A
老王	射手座	A
宋宋	白羊座	B
豬八戒	白羊座	A
冰冰	射手座	A

2.10.3 需求

把星座和血型一樣的人歸類到一起。結果如下：

射手座,A            老王|冰冰
白羊座,A            孫悟空|豬八戒
白羊座,B            宋宋

2.10.4 創建本地文件，導入數據

服務器執行以下命令創建文件，注意數據使用\t進行分割

cd /zsc/install/hivedatas
vim constellation.txt

孫悟空	白羊座	A
老王	射手座	A
宋宋	白羊座	B       
豬八戒	白羊座	A
鳳姐	射手座	A

2.10.5 創建hive表並導入數據

創建hive表並加載數據

hive (hive_explode)> create table person_info(  name string,  constellation string,  blood_type string)  row format delimited fields terminated by "\t";

加載數據

hive (hive_explode)> load data local inpath '/zsc/install/hivedatas/constellation.txt' into table person_info;

2.10.6 按需求查詢數據

hive (hive_explode)> select t1.base, concat_ws('|', collect_set(t1.name)) name from    (select name, concat(constellation, "," , blood_type) base from person_info) t1 group by  t1.base;

2.11 行轉列

2.11.1 函數說明

EXPLODE(col)：將hive一列中複雜的array或者map結構拆分成多行。
LATERAL VIEW
- 用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias
- 解釋：用於和split, explode等UDTF一起使用，它能夠將一列數據拆成多行數據，在此基礎上可以對拆分後的數據進行聚合。

2.11.2 數據準備

數據內容如下，字段之間都是使用\t進行分割

cd /zsc/install/hivedatas

vim movie.txt
《疑犯追蹤》	懸疑,動作,科幻,劇情
《Lie to me》	懸疑,警匪,動作,心理,劇情
《戰狼2》	戰爭,動作,災難

2.11.3 需求

將電影分類中的數組數據展開。結果如下：

《疑犯追蹤》	懸疑
《疑犯追蹤》	動作
《疑犯追蹤》	科幻
《疑犯追蹤》	劇情
《Lie to me》	懸疑
《Lie to me》	警匪
《Lie to me》	動作
《Lie to me》	心理
《Lie to me》	劇情
《戰狼2》	戰爭
《戰狼2》	動作
《戰狼2》	災難

2.11.4 創建hive表並導入數據

hive (hive_explode)> create table movie_info(movie string, category array<string>) row format delimited fields terminated by "\t" collection items terminated by ",";

加載數據

load data local inpath "/zsc/install/hivedatas/movie.txt" into table movie_info;

2.11.5 按需求查詢數據

hive (hive_explode)>  select movie, category_name  from  movie_info lateral view explode(category) table_tmp as category_name;
# 求每一類電影有多少部 按照電影電影類別分組
select  category_name,count(1) from (
 select movie, category_name  from  movie_info lateral view explode(category) 
 table_tmp as category_name ) temp2 group by temp2.category_name;

2.12 reflect函數

reflect函數可以支持在sql中調用java中的自帶函數，秒殺一切udf函數, 類似java的反射思想, 傳入類名\方法名\參數就可以調用java方法.

2.12.1 使用java.lang.Math當中的Max求兩列中最大值

創建hive表

hive (hive_explode)>  create table test_udf(col1 int,col2 int) row format delimited fields terminated by ',';

準備數據並加載數據

cd /zsc/install/hivedatas

vim test_udf

1,2
4,3
6,4
7,5
5,6

加載數據

hive (hive_explode)> load data local inpath '/zsc/install/hivedatas/test_udf' overwrite into table test_udf;

使用java.lang.Math當中的Max求兩列當中的最大值

hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf;

2.12.2 不同記錄執行不同的java內置函數

創建hive表

hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

準備數據

cd /export/servers/hivedatas

vim test_udf2

java.lang.Math,min,1,2
java.lang.Math,max,2,3

加載數據

hive (hive_explode)> load data local inpath '/zsc/install/hivedatas/test_udf2' overwrite into table test_udf2;

執行查詢

hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

2.12.3 判斷是否爲數字

使用apache commons中的函數，commons下的jar已經包含在hadoop的classpath中，所以可以直接使用。
使用方式如下：

hive (hive_explode)> select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123");

2.13 hive當中的分析函數—分組求topN

2.13.1 分析函數的作用介紹

對於一些比較複雜的數據求取過程，我們可能就要用到分析函數，分析函數主要用於分組求topN，或者求取百分比，或者進行數據的切片等等，我們都可以使用分析函數來解決

2.13.2 常用的分析函數介紹

1、ROW_NUMBER()：

從1開始，按照順序，生成分組內記錄的序列,比如，按照pv降序排列，生成分組內每天的pv名次,ROW_NUMBER()的應用場景非常多，再比如，獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。

2、RANK() ：

生成數據項在分組中的排名，排名相等會在名次中留下空位

3、DENSE_RANK() ：

生成數據項在分組中的排名，排名相等會在名次中不會留下空位

4、CUME_DIST ：

小於等於當前值的行數/分組內總行數。比如，統計小於等於當前薪水的人數，所佔總人數的比例

5、PERCENT_RANK ：

分組內當前行的RANK值/分組內總行數

6、NTILE(n) ：

用於將分組數據按照順序切分成n片，返回當前切片值，如果切片不均勻，默認增加第一個切片的分佈。NTILE不支持ROWS BETWEEN，比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。

2.13.3 需求描述

現有數據內容格式如下，分別對應三個字段，cookieid，createtime ，pv，求取每個cookie訪問pv前三名的數據記錄，其實就是分組求topN，求取每組當中的前三個值

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

第一步：創建數據庫表

CREATE EXTERNAL TABLE cookie_pv (
cookieid string,
createtime string, 
pv INT
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' ;

第二步：準備數據並加載

創建數據，並加載到hive表當中去

cd /zsc/install/hivedatas
vim cookiepv.txt

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

加載數據到hive表當中去

load  data  local inpath '/zsc/install/hivedatas/cookiepv.txt'  overwrite into table  cookie_pv

第三步：使用分析函數來求取每個cookie訪問PV的前三條記錄

SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM cookie_pv 
WHERE rn1 <=  3 ;

2.14 hive自定義函數

2.14.1 自定義函數的基本介紹

Hive 自帶了一些函數，比如：max/min等，但是數量有限，自己可以通過自定義UDF來方便的擴展。
當Hive提供的內置函數無法滿足你的業務處理需要時，此時就可以考慮使用用戶自定義函數（UDF：user-defined function）。
根據用戶自定義函數類別分爲以下三種：
UDF（User-Defined-Function）一進一出
UDAF（User-Defined Aggregation Function）聚集函數，多進一出count/max/min
UDTF（User-Defined Table-Generating Functions）一進多出如lateral view explode()
官方文檔地址 :
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
編程步驟：
- 繼承org.apache.hadoop.hive.ql.UDF
- 需要實現evaluate函數；evaluate函數支持重載；
注意事項
- UDF必須要有返回類型，可以返回null，但是返回類型不能爲void；
- UDF中常用Text/LongWritable等類型，不推薦使用java類型；

2.14.2 自定義函數開發

第一步：創建maven java 工程，並導入jar包

<repositories>
    <repository>
        <id>cloudera</id>
 <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0-cdh5.14.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.1.0-cdh5.14.2</version>
    </dependency>
</dependencies>
<build>
<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-shade-plugin</artifactId>
         <version>2.2</version>
         <executions>
             <execution>
                 <phase>package</phase>
                 <goals>
                     <goal>shade</goal>
                 </goals>
                 <configuration>
                     <filters>
                         <filter>
                             <artifact>*:*</artifact>
                             <excludes>
                                 <exclude>META-INF/*.SF</exclude>
                                 <exclude>META-INF/*.DSA</exclude>
                                 <exclude>META-INF/*/RSA</exclude>
                             </excludes>
                         </filter>
                     </filters>
                 </configuration>
             </execution>
         </executions>
     </plugin>
</plugins>
</build>

第二步：開發java類繼承UDF，並重載evaluate 方法

public class MyUDF extends UDF {
     public Text evaluate(final Text s) {
         if (null == s) {
             return null;
         }
         //**返回大寫字母         return new Text(s.toString().toUpperCase());
     }
 }

第三步：將我們的項目打包，並上傳到hive的lib目錄下

使用maven的package進行打包，將我們打包好的jar包上傳到node03服務器的/zsc/install/hive-1.1.0-cdh5.14.2/lib 這個路徑下

第四步：添加我們的jar包

重命名我們的jar包名稱

cd /zsc/install/hive-1.1.0-cdh5.14.2/lib
mv original-day_hive_udf-1.0-SNAPSHOT.jar udf.jar

hive的客戶端添加我們的jar包

0: jdbc:hive2://node03:10000> add jar /zsc/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;

第五步：設置函數與我們的自定義函數關聯

0: jdbc:hive2://node03:10000> create temporary function tolowercase as 'com.zsc.udf.MyUDF';

第六步：使用自定義函數

0: jdbc:hive2://node03:10000>select tolowercase('abc');

hive當中如何創建永久函數 :

在hive當中添加臨時函數，需要我們每次進入hive客戶端的時候都需要添加以下，退出hive客戶端臨時函數就會失效，那麼我們也可以創建永久函數來讓其不會失效

創建永久函數 :

1、指定數據庫，將我們的函數創建到指定的數據庫下面
0: jdbc:hive2://node03:10000>use myhive;

2、使用add jar添加我們的jar包到hive當中來
0: jdbc:hive2://node03:10000>add jar /zsc/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;

3、查看我們添加的所有的jar包
0: jdbc:hive2://node03:10000>list  jars;

4、創建永久函數，與我們的函數進行關聯
0: jdbc:hive2://node03:10000>create  function myuppercase as 'com.kkb.udf.MyUDF';

5、查看我們的永久函數
0: jdbc:hive2://node03:10000>show functions like 'my*';

6、使用永久函數
0: jdbc:hive2://node03:10000>select myhive.myuppercase('helloworld');

7、刪除永久函數
0: jdbc:hive2://node03:10000>drop function myhive.myuppercase;

8、查看函數
 show functions like 'my*';

3. hive表的數據壓縮

3.1 數據的壓縮說明

壓縮模式評價 :
- 可使用以下三種標準對壓縮方式進行評價 :
  - 1、壓縮比：壓縮比越高，壓縮後文件越小，所以壓縮比越高越好
  - 2、壓縮時間：越快越好
  - 3、已經壓縮的格式文件是否可以再分割：可以分割的格式允許單一文件由多個Mapper程序處理，可以更好的並行化
常見壓縮格式

壓縮方式	壓縮比	壓縮速度	解壓縮速度	是否可分割
gzip	13.4%	21 MB/s	118 MB/s	否
bzip2	13.2%	2.4MB/s	9.5MB/s	是
lzo	20.5%	135 MB/s	410 MB/s	是
snappy	22.2%	172 MB/s	409 MB/s	否

Hadoop編碼/解碼器方式

壓縮格式	對應的編碼/解碼器
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
Gzip	org.apache.hadoop.io.compress.GzipCodec
BZip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compress.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

壓縮性能的比較

壓縮算法	原始文件大小	壓縮文件大小	壓縮速度	解壓速度
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB/s
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s

snappy 和 lzo 在實際工作中用的比較多
snappy : http://google.github.io/snappy/

3.2 壓縮配置參數

要在Hadoop中啓用壓縮，可以配置如下參數（mapred-site.xml文件中）：

參數	默認值	階段	建議
io.compression.codecs （在core-site.xml中配置）	org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec	輸入壓縮	Hadoop使用文件擴展名判斷是否支持某種編解碼器
mapreduce.map.output.compress	false	mapper輸出	這個參數設爲true啓用壓縮
mapreduce.map.output.compress.codec	org.apache.hadoop.io.compress.DefaultCodec	mapper輸出	使用LZO、LZ4或snappy編解碼器在此階段壓縮數據
mapreduce.output.fileoutputformat.compress	false	reducer輸出	這個參數設爲true啓用壓縮
mapreduce.output.fileoutputformat.compress.codec	org.apache.hadoop.io.compress. DefaultCodec	reducer輸出	使用標準工具或者編解碼器，如gzip和bzip2
mapreduce.output.fileoutputformat.compress.type	RECORD	reducer輸出	SequenceFile輸出使用的壓縮類型：NONE和BLOCK

3.3 開啓Map輸出階段壓縮

開啓map輸出階段壓縮可以減少job中map和Reduce task間數據傳輸量。具體配置如下：

1）開啓hive中間傳輸數據壓縮功能
hive (default)>set hive.exec.compress.intermediate=true;

2）開啓mapreduce中map輸出壓縮功能
hive (default)>set mapreduce.map.output.compress=true;

3）設置mapreduce中map輸出數據的壓縮方式
hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;

4）執行查詢語句
   select count(1) from score;

3.4 開啓Reduce輸出階段壓縮

當Hive將輸出寫入到表中時，輸出內容同樣可以進行壓縮。屬性hive.exec.compress.output控制着這個功能。用戶可能需要保持默認設置文件中的默認值false，這樣默認的輸出就是非壓縮的純文本文件了。用戶可以通過在查詢語句或執行腳本中設置這個值爲true，來開啓輸出結果壓縮功能。

1）開啓hive最終輸出數據壓縮功能
hive (default)>set hive.exec.compress.output=true;

2）開啓mapreduce最終輸出數據壓縮
hive (default)>set mapreduce.output.fileoutputformat.compress=true;

3）設置mapreduce最終數據輸出壓縮方式
hive (default)> set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;

4）設置mapreduce最終數據輸出壓縮爲塊壓縮
hive (default)>set mapreduce.output.fileoutputformat.compress.type=BLOCK;

5）測試一下輸出結果是否是壓縮文件
insert overwrite local directory '/zsc/install/hivedatas/snappy' select * from score distribute by s_id sort by s_id desc;

3.5 Json數據解析UDF開發

原始json數據格式內容如下：

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} 
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}

需求：創建hive表，加載數據，使用自定義函數來解析json格式的數據，最後接的得到如下結果

movie	rate	timestamp	uid
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
1197	3	978302268	1
1287	5	978302039	1

數據倉庫工具之hive(3)