1 連接

1.1 數據準備

創建兩個表

hive> create table studenta(
    > id int,
    > name string)
    > row format delimited
    > fields terminated by '\t'
    > stored as textfile;
OK
Time taken: 0.138 seconds
hive> create table studentb(
    > id int,
    > age int)
    > row format delimited
    > fields terminated by '\t'
    > stored as textfile;
OK
Time taken: 0.057 seconds

創建兩個數據文件

vim studenta.txt
vim studentb.txt

studenta.txt
10001	shiny
10002	mark
10003	angel
10005	ella
10009	jack
10014	eva
10018	judy
10020	cendy

studentb.txt
10001	23
10004	22
10007	24
10008	21
10009	25
10012	25
10015	20
10018	19
10020	26

載入數據

hive> load data local inpath '/home/zjt/data/studenta.txt' overwrite into table studenta;
Loading data to table ducl_test.studenta
Table ducl_test.studenta stats: [numFiles=1, numRows=0, totalSize=90, rawDataSize=0]
OK
Time taken: 0.248 seconds
hive> load data local inpath '/home/zjt/data/studentb.txt' overwrite into table studentb;
Loading data to table ducl_test.studentb
Table ducl_test.studentb stats: [numFiles=1, numRows=0, totalSize=81, rawDataSize=0]
OK
Time taken: 0.199 seconds
hive> select * from studena;
FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'studena'
hive> select * from studenta;
OK
10001	shiny
10002	mark
10003	angel
10005	ella
10009	jack
10014	eva
10018	judy
10020	cendy
Time taken: 0.045 seconds, Fetched: 8 row(s)
hive> select * from studentb;
OK
10001	23
10004	22
10007	24
10008	21
10009	25
10012	25
10015	20
10018	19
10020	26
Time taken: 0.047 seconds, Fetched: 9 row(s)

1.2 Join操作

1.2.1 內連接JOIN

語法與實例

語法：... join ... on ...
//實例
hive> select * from studenta a join studentb b on a.id=b.id;
...
OK
10001	shiny	10001	23
10009	jack	10009	25
10018	judy	10018	19
10020	cendy	10020	26
Time taken: 34.066 seconds, Fetched: 4 row(s)

作用
- 把符合兩邊連接條件的數據查出來

1.2.2 外連接

左外連接
語法與實例

//語法
... left join ... on ...
//實例
hive> select * from studenta a left join studentb b on a.id=b.id;
OK
10001	shiny	10001	23
10002	mark	NULL	NULL
10003	angel	NULL	NULL
10005	ella	NULL	NULL
10009	jack	10009	25
10014	eva	NULL	NULL
10018	judy	10018	19
10020	cendy	10020	26
Time taken: 28.853 seconds, Fetched: 8 row(s)

作用
- 以左表數據爲匹配標準，左大右小
- 匹配不上的就是Null
- 返回的數據條數與左表相同

1.2.3 右外連接

語法與實例

//語法
... right join ... on ...
hive> select * from studenta a right join studentb b on a.id=b.id;
OK
10001	shiny	10001	23
NULL	NULL	10004	22
NULL	NULL	10007	24
NULL	NULL	10008	21
10009	jack	10009	25
NULL	NULL	10012	25
NULL	NULL	10015	20
10018	judy	10018	19
10020	cendy	10020	26
Time taken: 28.703 seconds, Fetched: 9 row(s)

作用
- 以右表數據爲匹配標準，左小右大
- 匹配不上的就是Null
- 返回的數據條數與右表相同

1.2.4 全外連接

實例

//語法
... full join ... on ...
//實例
hive> select * from studenta a full join studentb b on a.id=b.id;
OK
10001	shiny	10001	23
10002	mark	NULL	NULL
10003	angel	NULL	NULL
NULL	NULL	10004	22
10005	ella	NULL	NULL
NULL	NULL	10007	24
NULL	NULL	10008	21
10009	jack	10009	25
NULL	NULL	10012	25
10014	eva	NULL	NULL
NULL	NULL	10015	20
10018	judy	10018	19
10020	cendy	10020	26
Time taken: 34.453 seconds, Fetched: 13 row(s)

作用
- 以兩個表的數據爲匹配標準
- 匹配不上的爲null
- 反回的數據條數等於兩表數據去重之和

1.3 左半連接

語法與實例

//語法
... left semi join ... on ...
//實例
hive> select * from studenta a left semi join studentb b on a.id=b.id;
OK
10001	shiny
10009	jack
10018	judy
10020	cendy
Time taken: 32.075 seconds, Fetched: 4 row(s)

作用
- 將符合兩邊連接條件的左表的數據顯示出來
- 右表只能在ON子句中設置過濾條件，在WHERE子句和SELECT子句或者其他地方過濾都不行，因爲如果連接語句中有WHERE子句，會執行JOIN子句，再執行WHERE子句

2 數據類型

2.1 原子數據類型

整型、浮點型、字符串、字符、布爾、日期

2.2 複雜數據類型

數組、映射、結構體

2.3 複雜數據類型實例

2.3.1 數組(ARRAY)

創建表

hive> create table employee(
    > name string,
    > age int,
    > work_location array<string>)
    > row format delimited
    > fields terminated by '\t'
    > collection items terminated by ',' //指定數組的分隔符
    > stored as textfile;
OK
Time taken: 0.096 seconds

數據準備

array.txt 
shiny	23	beijing,tianjin,qingdao
jack	34	shanghai,guangzhou
mark	26	beijing,xian
ella	21	beijing
judy	30	shanghai,hangzhou,chongqing
cendy	28	beijing,shanghai,dalian,chengdu

導入數據

[zjt@master data]$ vim employee.txt

hive> load data local inpath '/home/zjt/data/employee.txt' overwrite into table employee;
Loading data to table ducl_test.employee
Table ducl_test.employee stats: [numFiles=1, numRows=0, totalSize=174, rawDataSize=0]
OK
Time taken: 0.698 seconds

查詢數據

//查詢所有數據
hive> select * from employee;
OK
shiny	23	["beijing","tianjin","qingdao"]
jack	34	["shanghai","guangzhou"]
mark	26	["beijing","xian"]
ella	21	["beijing"]
judy	30	["shanghai","hangzhou","chongqing"]
cendy	28	["beijing","shanghai","dalian","chengdu"]
Time taken: 0.277 seconds, Fetched: 6 row(s)

//查詢數組中指定位置的數據
hive> select name,age,work_location[0] from employee;
OK
shiny	23	beijing
jack	34	shanghai
mark	26	beijing
ella	21	beijing
judy	30	shanghai
cendy	28	beijing
Time taken: 0.12 seconds, Fetched: 6 row(s)

//數組索引不足的位置用NULL代替
hive> select name,age,work_location[2] from employee;
OK
shiny	23	qingdao
jack	34	NULL
mark	26	NULL
ella	21	NULL
judy	30	chongqing
cendy	28	dalian
Time taken: 0.084 seconds, Fetched: 6 row(s)

2.3.2 映射

創建表

hive> create table scores(
    > name string,
    > score map<string,int>)
    > row format delimited
    > fields terminated by '\t'
    > collection items terminated by ','
    > map keys terminated by ':'
    > stored as textfile;
OK
Time taken: 0.09 seconds

數據準備

scores.txt 
shiny	chinese:90,math:100,english:99
mark	chinese:89,math:56,english:87
judy	chinese:94,math:78,english:81
ella	chinese:54,math:23,english:48
jack	chinese:100,math:95,english:69
cendy	chinese:67,math:83,english:45

導入數據

[zjt@master data]$ vim scores.txt

hive> load data local inpath '/home/zjt/data/scores.txt' into table scores;
Loading data to table ducl_test.scores
Table ducl_test.scores stats: [numFiles=1, totalSize=214]
OK
Time taken: 0.251 seconds

查詢數據

//查詢所有數據
hive> select * from scores;
OK
shiny	{"chinese":90,"math":100,"english":99}
mark	{"chinese":89,"math":56,"english":87}
judy	{"chinese":94,"math":78,"english":81}
ella	{"chinese":54,"math":23,"english":48}
jack	{"chinese":100,"math":95,"english":69}
cendy	{"chinese":67,"math":83,"english":45}
Time taken: 0.069 seconds, Fetched: 6 row(s)

//查詢map數據中的某個key，併爲查詢結果添加固定列
hive> select name,'chinese',score["chinese"] from scores;
OK
shiny	chinese	90
mark	chinese	89
judy	chinese	94
ella	chinese	54
jack	chinese	100
cendy	chinese	67
Time taken: 1.341 seconds, Fetched: 6 row(s)

2.3.3 結構體

創建表

hive> create table coursescore(
    > id int,
    > course struct<name:string,score:int>)
    > row format delimited
    > fields terminated by '\t'
    > collection items terminated by ','
    > stored as textfile;
OK
Time taken: 0.323 seconds

數據準備

coursescore.txt 
1	chinese,100
2	math,98
3	english,99
4	computer,78

導入數據

[zjt@master data]$ vim coursescore.txt

hive> load data local inpath '/home/zjt/data/coursescore.txt' into table coursescore;
Loading data to table ducl_test.coursescore
Table ducl_test.coursescore stats: [numFiles=1, totalSize=51]
OK
Time taken: 1.392 seconds

查詢數據

//查詢所有數據
hive> select * from coursescore;
OK
1	{"name":"chinese","score":100}
2	{"name":"math","score":98}
3	{"name":"english","score":99}
4	{"name":"computer","score":78}
Time taken: 0.586 seconds, Fetched: 4 row(s)

//結構體中的成員使用"."來獲取
hive> select id,course.name,course.score from coursescore;
OK
1	chinese	100
2	math	98
3	english	99
4	computer	78
Time taken: 0.058 seconds, Fetched: 4 row(s)

3 查詢

Hive查詢大體與SQL一致，不過存在一些不同之處，以下只列出這些不同
判斷查詢
格式：if( , , )
第一個逗號裏面的內容寫的是判斷的條件，第二個逗號裏面的內容寫的是滿足這個判斷之後執行的內容，第三個逗號裏面的內容寫的是不滿足這個判斷之後執行的內容，可以是指定的某個值或者是一個語句

//例子：查詢出年齡是15歲的女孩的學號
select if(age=15,id,0) from stu_messages;

排序
排序子句有以下幾種
order by, distribute by, sort by, cluster by

4 函數

4.1 內置函數

4.1.1 查看內置函數

show functions

4.1.2 顯示函數的詳細信息

desc function trim

hive> desc function trim;
OK
trim(str) - Removes the leading and trailing space characters from str 
Time taken: 0.008 seconds, Fetched: 1 row(s)

4.1.3 顯示函數的擴展信息

desc function extended trim

hive> desc function extended trim;
OK
trim(str) - Removes the leading and trailing space characters from str 
Example:
  > SELECT trim('   facebook  ') FROM src LIMIT 1;
  'facebook'
Time taken: 0.005 seconds, Fetched: 4 row(s)

4.2 JSON數據解析-內置函數

4.2.1 創建表

hive> create table rat_json(
    > line string)
    > row format delimited;
OK

4.2.2 準備數據

rating.json
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}

4.2.3 加載數據

//將json數據導入表
hive> load data local inpath '/home/zjt/data/rating.json' into table rat_json;
Loading data to table ducl_test.rat_json
Table ducl_test.rat_json stats: [numFiles=1, totalSize=65602705]
OK
Time taken: 1.278 seconds

4.2.4 查看數據

hive> select * from rat_json limit 10;
OK
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
{"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"}
Time taken: 0.086 seconds, Fetched: 10 row(s)

4.2.5 解析Json數據

需要解析Json數據成四個字段，插入一張新的表 rate（用於存放處理的數據，需要有四個字段）

//創建json數據表
hive> create table rate(
    > move int,
    > rate int,
    > ts int,
    > uid int)
    > row format delimited
    > fields terminated by '\t';
OK
Time taken: 0.052 seconds

//解析json，得到結果之後存入rate表
insert into table rate
select get_json_object(line,'$.movie') as move,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as ts,
get_json_object(line,'$.uid') as uid
from rat_json;

4.2.6 查詢解析數據

hive> select * from rate limit 10;
OK
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
1197	3	978302268	1
1287	5	978302039	1
2804	5	978300719	1
594	4	978302268	1
919	4	978301368	1
Time taken: 0.035 seconds, Fetched: 10 row(s)

4.3 自定義函數

4.3.1 種類

UDF：用戶自定義函數作用於單個數產生一個數據行作爲輸出(數學函數，字符串函數)
UDAF（User- Defined Aggregation Funcation）：用戶自定義聚合函數接收多個輸入數據行，併產生一個輸出數據行。（COUNT、MAX）

4.3.2 JSON數據解析-自定義函數

編寫Java類

//開發一個簡單的Java類，繼承import org.apache.hadoop.hive.ql.exec.UDF，重載evaluate方法

package org.zjt.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;


public class JsonUDF extends UDF {
	public String evaluate(String jsonStr,String fields) throws JSONException{
		//創建一個json對象
		JSONObject json=new JSONObject(jsonStr);
		//將該json對象造型爲string類型
		String result=(String) json.get(fields);
		
		return result;
	}
}

打jar包

//打成jar包添加到Hive的classpath下
hive> add jar /home/zjt/json-udf.jar;
Added [/home/zjt/json-udf.jar] to class path
Added resources: [/home/zjt/json-udf.jar]
hive> list jar;
/home/zjt/json-udf.jar

創建臨時函數

//創建臨時函數與開發好的class關聯起來
hive> create temporary function jsontostring as 'org.zjt.hive.udf.JsonUDF';
OK
Time taken: 0.014 seconds

解析Json

//將得到的結果存入rates表(rates表用於存放處理的數據，所以需要有四個字段)
create table rates as select
jsontostring(line,'movie') as move,
jsontostring(line,'rate') as rate,
jsontostring(line,'timeStamp') as ts,
jsontostring(line,'uid') as uid
from rat_json;

查看數據

hive> select * from rates limit 5;
OK
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
Time taken: 0.088 seconds, Fetched: 5 row(s)

5 Hive Shell

4.1 參數

-i：從文件初始化HQL
-e：從命令行執行指定的HQL
-f：執行HQL腳本
-v：輸出執行的HQL語句到控制檯
-p：connect to Hive Server on port number
-hiveconf x=y（Use this to set hive/hadoop configuration variables）
-S：表示以不打印日誌的形式執行命名操作

5.2 實例

//從命令行執行指定的HQL
hive -e 'select * from ducl_test.student3'

5.3 運行一個腳本

//文件
[zjt@master ~]$ vim hive.hql

use ducl_2019;
create table if not exists stu(id int,name string)
row format delimited fields terminated by ',' stored as textfile;
load data local inpath '/home/zjt/data/stu.txt' into table stu;
insert overwrite directory '/ducl_2019/stu' 
row format delimited
fields terminated by '\t'
select * from stu;

//執行HQL腳本
[zjt@master ~]$ hive -f hive.hql

Hadoop學習筆記之Hive高級操作

文章目錄

1 連接

1.1 數據準備

1.2 Join操作

1.2.1 內連接JOIN

1.2.2 外連接

1.2.3 右外連接

1.2.4 全外連接

1.3 左半連接

2 數據類型

2.1 原子數據類型

2.2 複雜數據類型

2.3 複雜數據類型實例

2.3.1 數組(ARRAY)

2.3.2 映射

2.3.3 結構體

3 查詢

4 函數

4.1 內置函數

4.1.1 查看內置函數

4.1.2 顯示函數的詳細信息

4.1.3 顯示函數的擴展信息

4.2 JSON數據解析-內置函數

4.2.1 創建表

4.2.2 準備數據

4.2.3 加載數據

4.2.4 查看數據

4.2.5 解析Json數據

4.2.6 查詢解析數據

4.3 自定義函數

4.3.1 種類

4.3.2 JSON數據解析-自定義函數

5 Hive Shell

4.1 參數

5.2 實例

5.3 運行一個腳本