Hive II

Hive

數據倉庫。
OLAP(online analyze process)
hdfs
元數據關係型數據中。

Hive執行流程

cli交互driver
driver通過編譯器進行編譯(語法解析和語義解析)
編譯器查詢metastore進行編譯,生成計劃。
執行計劃返回driver,driver提交執行引擎,
執行引擎再提交作業給hadoop,hadoop返回結果
直至client。

tool,hadoop mr.

技術

hive
hiveserver2
beeline:

分區表 :分區就是目錄。
桶表 :桶是文件。
內部表 :刪除全刪
外部表 :只刪除表結構,表結構在metastore中。

結構化數據 :

tinyint
smallint
int
bigint
float
double
decimal

array [,,,,]
struct {“”,12,}
named struct {“key1”:”v1”,”k2”,”v2”}
map {1:”“,,,,}
union {a:[]}

split()函數

explode()

炸裂函數,表生成函數。炸開的是array和map.

cross

優化手段

mapjoin

select /*+MAPJOIN()*/

創建表完整語法

CREATE TABLE employee
    (
    name string,
    arr ARRAY<string>,
    struc STRUCT<sex:string,age:int>,
    map1 MAP<string,int>,
    map2 MAP<string,ARRAY<string>>
    )
    ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '|'                //默認\001
        COLLECTION ITEMS TERMINATED BY ','      //默認\002
        MAP KEYS TERMINATED BY ':'              //默認\003
        LINES TERMINATED BY '\n'                //行結束符
        STORED AS TEXTFILE;                     //

Map端連接

    //連接暗示/*+ MAPJOIN(employee) */
    SELECT /*+ MAPJOIN(employee) */ c.* FROM custs c CROSS JOIN orders o WHERE c.id  <> o.cid;

    //通過設置自動map端連接轉換,實現map連接
    set hive.auto.convert.join=true
    SELECT c.* FROM custs c CROSS JOIN orders o WHERE c.id  <> o.cid;

load data local inpath //上傳
load data inpath //移動

load命令可以向分區表加載數據,無法加入指定的桶中.
桶表使用insert into..

//在shell中直接執行hive命令
$>hive -e “select * from mydb.custs”

//在shell命令行下執行hive的腳本
$>hive -f hive.sql

//導出表數據,導出元數據和數據本身
$hive>export table mydb.custs to ‘/user/centos/custs.dat’

//order by,全局排序,使用一個reduce實現全排序,強烈推薦數量用limit配合使用。
//對於每個reduce只要查詢limit的數量即可。
$hive>select * from custs limit 3 ;

//設置mapred模式爲嚴格模式,
//1.order by時必須使用limit限制結果數量
//2.如果分區表,必須指定分區。
hive>sethive.mapred.mode=strict hive>set hive.mapred.mode = nonstrict

//sort by是指map端排序
//在每個reduce中按照指定字段排序(asc|desc)
//如果mapred.reduce.tasks=1,等價於order by
//order by時始終使用一個reduce
// 沒有hive前綴,是hadoop屬性
$hive>set mapred.reduce.tasks=2

//distribute by等價mr的分區過程,按照指定字段進行分區,
//按照哪個列分發,必須出現在select語句中。
$hive>select * from orders distribute by cid sort by prices desc ;

//cluster by是快捷方式,如果使用同一個字段進行distribute和sort,
//可以使用該方式。
$hive>select * from orders cluster by cid ;

排序總結

1.order by
    全局排序
2.sort by
    reduce內排序
3.distribute by
    分區,決定記錄按哪個字段分配到分區。
4.cluster by
    distribute by x sort by x ;

//函數
//size()提取數據大小
$hive>select size(arr) from emp ;

//是否包含指定元素
$hive>select array_contains(arr,”xx”) from emp ;

//查看所有函數
show functions ;

//
desc formatted function array_contains ;
select current_database() ;
select current_user() ;
select current_date() ;

//case when == switch case
SELECT CASE WHEN length(name) <= 4 THEN ‘Short’ ELSE ‘Long’ END as xx FROM emp ;

//倒序字符串
select reverse(“12345”) ;
SELECT reverse(split(reverse(‘/home/user/employee.txt’),’/’)[0])

//創建數組對象
select array(1,1,1,2,3,4) ;

//collect_set()聚合函數,對結果集進行聚合,返回一個集合。
SELECT collect_set(work_place[0]) AS flat_workplace0 FROM employee;

//虛列,內置列
select INPUT__FILE__NAME from emp ;

//事務,hive 0.13.0之後完全支持行級acid事務處理。
//所有事務都是自動提交,並且存儲文件只能是orc文件,而且只能在桶表中使用。
1.設置相關屬性
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;

2.顯式事務命令:
SHOW TRANSACTIONS;

3.操作語法
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2]…)] VALUES values_row [, values_row …];
UPDATE tablename SET column = value [, column = value…] [WHERE expression]
DELETE FROM tablename [WHERE expression]

4.創建表時,使用桶表,orc文件格式,支持事務的屬性
create table tx(id int ,name string , age int)
clustered by (id) into 2 buckets
stored as orc
TBLPROPERTIES(‘transactional’=’true’);

5.執行操作
insert into tx(id,name,age) values(1,’tom’,2) ;
update tx set name = ‘tomas’ where id = 1 ;
delete from tx where id =1 ;

面向行存儲

結構數據。
select name from orders ;
磁盤尋址非線性。

面向列存儲

線性的。
orc

數據聚合與採樣

count()
sum()
avg()
max()
min()

//查詢每個customer的訂單數
select cid,count(*) from orders group by cid ;

//錯,select字段必須出現在group by中。
select cid,orderno ,count(*) from group by cid  ;

//去重集合
select cid,collect_set(price) from group by cid  ;

//select中出現多個聚合函數
select cid,max(price),min(price) from group by cid  ;

//coalesce
返回第一個非空參數
SELECT sum(coalesce(sex_age.age,0)) AS age_sum,
sum(if(sex_age.sex = 'Female',sex_age.age,0))
AS female_age_sum FROM employee;

//不允許嵌套聚合,一下語句錯誤
SELECT avg(count(*)) AS row_cnt ;

//如果使用count + distinct組合,mapred.reduce.tasks屬性失效,使用
//使用一個reduce進行,類似於order by


//map端聚合,預聚合,消耗更多內存。默認false
set hive.map.aggr=true

高級聚合

GROUPING SETS.
group by + union all 

//查詢每個cust的訂單數
select count(*) from orders group by cid ;
select count(*) from orders group by orderno ;

//group + union all
select count(*) from orders group by cid union all select count(*) from orders group by orderno ;
//group by :指定按照哪些字段分組,
//grouping sets : 以上字段集如何組合。
select count(*) from orders group by cid,orderno grouping sets(cid,orderno,()) ;

//

rollup

rollup擴展了group by,
rollup比grouping sets多了一層聚合(n + 1)。

GROUP BY a,b,c WITH ROLLUP
GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),())

select cid,orderno,count(*) GROUP BY cid,orderno GROUPING SETS ((cid,orderno),(cid),())

cube

擴展了grouping sets,做了各種條件的組合,不做排序。
//代金券
//vip
//6點
等價於
GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())

聚合條件

having,用於在組內過濾。
//使用having
select cid , max(price) mx from orders group by cid having mx > 100.1 ; 


//嵌套子查詢
select t.cid , t.mx from (select cid , max(price) mx from orders group by cid) t where t.mx > 100.1 ;

分析函數

0.11之後支持的,掃描多個輸入的行計算每行的結果。通常和OVER, PARTITION BY, ORDER BY,windowing
配合使用。和傳統分組結果不一樣,傳統結果沒組中只有一個結果(max)。

分析函數的結果會出現多次,和每條記錄都連接輸出。

Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_clause>])
SELECT  name, dept_num, salary,
            COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,
            SUM(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) AS deptTotal,
            SUM(salary) OVER(ORDER BY dept_num) AS runningTotal1, 
            SUM(salary) OVER(ORDER BY dept_num, name rows unbounded preceding) AS runningTotal2
            FROM employee_contract
            ORDER BY dept_num, name;
//宏觀使用cid排序整個數據集,在分區內按照id降序排列。
SELECT  id, orderno, price,cid ,
    COUNT(*) OVER (PARTITION BY cid) AS cnt  , 
    min(price) over (partition by orderno order by id desc) FROM orders ORDER BY cid;


    //
    SELECT  id, orderno, price,cid ,
    min(price) over (partition by orderno) FROM orders ORDER BY cid;

    //order by每條記錄內取.
    SELECT  id, orderno, price,cid ,
    min(price) over (order by price desc) FROM orders ORDER BY cid;

    //分區都是獨立分區,不是嵌套再分區
    SELECT  id, orderno, price,cid ,
    COUNT(*) OVER (PARTITION BY cid) AS cnt  , 
    min(price) over (partition by orderno) FROM orders ORDER BY cid;

    //分區內排序
    SELECT  id, orderno, price,cid ,
    min(price) over (partition by cid order by price desc) FROM orders ORDER BY cid;

    //rank
    SELECT  id, orderno, price,cid ,
    RANK() OVER (PARTITION BY cid ORDER BY price) FROM orders ORDER BY cid;

    //dense_rank
    SELECT  id, orderno, price,cid ,
    dense_rank() over (partition by cid) FROM orders ORDER BY cid;

    //row_number()
    SELECT  id, orderno, price,cid ,
    row_number() over (partition by cid) FROM orders ORDER BY cid;

    //CUME_DIST:

    //PERCENT_RANK
    currow-1 / totalrow - 1
    1: 1 - 1 / 3 - 1 = 0
    2: 2 - 1 / 3 - 1 = 0.5 
    3: 3 - 1 / 3 - 1 = 1

    //NTILE:
CREATE TABLE employee
(
name string,
dept_num int,
salary float 
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
SELECT name, dept_num, salary,
RANK()       OVER (PARTITION BY dept_num ORDER BY salary) AS rank,
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank, 
ROW_NUMBER() OVER () AS row_num,
ROUND((CUME_DIST() OVER (PARTITION BY dept_num ORDER BY salary)), 2) AS cume_dist,
PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank,
NTILE(4) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile
FROM employee ORDER BY dept_num;

CUME_DIST:累加分佈

current row_num/ total rows,如果重複行,都取相同末尾行的行號。
例如:
1:  2 / 3 = 0.67
1:  2 / 3 = 0.67
2:  3 / 3 = 1

1:  1 / 3 = 0.33
2:  3 / 3 = 1
2:  3 / 3 = 1

1:  3 / 3 = 1
1:  3 / 3 = 1
1:  3 / 3 = 1

percent_rank

currentrow - 1 / totalrow - 1
類似於cume_dist,但是提取相同rank的首行行號。
1:  1 - 1 / 3 - 1 = 0
1:  1 - 1 / 3 - 1 = 0
2:  3 - 1 / 3 - 1 = 1 

1:  1 - 1 / 3 - 1 = 0
2:  2 - 1 / 3 - 1 = 0.5
2:  2 - 1 / 3 - 1 = 0.5 

NTile

對每條記錄分配桶的編號,桶的個數.指定桶的數。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章