hive sql

通過hive cli或者hive server2（實質上是jdbc連接）
hive cli：
hive -e “your sql” 執行sql並退出
hive -S -e “your sql” 靜默模式，返回結果省去執行耗時、結果行數等信息
hive -f /xx/your_sql.hql 執行指定文件中的sql（進入hive shell模式時，可以使用source指定sql文件）

hive外部表與管理表（內部表）

管理表 —— hive控制着數據的生命週期（刪除表時，數據會被刪除），數據存儲在默認的hive數據倉庫目錄。目錄通過參數hive.metastore.warehouse.dir配置
外部表，使用location關鍵字指定數據目錄，hive只負責管理表結構（表的元數據）。對外部表重命名時，不要直接使用rename，rename會導致數據位置發生變化，可以使用複製表結構來代替。

hive裝載數據

管理表裝載文件數據

load data (local) inpath ‘/your_path/’
overwrite into tabe your_table_name
(partitionn (par_key=‘par_value’) ) （overwrite關鍵字會覆蓋原分區數據，如果沒有指定分區，會覆蓋全表數據）

使用 local 表示本地目錄，否則爲hdfs目錄；
如果是本地目錄（文件）會上傳至hive的倉庫的hdfs路徑下，否則會將原hdfs目錄下數據移動到hive數據倉庫的hdfs路徑下（不會拷貝）。

分區表添加分區形式裝載數據：

alter table your_table_name add partition(par_key=‘par_value’)
location ‘hdfs://xxx’
此方法同時適用於外部表和管理表（location指令不會移動數據到hive warehouse路徑）。

通過查詢語句插入數據

1、單分區插入：
insert overwrite your_table_name
(partitionn (par_key=‘par_value’) ) （overwrite關鍵字會覆蓋原分區數據，如果沒有指定分區，會覆蓋全表數據）
select * from src_table
where xx=‘par_value’;

2、多分區插入：
from src_table
insert overwrite table your_table_name
partition(par_key=‘par_value1’)
select * where src_table.xx=‘par_value1’
partition(par_key=‘par_value2’)
select * where src_table.xx=‘par_value2’
partition(par_key=‘par_value3’)
select * where src_table.xx=‘par_value3’

3、動態分區插入（hive嚴格模式下不支持動態分區）：
insert overwrite table your_table_name
partition (par_key)
select …, xx
from src_table
hive根據最後一列（多列-視分區字段個數）來確定分區。

hive數據導出

輸出文件個數，取決於reducer個數
outputformat指定輸出格式（org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat）

1、查詢語句導出：
insert overwrite (local) directory ‘/your_path/’
select * …;

2、多路徑輸出：
from your_table_name
insert overwrite directory ‘/your_path/x1/’
select * where xx=x1
insert overwrite directory ‘/your_path/x2/’
select * where xx=x2
insert overwrite directory ‘/your_path/x3/’
select * where xx=x3

hive的幾種排序

order by 全局排序（輸出結果是有序的）；
sort by 分區內排序（每個分區下數據是有序的，輸出結果可能無序）；
distribute by 控制map的輸出結果在reduce端如何劃分：按指定的字段排序進行shuffle reparation（maper到reducer）；
distribute by + sort by 按照map中指定的key進行分發到reduce，每個reduce中再按sort by指定的字段進行排序。（distribute by 要在 sort by之前）
cluster by = distribute by + sort by（distribute by 的key和sort by key相同）

總結 —— order by 最終結果有序；sort by、distribute by和cluster by 結果不能保證有序，其中 sort by 在只有一個reducer的時候結果有序。

hive分桶表

表分桶和分區一樣都是對錶中數據按指定字段進行劃分；
不同之處：分桶需預先指定桶大小，按照指定key的哈希值，劃分到各個桶中，每個桶的數據量相對比較均勻，指定的分桶字段是表中的字段；分區按照分區字段值進行劃分，實際數據中不含分區字段，每個分區中數據可能不均勻。

建表：
create table your_table_name (uid string, user_name string, age string)
clustered by (uid) into 100 buckets;
插入數據：
使用hive.enforce.bucketing配置
set hive.enforce.bucketing=true;
insert overwrite table your_table_name
select uid, user_name, age
from src_table;
手動指定reducer數和cluster by 指定key
set maperd.reduce.tasks=100;
insert overwrite table your_table_name
select uid, user_name, age
from src_table
cluster by uid;

hive抽樣查詢

使用分桶操作進行桶內抽樣

select * from your_table_name tablesample(bucket n out of m on col) 按照指定的col字段進行分m個桶，選擇第n個桶。
select * from your_table_name tablesample(bucket n out of m on rand()) 按照隨機數進行分m個桶，選擇第n個桶。

數據塊抽樣

select * from your_table_name numbersflat tablesample(0.1 percent) 按數據塊百分比進行抽樣
這種抽樣查詢與數據存儲格式有關係，最小抽取數據樣例是一個hdfs數據塊（默認128M）。

使用分桶表進行隨機數據劃分

例如：your_table_bucked是一個以bucket_key進分桶的表，桶大小爲100

select * from your_table_bucked tablesample (bucket 2 out of 100 on bucket_key);（選取第2個桶）

hive視圖

概念：

1、hive視圖本質是sql（可以理解爲查詢語句的固化）；
2、視圖爲只讀的（不可用insert、load這些命令），不可改變其數據和元數據；
3、一般用於簡化查詢sql、權限控制（視圖sql進行條件過濾）；

視圖的創建：

create view (if not exists) your_view_name as
select * from your_table_name where xx = ‘xxx’;

create view your_view_name1 like your_view_name

視圖的查詢

同表的查詢；

視圖的刪除：

drop view (if exists) your_view_name;

hive索引

hive索引本質是個表
可以全表加索引，或指定分區加索引
create index your_index
on table your_table(your_index_col)
as ‘BITMAP’
with deferred rebuild

hive嚴格模式

hive.mapred.mode=strict
禁止三種查詢：分區表全表查詢（where條件不帶分區字段）；order by結果不帶limit；join操作不帶on條件（笛卡爾積的查詢）。

hive函數

數學函數；聚合函數（記錄多條變一條）；表生成函數（記錄一條變多條）
不一一枚舉。

hive開窗函數

窗口函數 + over(partition by col1 order by col2) （指定字段分區/組或者指定字段排序）

count(1) + over(partition by col1) 按照col1字段進行分組計數，與count(1) + group by col1區別在於返回結果前者所有記錄都展示出來，後者爲聚合之後結果；

row_number() + over(order by col1) 給所有記錄按col1排序並加上自增編號；
rank() + over(order by col1) 給所有記錄按col1排序並加上序號（與row_number區別：相同col1序號相同）；

自定義函數

編寫自定義函數原則：減少、避免創建對象，引用重用對象，一般不選擇不可變類型的對象。（減少gc）

udf

自定義函數，輸入一條記錄，返回一條記錄（類比Spark map）
繼承UDF類，實現evaluate()方法

udaf

自定義聚合函數，輸入多條記錄的集合，返回一條記錄（類比Spark aggregateByKey、reduceByKey）
繼承UDAF類，實現方法：
init()初始化；
iterate()聚合邏輯，參數類型爲真實接收數據類型；
terminatePartial()返回聚合中間結果；
merge()中間結果的聚合操作，接受參數對象與terminatePartial返回對象類型一致；
terminate()最終返回結果。

udtf

自定義表生成函數，輸入一條記錄，返回多條記錄（類比Spark flatMap）

hive的查詢優化

join優化

join大表放右邊
join帶上on條件（沒有on，則爲笛卡爾積，可以通過設置hive嚴格模式，強制限制不帶on的join查詢）

map-reduce優化

1、設置合理的map數和reduce數

map數和reduce數較小，會因爲並行度不夠，影響效率；
map數和reduce數較大，時間又會浪費在task的初始化上。
map數：取決於輸入的文件數，可預先合併過多的小文件，或者拆分過大的文件（最理想的是在數倉搭建時，文件大小存儲的合理）；
reduce數：
直接設置task數set mapred.reduce.tasks；
hive.exec.reducers.bytes.per.reducer每個reduce任務處理的數據量（reduce=總數據量/該參數）;
hive.exec.reducers.max每個任務最大的reduce數目;
reduce步驟拆分（邏輯優化）。

2、jvm重用（集羣資源緊張時慎用，有可能導致已完成的task的插槽仍然一直佔用不釋放，直至整個任務結束）

mapred-site.xml配置：
mapred.job.reuse.jvm.num.tasks 設置插槽重用次數

3、併發執行（沒有先後順序的job會併發執行）

hive.exec.parallel=true

4、推測執行（集羣資源緊張時慎用）

使用備胎task。

5、數據重用

一次加載數據map過程，把符合where條件的數據查詢出來（寫入指定表/路徑）
sql數據如下：
from your_table
insert your_table/your_path
select * where col = ‘value’
insert your_table/your_path
select * where col = ‘value’

該思想也可用於group by優化中
需要設置 hive.multigroupby.singlemr=true

hive文件歸檔

對於冷數據可以進行hdfs文件歸檔，以減小name node的壓力（歸檔文件後綴名爲 .har ）
hive.archive.enable=true;
alter table your_table_name archive partition(par=‘par_value’);
alter table your_table_name unarchive partition(par=‘par_value’);

hive虛擬列的應用

可用於排查問題，定位錯誤日誌
select input_file_name, block_offset_inside_file, row_offset_inside_block
from your_table_name

通俗易懂的Hive知識分享