HIVE知識梳理（轉載）

原創

2020-07-04 07:59

作爲個人筆記增加了一些寫過的例子，歡迎補充。

1、 order by， sort by， distribute by， cluster by

背景表結構
在講解中我們需要貫串一個例子，所以需要設計一個情景，對應還要有一個表結構和填充數據。如下：有 3 個字段，分別爲 personId 標識某一個人， company 標識一家公司名稱，money 標識該公司每年盈利收入（單位：萬元人民幣）

personId    company money
p1  公司1 100
p2  公司2 200
p1  公司3 150
p3  公司4 300

建表導入數據：

create table company_info(
  personId string,
  company string,
  money float
)row format delimited fields terminated by "\t"
load data local inpath “company_info.txt” into table company_info;

1、 order by
hive 中的 order by 語句會對查詢結果做一次全局排序，即，所有的 mapper 產生的結果都會交給一個 reducer 去處理，無論數據量大小， job 任務只會啓動一個 reducer，如果數據量巨大，則會耗費大量的時間。
尖叫提示： 如果在嚴格模式下， order by 需要指定 limit 數據條數，不然數據量巨大的情況下會造成崩潰無輸出結果。涉及屬性： set hive.mapred.mode=nonstrict/strict

例如：按照 money 排序的例子
select * from company_info order by money desc;

2、 sort by
hive 中的 sort by 語句會對每一塊局部數據進行局部排序，即，每一個 reducer 處理的數據都是有序的，但是不能保證全局有序。

3、 distribute by
hive 中的 distribute by 一般要和 sort by 一起使用，即將某一塊數據歸給(distribute by)某一個reducer 處理，然後在指定的 reducer 中進行 sort by 排序。
尖叫提示： distribute by 必須寫在 sort by 之前
尖叫提示： 涉及屬性 mapreduce.job.reduces，hive.exec.reducers.bytes.per.reducer

例如：不同的人（personId）分爲不同的組，每組按照 money 排序。
select * from company_info distribute by personId sort by personId, money desc;

4、 cluster by
hive 中的 cluster by 在 distribute by 和 sort by 排序字段一致的情況下是等價的。同時， clusterby 指定的列只能是降序，即默認的 descend，而不能是 ascend。

例如：寫一個等價於 distribute by 與 sort by 的例子
select * from company_info distribute by personId sort by personId;
等價於
select * from compnay_info cluster by personId;

2、行轉列、列轉行（UDAF 與 UDTF）
1 行轉列
表結構：

name    constellation   blood_type
孫悟空 白羊座 A
大海  射手座 A
宋宋  白羊座 B
豬八戒 白羊座 A
鳳姐  射手座 A

創建表及數據導入：

create table person_info(
  name string,
  constellation string,
  blood_type string)
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/person_info.tsv' into table person_info;
例如：把星座和血型一樣的人歸類到一起

select
  t1.base,concat_ws('|', collect_set(t1.name)) name
  from
    (select
      name,concat(constellation, ",", blood_type) base
    from
      person_info) t1
group by t1.base;

2、列轉行
表結構：

movie   category
《疑犯追蹤》  懸疑,動作,科幻,劇情
《Lie to me》 懸疑,警匪,動作,心理,劇情
《戰狼 2》  戰爭,動作,災難

創建表及導入數據：

create table movie_info(
  movie string,
  category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";

load data local inpath '/opt/module/datas/movie_info.tsv' into table movie_info;

例如：將電影分類中的數組數據展開

select
  movie,category_name
from
  movie_info lateral view explode(category) table_tmp as category_name;
補充select id,category_name
from fb.fb_clue 
lateral view explode(split(remark, '，')) newtable as category_name
where id<5

3、數組操作
“fields terminated by”：字段與字段之間的分隔符。
“collection items terminated by”：一個字段中各個子元素 item的分隔符。

4、 orc 存儲
orc 即 Optimized Row Columnar (ORC) file，在 RCFile 的基礎上演化而來，可以提供一種高效的方法在 Hive 中存儲數據，提升了讀、寫、處理數據的效率。

5、 Hive 分桶
Hive 可以將表或者表的分區進一步組織成桶，以達到：
1、數據取樣效率更高
2、數據處理效率更高
桶通過對指定列進行哈希來實現，將一個列名下的數據切分爲“一組桶” ，每個桶都對應了一個該列名下的一個存儲文件。

1、直接分桶
開始操作之前，需要將 hive.enforce.bucketing 屬性設置爲 true，以標識 Hive 可以識別桶。

create table music(
  id int,
  name string,
  size float)
row format delimited
fields terminated by "\t"
clustered by (id) into 4 buckets;

該代碼的意思是將 music 表按照 id 將數據分成了 4 個桶，插入數據時，會對應 4 個 reduce操作，輸出 4 個文件。

2、在分區中分桶
當數據量過大，需要龐大分區數量時，可以考慮桶，因爲分區數量太大的情況可能會導致文件系統掛掉，而且桶比分區有更高的查詢效率。數據最終落在哪一個桶裏，取決於 clusteredby 的那個列的值的 hash 數與桶的個數求餘來決定。雖然有一定離散性，但不能保證每個桶中的數據量是一樣的。

create table music2(
  id int,
  name string,
  size float)
partitioned by (date string)
clustered by (id) sorted by(size) into 4 bucket
row format delimited
fields terminated by "\t";

load data local inpath '/opt/module/datas/music.txt' into table music2 partition(date='2017-08-30');

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

HIVE知識梳理（轉載）

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

HIVE知識梳理（轉載）

【hive】——Hive sql語法詳解

MapReduce的原理及執行過程

Hive 中parse_url的使用

五年計劃

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結