Hive查詢之排序

一、查詢語句注意事項

1、where子句中不能使用字段別名

2、like和rlike

1）使用LIKE運算選擇類似的值

2）選擇條件可以包含字符或數字:

% 代表零個或多個字符(任意個字符)。

_ 代表一個字符。

3）RLIKE子句是Hive中這個功能的一個擴展，其可以通過Java的正則表達式這個更強大的語言來指定匹配條件。

4）案例實操

（1）查找以2開頭薪水的員工信息

hive (default)> select * from emp where sal LIKE '2%';

（2）查找第二個數值爲2的薪水的員工信息

hive (default)> select * from emp where sal LIKE '_2%';

（3）查找薪水中含有2的員工信息

hive (default)> select * from emp where sal RLIKE '[2]';

3、支持滿連接

滿外連接：將會返回所有表中符合WHERE語句條件的所有記錄。如果任一表的指定字段沒有符合條件的值的話，那麼就使用NULL值替代。

hive (default)> select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno;

4、★連接謂詞中不支持or

hive (default)> select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno = d.deptno or e.ename=d.ename; 錯誤的

二、排序

1、全局排序( order by)

Order By：全局排序，一個Reducer

使用 ORDER BY 子句排序:

ASC（ascend）: 升序（默認）

DESC（descend）: 降序

ORDER BY 子句在SELECT語句的結尾

2、Sort By 每個MapReduce內部排序

Sort By：每個Reducer內部進行排序，對全局結果集來說不是排序。一般結合的是Distributed by使用

sort by 是單獨在各自的reduce中進行排序，所以並不能保證全局有序，一般和distribute by 一起執行，而且distribute by 要寫在sort by前面

如果mapred.reduce.tasks=1和order by效果一樣，如果大於1會分成幾個文件輸出每個文件會按照指定的字段排序，而不保證全局有序。

舉例：

1）．設置reduce個數

hive (default)> set mapreduce.job.reduces=3;

2）．查看設置reduce個數

hive (default)> set mapreduce.job.reduces;

3）．根據部門編號降序查看員工信息

會根據empno的hash值進入不同的reduce中，每個reduce然後在排序

hive (default)> select * from emp sort by empno desc;

4）．將查詢結果導入到文件中（按照部門編號降序排序）

hive (default)> insert overwrite local directory '/opt/module/datas/sortby-result' select * from emp sort by deptno desc;

3、Distribute By分區排序（Distribute By）

Distribute By：類似MR中partition（自定義分區），進行分區，結合sort by使用。

注意，Hive要求DISTRIBUTE BY語句要寫在SORT BY語句之前。

用distribute by 會對指定的字段按照hashCode值對reduce的個數取模，然後將任務分配到對應的reduce中去執行

就是在mapreduce程序中的patition分區過程，默認根據指定key.hashCode()&Integer.MAX_VALUE%numReduce 確定處理該任務的reduce

案例：

hive (default)> set mapreduce.job.reduces=3;

hive (default)> insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

寫入本地文件只能用overwrite，不能用into

4、Cluster By

distribute by 和 sort by 合用就相當於cluster by，但是cluster by 不能指定排序爲asc或 desc 的規則，只能是desc倒序排列。

當distribute by和sorts by字段相同時，可以使用cluster by方式。

cluster by除了具有distribute by的功能外還兼具sort by的功能。但是排序只能是升序排序，不能指定排序規則爲ASC或者DESC。

1）以下兩種寫法等價

hive (default)> select * from emp cluster by deptno;

hive (default)> select * from emp distribute by deptno sort by deptno;

注意：按照部門編號分區，不一定就是固定死的數值，可以是20號和30號部門分到一個分區裏面去。

student__software

發佈了53 篇原創文章 · 獲贊 37 · 訪問量 5萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive查詢之排序

一、查詢語句注意事項

1、where子句中不能使用字段別名

2、like和rlike

3、支持滿連接

4、★連接謂詞中不支持or

二、排序

1、全局排序( order by)

2、Sort By 每個MapReduce內部排序

3、Distribute By分區排序（Distribute By）

4、Cluster By

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

Hive數據類型---以集合類型爲主

Kafka集羣部署及命令行操作

maven中log4j的配置文件

Hive查詢之排序

Hive數據導入導出

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結