Hive 之 查詢 03-排序
一、 全局排序(order by)
order by
:全局排序, 一個 reducer;(無論設置的 reducer 的個數是多少, order by 都是全局排序, 只有一個 reducer)
使用 asc 表示升序(默認), desc 表示降序排序;
order by
語句在 select 語句的最後面;
如:
查詢員工信息按照工資升序排列:
hive (default)> select * from emp order by sal;
... ...
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
Time taken: 13.866 seconds, Fetched: 14 row(s)
查詢員工信息按照工資降序排序:
hive (default)> select * from emp order by sal desc;
... ...
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 22.399 seconds, Fetched: 14 row(s)
二、 按照別名排序
hive (default)> select e.ename, e.sal*2 two_sal from emp e order by two_sal;
... ...
OK
e.ename two_sal
SMITH 1600.0
JAMES 1900.0
ADAMS 2200.0
WARD 2500.0
MARTIN 2500.0
MILLER 2600.0
TURNER 3000.0
ALLEN 3200.0
CLARK 4900.0
BLAKE 5700.0
JONES 5950.0
SCOTT 6000.0
FORD 6000.0
KING 10000.0
Time taken: 15.481 seconds, Fetched: 14 row(s)
三、 多個列排序
按照部門和薪水升序排列:
如:
hive (default)> select ename, deptno, sal from emp order by deptno, sal;
... ...
OK
ename deptno sal
MILLER 10 1300.0
CLARK 10 2450.0
KING 10 5000.0
SMITH 20 800.0
ADAMS 20 1100.0
JONES 20 2975.0
SCOTT 20 3000.0
FORD 20 3000.0
JAMES 30 950.0
MARTIN 30 1250.0
WARD 30 1250.0
TURNER 30 1500.0
ALLEN 30 1600.0
BLAKE 30 2850.0
Time taken: 31.475 seconds, Fetched: 14 row(s)
四、 每個 MapReduce 內部排序(sort by)
sort by
: 每個 Reducer 內部進行排序, 對全局結果集來說不是排序;
如:
設置 Reduce 的個數:
hive (default)> set mapreduce.job.reduces=3;
查看 Reduce 的個數:
hive (default)> set mapreduce.job.reduces;
mapreduce.job.reduces=3
根據部門編號降序查看員工信息:
hive (default)> select * from emp sort by empno desc;
... ...
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 28.287 seconds, Fetched: 14 row(s)
將查詢結果導入到文件中去:
hive (default)> insert overwrite local directory
> '/opt/module/data/soryby_result'
> row format delimited fields terminated by '\t'
> select * from emp sort by empno desc;
... ...
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
Time taken: 32.479 seconds
查看 /opt/module/data/soryby_result
下生成的文件的內容:
[root@hadoop102 soryby_result]# ll
總用量 12
-rw-r--r--. 1 root root 288 4月 1 22:55 000000_0
-rw-r--r--. 1 root root 282 4月 1 22:55 000001_0
-rw-r--r--. 1 root root 91 4月 1 22:55 000002_0
[root@hadoop102 soryby_result]# cat 000000_0
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
[root@hadoop102 soryby_result]#
[root@hadoop102 soryby_result]# cat 000001_0
7934 MILLER CLERK 7782 1982-1-23 1300.0 \N 10
7900 JAMES CLERK 7698 1981-12-3 950.0 \N 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 \N 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 \N 20
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
[root@hadoop102 soryby_result]#
[root@hadoop102 soryby_result]# cat 000002_0
7902 FORD ANALYST 7566 1981-12-3 3000.0 \N 20
7369 SMITH CLERK 7902 1980-12-17 800.0 \N 20
五、 分區排序(distribute by)
distribute by
: 類似於 MR 中的 partition, 按什麼進行分區, 結合 sort by 使用;
注意, Hive 要求 distribute by
要寫在 sort by
之前;
對於 distribute by
進行測試, 一定要設置多個 reducer 進行處理, 否則無法看到 distribute by
的效果;
如:
通過 deptno 進行分區, 分區內按照 empno 降序排列:
hive (default)> set mapreduce.job.reduces=3;
hive (default)> insert overwrite local directory
> '/opt/module/data/distributeby_result'
> row format delimited fields terminated by '\t'
> select * from emp distribute by deptno sort by empno desc;
... ...
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
Time taken: 26.144 seconds
[root@hadoop102 distributeby_result]# ll
總用量 12
-rw-r--r--. 1 root root 293 4月 1 23:15 000000_0
-rw-r--r--. 1 root root 139 4月 1 23:15 000001_0
-rw-r--r--. 1 root root 229 4月 1 23:15 000002_0
[root@hadoop102 distributeby_result]#
[root@hadoop102 distributeby_result]# cat 000000_0
7900 JAMES CLERK 7698 1981-12-3 950.0 \N 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
[root@hadoop102 distributeby_result]#
[root@hadoop102 distributeby_result]# cat 000001_0
7934 MILLER CLERK 7782 1982-1-23 1300.0 \N 10
7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10
[root@hadoop102 distributeby_result]#
[root@hadoop102 distributeby_result]# cat 000002_0
7902 FORD ANALYST 7566 1981-12-3 3000.0 \N 20
7876 ADAMS CLERK 7788 1987-5-23 1100.0 \N 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 \N 20
7369 SMITH CLERK 7902 1980-12-17 800.0 \N 20
六、 cluster by
當 distribute by
和 sort by
的字段相同時, 可以使用 cluster by
;
cluster by
除了具有 distribute by
的功能外, 還具有 sort by
的功能, 但是 cluster by
排序【只能是升序排列】, 不能指定 asc 或者 desc;
如:
下面兩種寫法是等價的:
hive (default)> select deptno, ename from emp
> cluster by deptno;
hive (default)> select deptno, ename from emp
> distribute by deptno
> sort by deptno;
出來結果都是:
OK
deptno ename
30 MARTIN
30 JAMES
30 BLAKE
30 WARD
30 TURNER
30 ALLEN
10 MILLER
10 KING
10 CLARK
20 SCOTT
20 JONES
20 ADAMS
20 FORD
20 SMITH
Time taken: 15.187 seconds, Fetched: 14 row(s)
指定 desc 將會報錯:
hive (default)> select deptno, ename from emp
> cluster by deptno desc;
FAILED: ParseException line 2:18 extraneous input 'desc' expecting EOF near '<EOF>'