Hive 之 查詢 03-排序

一、 全局排序(order by)

order by:全局排序, 一個 reducer;(無論設置的 reducer 的個數是多少, order by 都是全局排序, 只有一個 reducer)

使用 asc 表示升序(默認), desc 表示降序排序;

order by語句在 select 語句的最後面;

如:

查詢員工信息按照工資升序排列:

hive (default)> select * from emp order by sal;

... ...
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm       emp.deptno
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
Time taken: 13.866 seconds, Fetched: 14 row(s)

查詢員工信息按照工資降序排序:

hive (default)> select * from emp order by sal desc;

... ...
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm       emp.deptno
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
Time taken: 22.399 seconds, Fetched: 14 row(s)

二、 按照別名排序

hive (default)> select e.ename, e.sal*2 two_sal from emp e order by two_sal;

... ... 
OK
e.ename	two_sal
SMITH	1600.0
JAMES	1900.0
ADAMS	2200.0
WARD	2500.0
MARTIN	2500.0
MILLER	2600.0
TURNER	3000.0
ALLEN	3200.0
CLARK	4900.0
BLAKE	5700.0
JONES	5950.0
SCOTT	6000.0
FORD	6000.0
KING	10000.0
Time taken: 15.481 seconds, Fetched: 14 row(s)

三、 多個列排序

按照部門和薪水升序排列:

如:

hive (default)> select ename, deptno, sal from emp order by deptno, sal;

... ...
OK
ename	deptno	sal
MILLER	10	1300.0
CLARK	10	2450.0
KING	10	5000.0
SMITH	20	800.0
ADAMS	20	1100.0
JONES	20	2975.0
SCOTT	20	3000.0
FORD	20	3000.0
JAMES	30	950.0
MARTIN	30	1250.0
WARD	30	1250.0
TURNER	30	1500.0
ALLEN	30	1600.0
BLAKE	30	2850.0
Time taken: 31.475 seconds, Fetched: 14 row(s)

四、 每個 MapReduce 內部排序(sort by)

sort by: 每個 Reducer 內部進行排序, 對全局結果集來說不是排序;

如:

設置 Reduce 的個數:

hive (default)> set mapreduce.job.reduces=3;

查看 Reduce 的個數:

hive (default)> set mapreduce.job.reduces;
mapreduce.job.reduces=3

根據部門編號降序查看員工信息:

hive (default)> select * from emp sort by empno desc;

... ...
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm       emp.deptno
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
Time taken: 28.287 seconds, Fetched: 14 row(s)

將查詢結果導入到文件中去:

hive (default)> insert overwrite local directory 
              > '/opt/module/data/soryby_result'
              > row format delimited fields terminated by '\t'
              > select * from emp sort by empno desc;

... ... 
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm       emp.deptno
Time taken: 32.479 seconds

查看 /opt/module/data/soryby_result下生成的文件的內容:

[root@hadoop102 soryby_result]# ll
總用量 12
-rw-r--r--. 1 root root 288 4月   1 22:55 000000_0
-rw-r--r--. 1 root root 282 4月   1 22:55 000001_0
-rw-r--r--. 1 root root  91 4月   1 22:55 000002_0

[root@hadoop102 soryby_result]# cat 000000_0 
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7839	KING	PRESIDENT	\N	1981-11-17	5000.0	\N	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	\N	20
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	\N	10
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	\N	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
[root@hadoop102 soryby_result]# 
[root@hadoop102 soryby_result]# cat 000001_0 
7934	MILLER	CLERK	7782	1982-1-23	1300.0	\N	10
7900	JAMES	CLERK	7698	1981-12-3	950.0	\N	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	\N	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	\N	20
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
[root@hadoop102 soryby_result]# 
[root@hadoop102 soryby_result]# cat 000002_0 
7902	FORD	ANALYST	7566	1981-12-3	3000.0	\N	20
7369	SMITH	CLERK	7902	1980-12-17	800.0	\N	20

五、 分區排序(distribute by)

distribute by: 類似於 MR 中的 partition, 按什麼進行分區, 結合 sort by 使用;

注意, Hive 要求 distribute by要寫在 sort by之前;

對於 distribute by進行測試, 一定要設置多個 reducer 進行處理, 否則無法看到 distribute by的效果;

如:

通過 deptno 進行分區, 分區內按照 empno 降序排列:

hive (default)> set mapreduce.job.reduces=3;

hive (default)> insert overwrite local directory
              > '/opt/module/data/distributeby_result'
              > row format delimited fields terminated by '\t'
              > select * from emp distribute by deptno sort by empno desc;

... ...
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm       emp.deptno
Time taken: 26.144 seconds

[root@hadoop102 distributeby_result]# ll
總用量 12
-rw-r--r--. 1 root root 293 4月   1 23:15 000000_0
-rw-r--r--. 1 root root 139 4月   1 23:15 000001_0
-rw-r--r--. 1 root root 229 4月   1 23:15 000002_0
[root@hadoop102 distributeby_result]# 
[root@hadoop102 distributeby_result]# cat 000000_0 
7900	JAMES	CLERK	7698	1981-12-3	950.0	\N	30
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	\N	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
[root@hadoop102 distributeby_result]# 
[root@hadoop102 distributeby_result]# cat 000001_0 
7934	MILLER	CLERK	7782	1982-1-23	1300.0	\N	10
7839	KING	PRESIDENT	\N	1981-11-17	5000.0	\N	10
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	\N	10
[root@hadoop102 distributeby_result]# 
[root@hadoop102 distributeby_result]# cat 000002_0 
7902	FORD	ANALYST	7566	1981-12-3	3000.0	\N	20
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	\N	20
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	\N	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	\N	20
7369	SMITH	CLERK	7902	1980-12-17	800.0	\N	20

六、 cluster by

distribute bysort by的字段相同時, 可以使用 cluster by

cluster by除了具有 distribute by的功能外, 還具有 sort by的功能, 但是 cluster by排序【只能是升序排列】, 不能指定 asc 或者 desc;

如:

下面兩種寫法是等價的:

hive (default)> select deptno, ename from emp
              > cluster by deptno;

hive (default)> select deptno, ename from emp
              > distribute by deptno
              > sort by deptno;

出來結果都是:

OK
deptno	ename
30	MARTIN
30	JAMES
30	BLAKE
30	WARD
30	TURNER
30	ALLEN
10	MILLER
10	KING
10	CLARK
20	SCOTT
20	JONES
20	ADAMS
20	FORD
20	SMITH
Time taken: 15.187 seconds, Fetched: 14 row(s)

指定 desc 將會報錯:

hive (default)> select deptno, ename from emp
              > cluster by deptno desc;
FAILED: ParseException line 2:18 extraneous input 'desc' expecting EOF near '<EOF>'
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章