0 order by 是全局排序,把所有數據放在一個reduce task中排序。sort by是在一個reduce中排序,該reduce的輸出有序,是局部有序。distriute by c1 是作用於map輸出的結果,把c1的值相同的記錄輸入到同一個reduce中;如果reduce數目比較少,c1多個不同值的記錄會輸入到同一個reduce中。
1 distribute by要寫在sort by前面,不然報錯
2 distribute by c1,c2 sort by c1,c2 = cluster by c1,c2 ,注意distribute by 後面的字段名 與 sort by 後面的字段名相同時才能 使用cluster by。此時是仍然是局部有序,不是全局有序。
3 cluster by c1,c2 默認是升序,且不可指定asc或desc ,不然報錯
4 當reduce_num=1時,sort by c1,c2 = order by c1,c2,此時都是在一個reduce中排序,所以此時排序後的結果一致
5 distribute by c1 sort by c2,c3 desc ,如果c1只有一種值,那麼此時 = order by c2,c3 desc,因爲distribute by c1會把map輸出的數據劃分到同一個reduce中,然後在這個reduce中按照c2,c3 desc排序,此時與上一條4一致。此時與有多少個reduce task無關,即使手動設置reduce task有多個,但是map的輸出只會往一個reduce task中輸入,其他reduce task的輸入爲0
6 測試時,如果想手動設置reduce task有多個,set mapreduce.job.reduce = 2; -- 無效。set mapred.reduce.tasks = 2; -- 有效
Hive 中的查詢語句說明如下:
[WITH CommonTableExpression (, CommonTableExpression)*]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
測試如下:
1 select * from tmp.test_1031_external_dt ;
2
select * from tmp.test_1031_external_dt
distribute by dt sort by report_time
-- Stage-1: number of mappers: 3; number of reducers: 1
--結果5
select * from tmp.test_1031_external_dt order by report_time ;
-- Stage-1: number of mappers: 3; number of reducers: 1
--結果6
此時2個結果一樣:
3
-- set mapreduce.job.reduce = 2; -- 無效
set mapred.reduce.tasks = 2; -- 有效
select * from tmp.test_1031_external_dt sort by report_time ;
-- Stage-1: number of mappers: 3; number of reducers: 2
--結果7,結果第一段從'123ThreadPoolExecutor'升序到'mapred',第二段是從'123ThreadPoolExecutor'升序到'runNewMapper'
4
set mapred.reduce.tasks = 2; -- 有效
select * from tmp.test_1031_external_dt
distribute by dt sort by report_time
--結果8 ,dt=2020-06-08的數 在reduce1 有序,dt=2020-06-16、dt=2020-06-09的數 在reduce2有序,
5
select * from tmp.test_1031_external_dt
sort by report_time distribute by dt
-- FAILED: ParseException line 2:20 missing EOF at 'distribute' near 'report_time'
6
set mapred.reduce.tasks = 10; -- 有效
select * from tmp.test_1031_external_dt where dt = '2020-06-08'
distribute by dt sort by report_time
-- 結果11 ,全局有序,map的輸出只會寫到一個redeuce中
版本hive2.0.0
參考:
https://zhuanlan.zhihu.com/p/93747613