1.數據準備
創建person表
CREATE TABLE `person`(
`id` int,
`name` string,
`address` string)
添加如下數據:
hive> insert into person values(1, 'lisi', 'beijing');
hive> insert into person values(2, 'zhangsan', 'chengdu');
hive> insert into person values(3, 'wangwu', 'shanghai');
hive> insert into person values(4, 'zhaoliu', 'guangzhou');
hive> insert into person values(5, 'name5', 'beijing');
2.order by
order by會對查詢結果執行一個全局排序,reducer的數量是1。因此這個過程可能會很漫長。
hive> select * from person order by id desc;
5 name5 beijing
4 zhaoliu guangzhou
3 wangwu shanghai
2 zhangsan chengdu
1 lisi beijing
3.sort by
sort by 只會對每個reducer 中的數據進行排序,也就是執行一個局部排序過程。
hive> set mapreduce.job.reduces=3;
hive> insert overwrite local directory '/root/sortby-result' select * from person sort by id desc;
# 每個分區的數據按id降序
[root@master ~]# cat /root/sortby-result/000000_0
5name5beijing
[root@master ~]# cat /root/sortby-result/000001_0
4zhaoliuguangzhou
3wangwushanghai
2zhangsanchengdu
[root@master ~]# cat /root/sortby-result/000002_0
1lisibeijing
4.distribute by
distribute by 控制mapper中的輸出在 reducer 中是如何進行劃分的,使用distribute by可以保證相同key的記錄被劃分到一個reducer中。
# 以address分區然後再按id排序
hive> set mapreduce.job.reduces=3;
hive> insert overwrite local directory '/root/distributeby-result' select * from person distribute by address sort by id desc;
[root@master ~]# cat /root/distributeby-result/000000_0
4zhaoliuguangzhou
3wangwushanghai
[root@master ~]# cat /root/distributeby-result/000001_0
5name5beijing
1lisibeijing
[root@master ~]# cat /root/distributeby-result/000002_0
2zhangsanchengdu
5.cluster by
distribute by 和 sort by 合用就相當於cluster by,但是cluster by 不能指定排序爲asc或 desc 的規則,只能是升序排列。
hive> set mapreduce.job.reduces=3;
hive> insert overwrite local directory '/root/clusterby-result' select * from person cluster by address;
[root@master ~]# cat /root/distributeby-result/000000_0
4zhaoliuguangzhou
3wangwushanghai
[root@master ~]# cat /root/distributeby-result/000001_0
5name5beijing
1lisibeijing
[root@master ~]# cat /root/distributeby-result/000002_0
2zhangsanchengdu