系統環境

Linux Ubuntu 16.04

jdk-7u75-linux-x64

hive-1.1.0-cdh5.4.5

hadoop-2.6.0-cdh5.4.5

mysql-5.7.24

任務內容

1.全局排序Order by與局部排序Sort by的用法，以及各自適用的場景。

2.分組查詢Group by的應用場景與基本語法。

3.Cluster by與Distribute by和Sort by的關係及操作。

任務步驟

1.首先檢查Hadoop相關進程，是否已經啓動。若未啓動，切換到/apps/hadoop/sbin目錄下，啓動Hadoop。

jps  
cd /apps/hadoop/sbin  
./start-all.sh

2.然後開啓Mysql，用於存放Hive的元數據。（密碼：zhangyu）

sudo service mysql start

3.切換到/data/hive4目錄下，如不存在需提前創建hive4文件夾。

mkdir /data/hive4  
cd /data/hive4

4.使用wget命令，下載http://192.168.1.100:60000/allfiles/hive4中的文件。

wget http://192.168.1.100:60000/allfiles/hive4/goods_visit
wget http://192.168.1.100:60000/allfiles/hive4/order_items
wget http://192.168.1.100:60000/allfiles/hive4/buyer_favorite

5.在終端命令行界面，直接輸入Hive命令，啓動Hive命令行。

hive

Order by的演示

1.在Hive中創建一個goods_visit表，有goods_id ，click_num 2個字段，字符類型都爲string，以‘\t’爲分隔符。

create table goods_visit(goods_id string,click_num int)
row format delimited fields terminated by '\t'  stored as textfile;

創建完成，查詢一下。

show tables;

2.將本地 /data/hive4下的表goods_visit中數據導入到Hive中的goods_visit表中。

load data local inpath'/data/hive4/goods_visit' into table goods_visit;

3.使用Order by對商品點擊次數從大到小排序，並通過limit取出10條數據。

select * from goods_visit order by click_num desc limit 10;

Sort by 的演示

1.爲演示Sort by效果，我將Reduce個數設置爲三個，命令如下:

set mapred.reduce.tasks=3;

2.爲某電商創建一個訂單明細表，名爲order_items，包含item_id 、order_id 、goods_id 、goods_number 、shop_price 、goods_price 、goods_amount 七個字段，字符類型都爲string，以‘\t’爲分隔符。

create table order_items(item_id string,order_id string,goods_id string,goods_number string,
shop_price string,goods_price string,goods_amount string)
row format delimited fields terminated by '\t'  stored as textfile;

3.將本地/data/hive4/下的表order_items中數據導入到Hive中的order_items表中。

load data local inpath '/data/hive4/order_items' into table order_items;

4.按商品ID(goods_id)進行排序。

select * from order_items sort by goods_id;

Group by的演示

1.爲某電商創建一個買家收藏夾表，名爲buy_favorite，有buyer_id 、goods_id 、dt 三個字段，字符類型都爲string，以‘\t’爲分隔符。

create table buyer_favorite(buyer_id string,goods_id string,dt string)
row format delimited fields terminated by '\t'  stored as textfile;

2.將本地/data/hive4/下的表buyer_favorite中數據導入到Hive中的buyer_favorite表中。

load data local inpath '/data/hive4/buyer_favorite' into table buyer_favorite;

3.按dt分組查詢每天的buyer_id數量。

select dt,count(buyer_id) from buyer_favorite group by dt;

Distribute by的演示

1.爲演示Distribute by效果，我將Reduce個數設置爲三個，命令如下:

set mapred.reduce.tasks=3;

2.使用買家收藏夾表，按用戶ID(buyer_id)做分發(distribute by)，輸出到本地/data/hive4/out中。

insert overwrite local directory '/data/hive4/out' select * from buyer_favorite distribute by buyer_id;

3.切換到linux本地窗口，查看目錄/data/hive4/out下的文件。

cd /data/hive4/out  
ls

數據按buyer_id分發到三個文件中。

Cluster by 的演示

Cluster by除了具有Distribute by的功能外還兼具Sort by的功能，相當於Distribute by+ Sort by的結合，但是排序只能是倒敘排序，不能指定排序規則爲ASC或者DESC。

1.將Reduce個數設置爲3個。

set mapred.reduce.tasks=3;

2.按buyer_id將buyer_favorite分發成三個文件，並按buyer_id排序。

select * from buyer_favorite cluster by buyer_id;

Order by 與Sort by 對比

Order by的查詢結果是全部數據全局排序，它的Reduce數只有一個，Reduce任務繁重，因此數據量大的情況下將會消耗很長時間去執行，而且可能不會出結果，因此必須指定輸出條數。

Sort by是在每個Reduce端做排序，它的Reduce數可以有多個，它保證了每個Reduce出來的數據是有序的，但多個Reduce出來的數據合在一起未必是有序的，因此在Sort by做完局部排序後，還要再做一次全局排序，相當於先在小組內排序，然後只要將各小組排序即可，在數據量大的情況下，可以提升不少的效率。

Distribute by 與Group by 對比

Distribute by是通過設置的條件在Map端拆分數據給Reduce端的，按照指定的字段對數據劃分到不同的輸出Reduce文件中。

Group by它的作用是通過一定的規則將一個數據集劃分成若干個小的區域，然後針對若干個小區域進行數據處理，例如某電商想統計一年內商品銷售情況，可以使用Group by將一年的數據按月劃分，然後統計出每個月熱銷商品的前十名。

兩者相比，都是按Key值劃分數據，都使用Reduce操作，唯一不同的是Distribute by只是單純的分散數據，而Group by把相同Key的數據聚集到一起，後續必須是聚合操作。

Hive分組排序

系統環境

相關知識

任務內容

任務步驟

Order by的演示

Sort by 的演示

Group by的演示

Distribute by的演示

Cluster by 的演示

Order by 與Sort by 對比

Distribute by 與Group by 對比

數據結構思維導圖——緒論

計算機網絡實驗之交換機的管理配置

計算機網絡實驗之局域網的配置

數據結構思維導圖——線性表

計算機網絡實驗之DHCP實驗

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結