此次博主爲大家帶來的是Hive項目實戰系列的第三部分,也是最終部分。
目錄
- 我們先來測試一下
0: jdbc:hive2://hadoop002:10000> select * from video_orc order by views desc limit 10;
這時候出現了內存溢出的現象。
- 再次測試
0: jdbc:hive2://hadoop002:10000> select videoId,views from video_orc order by views desc limit 10;
由此可以得到以下結論,不要帶* 否則會出現堆溢出
1. 統計視頻觀看數Top10
思路:使用order by按照views字段做一個全局排序即可,同時我們設置只顯示前10條。
最終代碼:
select
videoId,
uploader,
age,
category,
length,
views,
rate,
ratings,
comments
from
video_orc
order by
views
desc limit
10;
2. 統計視頻類別熱度Top10
思路:
1.即統計每個類別有多少個視頻,顯示出包含視頻最多的前10個類別。
2. 我們需要按照類別group by聚合,然後count組內的videoId個數即可。
3.因爲當前表結構爲:一個視頻對應一個或多個類別。所以如果要group by類別,需要先將類別進行列轉行(展開),然後再進行count即可。
4.最後按照熱度排序,顯示前10條。
最終代碼:
select
category_name as category,
count(t1.videoId) as hot
from (
select
videoId,
category_name
from
video_orc lateral view explode(category) t_catetory as category_name) t1
group by
t1.category_name
order by
hot
desc limit
10;
3. 統計出視頻觀看數最高的20個視頻的所屬類別以及類別包含Top20視頻的個數
思路:
1.先找到觀看數最高的20個視頻所屬條目的所有信息,降序排列
2.把這20條信息中的category分裂出來(列轉行)
3.最後查詢視頻分類名稱和該分類下有多少個Top20的視頻
最終代碼:
select
category_name as category,
count(t2.videoId) as hot_with_views
from (
select
videoId,
category_name
from (
select
*
from
video_orc
order by
views
desc limit
20) t1 lateral view explode(category) t_catetory as category_name) t2
group by
category_name
order by
hot_with_views
desc;
4. 統計視頻觀看數Top50所關聯視頻的所屬類別排序
思路:
1.查詢出觀看數最多的前50個視頻的所有信息(當然包含了每個視頻對應的關聯視頻),記爲臨時表t1
2.將找到的50條視頻信息的相關視頻relatedId列轉行,記爲臨時表t2
3. 將相關視頻的id和gulivideo_orc表進行inner join操作
4. 按照視頻類別進行分組,統計每組視頻個數,然後排行
- 1. 觀看數前50的視頻
select
*
from
video_orc
order by
views
desc limit
50;
- 2. 將相關視頻的id進行列轉行操作
select
explode(relatedId) as videoId
from
t1;
- 3. 得到兩列數據,一列是category,一列是之前查詢出來的相關視頻id
(select
distinct(t2.videoId),
t3.category
from
t2
inner join
video_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name;
- 4. 按照視頻類別進行分組,統計每組視頻個數,然後排行
最終代碼
select
category_name as category,
count(t5.videoId) as hot
from (
select
videoId,
category_name
from (
select
distinct(t2.videoId),
t3.category
from (
select
explode(relatedId) as videoId
from (
select
*
from
video_orc
order by
views
desc limit
50) t1) t2
inner join
video_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5
group by
category_name
order by
hot
desc;
5. 統計每個類別中的視頻熱度Top10,以Music爲例
思路:
1.要想統計Music類別中的視頻熱度Top10,需要先找到Music類別,那麼就需要將category展開,所以可以創建一張表用於存放categoryId展開的數據。
2. 向category展開的表中插入數據。
3. 統計對應類別(Music)中的視頻熱度。
最終代碼:
- 1. 創建表類別表:
create table gulivideo_category(
videoId string,
uploader string,
age int,
categoryId string,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as orc;
- 2. 向類別表中插入數據:
insert into table gulivideo_category
select
videoId,
uploader,
age,
categoryId,
length,
views,
rate,
ratings,
comments,
relatedId
from
video_orc lateral view explode(category) catetory as categoryId;
- 3. 統計Music類別的Top10(也可以統計其他)
select
videoId,
categories,
views
from
gulivideo_category
where
categoryId = "Music"
order by
views
desc limit
10;
6. 統計每個類別中視頻流量Top10,以Music爲例
思路:
1.創建視頻類別展開表(categoryId列轉行後的表)
2.按照ratings排序即可
最終代碼:
select
videoId,
views,
ratings
from
gulivideo_category
where
categoryId = "Music"
order by
ratings
desc limit
10;
7. 統計每個類別視頻觀看數Top10
思路:
1.先得到categoryId展開的表數據
2.子查詢按照categoryId進行分區,然後分區內排序,並生成遞增數字,該遞增數字這一列起名爲rank列
3.通過子查詢產生的臨時表,查詢rank值小於等於10的數據行即可。
最終代碼:
select
t1.*
from (
select
videoId,
categoryId,
views,
row_number() over(partition by categoryId order by views desc) rank from gulivideo_category) t1
where
rank <= 10;
好了,關於此次實戰的全部內容已經更新完畢了。
^ _ ^ ❤️ ❤️ ❤️
碼字不易,大家的支持就是我堅持下去的動力。點贊後不要忘了關注我哦!