大數據開發面試的總結-第三篇sql常考題型
(1)、sql 分組取每組的前n條或每組的n%(百分之n)的數據
1)按UserID分組查詢出每組條數,rn從1遞增;
SELECT * ,ROW_NUMBER() OVER(partition by b.UserID order by b.UserID ) rn from b
2)查詢分組後每個user 5%的數據;
with temp as
(
–當數量<1的時候就取一條,且四捨五入取整,SelCnt就爲每組取的數據條數(對應row_number排序最大數);
SELECT b.UserID,cast(ROUND(CASE WHEN COUNT(1)*0.05<1 THEN 1 ELSE COUNT(1)*0.05 END,0) AS INT) SelCnt from b
)
3)關聯臨時表取數,順序數小於每組條數(對應row_number排序最大數)
SELECT * ,ROW_NUMBER() OVER(partition by b.UserID order by b.UserID ) rn from b
inner join temp on b.UserID=temp.UserID
where b.rn<=temp.SelCnt
(2)、sql 分組取每組的前n條或每組的n%(百分之n~百分之m,例如取20%至30%)的數據
1)同上。
2)根據20%及30%算出對應的排序邊界值。
with temp as(
SELECT b.UserID,cast(ROUND(CASE WHEN COUNT(1)*0.2<1 THEN 1 ELSE COUNT(1)*0.2 END,0) AS INT) SelCntStart, cast(ROUND(CASE WHEN COUNT(1)*0.3<1 THEN 1 ELSE COUNT(1)*0.3 END,0) AS INT) SelCntEnd
from b
)
3)根據排序區間值取數。
SELECT * ,ROW_NUMBER() OVER(partition by b.UserID order by b.UserID ) rn from b
inner join temp on b.UserID=temp.UserID
where b.rn<=temp.SelCntStart and b.rn<=temp.SelCntEnd
(3)、算出用戶購買時間的平均間隔、最長間隔及最長間隔對應的時間
1)LAG和LEAD函數介紹
LAG
LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值
參數1爲列名,參數2爲往上第n行(可選,默認爲1),參數3爲默認值(當往上第n行爲NULL時候,取默認值,如不指定,則爲NULL)
LEAD
與LAG相反
LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值
參數1爲列名,參數2爲往下第n行(可選,默認爲1),參數3爲默認值(當往下第n行爲NULL時候,取默認值,如不指定,則爲NULL)
2)row_number(連續,無重複),rank(不連續),dense_rank(連續,可能有重複)的區別
row_number的用途非常廣泛,排序最好用它,它會爲查詢出來的每一行記錄生成一個序號,依次排序且不會重複
rank函數用於返回結果集的分區內每行的排名,行的排名是相關行之前的排名數加一。
dense_rank函數的功能與rank函數類似,dense_rank函數在生成序號時是連續的,而rank函數生成的序號有可能不連續;
用lead函數,窗口:基於用戶分區後對用戶購買時間排序:
3)計算用戶購買的時間間隔
with t1 as (select userid,
time as stime, #當前時間
lead(time) over(partition by userid order by time) etime, #當前時間最近的下一個時間
UNIX_TIMESTAMP(lead(time) over(partition by userid order by time),‘yyyy-MM-dd HH:mm:ss’)- UNIX_TIMESTAMP(time,‘yyyy-MM-dd HH:mm:ss’) period
from test.user_log),
4)計算用戶購買平均間隔
select userid,
avg(period) as period
from t1 group by userid;
5)計算用戶最長間隔及最長間隔對應的時間
with t2 as (select
userid,stime, stime,period
rank over(partition by userid order by period desc) as rn
from t1 ) select userid,stime, stime,period from t2 where rn = 1
6)若只計算用戶購買的最長時間間隔,還可:
select userid,
max(period) as period
from t1 group by userid;