SQL MySQL和Sql Server的留存率及留存人數計算查詢語句

#SQL MySQL和Sql Server留存率及留存人數計算


本文作者:
第一作者:負責MySQL的趙芮萱(下文稱趙老師)
第二作者:負責Sql Server的葉嘉浩(下文稱葉同學)

目前在牆外的Stackoverflow和牆內的CSDN BLOGs上都有不少實現留存率計算的文章,但是絕大部分此類文章存在3個問題:
1)查詢語句思路單一,缺乏多個實現方法總結。
2)內容只涵蓋了MySQL或者Sql Server,缺乏對兩種Sql 工具的查詢語句總結或者通用查詢語句。
3)查詢語句只能滿足業務需求,無法滿足效率需求,缺乏優化代碼的思考。
下文將針對此3個問題展示留存率及留存人數查詢語句。

數據表簡介:
數據表名稱爲瀏覽明細表,該表有2個字段。
第1個字段爲uid varchar(20),第2個字段爲登陸日期時間(datetime)。
詳細數據在文末。

一、留存人數計算
**1.1)第一種留存人數定義:**新增用戶在第1天登錄且在第N天登錄。例如,3日留存用戶數定義爲在第1天作爲新增用戶登錄且在第3天登錄,至於第2天登錄與否不考慮。
1.1.1)第1種寫法,直接使用datediff函數=N。

CREATE VIEW a as 
	(select uid,min(logtime) as first_logtime from 瀏覽明細表 group by uid)
create view U as
	(select distinct(logtime) from 瀏覽明細表)
select U.logtime, 
	count(distinct a.uid)as 新增數,
	count(distinct b.uid)as 次留數,
	count(distinct c.uid)as 三日留數,
	count(distinct d.uid) as 七日留數
from  U 
	left join a on U.logtime=a.first_logtime 
	left join 瀏覽明細表 b on datediff(dd,a.logtime,b.logtime)=1 and b.uid=a.uid 
	left join 瀏覽明細表 c on datediff(dd,a.logtime,b.logtime)=2 and c.uid=a.uid 
	left join 瀏覽明細表 d on datediff(dd,a.logtime,b.logtime)=6 and d.uid=a.uid 
group by U.logtime

1.1.2)第2種寫法,使用datetime+N的寫法。

CREATE VIEW a as 
	(select uid,min(logtime) as first_logtime from 瀏覽明細表 group by uid)
create view U as
	(select distinct(logtime) from 瀏覽明細表)
select U.logtime, 
	count(distinct a.uid)as 新增數,
	count(distinct b.uid)as 次留數,
	count(distinct c.uid)as 三日留數,
	count(distinct d.uid) as 七日留數
from  U 
	left join a on U.logtime=a.first_logtime 
	left join 瀏覽明細表 b on b.logtime=a.first_logtime+1 and b.uid=a.uid 
	left join 瀏覽明細表 c on c.logtime=a.first_logtime+2 and c.uid=a.uid 
	left join 瀏覽明細表 d on d.logtime=a.first_logtime+6 and d.uid=a.uid 
group by U.logtime

用上述兩種算法寫出來的答案應該如下:
在這裏插入圖片描述
**1.2)第二種留存人數定義:**新增用戶在第1天登錄且在第2到N天之間任意一天登錄。例如,3日留存用戶數定義爲在第1天作爲新增用戶登錄且在第2或第3天任意一天登錄。此種寫法和上述2種寫法不同之處在於每句left join語句結合了2句datediff函數語句。

CREATE VIEW a as 
	(select uid,min(logtime) as first_logtime from 瀏覽明細表 group by uid)
create view U as
	(select distinct(logtime) from 瀏覽明細表)
select U.logtime, 
	count(distinct a.uid)as 新增數,
	count(distinct b.uid)as 次留數,
	count(distinct c.uid)as 三日留數,
	count(distinct d.uid) as 七日留數
from  U 
	left join a on U.logtime=a.first_logtime 
	left join 瀏覽明細表 b on (datediff(dd,a.first_logtime,b.logtime)=1 and b.uid=a.uid) 
	left join 瀏覽明細表 c on (datediff(dd,a.first_logtime,c.logtime)>=1 and datediff(dd,a.first_logtime,c.logtime)<=2 ) and c.uid=a.uid 
	left join 瀏覽明細表 d on (datediff(dd,a.first_logtime,d.logtime)>=1 and datediff(dd,a.first_logtime,d.logtime)<=6 ) and d.uid=a.uid 
group by U.logtime

用上述算法寫出來的答案應該如下:
在這裏插入圖片描述
二、留存率計算
留存率計算只需要在上述3種寫法前添加1段查詢語句即可,下文將用上述3種寫法的其中1種作爲例子,展示如何計算留存率,如果希望用其他寫法來計算留存率,只需要直接套用即可。

select *,
concat(round(100*次留數/新增數,2),'%') as 次日留存率,
concat(round(100*三日留數/新增數,2),'%') as 三日留存率,
concat(round(100*七日留數/新增數,2),'%') as 七日留存率
from
(select U.logtime, 
	count(distinct a.uid)as 新增數,
	count(distinct b.uid)as 次留數,
	count(distinct c.uid)as 三日留數,
	count(distinct d.uid) as 七日留數
from  U 
	left join a on U.logtime=a.first_logtime 
	left join 瀏覽明細表 b on (datediff(dd,a.first_logtime,b.logtime)=1 and b.uid=a.uid) 
	left join 瀏覽明細表 c on (datediff(dd,a.first_logtime,c.logtime)>=1 and datediff(dd,a.first_logtime,c.logtime)<=2 ) and c.uid=a.uid 
	left join 瀏覽明細表 d on (datediff(dd,a.first_logtime,d.logtime)>=1 and datediff(dd,a.first_logtime,d.logtime)<=6 ) and d.uid=a.uid 
group by U.logtime)as P

三、優化思路
第一部分的三種查詢語句雖然略有不同,但是本質上是利用left join實現多表連接,但是在數據量較大和表數量多的時候,該做法會導致低運算效率,因此,可以用另一種寫法提升運行效率。

select U.logtime,
	sum(case when byday=0 then 1 else 0 end) as new,
	sum(case when byday=1 then 1 else 0 end) as _2days,
	sum(case when byday=2 then 1 else 0 end) as _3days,
	sum(case when byday=6 then 1 else 0 end) as _7days
from U 
left join (select distinct a.uid, logtime, first_logtime,datediff(dd,first_logtime,logtime) as byday
		   from 瀏覽明細表 x left join a on x.uid=a.uid) as sub on
		   U.logtime=sub.first_logtime
		   group by U.logtime

原寫法測速語句:

Declare @d Datetime Set @d=GetDate()
select U.logtime, 
	count(distinct a.uid)as 新增數,
	count(distinct b.uid)as 次留數,
	count(distinct c.uid)as 三日留數,
	count(distinct d.uid) as 七日留數
from  U 
	left join a on U.logtime=a.first_logtime 
	left join 瀏覽明細表 b on datediff(dd,a.logtime,b.logtime)=1 and b.uid=a.uid 
	left join 瀏覽明細表 c on datediff(dd,a.logtime,b.logtime)=2 and c.uid=a.uid 
	left join 瀏覽明細表 d on datediff(dd,a.logtime,b.logtime)=6 and d.uid=a.uid 
group by U.logtime
Select [語句執行花費時間(毫秒)]=DateDiff(ms,@d,GetDate())

在這裏插入圖片描述
*備註:雖然此處顯示用時3毫秒,但是經過多次重複執行查詢語句測試,實際上可能會返回0-14秒之間的任意數值,但是其中以3毫秒出現的頻次最多和頻率最高,所以,我們取衆數3毫秒作爲此段查詢語句的運行時間。此處只測速第一部分3段語句的其中1種寫法,但實際上此3段語句速度均爲3毫秒。
優化寫法測速語句:

Declare @d Datetime Set @d=GetDate()
select U.logtime,
	sum(case when byday=0 then 1 else 0 end) as new,
	sum(case when byday=1 then 1 else 0 end) as _2days,
	sum(case when byday=2 then 1 else 0 end) as _3days,
	sum(case when byday=6 then 1 else 0 end) as _7days
from U 
left join (select distinct a.uid, logtime, first_logtime,datediff(dd,first_logtime,logtime) as byday
		   from 瀏覽明細表 x left join a on x.uid=a.uid) as sub on
		   U.logtime=sub.first_logtime
		   group by U.logtime
Select [語句執行花費時間(毫秒)]=DateDiff(ms,@d,GetDate())

在這裏插入圖片描述
*備註:雖然此處顯示用時0毫秒,但是經過多次重複執行查詢語句測試,實際上可能會返回0-4秒之間的任意數值,但是其中以0毫秒出現的頻次最多和頻率最高,所以,我們取衆數0毫秒作爲此段查詢語句的運行時間。

結論:優化後的查詢語句寫法大約能提高運行效率3倍。

四、主表數據:

Create table 瀏覽明細表
(logtime datetime,
uid varchar(20))
insert into 瀏覽明細表(logtime,uid) values
('2020/3/1','a1001'),
('2020/3/1','a1002'),
('2020/3/1','a1003'),
('2020/3/1','a1004'),
('2020/3/1','a1005'),
('2020/3/1','a1006'),
('2020/3/1','a1007'),
('2020/3/1','a1008'),
('2020/3/1','a1009'),
('2020/3/1','a1010'),
('2020/3/1','a1011'),
('2020/3/1','a1012'),
('2020/3/1','a1013'),
('2020/3/1','a1014'),
('2020/3/1','a1015'),
('2020/3/1','a1016'),
('2020/3/2','a1001'),
('2020/3/2','a1002'),
('2020/3/2','a1003'),
('2020/3/2','a1003'),
('2020/3/2','a1004'),
('2020/3/2','a1005'),
('2020/3/2','a1006'),
('2020/3/2','a1007'),
('2020/3/2','a1008'),
('2020/3/2','a1009'),
('2020/3/2','a1020'),
('2020/3/2','a1019'),
('2020/3/2','a1018'),
('2020/3/2','a1017'),
('2020/3/3','a1021'),
('2020/3/3','a1022'),
('2020/3/3','a1023'),
('2020/3/3','a1024'),
('2020/3/3','a1025'),
('2020/3/3','a1026'),
('2020/3/3','a1001'),
('2020/3/3','a1002'),
('2020/3/4','a1001'),
('2020/3/4','a1002'),
('2020/3/4','a1003'),
('2020/3/4','a1004'),
('2020/3/4','a1005'),
('2020/3/4','a1006'),
('2020/3/4','a1007'),
('2020/3/4','a1008'),
('2020/3/4','a1009'),
('2020/3/4','a1009'),
('2020/3/5','a1010'),
('2020/3/5','a1011'),
('2020/3/5','a1012'),
('2020/3/5','a1013'),
('2020/3/5','a1014'),
('2020/3/5','a1015'),
('2020/3/5','a1016'),
('2020/3/5','a1017'),
('2020/3/6','a1018'),
('2020/3/6','a1019'),
('2020/3/6','a1020'),
('2020/3/6','a1021'),
('2020/3/6','a1022'),
('2020/3/6','a1023'),
('2020/3/6','a1024'),
('2020/3/6','a1025'),
('2020/3/6','a1026'),
('2020/3/6','a1027'),
('2020/3/6','a1028'),
('2020/3/6','a1029'),
('2020/3/6','a1030'),
('2020/3/6','a1031'),
('2020/3/6','a1032'),
('2020/3/6','a1033'),
('2020/3/7','a1001'),
('2020/3/7','a1002'),
('2020/3/7','a1003'),
('2020/3/7','a1023'),
('2020/3/7','a1024'),
('2020/3/7','a1018'),
('2020/3/7','a1019'),
('2020/3/7','a1020'),
('2020/3/8','a1001'),
('2020/3/8','a1002'),
('2020/3/8','a1003'),
('2020/3/8','a1012'),
('2020/3/8','a1013'),
('2020/3/8','a1014'),
('2020/3/8','a1015'),
('2020/3/8','a1030')

特別鳴謝:趙老師,吳磊二人爲第三部分優化寫法的貢獻。
2020年5月23日早上6:22 於北京

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章