目錄
一、問題提出
1. 描述
這是一個實際業務需求中的問題。某一直播業務表中記錄瞭如下格式的用戶進出直播間日誌數據:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-03 01:11:01 |
| 2 | 4 | 2018-01-01 01:03:02 | 2018-01-01 01:12:05 |
| 2 | 5 | 2018-01-01 01:11:02 | 2018-01-01 01:12:05 |
| 2 | 6 | 2018-01-01 01:15:02 | 2018-01-01 01:16:05 |
| 2 | 7 | 2018-01-01 01:01:03 | 2018-01-01 01:11:05 |
| 2 | 8 | 2018-01-01 23:01:03 | 2018-01-02 01:11:01 |
| 3 | 1 | 2018-01-05 01:01:01 | 2018-01-10 01:01:01 |
| 3 | 2 | 2018-01-05 01:01:01 | 2018-01-06 01:01:01 |
| 3 | 3 | 2018-01-06 01:01:01 | 2018-01-06 02:01:01 |
...
四個字段分別表示直播間ID、用戶ID、進入時間和退出時間。求每天每個活躍房間的峯值人數和總時長。活躍房間的定義是:以每秒爲時間粒度,如果在某一時刻同時有兩個及其以上的用戶在房間內,該房間當天即爲活躍房間。峯值人數是指一天內同時在一個活躍房間的最大人數。總活躍時長是指一天內活躍時長的總和。
2. 分析
這是一個典型的重疊時間段的統計問題。具體來說,該需求可以細分爲這樣幾個需要解決的問題:
- 一個房間內同一用戶的重疊時間段合併。
- 拆分起止時間段跨天的時段。
- 取得活躍的時段。
- 按天計算每個房間活躍時段內的不同用戶數及其活躍時段的長度。
- 選取活躍時段內的最大人數,並彙總活躍時長。
(1)一個房間內同一用戶的重疊時段問題
理論上同一用戶進出房間的時間段是不存在重疊的。但表數據是移動端程序上報的,做過移動應用的開發者應該都理解,類似數據統計類的需求不能直接依賴端上報的數據,因爲有各種原因造成上報數據不準確。此案例中,任意給定的一個房間,用戶在其內的時間存在重疊部分,而重疊又分同一用戶的重疊與不同用戶之間重疊兩種情況。對於前一種情況,在判斷房間是否活躍時,不應該對用戶重複計數,因此這部分的重疊時段需要進行合併。例如,2018-01-01日,用戶1在房間1有四條日誌記錄:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
爲了判斷房間1在'2018-01-01 01:01:01'和'2018-01-01 01:11:05'之間是否存在活躍時間段,需要將四條記錄合併爲如下兩條記錄:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
(2)起止時段跨天的問題
由於是按天進行統計,對於進出時間點跨天的情況,要進行拆分。例如,用戶1在房間2的進出時間跨越了三天:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-03 01:11:01 |
+--------+--------+---------------------+---------------------+
爲了統計'2018-01-01'、'2018-01-02'、'2018-01-03'三天的數據,需要將這條記錄拆分爲如下三條記錄:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
|--------+--------+---------------------+---------------------|
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-01 23:59:59 |
| 2 | 1 | 2018-01-02 00:00:00 | 2018-01-02 23:59:59 |
| 2 | 1 | 2018-01-03 00:00:00 | 2018-01-03 01:11:01 |
+--------+--------+---------------------+---------------------+
拆分的起止時間相差一秒,不能相同。在後面介紹計算活躍時間段內的不同用戶數及其活躍時長的算法時,會看到這點非常重要。
(3)統計活躍時段
經過了前兩步的數據預處理便可以統計活躍時段。這步是一個令人頭疼的問題,關鍵在於如何高效地獲取活躍時段。我們嘗試了多種解決方案,後面將介紹其中兩種,它們的性能有着天壤之別。 下面建立測試表並生成數據,用於演示各種SQL的執行結果。
create table test1 (roomid int, userid int, s datetime, e datetime);
insert into test1 values
(1, 1, '2018-01-01 01:01:01', '2018-01-01 01:10:01'),
(1, 2, '2018-01-01 01:01:02', '2018-01-01 01:01:05'),
(1, 3, '2018-01-01 01:01:05', '2018-01-01 01:02:05'),
(2, 4, '2018-01-01 01:03:02', '2018-01-01 01:12:05'),
(2, 5, '2018-01-01 01:11:02', '2018-01-01 01:12:05'),
(2, 6, '2018-01-01 01:15:02', '2018-01-01 01:16:05'),
(2, 7, '2018-01-01 01:01:03', '2018-01-01 01:11:05'),
(1, 1, '2018-01-01 01:01:05', '2018-01-01 01:10:01'),
(1, 1, '2018-01-01 01:01:02', '2018-01-01 01:11:01'),
(1, 1, '2018-01-01 01:11:02', '2018-01-01 01:11:05'),
(2, 1, '2018-01-01 01:01:03', '2018-01-03 01:11:01'),
(2, 8, '2018-01-01 23:01:03', '2018-01-02 01:11:01'),
(3, 1, '2018-01-05 01:01:01', '2018-01-10 01:01:01'),
(3, 2, '2018-01-05 01:01:01', '2018-01-06 01:01:01'),
(3, 3, '2018-01-06 01:01:01', '2018-01-06 02:01:01');
commit;
爲了驗證不同方案的在實際數據集上的執行性能,採集了三天的2505495條業務數據,存儲在u_room_log表中。u_room_log與test1表結構相同,並且都沒有任何索引。
二、優化重疊查詢
如前所述,我們需要解決的第一個問題時合併一個房間內同一用戶的重疊時間段。下面討論兩種自關聯和遊標實現方案。
1. 自關聯
重疊問題的SQL解決方案中,最容易想到的是自關聯。先求出每個分組的開始時間,並用DISTINCT返回去重,然後用同樣的方法得到每組結束的時間,最後把前兩步的結果集合並,並通過MIN函數取得結束的時間。完整的SQL解決方案如下面的代碼所示:
select distinct roomid, userid,
if(date(s)!=date(e) and id>1,date(s+interval id-1 day),s) s,
if(date(s+interval id-1 day)=date(e),e,date_format(s+interval id-1 day,'%Y-%m-%d 23:59:59')) e
from (select distinct s.roomid, s.userid, s.s,
(select min(e) -- 合併後每個區間的結束時間
from (select distinct roomid, userid, e
from test1 a
where not exists (select * from test1 b
where a.roomid = b.roomid
and a.userid = b.userid
and a.e >= b.s
and a.e < b.e)) s2
where s2.e > s.s
and s.roomid = s2.roomid
and s.userid = s2.userid) e
from (select distinct roomid, userid, s -- 每個房間每個用戶的開始時間
from test1 a
where not exists (select * from test1 b
where a.roomid = b.roomid
and a.userid = b.userid
and a.s > b.s
and a.s <= b.e)) s,
(select distinct roomid, userid, e -- 每個房間每個用戶的結束時間
from test1 a
where not exists (select * from test1 b
where a.roomid = b.roomid
and a.userid = b.userid
and a.e >= b.s
and a.e < b.e)) e
where s.roomid = e.roomid
and s.userid = e.userid) t1,
(select id from nums where id<=100) nums
where nums.id<=datediff(e,s)+1;
最外層的查詢用於處理跨天時段。關聯數字輔助表將單行數據分解爲多行,id<=100表示單個時段跨越的天數最多是100。對於按天統計的直播業務,這個跨度足夠了。爲了提高查詢性能,該值應該爲滿足需求的最小值。下面是該查詢的執行結果:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:11:01 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 2 | 4 | 2018-01-01 01:03:02 | 2018-01-01 01:12:05 |
| 2 | 5 | 2018-01-01 01:11:02 | 2018-01-01 01:12:05 |
| 2 | 6 | 2018-01-01 01:15:02 | 2018-01-01 01:16:05 |
| 2 | 7 | 2018-01-01 01:01:03 | 2018-01-01 01:11:05 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-01 23:59:59 |
| 2 | 8 | 2018-01-01 23:01:03 | 2018-01-01 23:59:59 |
| 3 | 1 | 2018-01-05 01:01:01 | 2018-01-05 23:59:59 |
| 3 | 2 | 2018-01-05 01:01:01 | 2018-01-05 23:59:59 |
| 3 | 3 | 2018-01-06 01:01:01 | 2018-01-06 02:01:01 |
| 2 | 1 | 2018-01-02 00:00:00 | 2018-01-02 23:59:59 |
| 2 | 8 | 2018-01-02 00:00:00 | 2018-01-02 01:11:01 |
| 3 | 1 | 2018-01-06 00:00:00 | 2018-01-06 23:59:59 |
| 3 | 2 | 2018-01-06 00:00:00 | 2018-01-06 01:01:01 |
| 2 | 1 | 2018-01-03 00:00:00 | 2018-01-03 01:11:01 |
| 3 | 1 | 2018-01-07 00:00:00 | 2018-01-07 23:59:59 |
| 3 | 1 | 2018-01-08 00:00:00 | 2018-01-08 23:59:59 |
| 3 | 1 | 2018-01-09 00:00:00 | 2018-01-09 23:59:59 |
| 3 | 1 | 2018-01-10 00:00:00 | 2018-01-10 01:01:01 |
+--------+--------+---------------------+---------------------+
22 rows in set (0.01 sec)
原表的15行數據,經過重疊合並與跨天拆分後變爲22條數據。自關聯的寫法比較易懂,在小數據集上的性能尚可,但如果表很大,這種寫法就會凸顯性能問題。將查詢中的test1表改爲u_room_log表,沒有等到出結果。慢的原因從查詢計劃中就可得到直觀反映:
+----+--------------------+------------+------------+-------+---------------+-------------+---------+-------------------+----------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+------------+------------+-------+---------------+-------------+---------+-------------------+----------+----------+----------------------------------------------------+
| 1 | PRIMARY | nums | NULL | range | PRIMARY | PRIMARY | 8 | NULL | 100 | 100.00 | Using where; Using index; Using temporary |
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 24980980 | 100.00 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | <derived6> | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 2 | DERIVED | <derived8> | NULL | ref | <auto_key0> | <auto_key0> | 14 | s.roomid,s.userid | 10 | 100.00 | Distinct |
| 8 | DERIVED | a | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 9 | DEPENDENT SUBQUERY | b | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 0.11 | Using where |
| 6 | DERIVED | a | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 7 | DEPENDENT SUBQUERY | b | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 0.11 | Using where |
| 3 | DEPENDENT SUBQUERY | <derived4> | NULL | ref | <auto_key0> | <auto_key0> | 14 | s.roomid,s.userid | 249809 | 33.33 | Using where; Using index |
| 4 | DERIVED | a | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 5 | DEPENDENT SUBQUERY | b | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 0.11 | Using where |
+----+--------------------+------------+------------+-------+---------------+-------------+---------+-------------------+----------+----------+----------------------------------------------------+
要對一個250萬行的表多次進行相關子查詢,總計要掃描的行數是多個250萬的乘積,從執行時間看基本沒有意義,因此這個寫法被否定了。我們希望找到只掃描一遍表的實現方法,這是最優的解決方案,因爲無論如何也要掃描一遍表。
2. 遊標+內存臨時表
在數據庫優化中有一條基本原則,就是儘量使用集合操作而避免使用遊標,來看一個最簡單的例子。nums是單列100萬行的數字輔助表,select查詢時間爲0.41秒。
mysql> select @id:=id from nums;
...
1000000 rows in set, 1 warning (0.41 sec)
而遊標遍歷的時間爲3.05秒,比單條select語句慢了7.4倍。
mysql> delimiter //
mysql> create procedure p_cursor()
-> begin
-> declare done int default 0;
-> declare v_id bigint;
->
-> declare cur_nums cursor for select id from nums;
-> declare continue handler for not found set done = 1;
->
-> open cur_nums;
-> repeat
-> fetch cur_nums into v_id;
-> until done end repeat;
-> close cur_nums;
-> end//
Query OK, 0 rows affected (0.01 sec)
mysql>
mysql> call p_cursor()//
Query OK, 0 rows affected (3.05 sec)
此案例中情況卻有所不同。有可能通過業務數據表上的遊標,在逐行遍歷表時編寫複雜的應用邏輯,避免大表之間的關聯,極大減少掃描行數,性能會比表關聯好很多。下面是用遊標合併重疊時間段的存儲過程。
drop procedure if exists sp_overlap;
delimiter //
create procedure sp_overlap()
begin
declare done int default 0;
declare v_roomid bigint;
declare v_userid bigint;
declare v_start datetime;
declare v_end datetime;
declare v_prev_roomid int;
declare v_prev_userid bigint;
declare v_max_end datetime;
declare cur_t1 cursor for select roomid,userid,s,e from test1 order by roomid,userid,s,e;
declare continue handler for not found set done = 1;
drop table if exists t;
drop table if exists t1;
drop table if exists tmp_s;
create temporary table t(
roomid bigint,
userid bigint,
s datetime,
e datetime,
broken int
) engine=memory;
create temporary table t1 (
roomid int,
userid bigint,
s datetime,
e datetime
) engine=memory;
create temporary table tmp_s(
roomid bigint,
userid bigint,
s datetime,
e datetime,
i int
) engine=memory;
open cur_t1;
repeat
fetch cur_t1 into v_roomid,v_userid,v_start,v_end;
if done !=1 then
if(v_roomid=v_prev_roomid and v_userid=v_prev_userid) then
if(v_start<=v_max_end) then
insert into t values(v_roomid,v_userid,v_start,v_end,0);
else
insert into t values(v_roomid,v_userid,v_start,v_end,1);
end if;
if(v_end>=v_max_end) then
set v_max_end:=v_end;
end if;
set v_prev_roomid:=v_roomid;
set v_userid:=v_userid;
else
set v_max_end:=v_end;
set v_prev_roomid:=v_roomid;
set v_prev_userid:=v_userid;
insert into t values(v_roomid,v_userid,v_start,v_end,1);
end if;
end if;
until done end repeat;
close cur_t1;
insert into tmp_s
select roomid,userid,min(s) s,max(e) e,datediff(max(e),min(s))+1 i
from (select roomid,userid,s,e,case when @flag=flag then @rn:=@rn+broken when @flag:=flag then @rn:=broken end ran
from (select roomid,userid,s,e,broken,concat(roomid,',',userid) flag from t,(select @flag:='',@rn:=0) vars) a
order by roomid,userid,s,e) b
group by roomid,userid,ran;
select max(i) into @c from tmp_s;
insert into t1(roomid,userid,s,e)
select roomid, userid,
if(date(s)!=date(e) and id>1,date(s+interval id-1 day),s) s,
if(date(s+interval id-1 day)=date(e) ,e,date_format(s+interval id-1 day,'%y-%m-%d 23:59:59')) e
from tmp_s t1,
(select id from nums where id<=@c) nums
where (nums.id<=t1.i);
end
//
定義遊標的查詢需要按房間ID、用戶ID、起始時間、終止時間排序。v_roomid、v_userid、v_start、v_end四個變量存儲遊標當前行四個字段的數據。由於要按房間和用戶分組,v_prev_roomid與v_prev_userid分別存儲前一行的房間ID和用戶ID,用於和當前行進行比較,判斷哪些行屬於同一組。
v_max_end變量存儲同一分組中當前最大的結束時間。在當前行的開始時間小於等於v_max_end時,說明當前行與同組中前面的時間段存在重疊,用0標識該行,否則表示當前行與同組中前面的時間段不存在重疊,用1標識該行。將遊標遍歷結果存儲在臨時表t中,t只比原表多了broken字段,用於存儲所在行是否需要合併的標識:
+--------+--------+---------------------+---------------------+--------+
| roomid | userid | s | e | broken |
+--------+--------+---------------------+---------------------+--------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:10:01 | 1 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:11:01 | 0 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:10:01 | 0 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 | 1 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 | 1 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 | 1 |
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-03 01:11:01 | 1 |
| 2 | 4 | 2018-01-01 01:03:02 | 2018-01-01 01:12:05 | 1 |
| 2 | 5 | 2018-01-01 01:11:02 | 2018-01-01 01:12:05 | 1 |
| 2 | 6 | 2018-01-01 01:15:02 | 2018-01-01 01:16:05 | 1 |
| 2 | 7 | 2018-01-01 01:01:03 | 2018-01-01 01:11:05 | 1 |
| 2 | 8 | 2018-01-01 23:01:03 | 2018-01-02 01:11:01 | 1 |
| 3 | 1 | 2018-01-05 01:01:01 | 2018-01-10 01:01:01 | 1 |
| 3 | 2 | 2018-01-05 01:01:01 | 2018-01-06 01:01:01 | 1 |
| 3 | 3 | 2018-01-06 01:01:01 | 2018-01-06 02:01:01 | 1 |
+--------+--------+---------------------+---------------------+--------+
15 rows in set (0.00 sec)
臨時表tmp_s存儲合併行後的結果。除了原有的四列外,該表還增加了表示開始時間和結束時間之間跨越天數的一列。在生成該表數據的查詢語句中:
case when @flag=flag then @rn:=@rn+broken when @flag:=flag then @rn:=broken end
這句的含義是按房間和用戶分組(@flag相同的表示爲同一組),並且累加同一組中的broken,因爲需要合併行的broken=0,所以所有需要合併行的累加broken都是1。外層查詢就按這三列group by,min(s)、max(e)、datediff(max(e),min(s))+1 分別得到合併後的開始時間、結束時間和跨越天數。
然後用下面的查詢取得最大跨越天數:
select max(i) from tmp_s;
最後將tmp_s與數字輔助表連接,進行跨天時間段的拆分,並將拆分後的結果存入臨時表t1。
本過程使用遊標僅掃描一遍原始數據表,將中間處理結果存儲到內存臨時表中,對於處理重疊問題具有一定的通用性。之所以用到了三個臨時表,是爲了增加代碼的可讀性。每步產生的中間結果都存儲於內存臨時表,邏輯比較清晰。在性能優化時也要進行可讀性、靈活性、易維護性等多方面權衡,避免“優化強迫症”。本例是可以不用寫三個臨時表的,去掉一個臨時表可能提高些許性能,但若將此複雜的處理步驟合併爲單一查詢,必然使SQL語句變得極爲晦澀難懂,更不易維護,最終結果是得不償失。
此存儲過程在u_room_log表上執行,生成2557836行數據,用時2分26秒,這是一個可以接受的性能度量。
mysql> set max_heap_table_size=268435456;
Query OK, 0 rows affected (0.00 sec)
mysql> set tmp_table_size=268435456;
Query OK, 0 rows affected (0.00 sec)
mysql> call sp_overlap();
Query OK, 2557836 rows affected (2 min 26.36 sec)
三、改進取得活躍時段的算法
經過了前兩步的數據處理,得到了結果集 t1,其中同一房間同一用戶不存在重疊時間段,包括開始和結束的兩個時間點也不重合,並且每行的開始時間和結束時間都不跨天。下面要依據活躍時段的定義,以 t1 作爲輸入,找到不同用戶的重疊時間段。這裏使用了“最小範圍”和“正負計數器”兩種不同算法來實現,但在大數據量的生產環境中,只有後者在性能上是可行的。
1. 最小範圍算法(表連接)
該算法步驟如下:
(1)將進出同一房間的所有時間點(不分用戶)統一排序。例如,roomid=1的進出房間記錄如下:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:11:01 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
這步處理完成後的輸出爲:
+--------+---------------------+
| roomid | timepoint |
+--------+---------------------+
| 1 | 2018-01-01 01:01:01 |
| 1 | 2018-01-01 01:01:02 |
| 1 | 2018-01-01 01:01:05 |
| 1 | 2018-01-01 01:01:05 |
| 1 | 2018-01-01 01:02:05 |
| 1 | 2018-01-01 01:11:01 |
| 1 | 2018-01-01 01:11:02 |
| 1 | 2018-01-01 01:11:05 |
+--------+---------------------+
(2)對於上一步輸出中同一roomid的數據,將當前行的時間點作爲結束時間,前一行的時間點作爲開始時間,並且過濾掉開始時間爲空或開始時間等於結束時間的數據。輸出爲每個房間的最小時間範圍間隔。例如,roomid=1的最小時間範圍間隔爲:
+--------+---------------------+---------------------+
| roomid | starttime | endtime |
+--------+---------------------+---------------------+
| 1 | 2018-01-01 01:01:01 | 2018-01-01 01:01:02 |
| 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 2018-01-01 01:02:05 | 2018-01-01 01:11:01 |
| 1 | 2018-01-01 01:11:01 | 2018-01-01 01:11:02 |
| 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+---------------------+---------------------+
這步是算法的核心,實際上就是把同一房間的所有進出時間點串行化到一個連續的時間軸上,輸出的每個時間段首尾相接但不重疊。
(3)將上一步的輸出與 t1 表做內連接。如果用戶的在線時間和最小範圍重疊,就將重疊的最小範圍和userid、roomid輸出。結果包含了某個房間某個用戶一個或者多個的最小範圍。例如,roomid=1的房間,每個用戶對應的最小時間範圍間隔爲:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:01:02 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 1 | 2018-01-01 01:02:05 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
(4)按上一步輸出中的roomid和最小時間範圍分組,過濾出每組中userid個數大於1的數據,結果爲每個房間對應的活躍時間段。例如,roomid=1的房間輸出爲:
+--------+---------------------+---------------------+---+
| roomid | s | e | c |
+--------+---------------------+---------------------+---+
| 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 | 2 |
| 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 | 2 |
+--------+---------------------+---------------------+---+
(5)統計每個房間每天活躍時段內的最大人數,並彙總活躍時長(舍入到分鐘)。例如,roomid=1的房間輸出爲:
+--------+------------+------+------+
| roomid | dt | ts | c |
+--------+------------+------+------+
| 1 | 2018-01-01 | 1 | 2 |
+--------+------------+------+------+
下面是實現最小範圍算法的存儲過程:
drop procedure if exists sp_active_duration;
delimiter //
create procedure sp_active_duration()
begin
declare done int default 0;
declare v_roomid bigint;
declare v_start datetime;
declare v_end datetime;
drop table if exists tmp_time_point;
create temporary table tmp_time_point(
roomid bigint,
timepoint datetime
) engine=memory;
insert into tmp_time_point select roomid,s from t1;
insert into tmp_time_point select roomid,e from t1;
select roomid,date(s) dt,round(sum(timestampdiff(second,s,e))/60) ts,max(c) c
from (select roomid,s,e ,count(distinct userid) c
from (select distinct v6.roomid,v6.userid,starttime s,endtime e
from (select distinct roomid,cast(starttime as datetime) starttime,cast(endtime as datetime) endtime
from (select if(@roomid=roomid,@d,'') as starttime,@d:=timepoint,@roomid:=roomid,p.roomid,p.timepoint endtime
from tmp_time_point p,(select @d:='',@roomid:=-1) vars
order by roomid,timepoint) v4
where starttime!='' and date(starttime)=date(endtime) and starttime <> endtime) v5
inner join t1 v6 on(v5.starttime between v6.s and v6.e and v5.endtime between v6.s and v6.e and v5.roomid=v6.roomid)) v6
group by roomid,s,e having count(distinct userid)>1) v7
group by roomid,date(s);
end
//
delimiter ;
tmp_time_point表即爲步驟(1)的輸出結果。mysql限制在一條查詢中只能引用臨時表一次,否則會報 ERROR 1137 (HY000): Can't reopen table: 't1' 錯誤,所以生成tmp_time_point表數據時執行了兩次insert語句。中間結果集 v5、v6、v7 分別爲步驟(2)、步驟(3)和步驟(4)的輸出結果。
最小範圍算法獲取活躍時段的邏輯沒問題,但在第(3)步驟中需要表關聯,當數據量很大時,這步需要花費非常多的時間,因爲要掃描大量數據行。存儲過程中最後的select語句在u_room_log表上的執行計劃如下:
mysql> explain select roomid,date(s) dt,round(sum(timestampdiff(second,s,e))/60) ts,max(c) c
-> from (select roomid,s,e ,count(distinct userid) c
-> from (select distinct v6.roomid,v6.userid,greatest(s,starttime) s,least(e,endtime) e
-> from (select distinct roomid,cast(starttime as datetime) starttime,cast(endtime as datetime) endtime
-> from (select if(@roomid=roomid,@d,'') as starttime,@d:=timepoint,@roomid:=roomid,p.roomid,p.timepoint endtime
-> from tmp_time_point p,(select @d:='',@roomid:=-1) vars
-> order by roomid,timepoint) v4
-> where starttime!='' and date(starttime)=date(endtime) and starttime <> endtime) v5
-> inner join t1 v6 on(v5.starttime between v6.s and v6.e and v5.endtime between v6.s and v6.e and v5.roomid=v6.roomid)) v6
-> group by roomid,s,e having count(distinct userid)>1) v7
-> group by roomid,date(s);
+----+-------------+------------+------------+--------+---------------+-------------+---------+----------------+------------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+--------+---------------+-------------+---------+----------------+------------+----------+------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 1308213650 | 100.00 | Using temporary |
| 2 | DERIVED | <derived3> | NULL | ALL | NULL | NULL | NULL | NULL | 1308213650 | 100.00 | Using filesort |
| 3 | DERIVED | v6 | NULL | ALL | roomid | NULL | NULL | NULL | 2557836 | 100.00 | Using where; Using temporary |
| 3 | DERIVED | <derived4> | NULL | ref | <auto_key0> | <auto_key0> | 9 | test.v6.roomid | 41436 | 1.23 | Using where; Using index |
| 4 | DERIVED | <derived5> | NULL | ALL | NULL | NULL | NULL | NULL | 5115672 | 81.00 | Using where; Using temporary |
| 5 | DERIVED | <derived6> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | Using filesort |
| 5 | DERIVED | p | NULL | ALL | NULL | NULL | NULL | NULL | 5115672 | 100.00 | NULL |
| 6 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+------------+--------+---------------+-------------+---------+----------------+------------+----------+------------------------------+
8 rows in set, 5 warnings (0.01 sec)
可有看到,步驟(3)需要關聯兩個幾百萬行的大表,因此在u_room_log表上執行sp_active_duration()過程沒有等到出結果。
2. 正負計數器算法(一次掃描)
與重疊時間段優化思想類似,我們希望只掃描一遍表數據,去掉表關聯以提高性能。實際上,經過sp_overlap過程處理後,可以用一種高效的方式得到活躍時段。該算法的核心思想是:將所有的進出時間點統一排序,同時記錄每個時間點的進出用戶數。這樣我們可以將在線時間分成多個互斥的時間段,並且利用當前時間點前面的所有累計進出用戶數,作爲前一個時間點到當前時間點的重疊度,也即不同用戶數。用戶進入房間標記爲+1,離開房間標記爲-1,因此不妨稱之爲正負計數器算法,具體步驟如下。(1)將同一房間的所有進入時間點和退出時間點合併成一列,將進入時間標記爲1,退出時間標記爲-1。實際上,1表示在對應的時間點有一個用戶進入,-1表示在對應的時間點有一個用戶退出。這步處理後roomid=1的記錄變爲:
+--------+---------------------+------+
| roomid | timepoint | type |
+--------+---------------------+------+
| 1 | 2018-01-01 01:01:01 | 1 |
| 1 | 2018-01-01 01:01:02 | 1 |
| 1 | 2018-01-01 01:01:05 | -1 |
| 1 | 2018-01-01 01:01:05 | 1 |
| 1 | 2018-01-01 01:02:05 | -1 |
| 1 | 2018-01-01 01:11:01 | -1 |
| 1 | 2018-01-01 01:11:02 | 1 |
| 1 | 2018-01-01 01:11:05 | -1 |
+--------+---------------------+------+
(2)按房間和時間點分組,對標誌位彙總聚合,目的是去除重複的時間點。重複時間點表示在同一秒有多個用戶進入、或者退出、或者進入退出同一個房間。彙總的目的就是確定在該時間點,最終進出的用戶數。這一步是必須的,原因有兩個:1. 我們必須保證對於一個房間每個時間點是唯一的;2. 必須確定某一時間點的進出方向和進出數量。這兩個點是保證算法成立的充要條件。出於同樣的理由,在拆分跨天記錄時,爲保持時間點的唯一性,起止時間相差一秒。這步處理後roomid=1的記錄變爲:
+--------+---------------------+------+
| roomid | timepoint | type |
+--------+---------------------+------+
| 1 | 2018-01-01 01:01:01 | 1 |
| 1 | 2018-01-01 01:01:02 | 1 |
| 1 | 2018-01-01 01:01:05 | 0 |
| 1 | 2018-01-01 01:02:05 | -1 |
| 1 | 2018-01-01 01:11:01 | -1 |
| 1 | 2018-01-01 01:11:02 | 1 |
| 1 | 2018-01-01 01:11:05 | -1 |
+--------+---------------------+------+
(3)按房間分組,時間點排序,取得當前時間點的前一個時間點對應的進出用戶數。如果沒有前一個時間點,說明是該房間的第一次進入,前一個時間點對應的進出用戶數設爲0。這步處理後的記錄roomid=1變爲:
+--------+---------------------+------+----------+
| roomid | timepoint | type | prevType |
+--------+---------------------+------+----------+
| 1 | 2018-01-01 01:01:01 | 1 | 0 |
| 1 | 2018-01-01 01:01:02 | 1 | 1 |
| 1 | 2018-01-01 01:01:05 | 0 | 1 |
| 1 | 2018-01-01 01:02:05 | -1 | 0 |
| 1 | 2018-01-01 01:11:01 | -1 | -1 |
| 1 | 2018-01-01 01:11:02 | 1 | -1 |
| 1 | 2018-01-01 01:11:05 | -1 | 1 |
+--------+---------------------+------+----------+
(4)取當前時間點的前一個時間點作爲起始時間,當前時間點作爲終止時間,將房間的在線時間區間劃分成互斥時段。用當前時間點前面的所有累計進出用戶數,作爲該時段的重疊度。這步處理後roomid=1的記錄如下,rn即爲starttime和endtime這段時間內的不同用戶數:
+--------+---------------------+---------------------+------+
| roomid | starttime | endtime | rn |
+--------+---------------------+---------------------+------+
| 1 | NULL | 2018-01-01 01:01:01 | 0 |
| 1 | 2018-01-01 01:01:01 | 2018-01-01 01:01:02 | 1 |
| 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 | 2 |
| 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 | 2 |
| 1 | 2018-01-01 01:02:05 | 2018-01-01 01:11:01 | 1 |
| 1 | 2018-01-01 01:11:01 | 2018-01-01 01:11:02 | 0 |
| 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 | 1 |
+--------+---------------------+---------------------+------+
(5)按天統計每個房間活躍時長(重疊度大於1的時段彙總),並求出活躍時段的峯值人數(最大重疊度)。最終roomid=1的結果如下,其中dur爲活躍時長(單位舍入爲分鐘),c是峯值人數:
+--------+------------+------+------+
| roomid | dt | dur | c |
+--------+------------+------+------+
| 1 | 2018-01-01 | 1 | 2 |
+--------+------------+------+------+
採用正負計數器算法後的sp_active_duration如下:
drop procedure if exists sp_active_duration;
delimiter //
create procedure sp_active_duration()
begin
declare done int default 0;
declare v_roomid bigint;
declare v_start datetime;
declare v_end datetime;
declare cur_test cursor for select roomid,s,e from t1;
declare continue handler for not found set done = 1;
drop table if exists tmp_time_point;
create temporary table tmp_time_point(
roomid bigint,
timepoint datetime,
type smallint
) engine=memory;
-- 開始點+1, 結束點-1
insert into tmp_time_point(roomid,timepoint,type) select roomid,s,1 from t1;
insert into tmp_time_point(roomid,timepoint,type) select roomid,e,-1 from t1;
select roomid,date(s) dt,round(sum(timestampdiff(second,date_format(s,'%Y-%m-%d %H:%i:%s'),date_format(e,'%Y-%m-%d %H:%i:%s')))/60) ts,max(rn) c
from (select if(@roomid=roomid,@d,'') as s,
@d:=str_to_date(timepoint,'%Y-%m-%d %H:%i:%s.%f'),
@roomid:=roomid,
p.roomid,
str_to_date(timepoint,'%Y-%m-%d %H:%i:%s.%f') e,
rn
from (select round(case when @roomid=roomid then @rn:=@rn+prevType when @roomid:=roomid then @rn:=prevType end) rn,b.prevType,roomid,timepoint,type
from (select a.roomid,timepoint,type,if(@roomid=roomid,@type,0) prevType, @roomid:=roomid, @type:=type
from (select *
from (select roomid,timepoint,sum(type) type from tmp_time_point group by roomid,timepoint) tmp_time_point,
(select @roomid:=-1,@rn:=0,@type:=0) vars
order by roomid ,timepoint) a) b
order by roomid ,timepoint) p,
(select @d:='',@roomid:=-1) vars
order by roomid,timepoint) v4
where rn>=2
group by roomid,date(s);
end
//
delimiter ;
tmp_time_point表存儲步驟(1)的結果。b、v4分別是步驟(3)和步驟(4)的輸出結果。過程中最後的查詢只掃描一遍tmp_time_point表,處理速度大爲提高。u_room_log表上sp_active_duration過程的執行時間爲1分13秒。
爲滿足原始需求,只需要在一個會話中連續調用兩個存儲過程即可。250萬的業務日誌數據,總執行時間約爲3分40秒。
set max_heap_table_size=268435456;
set tmp_table_size=268435456;
call sp_overlap();
call sp_active_duration();
四、MySQL 8的單條查詢解決方案
MySQL 8提供了豐富的窗口函數,使複雜分析查詢成爲可能。更進一步,老版MySQL的行級變量用法已經不再推薦使用:
mysql> select @a:=id from nums limit 1;
+--------+
| @a:=id |
+--------+
| 1 |
+--------+
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1287 | Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'. |
+---------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
沒有提供窗口函數前,爲了處理複雜邏輯,使用行級變量也是不得已而爲之。本身就不是標準SQL,可讀性很差,如果需要換RDBMS,比重做一遍還麻煩。而MySQL 8在SQL功能上已經接近Oracle,重疊時間段問題用一句查詢即可解決:
with c1 as -- 合併同一房間同一用戶的重疊時間段,用於統計峯值人數
(
select distinct roomid,userid,min(s) s,max(e) e
from (select roomid,userid,s,e,
sum(broken) over (partition by roomid, userid order by s,e) flag
from (select *,
(case when s <= max(e) over (partition by roomid, userid order by s,e rows between unbounded preceding and 1 preceding) then 0
else 1
end) as broken
from test1
) t
) t
group by roomid,userid,flag
),
c2 as -- 拆分跨天的時間段
(
select *
from (select roomid,userid,s,e
from c1
where date(s) = date(e) -- 不跨天
union all
select roomid,userid,
case when id = 1 then s else date_add(date(s),interval id-1 day) end s,
case when id = m2 then e else date_add(date(s),interval id*3600*24 -1 second) end e
from (select roomid,userid,s,e,id,
max(id) over (partition by roomid,userid,s) m2
from c1,(select id from nums where id<=100) n
where date(s) <> date(e) -- 跨天
and id <= date(e)-date(s)+1) t1) t1
),
c3 as -- 在計算最小範圍的同時,計算區間用戶數
(
select roomid,ts endtime,sum(prevtype) over(partition by roomid order by ts) rn,
lag(ts) over (partition by roomid order by ts) starttime
from (
select a.*,ifnull(lag(type) over (partition by roomid order by ts),0) prevtype
from (
select
roomid,ts,sum(type) type
from (
select roomid,e ts, -1 type
from c2
union all
select roomid,s ts, 1 type
from c2
) t1 group by roomid,ts
) a
) c
)
select roomid,dt,round(sum(dur)/60) ts,max(rn) c from (
select roomid,date(starttime) dt,timestampdiff(second,starttime,endtime) dur,rn
from c3 where rn>=2
) t
group by roomid,dt
order by roomid,dt;
該查詢處理邏輯和存儲過程完全相同,只是大部分複雜工作都交給窗口函數完成了,寫法更簡練,但執行時間沒有存儲過程快。相同環境下,with查詢在u_room_log上的執行時間爲4分10秒左右,比自定義的存儲過程執行還慢半分鐘。