MPP列式存儲-over 性能優化

 

先說結論:

1.一般的over操作都能被group by +inner join 替代; 如果一定要使用over(配合lead等特殊函數使用),在select 的時候 儘量不要

包含partition by 和order by 之外的字段,以免掃描過多無用列進行統計,只需要在最終過濾結果後再關聯一次源表獲取補充字段即可(即採用 over+inner join 替代over)。

2.經測算group by +inner join 比over+inner join 又要快1倍多,所以能用group by +inner join替代的儘量替代。

測試詳情:

1. 對於列式存儲,使用over()時,將over()外的字段全部拿到外面來關聯獲取,是否會顯著提高over操作的性能?

create or replace view dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 as 
select  object_id,device_id,close_image_url,distant_image_url,snap_time,date_time  
from viid_facestatic.res_time_space where object_type ='2' and snap_time between 1588262400000 and 1588694399000 and hour(to_timestamp(snap_time/1000)) between 0 and 23;

--1000萬數據 執行耗時6.6s
 SELECT a.object_id,a.device_id,a.close_image_url AS latest_close_image_url,a.total_appear_times,a.date_time as date_time
FROM (
SELECT a.object_id,a.device_id,a.close_image_url,a.snap_time,a.date_time,
row_number()OVER(partition by a.object_id,a.date_time ORDER BY a.snap_time desc,a.date_time)as num1,
count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times 
FROM  dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a)a 
WHERE a.num1=1 AND a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10;

--將device_id,close_image_url改爲inner join 獲取後 1000萬數據 執行耗時2.9s
SELECT b.device_id,b.close_image_url AS latest_close_image_url,a.* from (
 SELECT a.object_id,a.total_appear_times,a.date_time as date_time,a.snap_time
FROM (
SELECT a.object_id,a.snap_time,a.date_time,
row_number()OVER(partition by a.object_id,a.date_time ORDER BY a.snap_time desc,a.date_time)as num1,
count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times 
FROM  dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a)a 
WHERE a.num1=1 AND a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10)a
inner join dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 b
on a.object_id=b.object_id and a.snap_time=b.snap_time;

--結論
對於列式存儲,使用over()時,將over()外的字段全部拿到過濾結果後來關聯獲取,會顯著提高over操作的性能。

2. 對於無多餘字段(select 列 只包含over內的字段),over 是否比group by 更耗時?
--1.6s
SELECT a.object_id,a.date_time,count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times FROM dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a

--2.37s
SELECT a.object_id,a.date_time, count(1) FROM dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a
group by a.object_id,a.date_time

--結論
單純的over 與group by 前者稍微要耗時一些,因爲它保留了所有記錄

3.先group by 再inner join 速度比over 再inner join 快?
--------------------------------------------------------------------------------
create or replace view dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 as 
select  object_id,device_id,device_name ,close_image_url,distant_image_url,snap_time,date_time  
from viid_facestatic.res_time_space where object_type ='3' 
--and date_time between '2020-05-01' and '2020-05-02'
and snap_time between 1588262400000 and 1588694399000 and hour(to_timestamp(snap_time/1000)) between 0 and 23;

--5000萬,6.3s
select b.device_id,b.device_name,b.close_image_url AS latest_close_image_url,a.* FROM (
select a.* from(
SELECT a.object_id,a.date_time, count(1)total_appear_times,max(snap_time )mtime FROM dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a
group by a.object_id,a.date_time)a
WHERE a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10)a
inner join dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 b
on a.object_id=b.object_id and a.mtime=b.snap_time;

--5000萬,14s
SELECT b.device_id,b.device_name,b.close_image_url AS latest_close_image_url,a.* from (
 SELECT a.object_id,a.total_appear_times,a.date_time as date_time,a.snap_time
FROM (
SELECT a.object_id,a.snap_time,a.date_time,
row_number()OVER(partition by a.object_id,a.date_time ORDER BY a.snap_time desc,a.date_time)as num1,
count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times 
FROM  dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a)a 
WHERE a.num1=1 AND a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10)a
inner join dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 b
on a.object_id=b.object_id and a.snap_time=b.snap_time;

-- 結論
先group by 再inner join 速度比over 再inner join 快。






 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章