MPP列式存储-over 性能优化

 

先说结论:

1.一般的over操作都能被group by +inner join 替代; 如果一定要使用over(配合lead等特殊函数使用),在select 的时候 尽量不要

包含partition by 和order by 之外的字段,以免扫描过多无用列进行统计,只需要在最终过滤结果后再关联一次源表获取补充字段即可(即采用 over+inner join 替代over)。

2.经测算group by +inner join 比over+inner join 又要快1倍多,所以能用group by +inner join替代的尽量替代。

测试详情:

1. 对于列式存储,使用over()时,将over()外的字段全部拿到外面来关联获取,是否会显著提高over操作的性能?

create or replace view dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 as 
select  object_id,device_id,close_image_url,distant_image_url,snap_time,date_time  
from viid_facestatic.res_time_space where object_type ='2' and snap_time between 1588262400000 and 1588694399000 and hour(to_timestamp(snap_time/1000)) between 0 and 23;

--1000万数据 执行耗时6.6s
 SELECT a.object_id,a.device_id,a.close_image_url AS latest_close_image_url,a.total_appear_times,a.date_time as date_time
FROM (
SELECT a.object_id,a.device_id,a.close_image_url,a.snap_time,a.date_time,
row_number()OVER(partition by a.object_id,a.date_time ORDER BY a.snap_time desc,a.date_time)as num1,
count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times 
FROM  dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a)a 
WHERE a.num1=1 AND a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10;

--将device_id,close_image_url改为inner join 获取后 1000万数据 执行耗时2.9s
SELECT b.device_id,b.close_image_url AS latest_close_image_url,a.* from (
 SELECT a.object_id,a.total_appear_times,a.date_time as date_time,a.snap_time
FROM (
SELECT a.object_id,a.snap_time,a.date_time,
row_number()OVER(partition by a.object_id,a.date_time ORDER BY a.snap_time desc,a.date_time)as num1,
count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times 
FROM  dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a)a 
WHERE a.num1=1 AND a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10)a
inner join dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 b
on a.object_id=b.object_id and a.snap_time=b.snap_time;

--结论
对于列式存储,使用over()时,将over()外的字段全部拿到过滤结果后来关联获取,会显著提高over操作的性能。

2. 对于无多余字段(select 列 只包含over内的字段),over 是否比group by 更耗时?
--1.6s
SELECT a.object_id,a.date_time,count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times FROM dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a

--2.37s
SELECT a.object_id,a.date_time, count(1) FROM dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a
group by a.object_id,a.date_time

--结论
单纯的over 与group by 前者稍微要耗时一些,因为它保留了所有记录

3.先group by 再inner join 速度比over 再inner join 快?
--------------------------------------------------------------------------------
create or replace view dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 as 
select  object_id,device_id,device_name ,close_image_url,distant_image_url,snap_time,date_time  
from viid_facestatic.res_time_space where object_type ='3' 
--and date_time between '2020-05-01' and '2020-05-02'
and snap_time between 1588262400000 and 1588694399000 and hour(to_timestamp(snap_time/1000)) between 0 and 23;

--5000万,6.3s
select b.device_id,b.device_name,b.close_image_url AS latest_close_image_url,a.* FROM (
select a.* from(
SELECT a.object_id,a.date_time, count(1)total_appear_times,max(snap_time )mtime FROM dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a
group by a.object_id,a.date_time)a
WHERE a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10)a
inner join dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 b
on a.object_id=b.object_id and a.mtime=b.snap_time;

--5000万,14s
SELECT b.device_id,b.device_name,b.close_image_url AS latest_close_image_url,a.* from (
 SELECT a.object_id,a.total_appear_times,a.date_time as date_time,a.snap_time
FROM (
SELECT a.object_id,a.snap_time,a.date_time,
row_number()OVER(partition by a.object_id,a.date_time ORDER BY a.snap_time desc,a.date_time)as num1,
count(1)OVER(partition by a.object_id,a.date_time)as total_appear_times 
FROM  dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 a)a 
WHERE a.num1=1 AND a.total_appear_times>=10
ORDER BY a.total_appear_times DESC limit 10)a
inner join dts_figure_resource.face_freq_appear_view_523040439e1c8f454eac1450481ff7bda34447c10 b
on a.object_id=b.object_id and a.snap_time=b.snap_time;

-- 结论
先group by 再inner join 速度比over 再inner join 快。






 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章