从2020/04/22日开始,往延3天,按req_id关联,统计曝光事件与点击事件之间的时间间隔分布情况,按30分钟为粒度,
点击事件字段:req_id,clickTime,
曝光事件字段:req_id,exposureTime
需要统计clickTime-exposureTime在各个时间差(30分钟,60分钟,90分钟。。。)的百分率
数据结果如下:
时间间隔(单位分钟) 占比情况(%) 备注
30 80
60 95
90 96
120 98
150 99
思路:
1.用req_id 关联点击事件left join 曝光事件,且曝光时间和点击事件都存在的情况下,click[20200422],exposure[20200420,20200421,20200422]
2.获取点击时间和曝光时间
3.利用2步骤中的 点击时间-曝光时间差 得到 时间差
4.计算各个时间差的占比
function etldata() {
echo "start etldata"
beeline -e "
select
time
,count
,sum
,sum(rate) over(rows between UNBOUNDED PRECEDING and CURRENT ROW)
from(
select
time
,count
,sum
,(count*100.00)/sum as rate
from
(
select
time
,count
,sum(count) over() as sum
from(
select
(time+1)*30 as time
,count(*) as count
from
(
select
click.req_id
,click.rtime as ctime
,expo.rtime as etime
,floor((click.rtime-expo.rtime)/1000/60/30) as time
,row_number() over(partition by expo.req_id) as rn
from (
select rtime,req_id
from tmp.ad_click_b
where data_date=2020042406
) click
left join
(
select rtime,req_id
from tmp.ad_exposure_b
where data_date>=2020042404 and data_date<=2020042406
) expo on click.req_id = expo.req_id
)cli_epo where rn=1 group by time
)t2
)t1
)t order by time asc
"
echo "end etldata"
}
function main() {
etldata
}
main