覈對Spark生成的數據流程(一)

日常開發中經常會驗證spark生成的數據是否和源oracle庫中數據join之後的結果相同?
也就是判斷spark sql --> hive sql --> oracle sql 執行結果一致即可,比如我們需要覈對201907月的數據:
(1)在測試oracle中執行sql

select
st.sst_code,
sum(case when o.order_type ='10721023' and pdet.part_type='10741001' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_total_p,
sum(case when o.order_type ='10721023' and pdet.part_type='10741003' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_accessory_total_p
from sbpopt.tt_einvoice_sap d left join sbpopt.tt_einvoice_item_sap t on d.vbeln = t.vbeln
left join (select ts.sst_code,tm.sst_name,ts.to_sender from sbpopt.tm_sst_sender ts, sbpopt.tm_sst tm
where ts.sst_code = tm.sst_code group by ts.sst_code,ts.to_sender,tm.sst_name ) st on substr(d.kunag, 6, 10) = st.to_sender
left join sbpopt.tt_part_order_detail det on t.order_code = det.order_code and t.aupos = det.no
left join sbpopt.tt_part_order o on det.order_id = o.order_id
left join sbpopt.tm_promotion_package pac on det.part_code = pac.pp_code and det.sn = pac.pp_version
left join sbpopt.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
left join sbpopt.tm_part_maindata pm on t.matnr = pm.part_code
where d.IS_CANCELED != 'Y' 
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and o.order_type = 10721023
and st.sst_code in ('74308100', '74308310') 
and to_char(d.fkdat,'yyyyMM') = '201907'
group by st.sst_code;

在這裏插入圖片描述
(2)其中sparksql執行的結果爲dws表,直接統計即可

select 
aa.sst_code ,
sum(aa.einvoice_amount_tyre_p) ,
sum(aa.einvoice_accessory_oil_p)
from dws_tt_einvoice_shipping aa where aa.sst_code in ('74308100', '74308310') 
and aa.billing_date>='20190701' and aa.billing_date<'20190801'
group by aa.rssc_name,aa.rssc_code,aa.sst_name,aa.sst_code;

在這裏插入圖片描述
一個經銷商指標只對上一個指標,另一個經銷商兩個指標完全一致??
(3)查找wd層彙總數據

select
sum(case when pdet.part_type='10741001' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_total_p,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308310' and substr(wi.te_fkdat,1,10)>='20190701' and substr(wi.te_fkdat,1,10)<'20190801'

在這裏插入圖片描述
wd層數據竟然和oracle中一致,看來是dws層代碼邏輯有問題?
難道是left join有問題。先驗證下數據再說

--1、查詢wd明細數據A
select
substr(wi.te_fkdat,1,10) date,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join  (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)
--2、查詢wd明細數據B
select
substr(wi.te_fkdat,1,10) date,
sum(case when wi.tpm_parttype='10741003' then nvl(wi.tei_zsp_value,0)+nvl(wi.tei_netwr,0) else 0 end) einvoice_accessory_total
from asmp.wd_tt_einvoice_item_sap wi
left join asmp.wd_tt_part_order wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (wi.tpm_parttype = '10741001' or wi.tpm_parttype = '10741003' ) and wo.order_type != 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)

通過對比發現 A_SQL中有的天數,B_SQL並沒有6和27號,因此判斷出確實是join的問題,最後把left join改成full join即可。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章