hive使用臨時表保留全量數據
需求:
在hive環境下,a表爲全量表,b表爲增量表(只有當天跑的數據),
假設需要將a表中有的但b表中沒有的數據仍然保留在a表,
而且需要將b表中有的但a表中沒有的數據追加到a表
方案一:
使用左外關聯先將a表中有的數據但b表中沒有的數據過濾出來,
然後再將b表的數據與過濾出來的數據合併
---------------------創建數據(在oracle演示)
--查詢b表在a表的信息
with a as(
select 1 as id, 'Lisi' as name ,'2019-10-01' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-01' as time from dual
union all
select 3 as id, 'Zhaoliu' as name,'2019-10-01' as time from dual
union all
select 4 as id, 'Pangsan' as name,'2019-10-01' as time from dual
),
b as(
select 1 as id, 'Lisi' as name,'2019-10-03' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-03' as time from dual
union all
select 5 as id, 'Huangsan' as name,'2019-10-03' as time from dual
)
--使用連接
select a.id, a.name,a.time
from a
left join b
on a.id = b.id
where b.id is null
union all
select b.id,b.name,b.time
from b
;
方案二:
先將a、b表的數據合併,
然後使用分析函數row_number()進行排序,將重複的數據進行分組排序,重複的數據只保留時間最新的那一份數據即可
---------------------創建數據(在oracle演示)
--查詢b表在a表的信息
with a as(
select 1 as id, 'Lisi' as name ,'2019-10-01' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-01' as time from dual
union all
select 3 as id, 'Zhaoliu' as name,'2019-10-01' as time from dual
union all
select 4 as id, 'Pangsan' as name,'2019-10-01' as time from dual
),
b as(
select 1 as id, 'Lisi' as name,'2019-10-02' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-02' as time from dual
union all
select 5 as id, 'Huangsan' as name,'2019-10-02' as time from dual
)
--使用連接
SELECT id
,NAME
,TIME
,rr
FROM (SELECT id
,NAME
,TIME
,row_number() over(PARTITION BY id ORDER BY TIME DESC) AS rr
FROM (SELECT a.id
,a.name
,a.time
FROM a a
UNION ALL
SELECT b.id
,b.name
,b.time
FROM b b) c) d
WHERE d.rr = 1
;
因數據量小爲得出哪個方案比較好,後續關注。