spark中union 和 unionAll 區別。
union會把數據都掃一遍,然後剔除重複的數據;
然而unionAll直接把兩份數據粘貼返回,時間上會快很多。
unionAll用的會比較多一些
union是返回兩個數據集的並集,不包括重複行,要求列數要一樣,類型可以不同
unionAll是返回兩個數據集的並集,包括重複行
Intersect是返回兩個數據集的交集,不包括重複行
Minus是返回兩個數據集的差集,不包括重複行
spark.sql(" ( select t.cgi,t.n_cgi from (select a.cgi,b.n_cgi ,dis,b.left2 aoa, row_number() over (partition by a.cgi order by b.left2 desc ) as rn from nloc_out a left join ratio_cgi b on a.cgi=b.cgi where a.dis ='1' ) t where t.rn <=3 ) union All ( select t.cgi,t.n_cgi from (select a.cgi,b.n_cgi ,dis,b.right1 aoa, row_number() over (partition by a.cgi order by b.right1 desc ) as rn from nloc_out a left join ratio_cgi b on a.cgi=b.cgi where a.dis ='2' ) t where t.rn <=3 ) union All select t.cgi,t.n_cgi from (select a.cgi,b.n_cgi ,dis,(b.left1+b.right2 ) aoa, row_number() over (partition by a.cgi order by (b.left1+b.right2) desc ) as rn from nloc_out a left join ratio_cgi b on a.cgi=b.cgi where a.dis ='3' ) t where t.rn <=3 ").createOrReplaceTempView("nloc_ncgis_prb_out")