Hive0.14在left outer join多級連接中,執行計劃生成BUG記錄

前言:

        這幾天遇到一個很詭異的問題,一個三級left outer join的句子,在hive0.9和0.14上的執行結果會不一樣。

        而且在0.14上通過轉換右表連接的順序可以達到正確輸出的目的,但是其中是爲什麼卻不得而知,情況非常

        詭異,猜測是編譯器編譯執行計劃有問題!(所謂轉換右表連接順序是指把A left outer join B left outer join C

        改成A left outer join C left outer join B,出問題的在B子句中有個聚合出的結果在最終結果中表現不對。)


下面詳細介紹下問題:

                      原語句:

select  A.state_date,
           A.customer,
           A.channel_2,
           A.id,
           A.pid,
           A.type,
           A.pv,
           A.uv,
           A.visits,
           if(C.stay_visits is null,0,C.stay_visits) as stay_visits,
           A.stay_time,
           if(B.bounce is null,0,B.bounce) as bounce
 from
     (select a.state_date,
            a.customer,
            b.url as channel_2,
            b.id,
            b.pid,
            b.type,
            count(1) as pv,
            count(distinct a.gid) uv,
            count(distinct a.session_id) as visits,
            sum(a.stay_time) as stay_time
       from      
               ( select state_date,
                           customer,
                           gid,
                           session_id,
                           ep,
                           stay_time
                    from bdi_fact.mid_pageview_dt0
                    where l_date ='$v_date'
                  )a
                  join
                  (select l_date as state_date ,
                          url,
                          id,
                          pid,
                          type,
                          cid
                   from bdi_fact.frequency_channel
                   where l_date ='$v_date'
                   and type ='2'
                   and dr='0'
                  )b
                   on  a.customer=b.cid 
                   where a.ep  rlike b.url
                   group by a.state_date, a.customer, b.url,b.id,b.pid,b.type
       )A
       
    left outer join
       (   select
                   c.state_date ,
                   c.customer ,
                   d.url as channel_2,
                   d.id,
                   sum(pagedepth) as bounce
            from
                  ( select
                              t1.state_date ,
                              t1.customer ,
                              t1.session_id,
                              t1.ep,
                              t2.pagedepth
                    from          
                         ( select
                                     state_date ,
                                     customer ,
                                     session_id,
                                     exit_url as ep
                          from ods.mid_session_enter_exit_dt0
                          where l_date ='$v_date'
                          )t1
                         join
                          ( select
                                    state_date ,
                                    customer ,
                                    session_id,
                                    pagedepth
                            from ods.mid_session_action_dt0
                            where l_date ='$v_date'
                            and  pagedepth='1'
                          )t2
                         on t1.customer=t2.customer
                         and t1.session_id=t2.session_id
                   )c
                   join
                   (select *
                   from bdi_fact.frequency_channel
                   where l_date ='$v_date'
                   and type ='2'
                   and dr='0'
                   )d
                   on c.customer=d.cid
                   where c.ep  rlike d.url
                   group by  c.state_date,c.customer,d.url,d.id
             )B
             on
         A.customer=B.customer
             and A.channel_2=B.channel_2
             and A.id=B.id
      left outer join
     (
             select e.state_date,
            e.customer,
            f.url as channel_2,
            f.id,
            f.pid,
            f.type,
            count(distinct e.session_id) as stay_visits
       from      
               ( select state_date,
                           customer,
                           gid,
                           session_id,
                           ep,
                           stay_time
                    from bdi_fact.mid_pageview_dt0
                    where l_date ='$v_date'
                  )e
                  join
                  (select l_date as state_date,
                          url,
                          id,
                          pid,
                          type,
                          cid
                   from bdi_fact.frequency_channel
                   where l_date ='$v_date'
                   and type ='2'
                   and dr='0'
                  )f
                   on  e.customer=f.cid 
                   where e.ep  rlike f.url
                   and e.stay_time is not null
                   and e.stay_time <>'0'
                   group by e.state_date, e.customer, f.url,f.id,f.pid,f.type
           )C
    on
        A.customer=C.customer
        and   A.channel_2=C.channel_2
        and   A.id=C.id
        and   A.pid=C.pid
        and   A.type=C.type
 where A.customer='Cdianyingwang' and A.channel_2='http://www.1905.com/film/filmnews/jk/' and A.id='127';"

                  在B子句中算出的結果B.bounce在最終結果中會顯示錯誤(如正確結果是500,但是卻顯示是100)

                  但是,通過調整連接順序後就很正常了。

 

                  打印執行計劃出來看:

                                  

 

                可以很明顯的看見同一個階段有一個字段沒有輸出出來,這個階段就是B子句的任務。而這個字段就是B.bounce。

                這應該就能解釋爲什麼最後結果是不對的

                好了,既然知道問題在哪兒了,就來看源碼吧!

                通過查找ExprNodeColumnDesc.java(負責解析並生成輸出字段)類,有以下發現:

                

                (圖中箭頭所指是我修改後的代碼:

                                               原代碼是:if (tabAlias != null && dest.tabAlias != null && !tabAlias.equals(dest.tabAlias)){

                                                                                                   return false;
                                                                })

                解析器在判斷是否輸出字段時,會判斷當前字段是否跟最終表的字段相同(在查詢中有很多中間表,如多級連接中):

                                (以下所說的表都是指表別名

                                 如果中間表和最終表都不爲空且當前表不是最終表的話,返回false。即當前字段跟最終字段不同,需要輸出。

                                 但是有個漏洞,就是當最終表爲null的時候,這個時候如果是原代碼的話,會返回true,字段會被直接丟棄

                                 因爲返回true的話編譯器判斷最終結果中已有這個字段不需要輸出。

                                 但是想想,如果最終表的別名爲null,當前表的字段應該輸出纔對啊!不然數據的去處沒了,源頭也會沒了。

                                 仔細想想這應該也是一個寫代碼時粗心犯下的一個錯誤吧!(在left outer join多級連接中有可能目標表的別名會爲空)

 

               通過修改以上代碼並編譯後在集羣上測試,執行計劃輸出正常了,數據結果也正常了:

                      圖爲patch前的結果(最後一列爲B.bounce):

                      

                      圖爲patch後的結果:

                      

發佈了27 篇原創文章 · 獲贊 0 · 訪問量 4萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章