Hive 表的連接

原創

humanity11

2019-01-26 20:05

Hive表常用連接

對於直接在mapReduce中用join相比，hive的好處是簡化了繁瑣的處理工作，hive表的連接操作就是如此，本文主要講解hive的4中主要連接：內連接、外連接、半連接、map連接。

我們用如下的sales,things表的數據來舉例說明各種連接的作用，方便大家理解。

（圖1 sales表）（圖2 things表）

內連接

內連接是最簡單的一種連接，它就是將表匹配的行顯示出來。通過join關鍵字對錶連接，然後是通過on關鍵字進行謂語動詞的連接，等值的條件在on語句中進行限定，當然我們可以在條件中用and，or等分割限定的條件。

如：select sales.*,things.* from sales join things on (sales.id=things.id);

（圖3 查詢結果）

添加and限定：select sales.*,things.* from sales join things on (sales.id=things.id and sales.id>2);

（圖4 查詢結果）

通常單個的連接是執行一個mapredce，可以通過explain來看執行了多少個mapreduce

如：explain extended select sales.*,things.* from sales join things on (sales.id=things.id);

外連接

外連接可以顯示錶中不能匹配的行，外連接可以分爲left outer join，right outer join，full outer join三種

left outer join

左連接是顯示左表的字段，將join表的字段不能匹配的行null來顯示

比如：select sales.*,things.* from sales left outer join things on (sales.id=things.id);

joe	2	shuit	2
hank	3	milk	3
wangwu	4	water	4
lisi	0	NULL	NULL
daic	2	shuit	2

right outer join

相對於left outer join相比，right outer join是交換兩表的連接關係

比如：select sales.*,things.* from sales right outer join things on (sales.id=things.id);

joe	2	shuit	2
daic	2	shuit	2
wangwu	4	water	4
NULL	NULL	air	1
hank	3	milk	3

full outer join

顧名思義就是將所有表所在的行都有對應的行輸出

比如：

select sales.*,things.* from sales full outer join things on (sales.id=things.id);

lisi	0	NULL	NULL
wangwu	4	water	4
NULL	NULL	air	1
joe	2	shuit	2
daic	2	shuit	2
hank	3	milk	3

半連接，半連接類似於左連接，不過並不會輸出右表的值：

比如：select * from sales left semi join things on (sales.id=things.id);

joe	2
hank	3
wangwu	4
daic	2

map連接

當一個表足夠小，比如sales表，適合放在內存中，就可以將其放在內存中做連接操作。如果需要指定map，就需要通過註釋的方式來做。

不如：select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);

joe	2	shuit	2
hank	3	milk	3
wangwu	4	water	4
daic	2	shuit	2

最後查看下執行過程。

比如:explain select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);

STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:32
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: things
                  Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
                    Spark HashTable Sink Operator
                      keys:
                        0 id (type: string)
                        1 id (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:31
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: sales
                  Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 id (type: string)
                        1 id (type: string)
                      outputColumnNames: _col0, _col1, _col5, _col6
                      input vertices:
                        1 Map 2
                      Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string), _col1 (type: string), _col5 (type: string), _col6 (type: string)
                        outputColumnNames: _col0, _col1, _col2, _col3
                        Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive 表的連接

Hive表常用連接

內連接

外連接

left outer join

right outer join

full outer join

map連接

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

關於接口協議，你必須要知道這些！

sparkSql中的那些函數

hbase 錯誤調用表讀方法引發的血案

如何快速寫出Mysql 語句

SparkStreaming從入門到放棄（五）

SparkStreaming從入門到放棄（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結