Hive表常用連接
對於直接在mapReduce中用join相比,hive的好處是簡化了繁瑣的處理工作,hive表的連接操作就是如此,本文主要講解hive的4中主要連接:內連接、外連接、半連接、map連接。
我們用如下的sales,things表的數據來舉例說明各種連接的作用,方便大家理解。
(圖1 sales表) (圖2 things表)
內連接
內連接是最簡單的一種連接,它就是將表匹配的行顯示出來。通過join關鍵字對錶連接,然後是通過on關鍵字進行謂語動詞的連接,等值的條件在on語句中進行限定,當然我們可以在條件中用and,or等分割限定的條件。
如:select sales.*,things.* from sales join things on (sales.id=things.id);
(圖3 查詢結果)
添加and限定:select sales.*,things.* from sales join things on (sales.id=things.id and sales.id>2);
(圖4 查詢結果)
通常單個的連接是執行一個mapredce,可以通過explain來看執行了多少個mapreduce
如:explain extended select sales.*,things.* from sales join things on (sales.id=things.id);
外連接
外連接可以顯示錶中不能匹配的行,外連接可以分爲left outer join,right outer join,full outer join三種
left outer join
左連接是顯示左表的字段,將join表的字段不能匹配的行null來顯示
比如:select sales.*,things.* from sales left outer join things on (sales.id=things.id);
joe 2 shuit 2
hank 3 milk 3
wangwu 4 water 4
lisi 0 NULL NULL
daic 2 shuit 2
right outer join
相對於left outer join相比,right outer join是交換兩表的連接關係
比如:select sales.*,things.* from sales right outer join things on (sales.id=things.id);
joe 2 shuit 2
daic 2 shuit 2
wangwu 4 water 4
NULL NULL air 1
hank 3 milk 3
full outer join
顧名思義就是將所有表所在的行都有對應的行輸出
比如:
select sales.*,things.* from sales full outer join things on (sales.id=things.id);
lisi 0 NULL NULL
wangwu 4 water 4
NULL NULL air 1
joe 2 shuit 2
daic 2 shuit 2
hank 3 milk 3
半連接,半連接類似於左連接,不過並不會輸出右表的值:
比如:select * from sales left semi join things on (sales.id=things.id);
joe 2
hank 3
wangwu 4
daic 2
map連接
當一個表足夠小,比如sales表,適合放在內存中,就可以將其放在內存中做連接操作。如果需要指定map,就需要通過註釋的方式來做。
不如:select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);
joe 2 shuit 2
hank 3 milk 3
wangwu 4 water 4
daic 2 shuit 2
最後查看下執行過程。
比如:explain select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:32
Vertices:
Map 2
Map Operator Tree:
TableScan
alias: things
Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
Spark HashTable Sink Operator
keys:
0 id (type: string)
1 id (type: string)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:31
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: sales
Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 id (type: string)
1 id (type: string)
outputColumnNames: _col0, _col1, _col5, _col6
input vertices:
1 Map 2
Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col5 (type: string), _col6 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink