0 引言
Pandas中不僅可以調用 .concat 函數對多個表格進行合併,還可以調用 .merge 函數進行多個表格的合併。
1 .merge合併DataFrame表格
import pandas as pd
生成多個DataFrame表格
left = pd.DataFrame({'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
掉用 .merge 函數並依據’key’進行兩個表格的合併
res = pd.merge(left,right,on='key')
res
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
left = pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key1':['K0','K1','K1','K3'],
'key2':['K0','K0','K0','K0'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
print(right)
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K3 K0 C3 D3
調用 .merge,並依據多個key,設置how爲’outer’,結果就是兩個表格的位置一樣的值就完全合併,不一樣的值就設置爲空值(NaN)
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='outer')# how默認inner
res
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K3 | K0 | NaN | NaN | C3 | D3 |
how 設置爲 ‘inner’ ,有空值的行會直接刪掉
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='inner')# how默認inner
res
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
how 設置爲’left‘,只會考慮左邊的DataFrame
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='left')# how默認inner
res
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
繼續在 .merge 函數中添加參數 indicator ,顯示merge信息
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='outer',indicator=True)# 顯示merge信息
res
key1 | key2 | A | B | C | D | _merge | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
5 | K3 | K0 | NaN | NaN | C3 | D3 | right_only |
也可以爲參數 indicator 設置名字
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='outer',indicator='indicator_column')# 顯示merge信息
res
key1 | key2 | A | B | C | D | indicator_column | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
5 | K3 | K0 | NaN | NaN | C3 | D3 | right_only |
left = pd.DataFrame({'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']},
index = ['K0','K1','K2','K3'],)
right = pd.DataFrame({'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']},
index = ['K0','K1','K2','K3'],)
print(left)
print(right)
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
C D
K0 C0 D0
K1 C1 D1
K2 C2 D2
K3 C3 D3
調用 .merge 也可以像 .concat 一樣依據索引 index 進行合併
res = pd.merge(left,right,left_index=True,right_index=True,how='outer')
res
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | C1 | D1 |
K2 | A2 | B2 | C2 | D2 |
K3 | A3 | B3 | C3 | D3 |
boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
girls = pd.DataFrame({'k':['K0','K0','K3'],'age':[4,5,6]})
print(boys)
print(girls)
k age
0 K0 1
1 K1 2
2 K2 3
k age
0 K0 4
1 K0 5
2 K3 6
參數suffixes 可以爲合併後的表格重新修改表頭,區分不同表格的表頭
res = pd.merge(boys,girls,on='k',suffixes=['_boy','_girl'],how='outer')
res
k | age_boy | age_girl | |
---|---|---|---|
0 | K0 | 1.0 | 4.0 |
1 | K0 | 1.0 | 5.0 |
2 | K1 | 2.0 | NaN |
3 | K2 | 3.0 | NaN |
4 | K3 | NaN | 6.0 |