3.8 合併數據集:合併與連接
pd的基本特性之一就是高性能的內存式數據連接join與合併merge操作。pd的主接口是merge函數。
3.8.1 關係代數
合併的理論基礎是關係代數
3.8.2 數據連接的類型
merge實現三種數據連接類型:一對一,多對一,多對多。
import pandas as pd import numpy as np class display(object): """Display HTML representation of multiple objects""" template = """<div style="float: left; padding: 10px;"> <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1} </div>""" def __init__(self, *args): self.args = args def _repr_html_(self): return '\n'.join(self.template.format(a, eval(a)._repr_html_()) for a in self.args) def __repr__(self): return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args)
一對一連接
是最簡單的數據合併類型,與3.7節介紹的按列合併十分相似。
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']}) df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_date': [2004, 2008, 2012, 2014]}) display('df1', 'df2')
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
若要將上邊兩個DF合併爲一個,用merge函數:
df3 = pd.merge(df1, df2) df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
merge函數自動將兩個DF共有的列employee作爲鍵進行連接,生成一個新DF,原來DF的行索引自動丟棄,自動生成新行索引。
多對一連接
這種連接中,在需要連接的兩個列中,有一列的值有重複。通過多對一連接的結果DF會保留重複值。如:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']}) display('df3', 'df4', 'pd.merge(df3, df4)')
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
df4
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
pd.merge(df3, df4)
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Bob | Accounting | 2008 | Carly |
1 | Jake | Engineering | 2012 | Guido |
2 | Lisa | Engineering | 2004 | Guido |
3 | Sue | HR | 2014 | Steve |
在結果的DF中多了一個supervisor列,裏面有些值會因爲輸入數據的對應關係而有所重複。
多對多連接
如果左右兩個輸入的共同列都包含重複值,那麼合併結果就是一種多對多連接,如:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'], 'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']}) display('df1', 'df5', "pd.merge(df1, df5)")
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df5
group | skills | |
---|---|---|
0 | Accounting | math |
1 | Accounting | spreadsheets |
2 | Engineering | coding |
3 | Engineering | linux |
4 | HR | spreadsheets |
5 | HR | organization |
pd.merge(df1, df5)
employee | group | skills | |
---|---|---|---|
0 | Bob | Accounting | math |
1 | Bob | Accounting | spreadsheets |
2 | Jake | Engineering | coding |
3 | Jake | Engineering | linux |
4 | Lisa | Engineering | coding |
5 | Lisa | Engineering | linux |
6 | Sue | HR | spreadsheets |
7 | Sue | HR | organization |
這三種數據連接類型可以直接與其他pd工具組合使用,從而實現各種功能。但工作的真是數據集往往不如例子的數據那樣乾淨整潔,下面介紹更多merge功能來更好應對數據連接中的問題。
3.8.3 設置數據合併的鍵
merge默認將兩個輸入的一個或多個同名的列作爲鍵進行合併,但由於兩個輸入要合併的列通常不同名,因此merge提供參數解決這個問題。
參數on的用法
最簡單的方法就是直接將參數on設置爲一個列名字符串或者一個包含多列名稱的列表:
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
pd.merge(df1, df2, on='employee')
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
這個參數只能在兩個DF有共同列名的時候纔可以使用。
left_on和right_on參數
有時候也要合併兩個列名不同的數據集,這種情況下就可以用left_on和right_on參數來指定列名:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'salary': [70000, 80000, 120000, 90000]}) display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df3
name | salary | |
---|---|---|
0 | Bob | 70000 |
1 | Jake | 80000 |
2 | Lisa | 120000 |
3 | Sue | 90000 |
pd.merge(df1, df3, left_on="employee", right_on="name")
employee | group | name | salary | |
---|---|---|---|---|
0 | Bob | Accounting | Bob | 70000 |
1 | Jake | Engineering | Jake | 80000 |
2 | Lisa | Engineering | Lisa | 120000 |
3 | Sue | HR | Sue | 90000 |
獲取的結果中會有一個多餘的列,可通過DF的drop方法將其去掉:
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
employee | group | salary | |
---|---|---|---|
0 | Bob | Accounting | 70000 |
1 | Jake | Engineering | 80000 |
2 | Lisa | Engineering | 120000 |
3 | Sue | HR | 90000 |
left_index和right_index參數
除了合併列之外,有時候還需要合併索引:
df1a = df1.set_index('employee') df2a = df2.set_index('employee') display('df1a', 'df2a')
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df2a
hire_date | |
---|---|
employee | |
Lisa | 2004 |
Bob | 2008 |
Jake | 2012 |
Sue | 2014 |
可通過merge中left_index和//或right_index參數將索引設置爲鍵來實現合併:
display('df1a', 'df2a', "pd.merge(df1a, df2a, left_index=True, right_index=True)")
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df2a
hire_date | |
---|---|
employee | |
Lisa | 2004 |
Bob | 2008 |
Jake | 2012 |
Sue | 2014 |
pd.merge(df1a, df2a, left_index=True, right_index=True)
group | hire_date | |
---|---|---|
employee | ||
Bob | Accounting | 2008 |
Jake | Engineering | 2012 |
Lisa | Engineering | 2004 |
Sue | HR | 2014 |
爲了方便考慮,DF實現了join方法,可以按照索引進行數據合併:
display('df1a', 'df2a', 'df1a.join(df2a)')
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df2a
hire_date | |
---|---|
employee | |
Lisa | 2004 |
Bob | 2008 |
Jake | 2012 |
Sue | 2014 |
df1a.join(df2a)
group | hire_date | |
---|---|---|
employee | ||
Bob | Accounting | 2008 |
Jake | Engineering | 2012 |
Lisa | Engineering | 2004 |
Sue | HR | 2014 |
如果想將索引與列混合使用,那可以通過結合left_index與right_on,或結合left_on與right_index來實現:
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df3
name | salary | |
---|---|---|
0 | Bob | 70000 |
1 | Jake | 80000 |
2 | Lisa | 120000 |
3 | Sue | 90000 |
pd.merge(df1a, df3, left_index=True, right_on='name')
group | name | salary | |
---|---|---|---|
0 | Accounting | Bob | 70000 |
1 | Engineering | Jake | 80000 |
2 | Engineering | Lisa | 120000 |
3 | HR | Sue | 90000 |
當然這些參數都適用於多個索引和多個列名。
3.8.4 設置數據連接的集合操作規則
集合操作規則是數據連接的一個重要條件。當一個值出現在一列而沒有出現在另一列,就要考慮聚合操作規則了,如:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'], 'food': ['fish', 'beans', 'bread']}, columns=['name', 'food']) df7 = pd.DataFrame({'name': ['Mary', 'Joseph'], 'drink': ['wine', 'beer']}, columns=['name', 'drink']) display('df6', 'df7', 'pd.merge(df6, df7)')
df6
name | food | |
---|---|---|
0 | Peter | fish |
1 | Paul | beans |
2 | Mary | bread |
df7
name | drink | |
---|---|---|
0 | Mary | wine |
1 | Joseph | beer |
pd.merge(df6, df7)
name | food | drink | |
---|---|---|---|
0 | Mary | bread | wine |
合併兩個數據集,在name列中只有一條共同的值Mary。默認情況下結果只會包含兩個輸入集合的交集,這種連接方式爲內連接,可用參數how設置連接方式,默認就是內連接inner:
pd.merge(df6, df7, how='inner')
name | food | drink | |
---|---|---|---|
0 | Mary | bread | wine |
how參數支持的數據連接方式還有 outer,left,right。
外連接outer返回兩個輸入集合的並集,所有缺失值都用NaN填充:
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
df6
name | food | |
---|---|---|
0 | Peter | fish |
1 | Paul | beans |
2 | Mary | bread |
df7
name | drink | |
---|---|---|
0 | Mary | wine |
1 | Joseph | beer |
pd.merge(df6, df7, how='outer')
name | food | drink | |
---|---|---|---|
0 | Peter | fish | NaN |
1 | Paul | beans | NaN |
2 | Mary | bread | wine |
3 | Joseph | NaN | beer |
左連接left和右連接right返回的結果分別只包含左列和右列,如:
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
df6
name | food | |
---|---|---|
0 | Peter | fish |
1 | Paul | beans |
2 | Mary | bread |
df7
name | drink | |
---|---|---|
0 | Mary | wine |
1 | Joseph | beer |
pd.merge(df6, df7, how='left')
name | food | drink | |
---|---|---|---|
0 | Peter | fish | NaN |
1 | Paul | beans | NaN |
2 | Mary | bread | wine |
現在輸出的行中只包含左邊輸入列的值。如果用how='right'的話,輸出的行則只包含右邊輸入列的值。
3.8.5 重複列名:suffixes 參數
最後,可能會遇到兩個輸入DF有重名列的情況,如:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [1, 2, 3, 4]}) df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [3, 1, 4, 2]}) display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
df8
name | rank | |
---|---|---|
0 | Bob | 1 |
1 | Jake | 2 |
2 | Lisa | 3 |
3 | Sue | 4 |
df9
name | rank | |
---|---|---|
0 | Bob | 3 |
1 | Jake | 1 |
2 | Lisa | 4 |
3 | Sue | 2 |
pd.merge(df8, df9, on="name")
name | rank_x | rank_y | |
---|---|---|---|
0 | Bob | 1 | 3 |
1 | Jake | 2 | 1 |
2 | Lisa | 3 | 4 |
3 | Sue | 4 | 2 |
由於輸出結果中有兩個重複的列名,因此merge函數自動給其添加了後綴_x,_y。當然也可以通過suffixes參數自定義後綴名:
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')
df8
name | rank | |
---|---|---|
0 | Bob | 1 |
1 | Jake | 2 |
2 | Lisa | 3 |
3 | Sue | 4 |
df9
name | rank | |
---|---|---|
0 | Bob | 3 |
1 | Jake | 1 |
2 | Lisa | 4 |
3 | Sue | 2 |
pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])
name | rank_L | rank_R | |
---|---|---|---|
0 | Bob | 1 | 3 |
1 | Jake | 2 | 1 |
2 | Lisa | 3 | 4 |
3 | Sue | 4 | 2 |
suffixes參數同樣適合於任何連接方式,即使有三個或以上的重複列名時也同樣適用。
3.8.6 例子:美國各州的統計數據
略