3.8 合併數據集:合併與連接

3.8 合併數據集:合併與連接

pd的基本特性之一就是高性能的內存式數據連接join與合併merge操作。pd的主接口是merge函數。

3.8.1 關係代數

合併的理論基礎是關係代數

3.8.2 數據連接的類型

merge實現三種數據連接類型:一對一,多對一,多對多。

import pandas as pd
import numpy as np

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

一對一連接

是最簡單的數據合併類型,與3.7節介紹的按列合併十分相似。

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')

df1

  employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df2

  employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014

若要將上邊兩個DF合併爲一個,用merge函數:

df3 = pd.merge(df1, df2)
df3
  employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

merge函數自動將兩個DF共有的列employee作爲鍵進行連接,生成一個新DF,原來DF的行索引自動丟棄,自動生成新行索引。

多對一連接

這種連接中,在需要連接的兩個列中,有一列的值有重複。通過多對一連接的結果DF會保留重複值。如:

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')

df3

  employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

df4

  group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve

pd.merge(df3, df4)

  employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve

在結果的DF中多了一個supervisor列,裏面有些值會因爲輸入數據的對應關係而有所重複。

多對多連接

如果左右兩個輸入的共同列都包含重複值,那麼合併結果就是一種多對多連接,如:

df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")

df1

  employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df5

  group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization

pd.merge(df1, df5)

  employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization

這三種數據連接類型可以直接與其他pd工具組合使用,從而實現各種功能。但工作的真是數據集往往不如例子的數據那樣乾淨整潔,下面介紹更多merge功能來更好應對數據連接中的問題。

3.8.3 設置數據合併的鍵

merge默認將兩個輸入的一個或多個同名的列作爲鍵進行合併,但由於兩個輸入要合併的列通常不同名,因此merge提供參數解決這個問題。

參數on的用法

最簡單的方法就是直接將參數on設置爲一個列名字符串或者一個包含多列名稱的列表:

display('df1', 'df2', "pd.merge(df1, df2, on='employee')")

df1

  employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df2

  employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014

pd.merge(df1, df2, on='employee')

  employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

這個參數只能在兩個DF有共同列名的時候纔可以使用。

left_on和right_on參數

有時候也要合併兩個列名不同的數據集,這種情況下就可以用left_on和right_on參數來指定列名:

df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')

df1

  employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df3

  name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000

pd.merge(df1, df3, left_on="employee", right_on="name")

  employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000

獲取的結果中會有一個多餘的列,可通過DF的drop方法將其去掉:

pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
  employee group salary
0 Bob Accounting 70000
1 Jake Engineering 80000
2 Lisa Engineering 120000
3 Sue HR 90000

left_index和right_index參數

除了合併列之外,有時候還需要合併索引:

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')

df1a

  group
employee  
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df2a

  hire_date
employee  
Lisa 2004
Bob 2008
Jake 2012
Sue 2014

可通過merge中left_index和//或right_index參數將索引設置爲鍵來實現合併:

display('df1a', 'df2a',
        "pd.merge(df1a, df2a, left_index=True, right_index=True)")

df1a

  group
employee  
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df2a

  hire_date
employee  
Lisa 2004
Bob 2008
Jake 2012
Sue 2014

pd.merge(df1a, df2a, left_index=True, right_index=True)

  group hire_date
employee    
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014

爲了方便考慮,DF實現了join方法,可以按照索引進行數據合併:

display('df1a', 'df2a', 'df1a.join(df2a)')

df1a

  group
employee  
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df2a

  hire_date
employee  
Lisa 2004
Bob 2008
Jake 2012
Sue 2014

df1a.join(df2a)

  group hire_date
employee    
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014

如果想將索引與列混合使用,那可以通過結合left_index與right_on,或結合left_on與right_index來實現:

display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")

df1a

  group
employee  
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df3

  name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000

pd.merge(df1a, df3, left_index=True, right_on='name')

  group name salary
0 Accounting Bob 70000
1 Engineering Jake 80000
2 Engineering Lisa 120000
3 HR Sue 90000

當然這些參數都適用於多個索引和多個列名。

3.8.4 設置數據連接的集合操作規則

集合操作規則是數據連接的一個重要條件。當一個值出現在一列而沒有出現在另一列,就要考慮聚合操作規則了,如:

df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')

df6

  name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

  name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7)

  name food drink
0 Mary bread wine

合併兩個數據集,在name列中只有一條共同的值Mary。默認情況下結果只會包含兩個輸入集合的交集,這種連接方式爲內連接,可用參數how設置連接方式,默認就是內連接inner:

pd.merge(df6, df7, how='inner')
  name food drink
0 Mary bread wine

how參數支持的數據連接方式還有 outer,left,right。

外連接outer返回兩個輸入集合的並集,所有缺失值都用NaN填充:

display('df6', 'df7', "pd.merge(df6, df7, how='outer')")

df6

  name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

  name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7, how='outer')

  name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer

左連接left和右連接right返回的結果分別只包含左列和右列,如:

display('df6', 'df7', "pd.merge(df6, df7, how='left')")

df6

  name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

  name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7, how='left')

  name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine

現在輸出的行中只包含左邊輸入列的值。如果用how='right'的話,輸出的行則只包含右邊輸入列的值。

3.8.5 重複列名:suffixes 參數

最後,可能會遇到兩個輸入DF有重名列的情況,如:

df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')

df8

  name rank
0 Bob 1
1 Jake 2
2 Lisa 3
3 Sue 4

df9

  name rank
0 Bob 3
1 Jake 1
2 Lisa 4
3 Sue 2

pd.merge(df8, df9, on="name")

  name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2

由於輸出結果中有兩個重複的列名,因此merge函數自動給其添加了後綴_x,_y。當然也可以通過suffixes參數自定義後綴名:

display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')

df8

  name rank
0 Bob 1
1 Jake 2
2 Lisa 3
3 Sue 4

df9

  name rank
0 Bob 3
1 Jake 1
2 Lisa 4
3 Sue 2

pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])

  name rank_L rank_R
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2

suffixes參數同樣適合於任何連接方式,即使有三個或以上的重複列名時也同樣適用。

3.8.6 例子:美國各州的統計數據

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章