[Python3] Pandas v1.0 —— (四) 合併數據集

文章目錄

[ Pandas version: 1.0.1 ]

六、合併數據集：Concat與Append操作

將不同的數據源進行合併，包括：

將兩個不同的數據集簡單拼接
用數據庫的連接 (join) 與合併 (merge) 操作處理有重疊字段的數據集

# 定義一個能夠創建DataFrame某種形式的函數
def make_df(cols, ind):
    """一個簡單的DataFrame"""
    data = {c: [str(c) + str(i) for i in  ind] for c in cols}
    return pd.DataFrame(data, ind)

# DataFrame示例
make_df('ABC', range(3)) 

#     A   B   C
# 0  A0  B0  C0
# 1  A1  B1  C1
# 2  A2  B2  C2

（一）NumPy數組的合併 np.concatenate()

x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])

x = [[1, 2], [3, 4]]
np.concatenate([x, x], axis=1)
# array([[1, 2, 1, 2],
#        [3, 4, 3, 4]])

（二）通過 pd.concat 實現簡易合併

pd.concat()函數比np.concatenate()配置更多參數，功能更強大。

pandas.concat — pandas 1.0.3 documentation

# pandas.concat — pandas 1.0.3 documentation
pandas.concat(objs: Union[Iterable[Union[ForwardRef('DataFrame'), ForwardRef('Series')]], Mapping[Union[Hashable, NoneType], Union[ForwardRef('DataFrame'), ForwardRef('Series')]]], axis=0, join='outer', ignore_index: bool = False, keys=None, levels=None, names=None, verify_integrity: bool = False, sort: bool = False, copy: bool = True) → Union[ForwardRef('DataFrame'), ForwardRef('Series')]

Parameters:

objs:	a sequence or mapping of Series or DataFrame objects
		If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.

axis:	{0/’index’, 1/’columns’}, default 0
		The axis to concatenate along.

join:	{‘inner’, ‘outer’}, default ‘outer’
		How to handle indexes on other axis (or axes).

ignore_index:	bool, default False
		If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

keys:	sequence, default None
		If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.

levels:	list of sequences, default None
		Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.

names:	list, default None
		Names for the levels in the resulting hierarchical index.

verify_integrity:	bool, default False
		Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.

sort:	bool, default False
		Sort non-concatenation axis if it is not already aligned when join is ‘outer’. This has no effect when join='inner', which already preserves the order of the non-concatenation axis.

copy:	bool, default True
		If False, do not copy data unnecessarily.

Returns:	object, type of objs
		When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.

# 一維合併
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
# 1    A
# 2    B
# 3    C
# 4    D
# 5    E
# 6    F
# dtype: object

# 合併高維數據
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
print(df1)
print(df2)
print(pd.concat([df1, df2]))
#     A   B
# 1  A1  B1
# 2  A2  B2

#     A   B
# 3  A3  B3
# 4  A4  B4

#     A   B
# 1  A1  B1
# 2  A2  B2
# 3  A3  B3
# 4  A4  B4

# 逐列合併（參數默認axis=0）
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
print(df3)
print(df4)
print(pd.concat([df3, df4], axis='columns'))
#     A   B
# 0  A0  B0
# 1  A1  B1

#     C   D
# 0  C0  D0
# 1  C1  D1

#     A   B   C   D
# 0  A0  B0  C0  D0
# 1  A1  B1  C1  D1

1. 索引重複

np.concatenate()與pd.concat()最主要的差異之一是Pandas在合併時會保留索引，即使索引是重複的。

x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index   # 複製索引
print(x)
print(y)
print(pd.concat([x, y]))
#     A   B
# 0  A0  B0
# 1  A1  B1

#     A   B
# 0  A2  B2
# 1  A3  B3

#     A   B
# 0  A0  B0
# 1  A1  B1
# 0  A2  B2
# 1  A3  B3

(1) 捕捉索引重複的錯誤：參數verify_integrity

設置參數verify_integrity=True，合併時若有索引重複就會觸發異常。

try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)
# ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')

(2) 忽略索引：參數ignore_index

有時索引無關緊要，合併時可以忽略索引。設置ignore_index=True，合併時將會創建一個新的整數索引。

pd.concat([x, y], ignore_index=True)
#     A   B
# 0  A0  B0
# 1  A1  B1
# 2  A2  B2
# 3  A3  B3

(3) 增加多級索引：參數keys

通過keys參數爲數據源設置多級索引標籤，結果數據會帶上多級索引。

pd.concat([x, y], keys=['x', 'y'])
#       A   B
# x 0  A0  B0
#   1  A1  B1
# y 0  A2  B2
#   1  A3  B3

2. 類似 join 的合併

實際工作中，需要合併的數據往往帶有不同的列名。列名部分相同時，可以設置參數join

默認情況下，某個位置上的缺失的數據會用 NaN 表示
參數join='outer'（默認）合併方式是對所有輸入列進行並集合並
參數join='inter'合併方式是對輸入列進行交集合並

參數join_axes直接確定結果使用的列名，裏面是索引對象構成的列表（列表的列表）

注：pandas 1.0版本以後不推薦使用join_axes參數，會報錯 TypeError: concat() got an unexpected keyword argument ‘join_axes’；pandas 1.0之前仍可使用

可以用reindex()或reindex_like()方法替代參數join_axes的類似功能

參考鏈接：

Removed the previously deprecated keyword “join_axes” from concat(); use reindex_like on the result instead (GH22318) - What’s new in 1.0.0 (January 29, 2020)

pandas.DataFrame.reindex - pandas 1.0.3 documentation

pandas.DataFrame.reindex_like - pandas 1.0.3 documentation

df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5)
print(df6)
#     A   B   C
# 1  A1  B1  C1
# 2  A2  B2  C2

#     B   C   D
# 3  B3  C3  D3
# 4  B4  C4  D4

pd.concat([df5, df6], sort=False)
#      A   B   C    D
# 1   A1  B1  C1  NaN
# 2   A2  B2  C2  NaN
# 3  NaN  B3  C3   D3
# 4  NaN  B4  C4   D4

pd.concat([df5, df6], join='inner')
#     B   C
# 1  B1  C1
# 2  B2  C2
# 3  B3  C3
# 4  B4  C4

# pd.concat([df5, df6], join_axes=[df5.columns])  # 已無法使用
#      A   B   C
# 1   A1  B1  C1
# 2   A2  B2  C2
# 3  NaN  B3  C3
# 4  NaN  B4  C4

# 替代方法
pd.concat([df5, df6]).reindex(columns=df5.columns)
#      A   B   C
# 1   A1  B1  C1
# 2   A2  B2  C2
# 3  NaN  B3  C3
# 4  NaN  B4  C4

3. append() 方法

# 兩個方法結果相同
df5.append(df6)
pd.concat([df5, df6])

Pandas的append()方法不直接更新原有對象的值，而是爲合併後的數據創建一個新對象，每次合併都需要重新創建索引和數據緩存。（不同於Python的append()方法）

如果需要進行多個append操作，建議先創建一個DataFrame列表，用concat()函數一次性解決所有合併任務。

七、合併數據集：合併與連接 pd.merge()

Pandas的基本特性之一是高性能的內存是數據連接（join）與合併（merge）操作。Pandas的主接口是pd.merge函數

pandas.DataFrame.merge — pandas 1.0.3 documentation

# pandas.DataFrame.merge — pandas 1.0.3 documentation
DataFrame.merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) → 'DataFrame'[source]

	Merge DataFrame or named Series objects with a database-style join.

	The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters:
right: 	DataFrame or named Series
		Object to merge with.

how: 	{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
		Type of merge to be performed.
		- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
		- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
		- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
		- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

on: 	label or list
		Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_on:	label or list, or array-like
		Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_on:	label or list, or array-like
		Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_index:	bool, default False
		Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.

right_index: bool, default False
		Use the index from the right DataFrame as the join key. Same caveats as left_index.

sort:	bool, default False
		Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).

suffixes:	tuple of (str, str), default (‘_x’, ‘_y’)
		Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).

copy:	bool, default True
		If False, avoid copy if possible.

indicator: bool or str, default False
		If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

validate: str, optional
		If specified, checks if merge is of specified type.
		- “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
		- “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
		- “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
		- “many_to_many” or “m:m”: allowed, but does not result in checks.

Returns: DataFrame
		A DataFrame of the two merged objects.

（一）關係代數

pd.merge()實現的功能基於關係代數（relational algebra）的一部分。關係代數是處理關係型數據的通用理論，絕大部分數據庫的可用操作都以此爲理論基礎。

關係代數方法論的強大之處在於，它剔除的若干簡單操作規則經過組合就可以爲任意數據集構建複雜的操作。

Pandas在pd.merge()函數與Series和DataFrame的join()方法裏實現了這些基本操作規則。

（二）數據連接的類型

1. 一對一連接

df_1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 
                     'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df_2 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 
                     'hire_date': [2004, 2008, 2012, 2014]})
df_1
#   employee        group
# 0      Bob   Accounting
# 1     Jake  Engineering
# 2     Lisa  Engineering
# 3      Sue           HR

df_2
  # employee  hire_date
# 0      Bob       2004
# 1     Jake       2008
# 2     Lisa       2012
# 3      Sue       2014

df_3 = pd.merge(df_1, df_2)
df_3
#   employee        group  hire_date
# 0      Bob   Accounting       2004
# 1     Jake  Engineering       2008
# 2     Lisa  Engineering       2012
# 3      Sue           HR       2014

共同列的位置可以是不一致的，pd.merge()函數會自動處理
pd.merge()會默認丟棄原來的行索引，也可以自定義索引

2. 多對一連接

多對一連接是指在需要連接的兩個列中，有一列的值有重複。通過多對一連接的結果DataFrame會保留重複值。

df_4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 
                     'supervisor': ['Carly', 'Guido', 'Steve']})
df_4
#          group supervisor
# 0   Accounting      Carly
# 1  Engineering      Guido
# 2           HR      Steve

pd.merge(df_3, df_4)
#   employee        group  hire_date supervisor
# 0      Bob   Accounting       2004      Carly
# 1     Jake  Engineering       2008      Guido
# 2     Lisa  Engineering       2012      Guido
# 3      Sue           HR       2014      Steve

3. 多對多連接

如果左右兩個輸入的共同列都包含重複值，合併的結果就是一種多對多連接。

df_5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
                     'skill': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
df_5
#          group         skill
# 0   Accounting          math
# 1   Accounting  spreadsheets
# 2  Engineering        coding
# 3  Engineering         linux
# 4           HR  spreadsheets
# 5           HR  organization

pd.merge(df_1, df_5)
#   employee        group         skill
# 0      Bob   Accounting          math
# 1      Bob   Accounting  spreadsheets
# 2     Jake  Engineering        coding
# 3     Jake  Engineering         linux
# 4     Lisa  Engineering        coding
# 5     Lisa  Engineering         linux
# 6      Sue           HR  spreadsheets
# 7      Sue           HR  organization

（三）設置數據合併的鍵

pd.merge()的默認行爲：將兩個輸入的一個或多個共同列作爲鍵進行合併。但由於合併列通常不同名，因此pd.merge()提供一些參數進行處理。

1. 參數 on

直接將參數on設置爲一個列名字符串或一個包含多列名稱的列表，這個參數只能在兩個DataFrame有共同列名時才能使用。

pd.merge(df_1, df_2, on='employee')
#   employee        group  hire_date
# 0      Bob   Accounting       2004
# 1     Jake  Engineering       2008
# 2     Lisa  Engineering       2012
# 3      Sue           HR       2014

2. left_on 與 right_on 參數

需要兩個數據集的共同列列名不同，可以用left_on和right_on參數指定列名

合併後會產生多餘列（列名不同導致），通過DataFrame的drop()方法移除列

df_3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                     'salary': [70000, 80000, 120000, 90000]})
df_3
#    name  salary
# 0   Bob   70000
# 1  Jake   80000
# 2  Lisa  120000
# 3   Sue   90000

pd.merge(df_1, df_3, left_on='employee', right_on='name')
#   employee        group  name  salary
# 0      Bob   Accounting   Bob   70000
# 1     Jake  Engineering  Jake   80000
# 2     Lisa  Engineering  Lisa  120000
# 3      Sue           HR   Sue   90000

pd.merge(df_1, df_3, left_on='employee', right_on='name').drop('name', axis=1)
#   employee        group  salary
# 0      Bob   Accounting   70000
# 1     Jake  Engineering   80000
# 2     Lisa  Engineering  120000
# 3      Sue           HR   90000

3. left_index 與 right_index 參數

通過設置left_index和right_index參數將索引設置爲鍵來實現合併

也可以使用DataFrame的join()方法合併索引，結果相同

如果想將索引與列混合使用，可以通過結合left_index與right_on，或結合left_on與right_index來實現

df1a = df_1.set_index('employee')
df1a
#                 group
# employee
# Bob        Accounting
# Jake      Engineering
# Lisa      Engineering
# Sue                HR

df2a = df_2.set_index('employee')
df2a
#           hire_date
# employee
# Bob            2004
# Jake           2008
# Lisa           2012
# Sue            2014

pd.merge(df1a, df2a, left_index=True, right_index=True)
#                 group  hire_date
# employee
# Bob        Accounting       2004
# Jake      Engineering       2008
# Lisa      Engineering       2012
# Sue                HR       2014

df1a.join(df2a)
#                 group  hire_date
# employee
# Bob        Accounting       2004
# Jake      Engineering       2008
# Lisa      Engineering       2012
# Sue                HR       2014

pd.merge(df1a, df_3, left_index=True, right_on='name')
#          group  name  salary
# 0   Accounting   Bob   70000
# 1  Engineering  Jake   80000
# 2  Engineering  Lisa  120000
# 3           HR   Sue   90000

（四）設置數據連接的集合操作規則

當一個值出現在一列，但沒有出現在另一列時，考慮集合操作規則。

how參數設置連接方式：{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

內連接（inner join）返回兩個輸入列的交集（默認）
外連接（outer join）返回兩個輸入列的並集，所有缺失值都用NaN填充
左連接（left join）和右連接（right join）返回的結果分別只包含左列和右列

df_6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'], 'food': ['fish', 'beans', 'bread']}, 
                   columns=['name', 'food'])
df_6
#     name   food
# 0  Peter   fish
# 1   Paul  beans
# 2   Mary  bread

df_7 = pd.DataFrame({'name': ['Mary', 'Joseph'], 'drink': ['wine', 'beer']}, 
                    columns=['name', 'drink'])
df_7
#      name drink
# 0    Mary  wine
# 1  Joseph  beer

pd.merge(df_6, df_7, how='inner')
#    name   food drink
# 0  Mary  bread  wine

pd.merge(df_6, df_7, how='outer')
#      name   food drink
# 0   Peter   fish   NaN
# 1    Paul  beans   NaN
# 2    Mary  bread  wine
# 3  Joseph    NaN  beer

pd.merge(df_6, df_7, how='left')
#     name   food drink
# 0  Peter   fish   NaN
# 1   Paul  beans   NaN
# 2   Mary  bread  wine

（五）重複列名：suffixes 參數

兩個輸入DataFrame有重名列時，pd.merge()函數會自動增加後綴 _x或_y，也可以通過suffixes參數自定義後綴名

df_8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [1, 2, 3, 4]})
df_8
#    name  rank
# 0   Bob     1
# 1  Jake     2
# 2  Lisa     3
# 3   Sue     4

df_9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [3, 1, 4, 2]})
df_9
#    name  rank
# 0   Bob     3
# 1  Jake     1
# 2  Lisa     4
# 3   Sue     2

# 自動添加後綴
pd.merge(df_8, df_9, on='name')
#    name  rank_x  rank_y
# 0   Bob       1       3
# 1  Jake       2       1
# 2  Lisa       3       4
# 3   Sue       4       2

# 自定義後綴
pd.merge(df_8, df_9, on='name', suffixes=['_L', '_R'])
#    name  rank_L  rank_R
# 0   Bob       1       3
# 1  Jake       2       1
# 2  Lisa       3       4
# 3   Sue       4       2

總結自《Python數據科學手冊》