文章目錄
[ Pandas version: 1.0.1 ]
六、合併數據集:Concat與Append操作
將不同的數據源進行合併,包括:
- 將兩個不同的數據集簡單拼接
- 用數據庫的連接 (join) 與合併 (merge) 操作處理有重疊字段的數據集
# 定義一個能夠創建DataFrame某種形式的函數
def make_df(cols, ind):
"""一個簡單的DataFrame"""
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
# DataFrame示例
make_df('ABC', range(3))
# A B C
# 0 A0 B0 C0
# 1 A1 B1 C1
# 2 A2 B2 C2
(一)NumPy數組的合併 np.concatenate()
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])
x = [[1, 2], [3, 4]]
np.concatenate([x, x], axis=1)
# array([[1, 2, 1, 2],
# [3, 4, 3, 4]])
(二)通過 pd.concat 實現簡易合併
pd.concat()
函數比np.concatenate()
配置更多參數,功能更強大。
# pandas.concat — pandas 1.0.3 documentation
pandas.concat(objs: Union[Iterable[Union[ForwardRef('DataFrame'), ForwardRef('Series')]], Mapping[Union[Hashable, NoneType], Union[ForwardRef('DataFrame'), ForwardRef('Series')]]], axis=0, join='outer', ignore_index: bool = False, keys=None, levels=None, names=None, verify_integrity: bool = False, sort: bool = False, copy: bool = True) → Union[ForwardRef('DataFrame'), ForwardRef('Series')]
Parameters:
objs: a sequence or mapping of Series or DataFrame objects
If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
axis: {0/’index’, 1/’columns’}, default 0
The axis to concatenate along.
join: {‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
ignore_index: bool, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
keys: sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
levels: list of sequences, default None
Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
names: list, default None
Names for the levels in the resulting hierarchical index.
verify_integrity: bool, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
sort: bool, default False
Sort non-concatenation axis if it is not already aligned when join is ‘outer’. This has no effect when join='inner', which already preserves the order of the non-concatenation axis.
copy: bool, default True
If False, do not copy data unnecessarily.
Returns: object, type of objs
When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.
# 一維合併
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
# 1 A
# 2 B
# 3 C
# 4 D
# 5 E
# 6 F
# dtype: object
# 合併高維數據
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
print(df1)
print(df2)
print(pd.concat([df1, df2]))
# A B
# 1 A1 B1
# 2 A2 B2
# A B
# 3 A3 B3
# 4 A4 B4
# A B
# 1 A1 B1
# 2 A2 B2
# 3 A3 B3
# 4 A4 B4
# 逐列合併(參數默認axis=0)
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
print(df3)
print(df4)
print(pd.concat([df3, df4], axis='columns'))
# A B
# 0 A0 B0
# 1 A1 B1
# C D
# 0 C0 D0
# 1 C1 D1
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
1. 索引重複
np.concatenate()
與pd.concat()
最主要的差異之一是Pandas在合併時會保留索引,即使索引是重複的。
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # 複製索引
print(x)
print(y)
print(pd.concat([x, y]))
# A B
# 0 A0 B0
# 1 A1 B1
# A B
# 0 A2 B2
# 1 A3 B3
# A B
# 0 A0 B0
# 1 A1 B1
# 0 A2 B2
# 1 A3 B3
(1) 捕捉索引重複的錯誤:參數verify_integrity
設置參數verify_integrity=True
,合併時若有索引重複就會觸發異常。
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
# ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
(2) 忽略索引:參數ignore_index
有時索引無關緊要,合併時可以忽略索引。設置ignore_index=True
,合併時將會創建一個新的整數索引。
pd.concat([x, y], ignore_index=True)
# A B
# 0 A0 B0
# 1 A1 B1
# 2 A2 B2
# 3 A3 B3
(3) 增加多級索引:參數keys
通過keys
參數爲數據源設置多級索引標籤,結果數據會帶上多級索引。
pd.concat([x, y], keys=['x', 'y'])
# A B
# x 0 A0 B0
# 1 A1 B1
# y 0 A2 B2
# 1 A3 B3
2. 類似 join 的合併
實際工作中,需要合併的數據往往帶有不同的列名。列名部分相同時,可以設置參數join
- 默認情況下,某個位置上的缺失的數據會用 NaN 表示
- 參數
join='outer'
(默認)合併方式是對所有輸入列進行並集合並 - 參數
join='inter'
合併方式是對輸入列進行交集合並
參數
join_axes
直接確定結果使用的列名,裏面是索引對象構成的列表(列表的列表)
- 注:pandas 1.0版本以後 不推薦使用
join_axes
參數,會報錯 TypeError: concat() got an unexpected keyword argument ‘join_axes’;pandas 1.0之前仍可使用- 可以用
reindex()
或reindex_like()
方法替代參數join_axes
的類似功能參考鏈接:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5)
print(df6)
# A B C
# 1 A1 B1 C1
# 2 A2 B2 C2
# B C D
# 3 B3 C3 D3
# 4 B4 C4 D4
pd.concat([df5, df6], sort=False)
# A B C D
# 1 A1 B1 C1 NaN
# 2 A2 B2 C2 NaN
# 3 NaN B3 C3 D3
# 4 NaN B4 C4 D4
pd.concat([df5, df6], join='inner')
# B C
# 1 B1 C1
# 2 B2 C2
# 3 B3 C3
# 4 B4 C4
# pd.concat([df5, df6], join_axes=[df5.columns]) # 已無法使用
# A B C
# 1 A1 B1 C1
# 2 A2 B2 C2
# 3 NaN B3 C3
# 4 NaN B4 C4
# 替代方法
pd.concat([df5, df6]).reindex(columns=df5.columns)
# A B C
# 1 A1 B1 C1
# 2 A2 B2 C2
# 3 NaN B3 C3
# 4 NaN B4 C4
3. append() 方法
# 兩個方法結果相同
df5.append(df6)
pd.concat([df5, df6])
Pandas的append()
方法不直接更新原有對象的值,而是爲合併後的數據創建一個新對象,每次合併都需要重新創建索引和數據緩存。(不同於Python的append()方法)
如果需要進行多個append操作,建議先創建一個DataFrame列表,用concat()
函數一次性解決所有合併任務。
七、合併數據集:合併與連接 pd.merge()
Pandas的基本特性之一是高性能的內存是數據連接(join)與合併(merge)操作。Pandas的主接口是pd.merge
函數
# pandas.DataFrame.merge — pandas 1.0.3 documentation
DataFrame.merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) → 'DataFrame'[source]
Merge DataFrame or named Series objects with a database-style join.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters:
right: DataFrame or named Series
Object to merge with.
how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
Type of merge to be performed.
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
on: label or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_on: label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_on: label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_index: bool, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
right_index: bool, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index.
sort: bool, default False
Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).
suffixes: tuple of (str, str), default (‘_x’, ‘_y’)
Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).
copy: bool, default True
If False, avoid copy if possible.
indicator: bool or str, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
validate: str, optional
If specified, checks if merge is of specified type.
- “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
- “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
- “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
- “many_to_many” or “m:m”: allowed, but does not result in checks.
Returns: DataFrame
A DataFrame of the two merged objects.
(一)關係代數
pd.merge()
實現的功能基於關係代數(relational algebra)的一部分。關係代數是處理關係型數據的通用理論,絕大部分數據庫的可用操作都以此爲理論基礎。
關係代數方法論的強大之處在於,它剔除的若干簡單操作規則經過組合就可以爲任意數據集構建複雜的操作。
Pandas在pd.merge()
函數與Series和DataFrame的join()
方法裏實現了這些基本操作規則。
(二)數據連接的類型
1. 一對一連接
df_1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df_2 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
df_1
# employee group
# 0 Bob Accounting
# 1 Jake Engineering
# 2 Lisa Engineering
# 3 Sue HR
df_2
# employee hire_date
# 0 Bob 2004
# 1 Jake 2008
# 2 Lisa 2012
# 3 Sue 2014
df_3 = pd.merge(df_1, df_2)
df_3
# employee group hire_date
# 0 Bob Accounting 2004
# 1 Jake Engineering 2008
# 2 Lisa Engineering 2012
# 3 Sue HR 2014
- 共同列的位置可以是不一致的,
pd.merge()
函數會自動處理 pd.merge()
會默認丟棄原來的行索引,也可以自定義索引
2. 多對一連接
多對一連接是指在需要連接的兩個列中,有一列的值有重複。通過多對一連接的結果DataFrame會保留重複值。
df_4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
df_4
# group supervisor
# 0 Accounting Carly
# 1 Engineering Guido
# 2 HR Steve
pd.merge(df_3, df_4)
# employee group hire_date supervisor
# 0 Bob Accounting 2004 Carly
# 1 Jake Engineering 2008 Guido
# 2 Lisa Engineering 2012 Guido
# 3 Sue HR 2014 Steve
3. 多對多連接
如果左右兩個輸入的共同列都包含重複值,合併的結果就是一種多對多連接。
df_5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
'skill': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
df_5
# group skill
# 0 Accounting math
# 1 Accounting spreadsheets
# 2 Engineering coding
# 3 Engineering linux
# 4 HR spreadsheets
# 5 HR organization
pd.merge(df_1, df_5)
# employee group skill
# 0 Bob Accounting math
# 1 Bob Accounting spreadsheets
# 2 Jake Engineering coding
# 3 Jake Engineering linux
# 4 Lisa Engineering coding
# 5 Lisa Engineering linux
# 6 Sue HR spreadsheets
# 7 Sue HR organization
(三)設置數據合併的鍵
pd.merge()
的默認行爲:將兩個輸入的一個或多個共同列作爲鍵進行合併。但由於合併列通常不同名,因此pd.merge()
提供一些參數進行處理。
1. 參數 on
直接將參數on
設置爲一個列名字符串或一個包含多列名稱的列表,這個參數只能在兩個DataFrame有共同列名時才能使用。
pd.merge(df_1, df_2, on='employee')
# employee group hire_date
# 0 Bob Accounting 2004
# 1 Jake Engineering 2008
# 2 Lisa Engineering 2012
# 3 Sue HR 2014
2. left_on 與 right_on 參數
需要兩個數據集的共同列列名不同,可以用left_on
和right_on
參數指定列名
合併後會產生多餘列(列名不同導致),通過DataFrame的drop()
方法移除列
df_3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
df_3
# name salary
# 0 Bob 70000
# 1 Jake 80000
# 2 Lisa 120000
# 3 Sue 90000
pd.merge(df_1, df_3, left_on='employee', right_on='name')
# employee group name salary
# 0 Bob Accounting Bob 70000
# 1 Jake Engineering Jake 80000
# 2 Lisa Engineering Lisa 120000
# 3 Sue HR Sue 90000
pd.merge(df_1, df_3, left_on='employee', right_on='name').drop('name', axis=1)
# employee group salary
# 0 Bob Accounting 70000
# 1 Jake Engineering 80000
# 2 Lisa Engineering 120000
# 3 Sue HR 90000
3. left_index 與 right_index 參數
通過設置left_index
和right_index
參數將索引設置爲鍵來實現合併
也可以使用DataFrame的join()
方法合併索引,結果相同
如果想將索引與列混合使用,可以通過結合left_index
與right_on
,或結合left_on
與right_index
來實現
df1a = df_1.set_index('employee')
df1a
# group
# employee
# Bob Accounting
# Jake Engineering
# Lisa Engineering
# Sue HR
df2a = df_2.set_index('employee')
df2a
# hire_date
# employee
# Bob 2004
# Jake 2008
# Lisa 2012
# Sue 2014
pd.merge(df1a, df2a, left_index=True, right_index=True)
# group hire_date
# employee
# Bob Accounting 2004
# Jake Engineering 2008
# Lisa Engineering 2012
# Sue HR 2014
df1a.join(df2a)
# group hire_date
# employee
# Bob Accounting 2004
# Jake Engineering 2008
# Lisa Engineering 2012
# Sue HR 2014
pd.merge(df1a, df_3, left_index=True, right_on='name')
# group name salary
# 0 Accounting Bob 70000
# 1 Engineering Jake 80000
# 2 Engineering Lisa 120000
# 3 HR Sue 90000
(四)設置數據連接的集合操作規則
當一個值出現在一列,但沒有出現在另一列時,考慮集合操作規則。
how
參數設置連接方式:{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
- 內連接(inner join)返回兩個輸入列的交集(默認)
- 外連接(outer join)返回兩個輸入列的並集,所有缺失值都用
NaN
填充 - 左連接(left join)和右連接(right join)返回的結果分別只包含左列和右列
df_6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'], 'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df_6
# name food
# 0 Peter fish
# 1 Paul beans
# 2 Mary bread
df_7 = pd.DataFrame({'name': ['Mary', 'Joseph'], 'drink': ['wine', 'beer']},
columns=['name', 'drink'])
df_7
# name drink
# 0 Mary wine
# 1 Joseph beer
pd.merge(df_6, df_7, how='inner')
# name food drink
# 0 Mary bread wine
pd.merge(df_6, df_7, how='outer')
# name food drink
# 0 Peter fish NaN
# 1 Paul beans NaN
# 2 Mary bread wine
# 3 Joseph NaN beer
pd.merge(df_6, df_7, how='left')
# name food drink
# 0 Peter fish NaN
# 1 Paul beans NaN
# 2 Mary bread wine
(五)重複列名:suffixes 參數
兩個輸入DataFrame有重名列時,pd.merge()
函數會自動增加後綴 _x
或_y
,也可以通過suffixes
參數自定義後綴名
df_8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [1, 2, 3, 4]})
df_8
# name rank
# 0 Bob 1
# 1 Jake 2
# 2 Lisa 3
# 3 Sue 4
df_9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [3, 1, 4, 2]})
df_9
# name rank
# 0 Bob 3
# 1 Jake 1
# 2 Lisa 4
# 3 Sue 2
# 自動添加後綴
pd.merge(df_8, df_9, on='name')
# name rank_x rank_y
# 0 Bob 1 3
# 1 Jake 2 1
# 2 Lisa 3 4
# 3 Sue 4 2
# 自定義後綴
pd.merge(df_8, df_9, on='name', suffixes=['_L', '_R'])
# name rank_L rank_R
# 0 Bob 1 3
# 1 Jake 2 1
# 2 Lisa 3 4
# 3 Sue 4 2
Pandas 相關閱讀:
[Python3] Pandas v1.0 —— (一) 對象、數據取值與運算
[Python3] Pandas v1.0 —— (二) 處理缺失值
[Python3] Pandas v1.0 —— (三) 層級索引
[Python3] Pandas v1.0 —— (四) 合併數據集 【本文】
[Python3] Pandas v1.0 —— (五) 累計與分組
[Python3] Pandas v1.0 —— (六) 數據透視表
[Python3] Pandas v1.0 —— (七) 向量化字符串操作
[Python3] Pandas v1.0 —— (八) 處理時間序列
[Python3] Pandas v1.0 —— (九) 高性能Pandas: eval()與query()
總結自《Python數據科學手冊》