Python 數據分析三劍客之 Pandas（八）：數據重塑、重複數據處理與數據替換

CSDN 課程推薦：《邁向數據科學家：帶你玩轉Python數據分析》，講師齊偉，蘇州研途教育科技有限公司CTO，蘇州大學應用統計專業碩士生指導委員會委員；已出版《跟老齊學Python：輕鬆入門》《跟老齊學Python：Django實戰》、《跟老齊學Python：數據分析》和《Python大學實用教程》暢銷圖書。

Pandas 系列文章（正在更新中…）：

另有 NumPy、Matplotlib 系列文章已更新完畢，歡迎關注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推薦學習資料與網站（博主參與部分文檔翻譯）：

NumPy 官方中文網：https://www.numpy.org.cn/
Pandas 官方中文網：https://www.pypandas.cn/
Matplotlib 官方中文網：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目錄

【01x00】數據重塑

【02x00】重複數據處理

【03x00】數據替換

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106900748
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

【01x00】數據重塑

有許多用於重新排列表格型數據的基礎運算。這些函數也稱作重塑（reshape）或軸向旋轉（pivot）運算。重塑層次化索引主要有以下兩個方法：

stack：將數據的列轉換成行；
unstack：將數據的行轉換成列。

【01x01】stack

stack 方法用於將數據的列轉換成爲行；

基本語法：DataFrame.stack(self, level=-1, dropna=True)

官方文檔：https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html

參數	描述
level	從列轉換到行，指定不同層級的列索引或列標籤、由列索引或列標籤組成的數組，默認-1
dropna	bool 類型，是否刪除重塑後數據中所有值爲 NaN 的行，默認 True

單層列（Single level columns）：

>>> import pandas as pd
>>> obj = pd.DataFrame([[0, 1], [2, 3]], index=['cat', 'dog'], columns=['weight', 'height'])
>>> obj
     weight  height
cat       0       1
dog       2       3
>>> 
>>> obj.stack()
cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

多層列（Multi level columns）：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('weight', 'pounds')])
>>> obj = pd.DataFrame([[1, 2], [2, 4]], index=['cat', 'dog'], columns=multicol)
>>> obj
    weight       
        kg pounds
cat      1      2
dog      2      4
>>> 
>>> obj.stack()
            weight
cat kg           1
    pounds       2
dog kg           2
    pounds       4

缺失值填充：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')])
>>> obj = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], index=['cat', 'dog'], columns=multicol)
>>> obj
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> 
>>> obj.stack()
        height  weight
cat kg     NaN     1.0
    m      2.0     NaN
dog kg     NaN     3.0
    m      4.0     NaN

通過 level 參數指定不同層級的軸進行重塑：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')])
>>> obj = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], index=['cat', 'dog'], columns=multicol)
>>> obj
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> 
>>> obj.stack(level=0)
             kg    m
cat height  NaN  2.0
    weight  1.0  NaN
dog height  NaN  4.0
    weight  3.0  NaN
>>> 
>>> obj.stack(level=1)
        height  weight
cat kg     NaN     1.0
    m      2.0     NaN
dog kg     NaN     3.0
    m      4.0     NaN
>>>
>>> obj.stack(level=[0, 1])
cat  height  m     2.0
     weight  kg    1.0
dog  height  m     4.0
     weight  kg    3.0
dtype: float64

對於重塑後的數據，若有一行的值均爲 NaN，則默認會被刪除，可以設置 dropna=False 來保留缺失值：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')])
>>> obj = pd.DataFrame([[None, 1.0], [2.0, 3.0]], index=['cat', 'dog'], columns=multicol)
>>> obj
    weight height
        kg      m
cat    NaN    1.0
dog    2.0    3.0
>>> 
>>> obj.stack(dropna=False)
        height  weight
cat kg     NaN     NaN
    m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN
>>> 
>>> obj.stack(dropna=True)
        height  weight
cat m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN

【01x02】unstack

unstack：將數據的行轉換成列。

基本語法：

Series.unstack(self, level=-1, fill_value=None)
DataFrame.unstack(self, level=-1, fill_value=None)

官方文檔：

參數	描述
level	從行轉換到列，指定不同層級的行索引，默認-1
fill_value	用於替換 NaN 的值

在 Series 對象中的應用：

>>> import pandas as pd
>>> obj = pd.Series([1, 2, 3, 4], index=pd.MultiIndex.from_product([['one', 'two'], ['a', 'b']]))
>>> obj
one  a    1
     b    2
two  a    3
     b    4
dtype: int64
>>> 
>>> obj.unstack()
     a  b
one  1  2
two  3  4
>>> 
>>> obj.unstack(level=0)
   one  two
a    1    3
b    2    4

和 stack 方法類似，如果值不存在將會引入缺失值（NaN）：

>>> import pandas as pd
>>> obj1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
>>> obj2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
>>> obj3 = pd.concat([obj1, obj2], keys=['one', 'two'])
>>> obj3
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64
>>> 
>>> obj3.unstack()
       a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

在 DataFrame 對象中的應用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame(np.arange(6).reshape((2, 3)),
		       index=pd.Index(['Ohio','Colorado'], name='state'),
		       columns=pd.Index(['one', 'two', 'three'],
		       name='number'))
>>> obj
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
>>> 
>>> obj2 = obj.stack()
>>> obj2
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32
>>> 
>>> obj3 = pd.DataFrame({'left': obj2, 'right': obj2 + 5},
			columns=pd.Index(['left', 'right'], name='side'))
>>> obj3
side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10
>>> 
>>> obj3.unstack('state')
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10
>>> 
>>> obj3.unstack('state').stack('side')
state         Colorado  Ohio
number side                 
one    left          3     0
       right         8     5
two    left          4     1
       right         9     6
three  left          5     2
       right        10     7

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106900748
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

【02x00】重複數據處理

duplicated：判斷是否爲重複值；
drop_duplicates：刪除重複值。

【02x01】duplicated

duplicated 方法可以判斷值是否爲重複數據。

基本語法：

Series.duplicated(self, keep='first')
DataFrame.duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first') → ’Series’

官方文檔：

參數	描述
keep	標記重複項的方法，默認 `'first'` `'first'`：將非重複項和第一個重複項標記爲 False，其他重複項標記爲 True `'last'`：將非重複項和最後一個重複項標記爲 False，其他重複項標記爲 True `False`：將所有重複項標記爲 True，非重複項標記爲 False
subset	列標籤或標籤序列，在 DataFrame 對象中才有此參數，用於指定某列，僅標記該列的重複項，默認情況下將考慮所有列

默認情況下，對於每組重複的值，第一個出現的重複值標記爲 False，其他重複項標記爲 True，非重複項標記爲 False，相當於 keep='first'：

>>> import pandas as pd
>>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> obj
0      lama
1       cow
2      lama
3    beetle
4      lama
dtype: object
>>> 
>>> obj.duplicated()
0    False
1    False
2     True
3    False
4     True
dtype: bool
>>>
>>> obj.duplicated(keep='first')
0    False
1    False
2     True
3    False
4     True
dtype: bool

設置 keep='last'，將每組非重複項和最後一次出現的重複項標記爲 False，其他重複項標記爲 True，設置 keep=False，則所有重複項均爲 True，其他值爲 False：

>>> import pandas as pd
>>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> obj
0      lama
1       cow
2      lama
3    beetle
4      lama
dtype: object
>>> 
>>> obj.duplicated(keep='last')
0     True
1    False
2     True
3    False
4    False
dtype: bool
>>> 
>>> obj.duplicated(keep=False)
0     True
1    False
2     True
3    False
4     True
dtype: bool

在 DataFrame 對象中，subset 參數用於指定某列，僅標記該列的重複項，默認情況下將考慮所有列：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                       'data2' : np.random.randint(0, 4, 8)})
>>> obj
  data1  data2
0     a      0
1     a      0
2     a      0
3     a      3
4     b      3
5     b      3
6     b      0
7     b      2
>>> 
>>> obj.duplicated()
0    False
1     True
2     True
3    False
4    False
5     True
6    False
7    False
dtype: bool
>>> 
>>> obj.duplicated(subset='data1')
0    False
1     True
2     True
3     True
4    False
5     True
6     True
7     True
dtype: bool
>>> 
>>> obj.duplicated(subset='data2', keep='last')
0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
dtype: bool

【02x02】drop_duplicates

drop_duplicates 方法會返回一個刪除了重複值的序列。

基本語法：

Series.drop_duplicates(self, keep='first', inplace=False)

DataFrame.drop_duplicates(self,
						  subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
						  keep: Union[str, bool] = 'first',
						  inplace: bool = False,
						  ignore_index: bool = False) → Union[ForwardRef(‘DataFrame’), NoneType]

官方文檔：

參數	描述
keep	刪除重複項的方法，默認 `'first'` `'first'`：保留非重複項和第一個重複項，其他重複項標記均刪除 `'last'`：保留非重複項和最後一個重複項，其他重複項刪除 `False`：將所有重複項刪除，非重複項保留
inplace	是否返回刪除重複項後的值，默認 False，若設置爲 True，則不返回值，直接改變原數據
subset	列標籤或標籤序列，在 DataFrame 對象中才有此參數，用於指定某列，僅標記該列的重複項，默認情況下將考慮所有列
ignore_index	bool 類型，在 DataFrame 對象中才有此參數，是否忽略原對象的軸標記，默認 False，如果爲 True，則新對象的索引將是 0, 1, 2, …, n-1

keep 參數的使用：

>>> import pandas as pd
>>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], name='animal')
>>> obj
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object
>>> 
>>> obj.drop_duplicates()
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
>>> 
>>> obj.drop_duplicates(keep='last')
1       cow
3    beetle
4      lama
5     hippo
Name: animal, dtype: object
>>> 
>>> obj.drop_duplicates(keep=False)
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

如果設置 inplace=True，則不會返回任何值，但原對象的值已被改變：

>>> import pandas as pd
>>> obj1 = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], name='animal')
>>> obj1
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object
>>> 
>>> obj2 = obj1.drop_duplicates()
>>> obj2          # 有返回值
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
>>> 
>>> obj3 = obj1.drop_duplicates(inplace=True)
>>> obj3         # 無返回值
>>>
>>> obj1         # 原對象的值已改變
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

在 DataFrame 對象中的使用：

>>> import numpy as np
>>> import pandas as pd
>>> obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                       'data2' : np.random.randint(0, 4, 8)})
>>> obj
  data1  data2
0     a      2
1     a      1
2     a      1
3     a      2
4     b      1
5     b      2
6     b      0
7     b      0
>>> 
>>> obj.drop_duplicates()
  data1  data2
0     a      2
1     a      1
4     b      1
5     b      2
6     b      0
>>> 
>>> obj.drop_duplicates(subset='data2')
  data1  data2
0     a      2
1     a      1
6     b      0
>>> 
>>> obj.drop_duplicates(subset='data2', ignore_index=True)
  data1  data2
0     a      2
1     a      1
2     b      0

【03x00】數據替換

【03x01】replace

replace 方法可以根據值的內容進行替換。

基本語法：

Series.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
DataFrame.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

官方文檔：

常用參數：

參數	描述
to_replace	找到要替換值的方法，可以是：字符串、正則表達式、列表、字典、整數、浮點數、Series 對象或者 None 使用不同參數的區別參見官方文檔
value	用於替換匹配項的值，對於 DataFrame，可以使用字典的值來指定每列要使用的值，還允許使用此類對象的正則表達式，字符串和列表或字典
inplace	bool 類型，是否直接改變原數據且不返回值，默認 False
regex	bool 類型或者與 to_replace 相同的類型，當 to_replace 參數爲正則表達式時，regex 應爲 True，或者直接使用該參數代替 to_replace

to_replace 和 value 參數只傳入一個值，單個值替換單個值：

>>> import pandas as pd
>>> obj = pd.Series([0, 1, 2, 3, 4])
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

to_replace 傳入多個值，value 傳入一個值，多個值替換一個值：

>>> import pandas as pd
>>> obj = pd.Series([0, 1, 2, 3, 4])
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.replace([0, 1, 2, 3], 4)
0    4
1    4
2    4
3    4
4    4
dtype: int64

to_replace 和 value 參數都傳入多個值，多個值替換多個值：

>>> import pandas as pd
>>> obj = pd.Series([0, 1, 2, 3, 4])
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.replace([0, 1, 2, 3], [4, 3, 2, 1])
0    4
1    3
2    2
3    1
4    4
dtype: int64

to_replace 傳入字典：

>>> import pandas as pd
>>> obj = pd.DataFrame({'A': [0, 1, 2, 3, 4],
			'B': [5, 6, 7, 8, 9],
			'C': ['a', 'b', 'c', 'd', 'e']})
>>> obj
   A  B  C
0  0  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e
>>> 
>>> obj.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e
>>> 
>>> obj.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> 
>>> obj.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> obj.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

to_replace 傳入正則表達式：

>>> import pandas as pd
>>> obj = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
			'B': ['abc', 'bar', 'xyz']})
>>> obj
      A    B
0   bat  abc
1   foo  bar
2  bait  xyz
>>> 
>>> obj.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> 
>>> obj.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> 
>>> obj.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> 
>>> obj.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> 
>>> obj.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

【03x02】where

where 方法用於替換條件爲 False 的值。

基本語法：

Series.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
DataFrame.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)

官方文檔：

常用參數：

參數	描述
cond	替換條件，如果 cond 爲 True，則保留原始值。如果爲 False，則替換爲來自 other 的相應值
other	替換值，如果 cond 爲 False，則替換爲來自該參數的相應值
inplace	bool 類型，是否直接改變原數據且不返回值，默認 False

在 Series 中的應用：

>>> import pandas as pd
>>> obj = pd.Series(range(5))
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.where(obj > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> 
>>> obj.where(obj > 1, 10)
0    10
1    10
2     2
3     3
4     4
dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd
>>> obj = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> obj
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> 
>>> m = obj % 3 == 0
>>> obj.where(m, -obj)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> 
>>> obj.where(m, -obj) == np.where(m, obj, -obj)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

【03x03】mask

mask 方法與 where 方法相反，mask 用於替換條件爲 False 的值。

基本語法：

Series.mask(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
DataFrame.mask(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)

官方文檔：

常用參數：

參數	描述
cond	替換條件，如果 cond 爲 False，則保留原始值。如果爲 True，則替換爲來自 other 的相應值
other	替換值，如果 cond 爲 False，則替換爲來自該參數的相應值
inplace	bool 類型，是否直接改變原數據且不返回值，默認 False

在 Series 中的應用：

>>> import pandas as pd
>>> obj = pd.Series(range(5))
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.mask(obj > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> 
>>> obj.mask(obj > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd
>>> obj = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> obj
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> 
>>> m = obj % 3 == 0
>>> 
>>> obj.mask(m, -obj)
   A  B
0  0  1
1  2 -3
2  4  5
3 -6  7
4  8 -9
>>> 
>>> obj.where(m, -obj) == obj.mask(~m, -obj)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106900748
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

Python 數據分析三劍客之 Pandas（八）：數據重塑、重複數據處理與數據替換

文章目錄

【01x00】數據重塑

【01x01】stack

【01x02】unstack

【02x00】重複數據處理

【02x01】duplicated

【02x02】drop_duplicates

【03x00】數據替換

【03x01】replace

【03x02】where

【03x03】mask

DAPPER 事務 TRANSACTION

COVID-19 肺炎疫情數據實時監控（python 爬蟲 + pyecharts 數據可視化 + wordcloud 詞雲圖）

華中科技大學文華學院 CSDN 高校俱樂部成立啦！

Python 數據分析三劍客之 Pandas（九）：時間序列

Python 數據分析三劍客之 Pandas（十）：數據讀寫

Python 數據分析三劍客之 Pandas（七）：合併數據集

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結