Python 數據分析三劍客之 Pandas（四）：函數應用、映射、排序和層級索引

CSDN 課程推薦：《邁向數據科學家：帶你玩轉Python數據分析》，講師齊偉，蘇州研途教育科技有限公司CTO，蘇州大學應用統計專業碩士生指導委員會委員；已出版《跟老齊學Python：輕鬆入門》《跟老齊學Python：Django實戰》、《跟老齊學Python：數據分析》和《Python大學實用教程》暢銷圖書。

Pandas 系列文章（正在更新中…）：

另有 NumPy、Matplotlib 系列文章已更新完畢，歡迎關注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推薦學習資料與網站（博主參與部分文檔翻譯）：

NumPy 官方中文網：https://www.numpy.org.cn/
Pandas 官方中文網：https://www.pypandas.cn/
Matplotlib 官方中文網：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目錄

【03x00】層級索引

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106758103
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

【01x00】函數應用和映射

Pandas 可直接使用 NumPy 的 ufunc（元素級數組方法）函數：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame(np.random.randn(5,4) - 1)
>>> obj
          0         1         2         3
0 -0.228107  1.377709 -1.096528 -2.051001
1 -2.477144 -0.500013 -0.040695 -0.267452
2 -0.485999 -1.232930 -0.390701 -1.947984
3 -0.839161 -0.702802 -1.756359 -1.873149
4  0.853121 -1.540105  0.621614 -0.583360
>>> 
>>> np.abs(obj)
          0         1         2         3
0  0.228107  1.377709  1.096528  2.051001
1  2.477144  0.500013  0.040695  0.267452
2  0.485999  1.232930  0.390701  1.947984
3  0.839161  0.702802  1.756359  1.873149
4  0.853121  1.540105  0.621614  0.583360

函數映射：在 Pandas 中 apply 方法可以將函數應用到列或行上，可以通過設置 axis 參數來指定行或列，默認 axis = 0，即按列映射：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame(np.random.randn(5,4) - 1)
>>> obj
          0         1         2         3
0 -0.707028 -0.755552 -2.196480 -0.529676
1 -0.772668  0.127485 -2.015699 -0.283654
2  0.248200 -1.940189 -1.068028 -1.751737
3 -0.872904 -0.465371 -1.327951 -2.883160
4 -0.092664  0.258351 -1.010747 -2.313039
>>> 
>>> obj.apply(lambda x : x.max())
0    0.248200
1    0.258351
2   -1.010747
3   -0.283654
dtype: float64
>>>
>>> obj.apply(lambda x : x.max(), axis=1)
0   -0.529676
1    0.127485
2    0.248200
3   -0.465371
4    0.258351
dtype: float64

另外還可以通過 applymap 將函數映射到每個數據上：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame(np.random.randn(5,4) - 1)
>>> obj
          0         1         2         3
0 -0.772463 -1.597008 -3.196100 -1.948486
1 -1.765108 -1.646421 -0.687175 -0.401782
2  0.275699 -3.115184 -1.429063 -1.075610
3 -0.251734 -0.448399 -3.077677 -0.294674
4 -1.495896 -1.689729 -0.560376 -1.808794
>>> 
>>> obj.applymap(lambda x : '%.2f' % x)
       0      1      2      3
0  -0.77  -1.60  -3.20  -1.95
1  -1.77  -1.65  -0.69  -0.40
2   0.28  -3.12  -1.43  -1.08
3  -0.25  -0.45  -3.08  -0.29
4  -1.50  -1.69  -0.56  -1.81

【02x00】排序

【02x01】sort_index() 索引排序

根據條件對數據集排序（sorting）也是一種重要的內置運算。要對行或列索引進行排序（按字典順序），可使用 sort_index 方法，它將返回一個已排序的新對象。

在 Series 和 DataFrame 中的基本語法如下：

Series.sort_index(self, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index: bool = False)
DataFrame.sort_index(self, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index: bool = False)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸排序，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
ascending	爲 `True`時升序排序（默認），爲 `False`時降序排序
kind	排序方法，`quicksort`：快速排序（默認）；`'mergesort’`：歸併排序；`'heapsort'`：堆排序；具體可參見 numpy.sort()

在 Series 中的應用（按照索引 index 排序）：

>>> import pandas as pd
>>> obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
>>> obj
d    0
a    1
b    2
c    3
dtype: int64
>>> 
>>> obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int64

在 DataFrame 中的應用（可按照索引 index 或列標籤 columns 排序）：

>>> import pandas as pd
>>> obj = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
>>> obj
       d  a  b  c
three  0  1  2  3
one    4  5  6  7
>>> 
>>> obj.sort_index()
       d  a  b  c
one    4  5  6  7
three  0  1  2  3
>>> 
>>> obj.sort_index(axis=1)
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
>>> 
>>> obj.sort_index(axis=1, ascending=False)
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

【02x02】sort_values() 按值排序

在 Series 和 DataFrame 中的基本語法如下：

Series.sort_values(self, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)
DataFrame.sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)

官方文檔：

常用參數描述如下：

參數	描述
by	DataFrame 中的必須參數，指定列的值進行排序，Series 中沒有此參數
axis	指定軸排序，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
ascending	爲 `True`時升序排序（默認），爲 `False`時降序排序
kind	排序方法，`quicksort`：快速排序（默認）；`'mergesort’`：歸併排序；`'heapsort'`：堆排序；具體可參見 numpy.sort()

在 Series 中的應用，按照值排序，如果有缺失值，默認都會被放到 Series 的末尾：

>>> import pandas as pd
>>> obj = pd.Series([4, 7, -3, 2])
>>> obj
0    4
1    7
2   -3
3    2
dtype: int64
>>> 
>>> obj.sort_values()
2   -3
3    2
0    4
1    7
dtype: int64
>>> 
>>> obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
>>> obj
0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64
>>> 
>>> obj.sort_values()
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

在 DataFrame 中的應用，有時候可能希望根據一個或多個列中的值進行排序。將一個或多個列的名字傳遞給 sort_values() 的 by 參數即可達到該目的，當傳遞多個列時，首先會對第一列進行排序，若第一列有相同的值，再根據第二列進行排序，依次類推：

>>> import pandas as pd
>>> obj = pd.DataFrame({'a': [4, 4, -3, 2], 'b': [0, 1, 0, 1], 'c': [6, 4, 1, 3]})
>>> obj
   a  b  c
0  4  0  6
1  4  1  4
2 -3  0  1
3  2  1  3
>>> 
>>> obj.sort_values(by='c')
   a  b  c
2 -3  0  1
3  2  1  3
1  4  1  4
0  4  0  6
>>> 
>>> obj.sort_values(by='c', ascending=False)
   a  b  c
0  4  0  6
1  4  1  4
3  2  1  3
2 -3  0  1
>>>
>>> obj.sort_values(by=['a', 'b'])
   a  b  c
2 -3  0  1
3  2  1  3
0  4  0  6
1  4  1  4

>>> import pandas as pd
>>> obj = pd.DataFrame({'a': [4, 4, -3, 2], 'b': [0, 1, 0, 1], 'c': [6, 4, 1, 3]}, index=['A', 'B', 'C', 'D'])
>>> obj
   a  b  c
A  4  0  6
B  4  1  4
C -3  0  1
D  2  1  3
>>> 
>>> obj.sort_values(by='B', axis=1)
   b  a  c
A  0  4  6
B  1  4  4
C  0 -3  1
D  1  2  3

【02x03】rank() 返回排序後元素索引

rank() 函數會返回一個對象，對象的值是原對象經過排序後的索引值，即下標。

在 Series 和 DataFrame 中的基本語法如下：

Series.rank(self: ~ FrameOrSeries, axis=0, method: str = 'average', numeric_only: Union[bool, NoneType] = None, na_option: str = 'keep', ascending: bool = True, pct: bool = False)
DataFrame.rank(self: ~ FrameOrSeries, axis=0, method: str = 'average', numeric_only: Union[bool, NoneType] = None, na_option: str = 'keep', ascending: bool = True, pct: bool = False)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸排序，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
method	有相同值時，如何處理： `‘average’`：默認值，去兩個相同索引的平均值；`‘min’`：取兩個相同索引的最小值； `‘max’`：取兩個相同索引的最大值；`‘first’`：按照出現的先後順序； `‘dense’`：和 `'min'` 差不多，但是各組之間總是+1的，不太好解釋，可以看後面的示例
ascending	爲 `True`時升序排序（默認），爲 `False`時降序排序

在 Series 中的應用，按照值排序，如果有缺失值，默認都會被放到 Series 的末尾：

>>> import pandas as pd
>>> obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
>>> obj
0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64
>>> 
>>> obj.rank()
0    6.5  # 第 0 個和第 2 個值從小到大排名分別爲 6 和 7，默認取平均值，即 6.5
1    1.0
2    6.5
3    4.5  # 第 3 個和第 6 個值從小到大排名分別爲 4 和 5，默認取平均值，即 4.5
4    3.0
5    2.0
6    4.5
dtype: float64
>>> 
>>> obj.rank(method='first')
0    6.0  # 第 0 個和第 2 個值從小到大排名分別爲 6 和 7，按照第一次出現排序，分別爲 6 和 7
1    1.0
2    7.0
3    4.0  # 第 3 個和第 6 個值從小到大排名分別爲 4 和 5，按照第一次出現排序，分別爲 4 和 5
4    3.0
5    2.0
6    5.0
dtype: float64
>>> 
>>> obj.rank(method='dense')
0    5.0  # 第 0 個和第 2 個值從小到大排名分別爲 6 和 7，按照最小值排序，但 dense 規定間隔爲 1 所以爲 5
1    1.0
2    5.0
3    4.0  # 第 3 個和第 6 個值從小到大排名分別爲 4 和 5，按照最小值排序，即 4
4    3.0
5    2.0
6    4.0
dtype: float64
>>> 
>>> obj.rank(method='min')
0    6.0  # 第 0 個和第 2 個值從小到大排名分別爲 6 和 7，按照最小值排序，即 6
1    1.0
2    6.0
3    4.0  # 第 3 個和第 6 個值從小到大排名分別爲 4 和 5，按照最小值排序，即 4
4    3.0
5    2.0
6    4.0
dtype: float64

在 DataFrame 中可以使用 axis 參數來指定軸：

>>> import pandas as pd
>>> obj = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})
>>> obj
     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5
>>> 
>>> obj.rank()
     b    a    c
0  3.0  1.5  2.0
1  4.0  3.5  3.0
2  1.0  1.5  4.0
3  2.0  3.5  1.0
>>> 
>>> obj.rank(axis='columns')
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106758103
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

【03x00】層級索引

【03x01】認識層級索引

以下示例將創建一個 Series 對象，索引 Index 由兩個子 list 組成，第一個子 list 是外層索引，第二個 list 是內層索引：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.Series(np.random.randn(12),index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> obj
a  0   -0.201536
   1   -0.629058
   2    0.766716
b  0   -1.255831
   1   -0.483727
   2   -0.018653
c  0    0.788787
   1    1.010097
   2   -0.187258
d  0    1.242363
   1   -0.822011
   2   -0.085682
dtype: float64

【03x02】MultiIndex 索引對象

官方文檔：https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html

嘗試打印上面示例中 Series 的索引類型，會得到一個 MultiIndex 對象，MultiIndex 對象的 lavels 屬性表示兩個層級中分別有那些標籤，codes 屬性表示每個位置分別是什麼標籤，如下所示：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.Series(np.random.randn(12),index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> obj
a  0    0.035946
   1   -0.867215
   2   -0.053355
b  0   -0.986616
   1    0.026071
   2   -0.048394
c  0    0.251274
   1    0.217790
   2    1.137674
d  0   -1.245178
   1    1.234972
   2   -0.035624
dtype: float64
>>> 
>>> type(obj.index)
<class 'pandas.core.indexes.multi.MultiIndex'>
>>> 
>>> obj.index
MultiIndex([('a', 0),
            ('a', 1),
            ('a', 2),
            ('b', 0),
            ('b', 1),
            ('b', 2),
            ('c', 0),
            ('c', 1),
            ('c', 2),
            ('d', 0),
            ('d', 1),
            ('d', 2)],
           )
>>> obj.index.levels
FrozenList([['a', 'b', 'c', 'd'], [0, 1, 2]])
>>>
>>> obj.index.codes
FrozenList([[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

通常可以使用 from_arrays() 方法來將數組對象轉換爲 MultiIndex 索引對象：

>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex([(1,  'red'),
            (1, 'blue'),
            (2,  'red'),
            (2, 'blue')],
           names=['number', 'color'])

其他常用方法見下表（更多方法參見官方文檔）：

方法	描述
from_arrays(arrays[, sortorder, names])	將數組轉換爲 MultiIndex
from_tuples(tuples[, sortorder, names])	將元組列表轉換爲 MultiIndex
from_product(iterables[, sortorder, names])	將多個可迭代的笛卡爾積轉換成 MultiIndex
from_frame(df[, sortorder, names])	將 DataFrame 對象轉換爲 MultiIndex
set_levels(self, levels[, level, inplace, …])	爲 MultiIndex 設置新的 levels
set_codes(self, codes[, level, inplace, …])	爲 MultiIndex 設置新的 codes
sortlevel(self[, level, ascending, …])	根據 level 進行排序
droplevel(self[, level])	刪除指定的 level
swaplevel(self[, i, j])	交換 level i 與 level i，即交換外層索引與內層索引

【03x03】提取值

對於這種有多層索引的對象，如果只傳入一個參數，則會對外層索引進行提取，其中包含對應所有的內層索引，如果傳入兩個參數，則第一個參數表示外層索引，第二個參數表示內層索引，示例如下：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.Series(np.random.randn(12),index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> obj
a  0    0.550202
   1    0.328784
   2    1.422690
b  0   -1.333477
   1   -0.933809
   2   -0.326541
c  0    0.663686
   1    0.943393
   2    0.273106
d  0    1.354037
   1   -2.312847
   2   -2.343777
dtype: float64
>>> 
>>> obj['b']
0   -1.333477
1   -0.933809
2   -0.326541
dtype: float64
>>>
>>> obj['b', 1]
-0.9338094811708413
>>> 
>>> obj[:, 2]
a    1.422690
b   -0.326541
c    0.273106
d   -2.343777
dtype: float64

【03x04】交換分層與排序

MultiIndex 對象的 swaplevel() 方法可以交換外層與內層索引，sortlevel() 方法會先對外層索引進行排序，再對內層索引進行排序，默認是升序，如果設置 ascending 參數爲 False 則會降序排列，示例如下：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.Series(np.random.randn(12),index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> obj
a  0   -0.110215
   1    0.193075
   2   -1.101706
b  0   -1.325743
   1    0.528418
   2   -0.127081
c  0   -0.733822
   1    1.665262
   2    0.127073
d  0    1.262022
   1   -1.170518
   2    0.966334
dtype: float64
>>> 
>>> obj.swaplevel()
0  a   -0.110215
1  a    0.193075
2  a   -1.101706
0  b   -1.325743
1  b    0.528418
2  b   -0.127081
0  c   -0.733822
1  c    1.665262
2  c    0.127073
0  d    1.262022
1  d   -1.170518
2  d    0.966334
dtype: float64
>>> 
>>> obj.swaplevel().index.sortlevel()
(MultiIndex([(0, 'a'),
            (0, 'b'),
            (0, 'c'),
            (0, 'd'),
            (1, 'a'),
            (1, 'b'),
            (1, 'c'),
            (1, 'd'),
            (2, 'a'),
            (2, 'b'),
            (2, 'c'),
            (2, 'd')],
           ), array([ 0,  3,  6,  9,  1,  4,  7, 10,  2,  5,  8, 11], dtype=int32))

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106758103
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

Python 數據分析三劍客之 Pandas（四）：函數應用、映射、排序和層級索引

文章目錄

【01x00】函數應用和映射

【02x00】排序

【02x01】sort_index() 索引排序

【02x02】sort_values() 按值排序

【02x03】rank() 返回排序後元素索引

【03x00】層級索引

【03x01】認識層級索引

【03x02】MultiIndex 索引對象

【03x03】提取值

【03x04】交換分層與排序

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

COVID-19 肺炎疫情數據實時監控（python 爬蟲 + pyecharts 數據可視化 + wordcloud 詞雲圖）

華中科技大學文華學院 CSDN 高校俱樂部成立啦！

Python 數據分析三劍客之 Pandas（九）：時間序列

Python 數據分析三劍客之 Pandas（十）：數據讀寫

Python 數據分析三劍客之 Pandas（七）：合併數據集

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結