Python 數據分析三劍客之 Pandas（五）：統計計算與統計描述

CSDN 課程推薦：《邁向數據科學家：帶你玩轉Python數據分析》，講師齊偉，蘇州研途教育科技有限公司CTO，蘇州大學應用統計專業碩士生指導委員會委員；已出版《跟老齊學Python：輕鬆入門》《跟老齊學Python：Django實戰》、《跟老齊學Python：數據分析》和《Python大學實用教程》暢銷圖書。

Pandas 系列文章（正在更新中…）：

另有 NumPy、Matplotlib 系列文章已更新完畢，歡迎關注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推薦學習資料與網站（博主參與部分文檔翻譯）：

NumPy 官方中文網：https://www.numpy.org.cn/
Pandas 官方中文網：https://www.pypandas.cn/
Matplotlib 官方中文網：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目錄

【01x00】統計計算

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106788501
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

【01x00】統計計算

Pandas 對象擁有一組常用的數學和統計方法。它們大部分都屬於約簡和彙總統計，用於從 Series 中提取單個值（如 sum 或 mean）或從 DataFrame 的行或列中提取一個 Series。跟對應的 NumPy 數組方法相比，它們都是基於沒有缺失數據的假設而構建的。

【01x01】sum() 求和

sum() 方法用於返回指定軸的和，相當於 numpy.sum()。

在 Series 和 DataFrame 中的基本語法如下：

Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
DataFrame.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸求和，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
skipna	bool 類型，求和時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求和

在 Series 中的應用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.sum()
14
>>> 
>>> obj.sum(level='blooded')
blooded
warm    6
cold    8
Name: legs, dtype: int64
>>> 
>>> obj.sum(level=0)
blooded
warm    6
cold    8
Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.sum()
one    9.25
two   -5.80
dtype: float64
>>> 
>>> obj.sum(axis=1)
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

【01x02】min() 最小值

min() 方法用於返回指定軸的最小值。

在 Series 和 DataFrame 中的基本語法如下：

Series.min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸求最小值，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
skipna	bool 類型，求最小值時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求最小值

在 Series 中的應用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.min()
0
>>> 
>>> obj.min(level='blooded')
blooded
warm    2
cold    0
Name: legs, dtype: int64
>>> 
>>> obj.min(level=0)
blooded
warm    2
cold    0
Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.min()
one    0.75
two   -4.50
dtype: float64
>>> 
>>> obj.min(axis=1)
a    1.4
b   -4.5
c    NaN
d   -1.3
dtype: float64
>>> 
>>> obj.min(axis='columns', skipna=False)
a    NaN
b   -4.5
c    NaN
d   -1.3
dtype: float64

【01x03】max() 最大值

max() 方法用於返回指定軸的最大值。

在 Series 和 DataFrame 中的基本語法如下：

Series.max(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸求最大值，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
skipna	bool 類型，求最大值時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求最大值

在 Series 中的應用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.max()
8
>>> 
>>> obj.max(level='blooded')
blooded
warm    4
cold    8
Name: legs, dtype: int64
>>> 
>>> obj.max(level=0)
blooded
warm    4
cold    8
Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.max()
one    7.1
two   -1.3
dtype: float64
>>> 
>>> obj.max(axis=1)
a    1.40
b    7.10
c     NaN
d    0.75
dtype: float64
>>> 
>>> obj.max(axis='columns', skipna=False)
a     NaN
b    7.10
c     NaN
d    0.75
dtype: float64

【01x04】mean() 平均值

mean() 方法用於返回指定軸的平均值。

在 Series 和 DataFrame 中的基本語法如下：

Series.mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸求平均值，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
skipna	bool 類型，求平均值時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求平均值

在 Series 中的應用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.mean()
3.5
>>> 
>>> obj.mean(level='blooded')
blooded
warm    3
cold    4
Name: legs, dtype: int64
>>> 
>>> obj.mean(level=0)
blooded
warm    3
cold    4
Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.mean()
one    3.083333
two   -2.900000
dtype: float64
>>> 
>>> obj.mean(axis=1)
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
>>> 
>>> obj.mean(axis='columns', skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

【01x05】idxmin() 最小值索引

idxmin() 方法用於返回最小值的索引。

在 Series 和 DataFrame 中的基本語法如下：

Series.idxmin(self, axis=0, skipna=True, *args, **kwargs)
DataFrame.idxmin(self, axis=0, skipna=True)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
skipna	bool 類型，是否排除缺失值（NA/null），默認 True

在 Series 中的應用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.idxmin()
('cold', 'fish')

在 DataFrame 中的應用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.idxmin()
one    d
two    b
dtype: object

【01x06】idxmax() 最大值索引

idxmax() 方法用於返回最大值的索引。

在 Series 和 DataFrame 中的基本語法如下：

Series.idxmax(self, axis=0, skipna=True, *args, **kwargs)
DataFrame.idxmax(self, axis=0, skipna=True)

官方文檔：

常用參數描述如下：

參數	描述
axis	指定軸，`0` or `‘index’`，`1` or `‘columns’`，只有在 DataFrame 中才有 `1` or `'columns’`
skipna	bool 類型，是否排除缺失值（NA/null），默認 True

在 Series 中的應用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.idxmax()
('cold', 'spider')

在 DataFrame 中的應用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.idxmax()
one    b
two    d
dtype: object

【02x00】統計描述

describe() 方法用於快速綜合統計結果：計數、均值、標準差、最大最小值、四分位數等。還可以通過參數來設置需要忽略或者包含的統計選項。

在 Series 和 DataFrame 中的基本語法如下：

Series.describe(self: ~ FrameOrSeries, percentiles=None, include=None, exclude=None)
DataFrame.describe(self: ~ FrameOrSeries, percentiles=None, include=None, exclude=None)

官方文檔：

參數	描述
percentiles	數字列表，可選項，要包含在輸出中的百分比。所有值都應介於 0 和 1 之間。默認值爲 [.25、.5、.75]，即返回第 25、50 和 75 個百分點
include	要包含在結果中的數據類型，數據類型列表，默認 None，具體取值類型參見官方文檔
exclude	要從結果中忽略的數據類型，數據類型列表，默認 None，具體取值類型參見官方文檔

描述數字形式的 Series 對象：

>>> import pandas as pd
>>> obj = pd.Series([1, 2, 3])
>>> obj
0    1
1    2
2    3
dtype: int64
>>> 
>>> obj.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

分類描述：

>>> import pandas as pd
>>> obj = pd.Series(['a', 'a', 'b', 'c'])
>>> obj
0    a
1    a
2    b
3    c
dtype: object
>>> 
>>> obj.describe()
count     4
unique    3
top       a
freq      2
dtype: object

描述時間戳：

>>> import pandas as pd
>>> obj  = pd.Series([
    np.datetime64("2000-01-01"),
    np.datetime64("2010-01-01"),
    np.datetime64("2010-01-01")
    ])
>>> obj
0   2000-01-01
1   2010-01-01
2   2010-01-01
dtype: datetime64[ns]
>>> 
>>> obj.describe()
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

描述 DataFrame 對象：

>>> import pandas as pd
>>> obj = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
>>> obj
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> 
>>> obj.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

不考慮數據類型，顯示所有描述：

>>> import pandas as pd
>>> obj = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
>>> obj
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> 
>>> obj.describe(include='all')
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

僅包含 category 列：

>>> import pandas as pd
>>> obj = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
>>> obj
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> 
>>> obj.describe(include=['category'])
       categorical
count            3
unique           3
top              f
freq             1

【03x00】常用統計方法

其他常用統計方法參見下表：

方法	描述	官方文檔
count	非NA值的數量	Series丨DataFrame
describe	針對Series或各DataFrame列計算彙總統計	Series丨DataFrame
min	計算最小值	Series丨DataFrame
max	計算最大值	Series丨DataFrame
argmin	計算能夠獲取到最小值的索引位置（整數）	Series
argmax	計算能夠獲取到最大值的索引位置（整數）	Series
idxmin	計算能夠獲取到最小值的索引值	Series丨DataFrame
idxmax	計算能夠獲取到最大值的索引值	Series丨DataFrame
quantile	計算樣本的分位數（0到1）	Series丨DataFrame
sum	值的總和	Series丨DataFrame
mean	值的平均數	Series丨DataFrame
median	值的算術中位數（50%分位數）	Series丨DataFrame
mad	根據平均值計算平均絕對離差	Series丨DataFrame
var	樣本值的方差	Series丨DataFrame
std	樣本值的標準差	Series丨DataFrame

這裏是一段防爬蟲文本，請讀者忽略。
本文原創首發於 CSDN，作者 TRHX。
博客首頁：https://itrhx.blog.csdn.net/
本文鏈接：https://itrhx.blog.csdn.net/article/details/106788501
未經授權，禁止轉載！惡意轉載，後果自負！尊重原創，遠離剽竊！

Python 數據分析三劍客之 Pandas（五）：統計計算與統計描述

文章目錄

【01x00】統計計算

【01x01】sum() 求和

【01x02】min() 最小值

【01x03】max() 最大值

【01x04】mean() 平均值

【01x05】idxmin() 最小值索引

【01x06】idxmax() 最大值索引

【02x00】統計描述

【03x00】常用統計方法

SQL優化-20231016

COVID-19 肺炎疫情數據實時監控（python 爬蟲 + pyecharts 數據可視化 + wordcloud 詞雲圖）

華中科技大學文華學院 CSDN 高校俱樂部成立啦！

Python 數據分析三劍客之 Pandas（九）：時間序列

Python 數據分析三劍客之 Pandas（十）：數據讀寫

Python 數據分析三劍客之 Pandas（七）：合併數據集

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結