[ Pandas version: 1.0.1 ]
九、數據透視表
數據透視表(pivot table)將每一列數據作爲輸入,輸出將數據不斷細分成多個維度累計信息的二維數據表(多維GroupBy累計操作,行列同時分組)
(一)GroupBy 實現數據透視表
import numpy as np
import pandas as pd
titanic = pd.read_csv('./seaborn-data-master/titanic.csv')
titanic.head()
# 統計不同性別乘客的生還率
titanic.groupby('sex')[['survived']].mean()
# survived
# sex
# female 0.742038
# male 0.188908
# 不同性別與船艙等級的生還情況
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
# class First Second Third
# sex
# female 0.968085 0.921053 0.500000
# male 0.368852 0.157407 0.135447
# 不同性別與船艙等級的生還情況 (pivot_table實現)
titanic.pivot_table('survived', index='sex', columns='class')
# class First Second Third
# sex
# female 0.968085 0.921053 0.500000
# male 0.368852 0.157407 0.135447
(二)數據透視表語法 pivot_table
DataFrame的pivot_table
能夠快速解決多維的累計分析任務。
# pandas.DataFrame.pivot_table — pandas 1.0.3 documentation
DataFrame.pivot_table(self, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) → 'DataFrame'[source]
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
Parameters:
values: column to aggregate, optional
index: column, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns: column, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc: function, list of functions, dict, default numpy.mean
If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions.
fill_value: scalar, default None
Value to replace missing values with.
margins: bool, default False
Add all row / columns (e.g. for subtotal / grand totals).
dropna: bool, default True
Do not include columns whose entries are all NaN.
margins_name: str, default ‘All’
Name of the row / column that will contain the totals when margins is True.
observed: bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
Returns: DataFrame
An Excel style pivot table.
1. 多級數據透視表
數據透視表中的分組可以通過各種參數指定多個等級。
分段函數:
pd.cut(1-d array, bins)
按照數據的值進行分割,而qcut函數則是根據數據本身的數量來對數據進行分割。 documentationpd.qcut(ndarray/Series, int)
按變量的數量來對變量進行分割,並且儘量保證每個分組裏變量的個數相同。 documentation
# 年齡作爲第三維度,對年齡進行分段
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
# 對船票價格按照計數項等分兩份,加入數據透視表
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
# 結果輸出帶層級索引的四維累計數據表,通過網絡顯示不同數值之間的相關性
2. pivot_table 主要參數解讀
fill_value
和dropna
參數用於處理缺失值aggfunc
參數用於設置累計函數類型,默認值是np.mean
- 累計函數可以用常見字符串表示(
'sum', 'mean', 'count', 'min', 'max'
等) - 可以用標準的累計函數表示(
np.sum(), min(), sum()
等) - 還可以通過字典爲不同列指定不同的累計函數
- 累計函數可以用常見字符串表示(
values
參數,當爲aggfunc
指定映射關係的時候,待透視的數值就已經確定了- 計算每一組的總數時,通過
margins
參數設置 margin
的標籤可以通過margin_name
參數進行自定義,默認值是"All"
titanic.pivot_table(index='sex', columns='class',
aggfunc={'survived': sum, 'fare': 'mean'})
# fare survived
# class First Second Third First Second Third
# sex
# female 106.125798 21.970121 16.118810 91 70 72
# male 67.226127 19.741782 12.661633 45 17 47
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
# class First Second Third All
# sex
# female 0.968085 0.921053 0.500000 0.742038
# male 0.368852 0.157407 0.135447 0.188908
# All 0.629630 0.472826 0.242363 0.383838
Pandas 相關閱讀:
[Python3] Pandas v1.0 —— (一) 對象、數據取值與運算
[Python3] Pandas v1.0 —— (二) 處理缺失值
[Python3] Pandas v1.0 —— (三) 層級索引
[Python3] Pandas v1.0 —— (四) 合併數據集
[Python3] Pandas v1.0 —— (五) 累計與分組
[Python3] Pandas v1.0 —— (六) 數據透視表 【本文】
[Python3] Pandas v1.0 —— (七) 向量化字符串操作
[Python3] Pandas v1.0 —— (八) 處理時間序列
[Python3] Pandas v1.0 —— (九) 高性能Pandas: eval()與query()
總結自《Python數據科學手冊》