[Python3] Pandas v1.0 —— (六) 數據透視表


[ Pandas version: 1.0.1 ]


九、數據透視表

數據透視表(pivot table)將每一列數據作爲輸入,輸出將數據不斷細分成多個維度累計信息的二維數據表(多維GroupBy累計操作,行列同時分組)

(一)GroupBy 實現數據透視表

import numpy as np
import pandas as pd
titanic = pd.read_csv('./seaborn-data-master/titanic.csv')
titanic.head()
# 統計不同性別乘客的生還率
titanic.groupby('sex')[['survived']].mean()
#         survived
# sex
# female  0.742038
# male    0.188908

# 不同性別與船艙等級的生還情況
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
# class      First    Second     Third
# sex
# female  0.968085  0.921053  0.500000
# male    0.368852  0.157407  0.135447
# 不同性別與船艙等級的生還情況 (pivot_table實現)
titanic.pivot_table('survived', index='sex', columns='class')
# class      First    Second     Third
# sex
# female  0.968085  0.921053  0.500000
# male    0.368852  0.157407  0.135447

(二)數據透視表語法 pivot_table

DataFrame的pivot_table能夠快速解決多維的累計分析任務。

pandas.DataFrame.pivot_table — pandas 1.0.3 documentation

# pandas.DataFrame.pivot_table — pandas 1.0.3 documentation
DataFrame.pivot_table(self, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)'DataFrame'[source]

		Create a spreadsheet-style pivot table as a DataFrame.

		The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

Parameters:

values:		column to aggregate, optional
index:		column, Grouper, array, or list of the previous
			If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

columns:	column, Grouper, array, or list of the previous
			If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.

aggfunc:	function, list of functions, dict, default numpy.mean
			If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions.

fill_value:	scalar, default None
			Value to replace missing values with.

margins:	bool, default False
			Add all row / columns (e.g. for subtotal / grand totals).

dropna:		bool, default True
			Do not include columns whose entries are all NaN.

margins_name: str, default ‘All’
			Name of the row / column that will contain the totals when margins is True.

observed:	bool, default False
			This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

Returns: 	DataFrame
			An Excel style pivot table.

1. 多級數據透視表

數據透視表中的分組可以通過各種參數指定多個等級。

分段函數:

  • pd.cut(1-d array, bins) 按照數據的值進行分割,而qcut函數則是根據數據本身的數量來對數據進行分割。 documentation
  • pd.qcut(ndarray/Series, int) 按變量的數量來對變量進行分割,並且儘量保證每個分組裏變量的個數相同。 documentation
# 年齡作爲第三維度,對年齡進行分段
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

在這裏插入圖片描述

# 對船票價格按照計數項等分兩份,加入數據透視表
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
# 結果輸出帶層級索引的四維累計數據表,通過網絡顯示不同數值之間的相關性

在這裏插入圖片描述

2. pivot_table 主要參數解讀

  • fill_valuedropna參數用於處理缺失值
  • aggfunc參數用於設置累計函數類型,默認值是np.mean
    • 累計函數可以用常見字符串表示('sum', 'mean', 'count', 'min', 'max'等)
    • 可以用標準的累計函數表示(np.sum(), min(), sum()等)
    • 還可以通過字典爲不同列指定不同的累計函數
  • values參數,當爲aggfunc指定映射關係的時候,待透視的數值就已經確定了
  • 計算每一組的總數時,通過margins參數設置
  • margin的標籤可以通過margin_name參數進行自定義,默認值是"All"
titanic.pivot_table(index='sex', columns='class',
                    aggfunc={'survived': sum, 'fare': 'mean'})

#               fare                       survived
# class        First     Second      Third    First Second Third
# sex
# female  106.125798  21.970121  16.118810       91     70    72
# male     67.226127  19.741782  12.661633       45     17    47
titanic.pivot_table('survived', index='sex', columns='class', margins=True)

# class      First    Second     Third       All
# sex
# female  0.968085  0.921053  0.500000  0.742038
# male    0.368852  0.157407  0.135447  0.188908
# All     0.629630  0.472826  0.242363  0.383838

Pandas 相關閱讀:

[Python3] Pandas v1.0 —— (一) 對象、數據取值與運算
[Python3] Pandas v1.0 —— (二) 處理缺失值
[Python3] Pandas v1.0 —— (三) 層級索引
[Python3] Pandas v1.0 —— (四) 合併數據集
[Python3] Pandas v1.0 —— (五) 累計與分組
[Python3] Pandas v1.0 —— (六) 數據透視表 【本文】
[Python3] Pandas v1.0 —— (七) 向量化字符串操作
[Python3] Pandas v1.0 —— (八) 處理時間序列
[Python3] Pandas v1.0 —— (九) 高性能Pandas: eval()與query()


總結自《Python數據科學手冊》

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章