[Python3] Pandas v1.0 —— (六) 数据透视表


[ Pandas version: 1.0.1 ]


九、数据透视表

数据透视表(pivot table)将每一列数据作为输入,输出将数据不断细分成多个维度累计信息的二维数据表(多维GroupBy累计操作,行列同时分组)

(一)GroupBy 实现数据透视表

import numpy as np
import pandas as pd
titanic = pd.read_csv('./seaborn-data-master/titanic.csv')
titanic.head()
# 统计不同性别乘客的生还率
titanic.groupby('sex')[['survived']].mean()
#         survived
# sex
# female  0.742038
# male    0.188908

# 不同性别与船舱等级的生还情况
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
# class      First    Second     Third
# sex
# female  0.968085  0.921053  0.500000
# male    0.368852  0.157407  0.135447
# 不同性别与船舱等级的生还情况 (pivot_table实现)
titanic.pivot_table('survived', index='sex', columns='class')
# class      First    Second     Third
# sex
# female  0.968085  0.921053  0.500000
# male    0.368852  0.157407  0.135447

(二)数据透视表语法 pivot_table

DataFrame的pivot_table能够快速解决多维的累计分析任务。

pandas.DataFrame.pivot_table — pandas 1.0.3 documentation

# pandas.DataFrame.pivot_table — pandas 1.0.3 documentation
DataFrame.pivot_table(self, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)'DataFrame'[source]

		Create a spreadsheet-style pivot table as a DataFrame.

		The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

Parameters:

values:		column to aggregate, optional
index:		column, Grouper, array, or list of the previous
			If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

columns:	column, Grouper, array, or list of the previous
			If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.

aggfunc:	function, list of functions, dict, default numpy.mean
			If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions.

fill_value:	scalar, default None
			Value to replace missing values with.

margins:	bool, default False
			Add all row / columns (e.g. for subtotal / grand totals).

dropna:		bool, default True
			Do not include columns whose entries are all NaN.

margins_name: str, default ‘All’
			Name of the row / column that will contain the totals when margins is True.

observed:	bool, default False
			This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

Returns: 	DataFrame
			An Excel style pivot table.

1. 多级数据透视表

数据透视表中的分组可以通过各种参数指定多个等级。

分段函数:

  • pd.cut(1-d array, bins) 按照数据的值进行分割,而qcut函数则是根据数据本身的数量来对数据进行分割。 documentation
  • pd.qcut(ndarray/Series, int) 按变量的数量来对变量进行分割,并且尽量保证每个分组里变量的个数相同。 documentation
# 年龄作为第三维度,对年龄进行分段
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

在这里插入图片描述

# 对船票价格按照计数项等分两份,加入数据透视表
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
# 结果输出带层级索引的四维累计数据表,通过网络显示不同数值之间的相关性

在这里插入图片描述

2. pivot_table 主要参数解读

  • fill_valuedropna参数用于处理缺失值
  • aggfunc参数用于设置累计函数类型,默认值是np.mean
    • 累计函数可以用常见字符串表示('sum', 'mean', 'count', 'min', 'max'等)
    • 可以用标准的累计函数表示(np.sum(), min(), sum()等)
    • 还可以通过字典为不同列指定不同的累计函数
  • values参数,当为aggfunc指定映射关系的时候,待透视的数值就已经确定了
  • 计算每一组的总数时,通过margins参数设置
  • margin的标签可以通过margin_name参数进行自定义,默认值是"All"
titanic.pivot_table(index='sex', columns='class',
                    aggfunc={'survived': sum, 'fare': 'mean'})

#               fare                       survived
# class        First     Second      Third    First Second Third
# sex
# female  106.125798  21.970121  16.118810       91     70    72
# male     67.226127  19.741782  12.661633       45     17    47
titanic.pivot_table('survived', index='sex', columns='class', margins=True)

# class      First    Second     Third       All
# sex
# female  0.968085  0.921053  0.500000  0.742038
# male    0.368852  0.157407  0.135447  0.188908
# All     0.629630  0.472826  0.242363  0.383838

Pandas 相关阅读:

[Python3] Pandas v1.0 —— (一) 对象、数据取值与运算
[Python3] Pandas v1.0 —— (二) 处理缺失值
[Python3] Pandas v1.0 —— (三) 层级索引
[Python3] Pandas v1.0 —— (四) 合并数据集
[Python3] Pandas v1.0 —— (五) 累计与分组
[Python3] Pandas v1.0 —— (六) 数据透视表 【本文】
[Python3] Pandas v1.0 —— (七) 向量化字符串操作
[Python3] Pandas v1.0 —— (八) 处理时间序列
[Python3] Pandas v1.0 —— (九) 高性能Pandas: eval()与query()


总结自《Python数据科学手册》

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章