[Python3] Pandas v1.0 —— (五) 累計與分組


[ Pandas version: 1.0.1 ]


八、累計與分組

在對較大的數據進行分析時,一項基本的工作就是有效的數據累計(summarization):計算累計(aggregation)指標,如sum(), mean(), median(), min(), max(),其中每一個指標都呈現了大數據集的特徵。

(一)Pandas的簡單累計功能

import numpy as np
import pandas as pd

rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
# 0    0.374540
# 1    0.950714
# 2    0.731994
# 3    0.598658
# 4    0.156019
# dtype: float64

ser.sum() 		# 2.811925491708157
ser.mean() 		# 0.5623850983416314

df = pd.DataFrame({'A': rng.rand(5), 'B': rng.rand(5)})
#           A         B
# 0  0.155995  0.020584
# 1  0.058084  0.969910
# 2  0.866176  0.832443
# 3  0.601115  0.212339
# 4  0.708073  0.181825

df.mean()
# A    0.477888
# B    0.443420
# dtype: float64

df.mean(axis='columns')
# 0    0.088290
# 1    0.513997
# 2    0.849309
# 3    0.406727
# 4    0.444949
# dtype: float64
# 行星數據集
planets = pd.read_csv('./seaborn-data-master/planets.csv')
planets.shape 		# (1035, 6)
planets.head()

在這裏插入圖片描述

planets.dropna().describe()

在這裏插入圖片描述

Pandas的累計方法

指標 描述
count() 計數項
first(), last() 第一項,最後一項
mean(), median() 均值,中位數
min(), max() 最小值,最大值
std(), var() 標準差,方差
mad() 均值絕對偏差 (mean absolute deviation)
prod() 所有項乘積
sum() 所有項求和

(二)GroupBy:分割、應用和組合

pandas.DataFrame.groupby — pandas 1.0.3 documentation

# pandas.DataFrame.groupby — pandas 1.0.3 documentation
DataFrame.groupby(self, by=None, axis=0, level=None, as_index: bool = True, sort: bool = True, group_keys: bool = True, squeeze: bool = False, observed: bool = False)'groupby_generic.DataFrameGroupBy'[source]

	Group DataFrame using a mapper or by a Series of columns.

	A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
by:		mapping, function, label, or list of labels
		Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

axis:	{0 or ‘index’, 1 or ‘columns’}, default 0
		Split along rows (0) or columns (1).

level:	int, level name, or sequence of such, default None
		If the axis is a MultiIndex (hierarchical), group by a particular level or levels.

as_index: bool, default True
		For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

sort:	bool, default True
		Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

group_keys: bool, default True
		When calling apply, add group keys to index to identify pieces.

squeeze: bool, default False
		Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

observed: bool, default False
		This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

Returns: DataFrameGroupBy
		Returns a groupby object that contains information about the groups.

1. 分割、應用和組合

GroupBy的過程:

  • 分割(split):將DataFrame按照指定的鍵分割成若干組
  • 應用(apply):對每個組應用函數,通常是累計、轉換或過濾函數
  • 組合(combine):將每一組的結果合併成一個輸出數組
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])

df.groupby('key') 		
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a1fd7e450>
df.groupby('key').sum()

用DataFrame的groupby()方法進行操作,返回一個DataFrameGroupBy對象(特殊形式的DataFrame),在沒有應用累計函數之前不會計算(延遲計算 lazy evaluation)

2. GroupBy 對象

(1) 按列取值

planets.groupby('method') 
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a1fb7fe10>
planets.groupby('method')['orbital_period']
# <pandas.core.groupby.generic.SeriesGroupBy object at 0x1a1fd7e2d0>

# 所有行星公轉週期的中位數
planets.groupby('method')['orbital_period'].median()
# method
# Astrometry                         631.180000
# Eclipse Timing Variations         4343.500000
# Imaging                          27500.000000
# Microlensing                      3300.000000
# Orbital Brightness Modulation        0.342887
# Pulsar Timing                       66.541900
# Pulsation Timing Variations       1170.000000
# Radial Velocity                    360.200000
# Transit                              5.714932
# Transit Timing Variations           57.011000
# Name: orbital_period, dtype: float64

(2) 按組迭代

for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))
# Astrometry                     shape=(2, 6)
# Eclipse Timing Variations      shape=(9, 6)
# Imaging                        shape=(38, 6)
# Microlensing                   shape=(23, 6)
# Orbital Brightness Modulation  shape=(3, 6)
# Pulsar Timing                  shape=(5, 6)
# Pulsation Timing Variations    shape=(1, 6)
# Radial Velocity                shape=(553, 6)
# Transit                        shape=(397, 6)
# Transit Timing Variations      shape=(4, 6)

(3) 調用方法

藉助Python的類,首先讓方法應用到每組數據上,結果由GroupBy組合後返回。任意DataFrame、Series的方法都可以由GroupBy方法調用。

  • 用DataFrame的describe()方法進行累計,對每一組數據進行描述性統計
planets.groupby('method')['year'].describe()

在這裏插入圖片描述

3. 累計、過濾、轉換和應用

rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data1': range(6), 
                   'data2': rng.randint(0, 10, 6)}, columns=['key', 'data1', 'data2'])
df
#   key  data1  data2
# 0   A      0      5
# 1   B      1      0
# 2   C      2      3
# 3   A      3      3
# 4   B      4      7
# 5   C      5      9

(1) 累計 aggregate()

aggregate()可以支持更復雜的操作,如字符串、函數、函數列表

  • 能一次性計算所有累計值
  • 通過Python字典指定不同列需要累計的函數
df.groupby('key').aggregate(['min', np.median, max])
#     data1            data2
#       min median max   min median max
# key
# A       0    1.5   3     3    4.0   5
# B       1    2.5   4     0    3.5   7
# C       2    3.5   5     3    6.0   9

df.groupby('key').aggregate({'data1': 'min', 'data2': 'max'})
#      data1  data2
# key
# A        0      5
# B        1      7
# C        2      9

(2) 過濾 filter()

過濾可以按照分組的屬性丟棄若干數據。filter()函數會返回一個布爾值,表示每個組是否通過過濾。

def filter_func(x):
	'''只保留data2列標準差大於4的組'''
    return x['data2'].std() > 4
df
#   key  data1  data2
# 0   A      0      5
# 1   B      1      0
# 2   C      2      3
# 3   A      3      3
# 4   B      4      7
# 5   C      5      9

df.groupby('key').std()
#        data1     data2
# key
# A    2.12132  1.414214
# B    2.12132  4.949747
# C    2.12132  4.242641

df.groupby('key').filter(filter_func)
#   key  data1  data2
# 1   B      1      0
# 2   C      2      3
# 4   B      4      7
# 5   C      5      9

(3) 轉換 transform()

累計操作返回的是對組內全量數據縮減過的結果,而轉換操作會返回一個新的全量數據。數據經過轉換之後,其形狀與原來的輸入數據是一樣的。

  • 常見例子:將每組的樣本數據減去各組的均值,實現數據標準化
df.groupby('key').transform(lambda x: x - x.mean())
#    data1  data2
# 0   -1.5    1.0
# 1   -1.5   -3.5
# 2   -1.5   -3.0
# 3    1.5   -1.0
# 4    1.5    3.5
# 5    1.5    3.0

(4) 應用 apply()

apply()方法可以在每個組上應用任意方法。輸入一個分組數據的DataFrame,返回一個Pandas對象或一個標量。組合操作會適應返回結果類型。

# 將第一列數據以第二列的和爲基數進行標準化

def norm_by_data2(x):
    '''x是一個分組數據的DataFrame'''
    x['data1'] /= x['data2'].sum()
    return x
df
#   key  data1  data2
# 0   A      0      5
# 1   B      1      0
# 2   C      2      3
# 3   A      3      3
# 4   B      4      7
# 5   C      5      9

df.groupby('key').apply(norm_by_data2)
#   key     data1  data2
# 0   A  0.000000      5
# 1   B  0.142857      0
# 2   C  0.166667      3
# 3   A  0.375000      3
# 4   B  0.571429      7
# 5   C  0.416667      9

4. 設置分割的鍵

(1) 將列表、數組、Series或索引作爲分組鍵:分組鍵可以是長度與DataFrame匹配的任意Series或列表

L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()
#    data1  data2
# 0      7     17
# 1      4      3
# 2      4      7

df.groupby(df['key']).sum()
#      data1  data2
# key
# A        3      8
# B        5      7
# C        7     12

(2) 用字典或Series將索引映射到分組名稱:提供一個字典,將索引映射到分組鍵

df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
df2
#      data1  data2
# key
# A        0      5
# B        1      0
# C        2      3
# A        3      3
# B        4      7
# C        5      9

df2.groupby(mapping).sum()
#            data1  data2
# consonant     12     19
# vowel          3      8

(3) 任意Python函數:可以將任意Python函數傳入groupby,函數映射到索引,輸出新的分組

df2.groupby(str.lower).mean() 	# 將索引轉換爲小寫字母形式
#    data1  data2
# a    1.5    4.0
# b    2.5    3.5
# c    3.5    6.0

(4) 多個有效鍵構成的列表:任意之前有效的鍵都可以組合起來進行分組,從而返回一個多級索引的分組結果

df2.groupby([str.lower, mapping]).mean()
#              data1  data2
# a vowel        1.5    4.0
# b consonant    2.5    3.5
# c consonant    3.5    6.0
# 不同方法和年份發現的行星數量
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

在這裏插入圖片描述


Pandas 相關閱讀:

[Python3] Pandas v1.0 —— (一) 對象、數據取值與運算
[Python3] Pandas v1.0 —— (二) 處理缺失值
[Python3] Pandas v1.0 —— (三) 層級索引
[Python3] Pandas v1.0 —— (四) 合併數據集
[Python3] Pandas v1.0 —— (五) 累計與分組 【本文】
[Python3] Pandas v1.0 —— (六) 數據透視表
[Python3] Pandas v1.0 —— (七) 向量化字符串操作
[Python3] Pandas v1.0 —— (八) 處理時間序列
[Python3] Pandas v1.0 —— (九) 高性能Pandas: eval()與query()


總結自《Python數據科學手冊》

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章