pandas库的groupby问题
一、对象分组
1 一个简单例子
In [1]: df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
...: ('bird', 'Psittaciformes', 24.0),
...: ('mammal', 'Carnivora', 80.2),
...: ('mammal', 'Primates', np.nan),
...: ('mammal', 'Carnivora', 58)],
...: index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
...: columns=('class', 'order', 'max_speed'))
...:
In [2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
In [3]: grouped = df.groupby('class') # 生成两组,默认axis=0/'columns'
In [4]: grouped = df.groupby('order', axis='columns') # 无输出
In [5]: grouped = df.groupby(['class', 'order']) # 生成四组
groupby默认按照列名排列,亦可使axis=1,这样就能以列为单位切片(结果是一列一列的组合):
In [50]: grouped = df.groupby(df.dtypes, axis=1) # 按列的类型来分组,可见被分成了两块
In [51]: for i,j in grouped:
...: print(i)
...: print(j)
float64
max_speed
falcon 389.0
parrot 24.0
lion 80.2
monkey NaN
leopard 58.0
object
class order
falcon bird Falconiformes
parrot bird Psittaciformes
lion mammal Carnivora
monkey mammal Primates
leopard mammal Carnivora
其他常用的几种功能:
# 按照组名数量使用以下代码查看包含了哪些内容,也称为迭代
for i,j in grouped:
print(i)
print(j)
for (i1,i2),j in grouped:
print(i1, i2)
print(j)
# 或者如下简单方式查看有哪些组:
grouped.groups
# 使用以下代码求和,前提是该列可以求和:
In []: df.groupby('class').sum()
Out[]:
max_speed
class
bird 413.0
mammal 138.2
# 分别求平均数、组的大小、计数:
In []: df.groupby('class').mean()
Out[]:
max_speed
class
bird 206.5
mammal 69.1
In []: df.groupby('class').size()
Out[]:
class
bird 2
mammal 3
dtype: int64
In []: df.groupby('class').count() # 注意count与size的区别
Out[]:
order max_speed
class
bird 2 2
mammal 3 2 # 由于有一个NaN值的存在
2 多重索引与分组
In [6]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B': ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C': np.random.randn(8),
...: 'D': np.random.randn(8)})
In [7]: df
Out[7]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
In [8]: grouped = df.groupby('A') # 生成2组
In [9]: grouped = df.groupby(['A', 'B']) # 生成6组
In [10]: df2 = df.set_index(['A', 'B'])
Out[10]: # print(df2)
C D
A B
foo one -1.209388 -0.309949
bar one -0.380334 -1.352238
foo two 0.309979 -0.695926
bar three 0.650321 0.965206
foo two 0.809020 1.003307
bar two 0.668484 1.013688
foo one 0.513104 0.079576
three 1.579055 -0.083461 # 注意多重索引此处不同
In [11]: grouped = df2.groupby(level=df2.index.names.difference(['B']))
# 等价于 grouped = df2.groupby(level=0),如要按多重索引内的索引'B',则level=1
# 由于'B'列不能求和,亦等价于 df.groupby(['A']).sum()
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.591710 -1.739537
foo -0.752861 -1.402938
3 应用函数与按列分组
In [13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....: else:
....: return 'consonant'
In [14]: grouped = df.groupby(get_letter_type, axis=1)
# axis=0将整个df以哪几行的形式分组,而axis=1则按照哪几列来分组
# 该处运用类似于apply函数
In []:for i,j in grouped:
print(i)
print(j)
Out[]: # 由于A的小写属于元音字母,导致原df被切分成两部分
consonant
B C D
0 one -1.209388 -0.309949
1 one -0.380334 -1.352238
2 two 0.309979 -0.695926
3 three 0.650321 0.965206
4 two 0.809020 1.003307
5 two 0.668484 1.013688
6 one 0.513104 0.079576
7 three 1.579055 -0.083461
vowel
A
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
4 level之用法
下面实例介绍了level的基本用法:
In [15]: lst = [1, 2, 3, 1, 2, 3]
In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
Out[16]: # print(s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
In [17]: grouped = s.groupby(level=0) # 按照索引仍分成3组,level=1用于多索引中的第二列索引
In [18]: grouped.first() # 每一组的第一行
Out[18]:
1 1
2 2
3 3
dtype: int64
In [19]: grouped.last() # 每一组的最后一行
Out[19]:
1 10
2 20
3 30
dtype: int64
In [20]: grouped.sum()
Out[20]:
1 11
2 22
3 33
dtype: int64
5 排序
groupby默认按照组名升序排列,可以取消此操作。
In [21]: df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})
In [22]: df2.groupby(['X']).sum()
Out[22]:
Y
X
A 7
B 3
In [23]: df2.groupby(['X'], sort=False).sum() # 并不是改为降序,而是按照组名的原有顺序
Out[23]:
Y
X
B 3
A 7
groupby分组的数据不会自动升序展示,而是默认按照原有顺序保存。
In [24]: df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
In [25]: df3.groupby(['X']).get_group('A')
Out[25]:
X Y
0 A 1
2 A 3
In [26]: df3.groupby(['X']).get_group('B')
Out[26]:
X Y
1 B 4
3 B 2
6 groupby对象的函数
按照一列分组和列切片:
In [27]: df.groupby('A').groups
Out[27]:
{'bar': Int64Index([1, 3, 5], dtype='int64'),
'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}
In [28]: df.groupby(get_letter_type, axis=1).groups
Out[28]:
{'consonant': Index(['B', 'C', 'D'], dtype='object'),
'vowel': Index(['A'], dtype='object')}
按照两列分组:
In [29]: grouped = df.groupby(['A', 'B'])
In [30]: grouped.groups
Out[30]:
{('bar', 'one'): Int64Index([1], dtype='int64'),
('bar', 'three'): Int64Index([3], dtype='int64'),
('bar', 'two'): Int64Index([5], dtype='int64'),
('foo', 'one'): Int64Index([0, 6], dtype='int64'),
('foo', 'three'): Int64Index([7], dtype='int64'),
('foo', 'two'): Int64Index([2, 4], dtype='int64')}
In [31]: len(grouped)
Out[31]: 6
使用tab键查看有哪些可选函数:
In [32]: df
Out[32]:
height weight gender
2000-01-01 42.849980 157.500553 male
2000-01-02 49.607315 177.340407 male
2000-01-03 56.293531 171.524640 male
2000-01-04 48.421077 144.251986 female
2000-01-05 46.556882 152.526206 male
2000-01-06 68.448851 168.272968 female
2000-01-07 70.757698 136.431469 male
2000-01-08 58.909500 176.499753 female
2000-01-09 76.435631 174.094104 female
2000-01-10 45.306120 177.540920 male
In [33]: gb = df.groupby('gender')
In [34]: gb.<TAB> # noqa: E225, E999
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
7 多重索引分组
# 创建一个有二重索引的列表
In [35]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [36]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
In [37]: s = pd.Series(np.random.randn(8), index=index)
# 相近索引会隐藏
In [38]: s
Out[38]:
first second
bar one -0.919854
two -0.042379
baz one 1.247642
two -0.009920
foo one 0.290213
two 0.495767
qux one 0.362949
two 1.548106
dtype: float64
简单求和操作:
In [39]: grouped = s.groupby(level=0)
In [40]: grouped.sum() # 等价于s.groupby('first').sum()
Out[40]:
first
bar -0.962232
baz 1.237723
foo 0.785980
qux 1.911055
dtype: float64
In [41]: s.groupby(level='second').sum() # 等价于s.sum(level='first')
Out[41]:
second
one 0.980950
two 1.991575
dtype: float64
In [42]: s.sum(level='second')
Out[42]:
second
one 0.980950
two 1.991575
dtype: float64
亦可同时按照两个索引列来分组求和:
In [43]: s
Out[43]:
first second third
bar doo one -1.131345
two -0.089329
baz bee one 0.337863
two -0.945867
foo bop one -0.932132
two 1.956030
qux bop one 0.017587
two -0.016692
dtype: float64
In [44]: s.groupby(level=['first', 'second']).sum()
Out[44]:
first second
bar doo -1.220674
baz bee -0.608004
foo bop 1.023898
qux bop 0.000895
dtype: float64
In [45]: s.groupby(['first', 'second']).sum()
Out[45]:
first second
bar doo -1.220674
baz bee -0.608004
foo bop 1.023898
qux bop 0.000895
dtype: float64
8 组内选组
# 对每一列进行不同操作
In [53]: grouped = df.groupby(['A'])
In [54]: grouped_C = grouped['C']
In [56]: df['C'].groupby(df['A']) # 最直观、最简单
Out[56]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f2b486509b0>
二、选择一个组
# get_group的参数为迭代时出现的组名
In [60]: grouped.get_group('bar')
Out[60]:
A B C D
1 bar one 0.254161 1.511763
3 bar three 0.215897 -0.990582
5 bar two -0.077118 1.211526
In [61]: df.groupby(['A', 'B']).get_group(('bar', 'one'))
Out[61]:
A B C D
1 bar one 0.254161 1.511763
三、聚合
In [62]: grouped = df.groupby('A')
In [63]: grouped.aggregate(np.sum) # 等价于grouped.sum()
Out[63]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
In [64]: grouped = df.groupby(['A', 'B'])
In [65]: grouped.aggregate(np.sum)
Out[65]:
C D
A B
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.983776 1.614581
three -0.862495 0.024580
two 0.049851 1.185429
In [66]: grouped = df.groupby(['A', 'B'], as_index=False) # as_index去除了索引与列的差距,现在求和得到的就是一个dataframe
In [67]: grouped.aggregate(np.sum)
Out[67]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
In [68]: df.groupby('A', as_index=False).sum()
Out[68]:
A C D
0 bar 0.392940 1.732707
1 foo -1.796421 2.824590
In [70]: grouped.size()
Out[70]:
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64
In [71]: grouped.describe() # 是对每个组的描述
# 更多用法:std() var() sem() nth()
1 同时应用多个函数
1.1 应用相同函数
In [72]: grouped = df.groupby('A')
In [73]: grouped['C'].agg([np.sum, np.mean, np.std]) # 可以改为aggregate
Out[73]:
sum mean std
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
In [74]: grouped.agg([np.sum, np.mean, np.std])
Out[74]:
C D
sum mean std sum mean std
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
In [75]: (grouped['C'].agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'}))
....:
Out[75]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
In [76]: (grouped.agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'}))
....:
Out[76]:
C D
foo bar baz foo bar baz
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
1.2 应用不同函数
In [77]: grouped.agg({'C': np.sum,
....: 'D': lambda x: np.std(x, ddof=1)})
....:
Out[77]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
In [78]: grouped.agg({'C': 'sum', 'D': 'std'})
Out[78]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
In [79]: from collections import OrderedDict
In [80]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[80]:
D C
A
bar 1.366330 0.130980
foo 0.884785 -0.359284
In [81]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[81]:
D C
A
bar 1.366330 0.130980
foo 0.884785 -0.359284
1.3 Cython类函数的应用
目前只有sum mean std sem符合
四、转化
In [84]: index = pd.date_range('10/1/1999', periods=1100)
In [85]: ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
# rollIng函数:window过去多少期,min_peirods最小观测量
In [86]: ts = ts.rolling(window=100, min_periods=100).mean().dropna()
In [87]: ts.head()
Out[87]:
2000-01-08 0.779333
2000-01-09 0.778852
2000-01-10 0.786476
2000-01-11 0.782797
2000-01-12 0.798110
Freq: D, dtype: float64
In [88]: ts.tail()
Out[88]:
2002-09-30 0.660294
2002-10-01 0.631095
2002-10-02 0.673601
2002-10-03 0.709213
2002-10-04 0.719369
Freq: D, dtype: float64
# 以年为单位将该列数据标准化
In [89]: transformed = (ts.groupby(lambda x: x.year)
....: .transform(lambda x: (x - x.mean()) / x.std()))
Out[89]:
2000-01-08 -0.624080
2000-01-09 -0.763061
2000-01-10 -1.009653
2000-01-11 -0.965821
2000-01-12 -1.227731...
# 可视化数据
In [96]: compare = pd.DataFrame({'Original': ts, 'Transformed': transformed})
In [97]: compare.plot()
Out[97]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b4866c1d0>
# 以年为单位计算最大值与最小值之差
In [98]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
Out[98]:
2000-01-08 0.623893
2000-01-09 0.623893
2000-01-10 0.623893
2000-01-11 0.623893
2000-01-12 0.623893
2000-01-13 0.623893
2000-01-14 0.623893
...
2002-09-28 0.558275
2002-09-29 0.558275
2002-09-30 0.558275
2002-10-01 0.558275
2002-10-02 0.558275
2002-10-03 0.558275
2002-10-04 0.558275
Freq: D, Length: 1001, dtype: float64
# 等价形式
In [99]: max = ts.groupby(lambda x: x.year).transform('max')
In [100]: min = ts.groupby(lambda x: x.year).transform('min')
In [101]: max - min
Out[101]:
2000-01-08 0.623893
2000-01-09 0.623893
2000-01-10 0.623893
2000-01-11 0.623893
2000-01-12 0.623893
2000-01-13 0.623893
2000-01-14 0.623893
...
2002-09-28 0.558275
2002-09-29 0.558275
2002-09-30 0.558275
2002-10-01 0.558275
2002-10-02 0.558275
2002-10-03 0.558275
2002-10-04 0.558275
Freq: D, Length: 1001, dtype: float64
In [102]: data_df
Out[102]:
A B C
0 1.539708 -1.166480 0.533026
1 1.302092 -0.505754 NaN
2 -0.371983 1.104803 -0.651520
3 -1.309622 1.118697 -1.161657
4 -1.924296 0.396437 0.812436
5 0.815643 0.367816 -0.469478
6 -0.030651 1.376106 -0.645129
.. ... ... ...
993 0.012359 0.554602 -1.976159
994 0.042312 -1.628835 1.013822
995 -0.093110 0.683847 -0.774753
996 -0.185043 1.438572 NaN
997 -0.394469 -0.642343 0.011374
998 -1.174126 1.857148 NaN
999 0.234564 0.517098 0.393534
[1000 rows x 3 columns]
In [103]: countries = np.array(['US', 'UK', 'GR', 'JP'])
In [104]: key = countries[np.random.randint(0, 4, 1000)]
In [105]: grouped = data_df.groupby(key)
# Non-NA count in each group
In [106]: grouped.count()
Out[106]:
A B C
GR 209 217 189
JP 240 255 217
UK 216 231 193
US 239 250 217
In [107]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
In [108]: grouped_trans = transformed.groupby(key)
In [109]: grouped.mean() # original group means
Out[109]:
A B C
GR -0.098371 -0.015420 0.068053
JP 0.069025 0.023100 -0.077324
UK 0.034069 -0.052580 -0.116525
US 0.058664 -0.020399 0.028603
In [110]: grouped_trans.mean() # transformation did not change group means
Out[110]:
A B C
GR -0.098371 -0.015420 0.068053
JP 0.069025 0.023100 -0.077324
UK 0.034069 -0.052580 -0.116525
US 0.058664 -0.020399 0.028603
In [111]: grouped.count() # original has some missing data points
Out[111]:
A B C
GR 209 217 189
JP 240 255 217
UK 216 231 193
US 239 250 217
In [112]: grouped_trans.count() # counts after transformation
Out[112]:
A B C
GR 228 228 228
JP 267 267 267
UK 247 247 247
US 258 258 258
In [113]: grouped_trans.size() # Verify non-NA count equals group size
Out[113]:
GR 228
JP 267
UK 247
US 258
dtype: int64
-
fillna
-
向下填充ffill/pad
-
向上填充bfill/backfill
-
滞后shift(1)、shift(-1)
1 窗口和重采样操作的新语法
In [115]: df_re = pd.DataFrame({'A': [1] * 10 + [5] * 10,
.....: 'B': np.arange(20)})
In [116]: df_re
Out[116]:
A B
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
.. .. ..
13 5 13
14 5 14
15 5 15
16 5 16
17 5 17
18 5 18
19 5 19
[20 rows x 2 columns]
# B过去四期的平均值,不足四期则为NaN
In [117]: df_re.groupby('A').rolling(4).B.mean()
Out[117]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
...
5 13 11.5
14 12.5
15 13.5
16 14.5
17 15.5
18 16.5
19 17.5
Name: B, Length: 20, dtype: float64
# 求和
In [118]: df_re.groupby('A').expanding().sum()
Out[118]:
A B
A
1 0 1.0 0.0
1 2.0 1.0
2 3.0 3.0
3 4.0 6.0
4 5.0 10.0
5 6.0 15.0
6 7.0 21.0
... ... ...
5 13 20.0 46.0
14 25.0 60.0
15 30.0 75.0
16 35.0 91.0
17 40.0 108.0
18 45.0 126.0
19 50.0 145.0
[20 rows x 2 columns]
In [119]: df_re = pd.DataFrame({'date': pd.date_range(start='2016-01-01', periods=4,
.....: freq='W'),
.....: 'group': [1, 1, 2, 2],
.....: 'val': [5, 6, 7, 8]}).set_index('date')
In [120]: df_re
Out[120]:
group val
date
2016-01-03 1 5
2016-01-10 1 6
2016-01-17 2 7
2016-01-24 2 8
# 不均匀数据重新抽样排列,参数可以为60S
In [121]: df_re.groupby('group').resample('1D').ffill()
Out[121]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
... ... ...
2 2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
[16 rows x 2 columns]
五、筛选
In [122]: sf = pd.Series([1, 1, 2, 3, 3, 3])
# 哪一组的和大于2
# 比较与sf[sf.apply(lambda x: x>=2)]的不同
In [123]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[123]:
3 3
4 3
5 3
dtype: int64
In [124]: dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
# 哪一组包含2个以上元素
In [125]: dff.groupby('B').filter(lambda x: len(x) > 2)
Out[125]:
A B
2 2 b
3 3 b
4 4 b
5 5 b
In [126]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
Out[126]:
A B
0 NaN NaN
1 NaN NaN
2 2.0 b
3 3.0 b
4 4.0 b
5 5.0 b
6 NaN NaN
7 NaN NaN
In [127]: dff['C'] = np.arange(8)
In [128]: dff.groupby('B').filter(lambda x: len(x['C']) > 2)
Out[128]:
A B C
2 2 b 2
3 3 b 3
4 4 b 4
5 5 b 5
# 每组前两个元素,tail()
In [129]: dff.groupby('B').head(2)
Out[129]:
A B C
0 0 a 0
1 1 a 1
2 2 b 2
3 3 b 3
6 6 c 6
7 7 c 7
六、组内函数
In [130]: grouped = df.groupby('A')
In [131]: grouped.agg(lambda x: x.std()) # 可简化为grouped.std()
Out[131]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
In [132]: grouped.std()
Out[132]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
In [133]: tsdf = pd.DataFrame(np.random.randn(1000, 3),
.....: index=pd.date_range('1/1/2000', periods=1000),
.....: columns=['A', 'B', 'C'])
In [134]: tsdf.iloc[::2] = np.nan
In [135]: grouped = tsdf.groupby(lambda x: x.year)
# 以年为单位往下填充
In [136]: grouped.fillna(method='pad')
Out[136]:
A B C
2000-01-01 NaN NaN NaN # 由于之前无数据,因此为NaN
2000-01-02 -0.353501 -0.080957 -0.876864
2000-01-03 -0.353501 -0.080957 -0.876864
2000-01-04 0.050976 0.044273 -0.559849
2000-01-05 0.050976 0.044273 -0.559849
2000-01-06 0.030091 0.186460 -0.680149
2000-01-07 0.030091 0.186460 -0.680149
... ... ... ...
2002-09-20 2.310215 0.157482 -0.064476
2002-09-21 2.310215 0.157482 -0.064476
2002-09-22 0.005011 0.053897 -1.026922
2002-09-23 0.005011 0.053897 -1.026922
2002-09-24 -0.456542 -1.849051 1.559856
2002-09-25 -0.456542 -1.849051 1.559856
2002-09-26 1.123162 0.354660 1.128135
[1000 rows x 3 columns]
In [137]: s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])
In [138]: g = pd.Series(list('abababab'))
In [139]: gb = s.groupby(g)
# 最大3个
In [140]: gb.nlargest(3)
Out[140]:
a 4 19.0
0 9.0
2 7.0
b 1 8.0
3 5.0
7 3.3
dtype: float64
# 每组最小3个
In [141]: gb.nsmallest(3)
Out[141]:
a 6 4.2
2 7.0
0 9.0
b 5 1.0
7 3.3
3 5.0
dtype: float64
七、统计
In [142]: df
Out[142]:
A B C D
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
In [143]: grouped = df.groupby('A')
In [144]: grouped['C'].apply(lambda x: x.describe())
Out[144]:
A
bar count 3.000000
mean 0.130980
std 0.181231
min -0.077118
25% 0.069390
50% 0.215897
75% 0.235029
...
foo mean -0.359284
std 0.912265
min -1.143704
25% -0.862495
50% -0.575247
75% -0.408530
max 1.193555
Name: C, Length: 16, dtype: float64
In [145]: grouped = df.groupby('A')['C']
In [146]: def f(group):
.....: return pd.DataFrame({'original': group,
.....: 'demeaned': group - group.mean()})
.....:
In [147]: grouped.apply(f)
Out[147]:
original demeaned
0 -0.575247 -0.215962
1 0.254161 0.123181
2 -1.143704 -0.784420
3 0.215897 0.084917
4 1.193555 1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211
In [148]: def f(x):
.....: return pd.Series([x, x ** 2], index=['x', 'x^2'])
.....:
In [149]: s = pd.Series(np.random.rand(5))
In [150]: s
Out[150]:
0 0.321438
1 0.493496
2 0.139505
3 0.910103
4 0.194158
dtype: float64
In [151]: s.apply(f)
Out[151]:
x x^2
0 0.321438 0.103323
1 0.493496 0.243538
2 0.139505 0.019462
3 0.910103 0.828287
4 0.194158 0.037697
In [152]: d = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [153]: def identity(df):
.....: print(df)
.....: return df
In [154]: d.groupby("a").apply(identity)
a b
0 x 1
a b
0 x 1
a b
1 y 2
Out[154]:
a b
0 x 1
1 y 2
八、其他有用用法
In [155]: df
Out[155]:
A B C D
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
In [156]: df.groupby('A').std()
Out[156]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
In [157]: from decimal import Decimal
In [158]: df_dec = pd.DataFrame(
.....: {'id': [1, 2, 1, 2],
.....: 'int_column': [1, 2, 3, 4],
.....: 'dec_column': [Decimal('0.50'), Decimal('0.15'),
.....: Decimal('0.25'), Decimal('0.40')]
.....: }
.....: )
# Decimal columns can be sum'd explicitly by themselves...
In [159]: df_dec.groupby(['id'])[['dec_column']].sum()
Out[159]:
dec_column
id
1 0.75
2 0.55
# ...but cannot be combined with standard data types or they will be excluded
In [160]: df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()
Out[160]:
int_column
id
1 4
2 6
# Use .agg function to aggregate over standard and "nuisance" data types
# at the same time
In [161]: df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})
Out[161]:
int_column dec_column
id
1 4 0.75
2 6 0.55
九、例子
1 根据要素分类
In [218]: df = pd.DataFrame({'a': [1, 0, 0], 'b': [0, 1, 0],
.....: 'c': [1, 0, 0], 'd': [2, 3, 4]})
.....:
In [219]: df
Out[219]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
# 根据各列之和分组并求各行和
In [220]: df.groupby(df.sum(), axis=1).sum()
Out[220]:
1 9
0 2 2
1 1 3
2 0 4
2 多列分类
In [221]: dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
In [222]: dfg
Out[222]:
A B
0 1 a
1 1 a
2 2 a
3 3 b
4 2 a
'''
dfg.groupby(['A', 'B']).groups
{(1, 'a'): Int64Index([0, 1], dtype='int64'),
(2, 'a'): Int64Index([2, 4], dtype='int64'),
(3, 'b'): Int64Index([3], dtype='int64')}
'''
In [223]: dfg.groupby(["A", "B"]).ngroup() # ngroup根据数据不同数值化
Out[223]:
0 0
1 0
2 1
3 2
4 1
dtype: int64
In [224]: dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
Out[224]:
0 0
1 0
2 1
3 3
4 2
dtype: int64
3 根据索引分类
In [225]: df = pd.DataFrame(np.random.randn(10, 2))
In [226]: df
Out[226]:
0 1
0 -0.793893 0.321153
1 0.342250 1.618906
2 -0.975807 1.918201
3 -0.810847 -1.405919
4 -1.977759 0.461659
5 0.730057 -1.316938
6 -0.751328 0.528290
7 -0.257759 -1.081009
8 0.505895 -1.701948
9 -1.006349 0.020208
In [227]: df.index // 5
Out[227]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')
In [228]: df.groupby(df.index // 5).std()
Out[228]:
0 1
0 0.823647 1.312912
1 0.760109 0.942941
# 应用:每n行执行一个操作
4 不同列分类操作
In [229]: df = pd.DataFrame({'a': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
.....: 'b': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
.....: 'c': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
.....: 'd': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]})
.....:
In [230]: def compute_metrics(x):
.....: result = {'b_sum': x['b'].sum(), 'c_mean': x['c'].mean()}
.....: return pd.Series(result, name='metrics')
.....:
In [231]: result = df.groupby('a').apply(compute_metrics)
In [232]: result
Out[232]:
metrics b_sum c_mean
a
0 2.0 0.5
1 2.0 0.5
2 2.0 0.5
In [233]: result.stack()
Out[233]:
a metrics
0 b_sum 2.0
c_mean 0.5
1 b_sum 2.0
c_mean 0.5
2 b_sum 2.0
c_mean 0.5
dtype: float64