pandas庫的groupby問題
一、對象分組
1 一個簡單例子
In [1]: df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
...: ('bird', 'Psittaciformes', 24.0),
...: ('mammal', 'Carnivora', 80.2),
...: ('mammal', 'Primates', np.nan),
...: ('mammal', 'Carnivora', 58)],
...: index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
...: columns=('class', 'order', 'max_speed'))
...:
In [2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
In [3]: grouped = df.groupby('class') # 生成兩組,默認axis=0/'columns'
In [4]: grouped = df.groupby('order', axis='columns') # 無輸出
In [5]: grouped = df.groupby(['class', 'order']) # 生成四組
groupby默認按照列名排列,亦可使axis=1,這樣就能以列爲單位切片(結果是一列一列的組合):
In [50]: grouped = df.groupby(df.dtypes, axis=1) # 按列的類型來分組,可見被分成了兩塊
In [51]: for i,j in grouped:
...: print(i)
...: print(j)
float64
max_speed
falcon 389.0
parrot 24.0
lion 80.2
monkey NaN
leopard 58.0
object
class order
falcon bird Falconiformes
parrot bird Psittaciformes
lion mammal Carnivora
monkey mammal Primates
leopard mammal Carnivora
其他常用的幾種功能:
# 按照組名數量使用以下代碼查看包含了哪些內容,也稱爲迭代
for i,j in grouped:
print(i)
print(j)
for (i1,i2),j in grouped:
print(i1, i2)
print(j)
# 或者如下簡單方式查看有哪些組:
grouped.groups
# 使用以下代碼求和,前提是該列可以求和:
In []: df.groupby('class').sum()
Out[]:
max_speed
class
bird 413.0
mammal 138.2
# 分別求平均數、組的大小、計數:
In []: df.groupby('class').mean()
Out[]:
max_speed
class
bird 206.5
mammal 69.1
In []: df.groupby('class').size()
Out[]:
class
bird 2
mammal 3
dtype: int64
In []: df.groupby('class').count() # 注意count與size的區別
Out[]:
order max_speed
class
bird 2 2
mammal 3 2 # 由於有一個NaN值的存在
2 多重索引與分組
In [6]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B': ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C': np.random.randn(8),
...: 'D': np.random.randn(8)})
In [7]: df
Out[7]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
In [8]: grouped = df.groupby('A') # 生成2組
In [9]: grouped = df.groupby(['A', 'B']) # 生成6組
In [10]: df2 = df.set_index(['A', 'B'])
Out[10]: # print(df2)
C D
A B
foo one -1.209388 -0.309949
bar one -0.380334 -1.352238
foo two 0.309979 -0.695926
bar three 0.650321 0.965206
foo two 0.809020 1.003307
bar two 0.668484 1.013688
foo one 0.513104 0.079576
three 1.579055 -0.083461 # 注意多重索引此處不同
In [11]: grouped = df2.groupby(level=df2.index.names.difference(['B']))
# 等價於 grouped = df2.groupby(level=0),如要按多重索引內的索引'B',則level=1
# 由於'B'列不能求和,亦等價於 df.groupby(['A']).sum()
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.591710 -1.739537
foo -0.752861 -1.402938
3 應用函數與按列分組
In [13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....: else:
....: return 'consonant'
In [14]: grouped = df.groupby(get_letter_type, axis=1)
# axis=0將整個df以哪幾行的形式分組,而axis=1則按照哪幾列來分組
# 該處運用類似於apply函數
In []:for i,j in grouped:
print(i)
print(j)
Out[]: # 由於A的小寫屬於元音字母,導致原df被切分成兩部分
consonant
B C D
0 one -1.209388 -0.309949
1 one -0.380334 -1.352238
2 two 0.309979 -0.695926
3 three 0.650321 0.965206
4 two 0.809020 1.003307
5 two 0.668484 1.013688
6 one 0.513104 0.079576
7 three 1.579055 -0.083461
vowel
A
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
4 level之用法
下面實例介紹了level的基本用法:
In [15]: lst = [1, 2, 3, 1, 2, 3]
In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
Out[16]: # print(s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
In [17]: grouped = s.groupby(level=0) # 按照索引仍分成3組,level=1用於多索引中的第二列索引
In [18]: grouped.first() # 每一組的第一行
Out[18]:
1 1
2 2
3 3
dtype: int64
In [19]: grouped.last() # 每一組的最後一行
Out[19]:
1 10
2 20
3 30
dtype: int64
In [20]: grouped.sum()
Out[20]:
1 11
2 22
3 33
dtype: int64
5 排序
groupby默認按照組名升序排列,可以取消此操作。
In [21]: df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})
In [22]: df2.groupby(['X']).sum()
Out[22]:
Y
X
A 7
B 3
In [23]: df2.groupby(['X'], sort=False).sum() # 並不是改爲降序,而是按照組名的原有順序
Out[23]:
Y
X
B 3
A 7
groupby分組的數據不會自動升序展示,而是默認按照原有順序保存。
In [24]: df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
In [25]: df3.groupby(['X']).get_group('A')
Out[25]:
X Y
0 A 1
2 A 3
In [26]: df3.groupby(['X']).get_group('B')
Out[26]:
X Y
1 B 4
3 B 2
6 groupby對象的函數
按照一列分組和列切片:
In [27]: df.groupby('A').groups
Out[27]:
{'bar': Int64Index([1, 3, 5], dtype='int64'),
'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}
In [28]: df.groupby(get_letter_type, axis=1).groups
Out[28]:
{'consonant': Index(['B', 'C', 'D'], dtype='object'),
'vowel': Index(['A'], dtype='object')}
按照兩列分組:
In [29]: grouped = df.groupby(['A', 'B'])
In [30]: grouped.groups
Out[30]:
{('bar', 'one'): Int64Index([1], dtype='int64'),
('bar', 'three'): Int64Index([3], dtype='int64'),
('bar', 'two'): Int64Index([5], dtype='int64'),
('foo', 'one'): Int64Index([0, 6], dtype='int64'),
('foo', 'three'): Int64Index([7], dtype='int64'),
('foo', 'two'): Int64Index([2, 4], dtype='int64')}
In [31]: len(grouped)
Out[31]: 6
使用tab鍵查看有哪些可選函數:
In [32]: df
Out[32]:
height weight gender
2000-01-01 42.849980 157.500553 male
2000-01-02 49.607315 177.340407 male
2000-01-03 56.293531 171.524640 male
2000-01-04 48.421077 144.251986 female
2000-01-05 46.556882 152.526206 male
2000-01-06 68.448851 168.272968 female
2000-01-07 70.757698 136.431469 male
2000-01-08 58.909500 176.499753 female
2000-01-09 76.435631 174.094104 female
2000-01-10 45.306120 177.540920 male
In [33]: gb = df.groupby('gender')
In [34]: gb.<TAB> # noqa: E225, E999
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
7 多重索引分組
# 創建一個有二重索引的列表
In [35]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [36]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
In [37]: s = pd.Series(np.random.randn(8), index=index)
# 相近索引會隱藏
In [38]: s
Out[38]:
first second
bar one -0.919854
two -0.042379
baz one 1.247642
two -0.009920
foo one 0.290213
two 0.495767
qux one 0.362949
two 1.548106
dtype: float64
簡單求和操作:
In [39]: grouped = s.groupby(level=0)
In [40]: grouped.sum() # 等價於s.groupby('first').sum()
Out[40]:
first
bar -0.962232
baz 1.237723
foo 0.785980
qux 1.911055
dtype: float64
In [41]: s.groupby(level='second').sum() # 等價於s.sum(level='first')
Out[41]:
second
one 0.980950
two 1.991575
dtype: float64
In [42]: s.sum(level='second')
Out[42]:
second
one 0.980950
two 1.991575
dtype: float64
亦可同時按照兩個索引列來分組求和:
In [43]: s
Out[43]:
first second third
bar doo one -1.131345
two -0.089329
baz bee one 0.337863
two -0.945867
foo bop one -0.932132
two 1.956030
qux bop one 0.017587
two -0.016692
dtype: float64
In [44]: s.groupby(level=['first', 'second']).sum()
Out[44]:
first second
bar doo -1.220674
baz bee -0.608004
foo bop 1.023898
qux bop 0.000895
dtype: float64
In [45]: s.groupby(['first', 'second']).sum()
Out[45]:
first second
bar doo -1.220674
baz bee -0.608004
foo bop 1.023898
qux bop 0.000895
dtype: float64
8 組內選組
# 對每一列進行不同操作
In [53]: grouped = df.groupby(['A'])
In [54]: grouped_C = grouped['C']
In [56]: df['C'].groupby(df['A']) # 最直觀、最簡單
Out[56]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f2b486509b0>
二、選擇一個組
# get_group的參數爲迭代時出現的組名
In [60]: grouped.get_group('bar')
Out[60]:
A B C D
1 bar one 0.254161 1.511763
3 bar three 0.215897 -0.990582
5 bar two -0.077118 1.211526
In [61]: df.groupby(['A', 'B']).get_group(('bar', 'one'))
Out[61]:
A B C D
1 bar one 0.254161 1.511763
三、聚合
In [62]: grouped = df.groupby('A')
In [63]: grouped.aggregate(np.sum) # 等價於grouped.sum()
Out[63]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
In [64]: grouped = df.groupby(['A', 'B'])
In [65]: grouped.aggregate(np.sum)
Out[65]:
C D
A B
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.983776 1.614581
three -0.862495 0.024580
two 0.049851 1.185429
In [66]: grouped = df.groupby(['A', 'B'], as_index=False) # as_index去除了索引與列的差距,現在求和得到的就是一個dataframe
In [67]: grouped.aggregate(np.sum)
Out[67]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
In [68]: df.groupby('A', as_index=False).sum()
Out[68]:
A C D
0 bar 0.392940 1.732707
1 foo -1.796421 2.824590
In [70]: grouped.size()
Out[70]:
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64
In [71]: grouped.describe() # 是對每個組的描述
# 更多用法:std() var() sem() nth()
1 同時應用多個函數
1.1 應用相同函數
In [72]: grouped = df.groupby('A')
In [73]: grouped['C'].agg([np.sum, np.mean, np.std]) # 可以改爲aggregate
Out[73]:
sum mean std
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
In [74]: grouped.agg([np.sum, np.mean, np.std])
Out[74]:
C D
sum mean std sum mean std
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
In [75]: (grouped['C'].agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'}))
....:
Out[75]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
In [76]: (grouped.agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'}))
....:
Out[76]:
C D
foo bar baz foo bar baz
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
1.2 應用不同函數
In [77]: grouped.agg({'C': np.sum,
....: 'D': lambda x: np.std(x, ddof=1)})
....:
Out[77]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
In [78]: grouped.agg({'C': 'sum', 'D': 'std'})
Out[78]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
In [79]: from collections import OrderedDict
In [80]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[80]:
D C
A
bar 1.366330 0.130980
foo 0.884785 -0.359284
In [81]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[81]:
D C
A
bar 1.366330 0.130980
foo 0.884785 -0.359284
1.3 Cython類函數的應用
目前只有sum mean std sem符合
四、轉化
In [84]: index = pd.date_range('10/1/1999', periods=1100)
In [85]: ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
# rollIng函數:window過去多少期,min_peirods最小觀測量
In [86]: ts = ts.rolling(window=100, min_periods=100).mean().dropna()
In [87]: ts.head()
Out[87]:
2000-01-08 0.779333
2000-01-09 0.778852
2000-01-10 0.786476
2000-01-11 0.782797
2000-01-12 0.798110
Freq: D, dtype: float64
In [88]: ts.tail()
Out[88]:
2002-09-30 0.660294
2002-10-01 0.631095
2002-10-02 0.673601
2002-10-03 0.709213
2002-10-04 0.719369
Freq: D, dtype: float64
# 以年爲單位將該列數據標準化
In [89]: transformed = (ts.groupby(lambda x: x.year)
....: .transform(lambda x: (x - x.mean()) / x.std()))
Out[89]:
2000-01-08 -0.624080
2000-01-09 -0.763061
2000-01-10 -1.009653
2000-01-11 -0.965821
2000-01-12 -1.227731...
# 可視化數據
In [96]: compare = pd.DataFrame({'Original': ts, 'Transformed': transformed})
In [97]: compare.plot()
Out[97]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b4866c1d0>
# 以年爲單位計算最大值與最小值之差
In [98]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
Out[98]:
2000-01-08 0.623893
2000-01-09 0.623893
2000-01-10 0.623893
2000-01-11 0.623893
2000-01-12 0.623893
2000-01-13 0.623893
2000-01-14 0.623893
...
2002-09-28 0.558275
2002-09-29 0.558275
2002-09-30 0.558275
2002-10-01 0.558275
2002-10-02 0.558275
2002-10-03 0.558275
2002-10-04 0.558275
Freq: D, Length: 1001, dtype: float64
# 等價形式
In [99]: max = ts.groupby(lambda x: x.year).transform('max')
In [100]: min = ts.groupby(lambda x: x.year).transform('min')
In [101]: max - min
Out[101]:
2000-01-08 0.623893
2000-01-09 0.623893
2000-01-10 0.623893
2000-01-11 0.623893
2000-01-12 0.623893
2000-01-13 0.623893
2000-01-14 0.623893
...
2002-09-28 0.558275
2002-09-29 0.558275
2002-09-30 0.558275
2002-10-01 0.558275
2002-10-02 0.558275
2002-10-03 0.558275
2002-10-04 0.558275
Freq: D, Length: 1001, dtype: float64
In [102]: data_df
Out[102]:
A B C
0 1.539708 -1.166480 0.533026
1 1.302092 -0.505754 NaN
2 -0.371983 1.104803 -0.651520
3 -1.309622 1.118697 -1.161657
4 -1.924296 0.396437 0.812436
5 0.815643 0.367816 -0.469478
6 -0.030651 1.376106 -0.645129
.. ... ... ...
993 0.012359 0.554602 -1.976159
994 0.042312 -1.628835 1.013822
995 -0.093110 0.683847 -0.774753
996 -0.185043 1.438572 NaN
997 -0.394469 -0.642343 0.011374
998 -1.174126 1.857148 NaN
999 0.234564 0.517098 0.393534
[1000 rows x 3 columns]
In [103]: countries = np.array(['US', 'UK', 'GR', 'JP'])
In [104]: key = countries[np.random.randint(0, 4, 1000)]
In [105]: grouped = data_df.groupby(key)
# Non-NA count in each group
In [106]: grouped.count()
Out[106]:
A B C
GR 209 217 189
JP 240 255 217
UK 216 231 193
US 239 250 217
In [107]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
In [108]: grouped_trans = transformed.groupby(key)
In [109]: grouped.mean() # original group means
Out[109]:
A B C
GR -0.098371 -0.015420 0.068053
JP 0.069025 0.023100 -0.077324
UK 0.034069 -0.052580 -0.116525
US 0.058664 -0.020399 0.028603
In [110]: grouped_trans.mean() # transformation did not change group means
Out[110]:
A B C
GR -0.098371 -0.015420 0.068053
JP 0.069025 0.023100 -0.077324
UK 0.034069 -0.052580 -0.116525
US 0.058664 -0.020399 0.028603
In [111]: grouped.count() # original has some missing data points
Out[111]:
A B C
GR 209 217 189
JP 240 255 217
UK 216 231 193
US 239 250 217
In [112]: grouped_trans.count() # counts after transformation
Out[112]:
A B C
GR 228 228 228
JP 267 267 267
UK 247 247 247
US 258 258 258
In [113]: grouped_trans.size() # Verify non-NA count equals group size
Out[113]:
GR 228
JP 267
UK 247
US 258
dtype: int64
-
fillna
-
向下填充ffill/pad
-
向上填充bfill/backfill
-
滯後shift(1)、shift(-1)
1 窗口和重採樣操作的新語法
In [115]: df_re = pd.DataFrame({'A': [1] * 10 + [5] * 10,
.....: 'B': np.arange(20)})
In [116]: df_re
Out[116]:
A B
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
.. .. ..
13 5 13
14 5 14
15 5 15
16 5 16
17 5 17
18 5 18
19 5 19
[20 rows x 2 columns]
# B過去四期的平均值,不足四期則爲NaN
In [117]: df_re.groupby('A').rolling(4).B.mean()
Out[117]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
...
5 13 11.5
14 12.5
15 13.5
16 14.5
17 15.5
18 16.5
19 17.5
Name: B, Length: 20, dtype: float64
# 求和
In [118]: df_re.groupby('A').expanding().sum()
Out[118]:
A B
A
1 0 1.0 0.0
1 2.0 1.0
2 3.0 3.0
3 4.0 6.0
4 5.0 10.0
5 6.0 15.0
6 7.0 21.0
... ... ...
5 13 20.0 46.0
14 25.0 60.0
15 30.0 75.0
16 35.0 91.0
17 40.0 108.0
18 45.0 126.0
19 50.0 145.0
[20 rows x 2 columns]
In [119]: df_re = pd.DataFrame({'date': pd.date_range(start='2016-01-01', periods=4,
.....: freq='W'),
.....: 'group': [1, 1, 2, 2],
.....: 'val': [5, 6, 7, 8]}).set_index('date')
In [120]: df_re
Out[120]:
group val
date
2016-01-03 1 5
2016-01-10 1 6
2016-01-17 2 7
2016-01-24 2 8
# 不均勻數據重新抽樣排列,參數可以爲60S
In [121]: df_re.groupby('group').resample('1D').ffill()
Out[121]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
... ... ...
2 2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
[16 rows x 2 columns]
五、篩選
In [122]: sf = pd.Series([1, 1, 2, 3, 3, 3])
# 哪一組的和大於2
# 比較與sf[sf.apply(lambda x: x>=2)]的不同
In [123]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[123]:
3 3
4 3
5 3
dtype: int64
In [124]: dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
# 哪一組包含2個以上元素
In [125]: dff.groupby('B').filter(lambda x: len(x) > 2)
Out[125]:
A B
2 2 b
3 3 b
4 4 b
5 5 b
In [126]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
Out[126]:
A B
0 NaN NaN
1 NaN NaN
2 2.0 b
3 3.0 b
4 4.0 b
5 5.0 b
6 NaN NaN
7 NaN NaN
In [127]: dff['C'] = np.arange(8)
In [128]: dff.groupby('B').filter(lambda x: len(x['C']) > 2)
Out[128]:
A B C
2 2 b 2
3 3 b 3
4 4 b 4
5 5 b 5
# 每組前兩個元素,tail()
In [129]: dff.groupby('B').head(2)
Out[129]:
A B C
0 0 a 0
1 1 a 1
2 2 b 2
3 3 b 3
6 6 c 6
7 7 c 7
六、組內函數
In [130]: grouped = df.groupby('A')
In [131]: grouped.agg(lambda x: x.std()) # 可簡化爲grouped.std()
Out[131]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
In [132]: grouped.std()
Out[132]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
In [133]: tsdf = pd.DataFrame(np.random.randn(1000, 3),
.....: index=pd.date_range('1/1/2000', periods=1000),
.....: columns=['A', 'B', 'C'])
In [134]: tsdf.iloc[::2] = np.nan
In [135]: grouped = tsdf.groupby(lambda x: x.year)
# 以年爲單位往下填充
In [136]: grouped.fillna(method='pad')
Out[136]:
A B C
2000-01-01 NaN NaN NaN # 由於之前無數據,因此爲NaN
2000-01-02 -0.353501 -0.080957 -0.876864
2000-01-03 -0.353501 -0.080957 -0.876864
2000-01-04 0.050976 0.044273 -0.559849
2000-01-05 0.050976 0.044273 -0.559849
2000-01-06 0.030091 0.186460 -0.680149
2000-01-07 0.030091 0.186460 -0.680149
... ... ... ...
2002-09-20 2.310215 0.157482 -0.064476
2002-09-21 2.310215 0.157482 -0.064476
2002-09-22 0.005011 0.053897 -1.026922
2002-09-23 0.005011 0.053897 -1.026922
2002-09-24 -0.456542 -1.849051 1.559856
2002-09-25 -0.456542 -1.849051 1.559856
2002-09-26 1.123162 0.354660 1.128135
[1000 rows x 3 columns]
In [137]: s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])
In [138]: g = pd.Series(list('abababab'))
In [139]: gb = s.groupby(g)
# 最大3個
In [140]: gb.nlargest(3)
Out[140]:
a 4 19.0
0 9.0
2 7.0
b 1 8.0
3 5.0
7 3.3
dtype: float64
# 每組最小3個
In [141]: gb.nsmallest(3)
Out[141]:
a 6 4.2
2 7.0
0 9.0
b 5 1.0
7 3.3
3 5.0
dtype: float64
七、統計
In [142]: df
Out[142]:
A B C D
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
In [143]: grouped = df.groupby('A')
In [144]: grouped['C'].apply(lambda x: x.describe())
Out[144]:
A
bar count 3.000000
mean 0.130980
std 0.181231
min -0.077118
25% 0.069390
50% 0.215897
75% 0.235029
...
foo mean -0.359284
std 0.912265
min -1.143704
25% -0.862495
50% -0.575247
75% -0.408530
max 1.193555
Name: C, Length: 16, dtype: float64
In [145]: grouped = df.groupby('A')['C']
In [146]: def f(group):
.....: return pd.DataFrame({'original': group,
.....: 'demeaned': group - group.mean()})
.....:
In [147]: grouped.apply(f)
Out[147]:
original demeaned
0 -0.575247 -0.215962
1 0.254161 0.123181
2 -1.143704 -0.784420
3 0.215897 0.084917
4 1.193555 1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211
In [148]: def f(x):
.....: return pd.Series([x, x ** 2], index=['x', 'x^2'])
.....:
In [149]: s = pd.Series(np.random.rand(5))
In [150]: s
Out[150]:
0 0.321438
1 0.493496
2 0.139505
3 0.910103
4 0.194158
dtype: float64
In [151]: s.apply(f)
Out[151]:
x x^2
0 0.321438 0.103323
1 0.493496 0.243538
2 0.139505 0.019462
3 0.910103 0.828287
4 0.194158 0.037697
In [152]: d = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [153]: def identity(df):
.....: print(df)
.....: return df
In [154]: d.groupby("a").apply(identity)
a b
0 x 1
a b
0 x 1
a b
1 y 2
Out[154]:
a b
0 x 1
1 y 2
八、其他有用用法
In [155]: df
Out[155]:
A B C D
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
In [156]: df.groupby('A').std()
Out[156]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
In [157]: from decimal import Decimal
In [158]: df_dec = pd.DataFrame(
.....: {'id': [1, 2, 1, 2],
.....: 'int_column': [1, 2, 3, 4],
.....: 'dec_column': [Decimal('0.50'), Decimal('0.15'),
.....: Decimal('0.25'), Decimal('0.40')]
.....: }
.....: )
# Decimal columns can be sum'd explicitly by themselves...
In [159]: df_dec.groupby(['id'])[['dec_column']].sum()
Out[159]:
dec_column
id
1 0.75
2 0.55
# ...but cannot be combined with standard data types or they will be excluded
In [160]: df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()
Out[160]:
int_column
id
1 4
2 6
# Use .agg function to aggregate over standard and "nuisance" data types
# at the same time
In [161]: df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})
Out[161]:
int_column dec_column
id
1 4 0.75
2 6 0.55
九、例子
1 根據要素分類
In [218]: df = pd.DataFrame({'a': [1, 0, 0], 'b': [0, 1, 0],
.....: 'c': [1, 0, 0], 'd': [2, 3, 4]})
.....:
In [219]: df
Out[219]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
# 根據各列之和分組並求各行和
In [220]: df.groupby(df.sum(), axis=1).sum()
Out[220]:
1 9
0 2 2
1 1 3
2 0 4
2 多列分類
In [221]: dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
In [222]: dfg
Out[222]:
A B
0 1 a
1 1 a
2 2 a
3 3 b
4 2 a
'''
dfg.groupby(['A', 'B']).groups
{(1, 'a'): Int64Index([0, 1], dtype='int64'),
(2, 'a'): Int64Index([2, 4], dtype='int64'),
(3, 'b'): Int64Index([3], dtype='int64')}
'''
In [223]: dfg.groupby(["A", "B"]).ngroup() # ngroup根據數據不同數值化
Out[223]:
0 0
1 0
2 1
3 2
4 1
dtype: int64
In [224]: dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
Out[224]:
0 0
1 0
2 1
3 3
4 2
dtype: int64
3 根據索引分類
In [225]: df = pd.DataFrame(np.random.randn(10, 2))
In [226]: df
Out[226]:
0 1
0 -0.793893 0.321153
1 0.342250 1.618906
2 -0.975807 1.918201
3 -0.810847 -1.405919
4 -1.977759 0.461659
5 0.730057 -1.316938
6 -0.751328 0.528290
7 -0.257759 -1.081009
8 0.505895 -1.701948
9 -1.006349 0.020208
In [227]: df.index // 5
Out[227]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')
In [228]: df.groupby(df.index // 5).std()
Out[228]:
0 1
0 0.823647 1.312912
1 0.760109 0.942941
# 應用:每n行執行一個操作
4 不同列分類操作
In [229]: df = pd.DataFrame({'a': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
.....: 'b': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
.....: 'c': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
.....: 'd': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]})
.....:
In [230]: def compute_metrics(x):
.....: result = {'b_sum': x['b'].sum(), 'c_mean': x['c'].mean()}
.....: return pd.Series(result, name='metrics')
.....:
In [231]: result = df.groupby('a').apply(compute_metrics)
In [232]: result
Out[232]:
metrics b_sum c_mean
a
0 2.0 0.5
1 2.0 0.5
2 2.0 0.5
In [233]: result.stack()
Out[233]:
a metrics
0 b_sum 2.0
c_mean 0.5
1 b_sum 2.0
c_mean 0.5
2 b_sum 2.0
c_mean 0.5
dtype: float64