pandas庫的groupby問題

一、對象分組

1 一個簡單例子

In [1]: df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
   ...:                    ('bird', 'Psittaciformes', 24.0),
   ...:                    ('mammal', 'Carnivora', 80.2),
   ...:                    ('mammal', 'Primates', np.nan),
   ...:                    ('mammal', 'Carnivora', 58)],
   ...:                   index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
   ...:                   columns=('class', 'order', 'max_speed'))
   ...:

In [2]: df
Out[2]:
          class           order  max_speed
falcon     bird   Falconiformes      389.0
parrot     bird  Psittaciformes       24.0
lion     mammal       Carnivora       80.2
monkey   mammal        Primates        NaN
leopard  mammal       Carnivora       58.0

In [3]: grouped = df.groupby('class')	# 生成兩組，默認axis=0/'columns'

In [4]: grouped = df.groupby('order', axis='columns')	# 無輸出

In [5]: grouped = df.groupby(['class', 'order'])	# 生成四組

groupby默認按照列名排列，亦可使axis=1，這樣就能以列爲單位切片（結果是一列一列的組合）：

In [50]: grouped = df.groupby(df.dtypes, axis=1)	# 按列的類型來分組，可見被分成了兩塊

In [51]: for i,j in grouped:
    ...:     print(i)
    ...:     print(j)

float64
         max_speed
falcon       389.0
parrot        24.0
lion          80.2
monkey         NaN
leopard       58.0

object
          class           order
falcon     bird   Falconiformes
parrot     bird  Psittaciformes
lion     mammal       Carnivora
monkey   mammal        Primates
leopard  mammal       Carnivora

其他常用的幾種功能：

# 按照組名數量使用以下代碼查看包含了哪些內容，也稱爲迭代
for i,j in grouped:
	print(i)
	print(j)   
for (i1,i2),j in grouped:
    print(i1, i2)
    print(j)

# 或者如下簡單方式查看有哪些組：
grouped.groups

# 使用以下代碼求和，前提是該列可以求和：
In []: df.groupby('class').sum()
Out[]:
        max_speed
class            
bird        413.0
mammal      138.2

# 分別求平均數、組的大小、計數：
In []: df.groupby('class').mean()
Out[]:
        max_speed
class            
bird        206.5
mammal       69.1

In []: df.groupby('class').size()
Out[]:
class
bird      2
mammal    3
dtype: int64
    
In []: df.groupby('class').count()	# 注意count與size的區別
Out[]:
        order  max_speed
class                   
bird        2          2
mammal      3          2	# 由於有一個NaN值的存在

2 多重索引與分組

In [6]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
   ...:                          'foo', 'bar', 'foo', 'foo'],
   ...:                    'B': ['one', 'one', 'two', 'three',
   ...:                          'two', 'two', 'one', 'three'],
   ...:                    'C': np.random.randn(8),
   ...:                    'D': np.random.randn(8)})

In [7]: df
Out[7]:
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860

In [8]: grouped = df.groupby('A')	# 生成2組

In [9]: grouped = df.groupby(['A', 'B'])	# 生成6組

In [10]: df2 = df.set_index(['A', 'B'])
Out[10]: # print(df2)
                  C         D
A   B                        
foo one   -1.209388 -0.309949
bar one   -0.380334 -1.352238
foo two    0.309979 -0.695926
bar three  0.650321  0.965206
foo two    0.809020  1.003307
bar two    0.668484  1.013688
foo one    0.513104  0.079576
    three  1.579055 -0.083461	# 注意多重索引此處不同

In [11]: grouped = df2.groupby(level=df2.index.names.difference(['B']))
# 等價於 grouped = df2.groupby(level=0)，如要按多重索引內的索引'B'，則level=1
# 由於'B'列不能求和，亦等價於 df.groupby(['A']).sum()

In [12]: grouped.sum()
Out[12]:
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

3 應用函數與按列分組

In [13]: def get_letter_type(letter):
   ....:     if letter.lower() in 'aeiou':
   ....:         return 'vowel'
   ....:     else:
   ....:         return 'consonant'

In [14]: grouped = df.groupby(get_letter_type, axis=1)
# axis=0將整個df以哪幾行的形式分組，而axis=1則按照哪幾列來分組
# 該處運用類似於apply函數

In []:for i,j in grouped:    
    print(i)
    print(j)

Out[]:    # 由於A的小寫屬於元音字母，導致原df被切分成兩部分
consonant
       B         C         D
0    one -1.209388 -0.309949
1    one -0.380334 -1.352238
2    two  0.309979 -0.695926
3  three  0.650321  0.965206
4    two  0.809020  1.003307
5    two  0.668484  1.013688
6    one  0.513104  0.079576
7  three  1.579055 -0.083461
vowel
     A
0  foo
1  bar
2  foo
3  bar
4  foo
5  bar
6  foo
7  foo

4 level之用法

下面實例介紹了level的基本用法：

In [15]: lst = [1, 2, 3, 1, 2, 3]

In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
Out[16]: # print(s)
1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64

In [17]: grouped = s.groupby(level=0)	# 按照索引仍分成3組，level=1用於多索引中的第二列索引

In [18]: grouped.first()	# 每一組的第一行
Out[18]:
1    1
2    2
3    3
dtype: int64

In [19]: grouped.last()    # 每一組的最後一行
Out[19]:
1    10
2    20
3    30
dtype: int64

In [20]: grouped.sum()
Out[20]:
1    11
2    22
3    33
dtype: int64

5 排序

groupby默認按照組名升序排列，可以取消此操作。

In [21]: df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})

In [22]: df2.groupby(['X']).sum()
Out[22]:
   Y
X   
A  7
B  3

In [23]: df2.groupby(['X'], sort=False).sum()    # 並不是改爲降序，而是按照組名的原有順序
Out[23]:
   Y
X   
B  3
A  7

groupby分組的數據不會自動升序展示，而是默認按照原有順序保存。

In [24]: df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})

In [25]: df3.groupby(['X']).get_group('A')
Out[25]:
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(['X']).get_group('B')
Out[26]:
   X  Y
1  B  4
3  B  2

6 groupby對象的函數

按照一列分組和列切片：

In [27]: df.groupby('A').groups
Out[27]:
{'bar': Int64Index([1, 3, 5], dtype='int64'),
 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}

In [28]: df.groupby(get_letter_type, axis=1).groups
Out[28]:
{'consonant': Index(['B', 'C', 'D'], dtype='object'),
 'vowel': Index(['A'], dtype='object')}

按照兩列分組：

In [29]: grouped = df.groupby(['A', 'B'])

In [30]: grouped.groups
Out[30]:
{('bar', 'one'): Int64Index([1], dtype='int64'),
 ('bar', 'three'): Int64Index([3], dtype='int64'),
 ('bar', 'two'): Int64Index([5], dtype='int64'),
 ('foo', 'one'): Int64Index([0, 6], dtype='int64'),
 ('foo', 'three'): Int64Index([7], dtype='int64'),
 ('foo', 'two'): Int64Index([2, 4], dtype='int64')}

In [31]: len(grouped)
Out[31]: 6

使用tab鍵查看有哪些可選函數：

In [32]: df
Out[32]:
               height      weight  gender
2000-01-01  42.849980  157.500553    male
2000-01-02  49.607315  177.340407    male
2000-01-03  56.293531  171.524640    male
2000-01-04  48.421077  144.251986  female
2000-01-05  46.556882  152.526206    male
2000-01-06  68.448851  168.272968  female
2000-01-07  70.757698  136.431469    male
2000-01-08  58.909500  176.499753  female
2000-01-09  76.435631  174.094104  female
2000-01-10  45.306120  177.540920    male

In [33]: gb = df.groupby('gender')

In [34]: gb.<TAB>  # noqa: E225, E999
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform  gb.aggregate  gb.count      gb.cumprod    gb.dtype     gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth       gb.prod       gb.resample   gb.sum        gb.var        gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

7 多重索引分組

# 創建一個有二重索引的列表
In [35]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [36]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [37]: s = pd.Series(np.random.randn(8), index=index)

# 相近索引會隱藏
In [38]: s
Out[38]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

簡單求和操作：

In [39]: grouped = s.groupby(level=0)

In [40]: grouped.sum()    # 等價於s.groupby('first').sum()
Out[40]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

In [41]: s.groupby(level='second').sum()    # 等價於s.sum(level='first')
Out[41]: 
second
one    0.980950
two    1.991575
dtype: float64

In [42]: s.sum(level='second')
Out[42]: 
second
one    0.980950
two    1.991575
dtype: float64

亦可同時按照兩個索引列來分組求和：

In [43]: s
Out[43]: 
first  second  third
bar    doo     one     -1.131345
               two     -0.089329
baz    bee     one      0.337863
               two     -0.945867
foo    bop     one     -0.932132
               two      1.956030
qux    bop     one      0.017587
               two     -0.016692
dtype: float64

In [44]: s.groupby(level=['first', 'second']).sum()
Out[44]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64
    
In [45]: s.groupby(['first', 'second']).sum()
Out[45]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

8 組內選組

# 對每一列進行不同操作
In [53]: grouped = df.groupby(['A'])

In [54]: grouped_C = grouped['C']

In [56]: df['C'].groupby(df['A'])	# 最直觀、最簡單
Out[56]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f2b486509b0>

二、選擇一個組

# get_group的參數爲迭代時出現的組名
In [60]: grouped.get_group('bar')
Out[60]: 
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526

In [61]: df.groupby(['A', 'B']).get_group(('bar', 'one'))
Out[61]: 
     A    B         C         D
1  bar  one  0.254161  1.511763

三、聚合

In [62]: grouped = df.groupby('A')

In [63]: grouped.aggregate(np.sum)	# 等價於grouped.sum()
Out[63]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [64]: grouped = df.groupby(['A', 'B'])

In [65]: grouped.aggregate(np.sum)
Out[65]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429
    
In [66]: grouped = df.groupby(['A', 'B'], as_index=False)	# as_index去除了索引與列的差距，現在求和得到的就是一個dataframe

In [67]: grouped.aggregate(np.sum)
Out[67]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

In [68]: df.groupby('A', as_index=False).sum()
Out[68]: 
     A         C         D
0  bar  0.392940  1.732707
1  foo -1.796421  2.824590

In [70]: grouped.size()
Out[70]: 
A    B    
bar  one      1
     three    1
     two      1
foo  one      2
     three    1
     two      2
dtype: int64
    
In [71]: grouped.describe()	# 是對每個組的描述
# 更多用法：std() var() sem() nth()

1 同時應用多個函數

1.1 應用相同函數

In [72]: grouped = df.groupby('A')

In [73]: grouped['C'].agg([np.sum, np.mean, np.std])	# 可以改爲aggregate
Out[73]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

In [74]: grouped.agg([np.sum, np.mean, np.std])
Out[74]: 
            C                             D                    
          sum      mean       std       sum      mean       std
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

In [75]: (grouped['C'].agg([np.sum, np.mean, np.std])
   ....:              .rename(columns={'sum': 'foo',
   ....:                               'mean': 'bar',
   ....:                               'std': 'baz'}))
   ....: 
Out[75]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

In [76]: (grouped.agg([np.sum, np.mean, np.std])
   ....:         .rename(columns={'sum': 'foo',
   ....:                          'mean': 'bar',
   ....:                          'std': 'baz'}))
   ....: 
Out[76]: 
            C                             D                    
          foo       bar       baz       foo       bar       baz
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

1.2 應用不同函數

In [77]: grouped.agg({'C': np.sum,
   ....:              'D': lambda x: np.std(x, ddof=1)})
   ....: 
Out[77]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

In [78]: grouped.agg({'C': 'sum', 'D': 'std'})
Out[78]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

In [79]: from collections import OrderedDict

In [80]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[80]: 
            D         C
A                      
bar  1.366330  0.130980
foo  0.884785 -0.359284

In [81]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[81]: 
            D         C
A                      
bar  1.366330  0.130980
foo  0.884785 -0.359284

1.3 Cython類函數的應用

目前只有sum mean std sem符合

四、轉化

In [84]: index = pd.date_range('10/1/1999', periods=1100)

In [85]: ts = pd.Series(np.random.normal(0.5, 2, 1100), index)

# rollIng函數：window過去多少期，min_peirods最小觀測量
In [86]: ts = ts.rolling(window=100, min_periods=100).mean().dropna()

In [87]: ts.head()
Out[87]: 
2000-01-08    0.779333
2000-01-09    0.778852
2000-01-10    0.786476
2000-01-11    0.782797
2000-01-12    0.798110
Freq: D, dtype: float64

In [88]: ts.tail()
Out[88]: 
2002-09-30    0.660294
2002-10-01    0.631095
2002-10-02    0.673601
2002-10-03    0.709213
2002-10-04    0.719369
Freq: D, dtype: float64

# 以年爲單位將該列數據標準化
In [89]: transformed = (ts.groupby(lambda x: x.year)
   ....:                  .transform(lambda x: (x - x.mean()) / x.std()))
Out[89]: 
2000-01-08   -0.624080
2000-01-09   -0.763061
2000-01-10   -1.009653
2000-01-11   -0.965821
2000-01-12   -1.227731...

# 可視化數據
In [96]: compare = pd.DataFrame({'Original': ts, 'Transformed': transformed})

In [97]: compare.plot()
Out[97]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b4866c1d0>

# 以年爲單位計算最大值與最小值之差
In [98]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
Out[98]: 
2000-01-08    0.623893
2000-01-09    0.623893
2000-01-10    0.623893
2000-01-11    0.623893
2000-01-12    0.623893
2000-01-13    0.623893
2000-01-14    0.623893
                ...   
2002-09-28    0.558275
2002-09-29    0.558275
2002-09-30    0.558275
2002-10-01    0.558275
2002-10-02    0.558275
2002-10-03    0.558275
2002-10-04    0.558275
Freq: D, Length: 1001, dtype: float64

# 等價形式
In [99]: max = ts.groupby(lambda x: x.year).transform('max')
In [100]: min = ts.groupby(lambda x: x.year).transform('min')
In [101]: max - min
Out[101]: 
2000-01-08    0.623893
2000-01-09    0.623893
2000-01-10    0.623893
2000-01-11    0.623893
2000-01-12    0.623893
2000-01-13    0.623893
2000-01-14    0.623893
                ...   
2002-09-28    0.558275
2002-09-29    0.558275
2002-09-30    0.558275
2002-10-01    0.558275
2002-10-02    0.558275
2002-10-03    0.558275
2002-10-04    0.558275
Freq: D, Length: 1001, dtype: float64

In [102]: data_df
Out[102]: 
            A         B         C
0    1.539708 -1.166480  0.533026
1    1.302092 -0.505754       NaN
2   -0.371983  1.104803 -0.651520
3   -1.309622  1.118697 -1.161657
4   -1.924296  0.396437  0.812436
5    0.815643  0.367816 -0.469478
6   -0.030651  1.376106 -0.645129
..        ...       ...       ...
993  0.012359  0.554602 -1.976159
994  0.042312 -1.628835  1.013822
995 -0.093110  0.683847 -0.774753
996 -0.185043  1.438572       NaN
997 -0.394469 -0.642343  0.011374
998 -1.174126  1.857148       NaN
999  0.234564  0.517098  0.393534

[1000 rows x 3 columns]

In [103]: countries = np.array(['US', 'UK', 'GR', 'JP'])

In [104]: key = countries[np.random.randint(0, 4, 1000)]

In [105]: grouped = data_df.groupby(key)

# Non-NA count in each group
In [106]: grouped.count()
Out[106]: 
      A    B    C
GR  209  217  189
JP  240  255  217
UK  216  231  193
US  239  250  217

In [107]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

In [108]: grouped_trans = transformed.groupby(key)

In [109]: grouped.mean()  # original group means
Out[109]: 
           A         B         C
GR -0.098371 -0.015420  0.068053
JP  0.069025  0.023100 -0.077324
UK  0.034069 -0.052580 -0.116525
US  0.058664 -0.020399  0.028603

In [110]: grouped_trans.mean()  # transformation did not change group means
Out[110]: 
           A         B         C
GR -0.098371 -0.015420  0.068053
JP  0.069025  0.023100 -0.077324
UK  0.034069 -0.052580 -0.116525
US  0.058664 -0.020399  0.028603

In [111]: grouped.count()  # original has some missing data points
Out[111]: 
      A    B    C
GR  209  217  189
JP  240  255  217
UK  216  231  193
US  239  250  217

In [112]: grouped_trans.count()  # counts after transformation
Out[112]: 
      A    B    C
GR  228  228  228
JP  267  267  267
UK  247  247  247
US  258  258  258

In [113]: grouped_trans.size()  # Verify non-NA count equals group size
Out[113]: 
GR    228
JP    267
UK    247
US    258
dtype: int64

fillna
向下填充ffill/pad
向上填充bfill/backfill
滯後shift(1)、shift(-1)

1 窗口和重採樣操作的新語法

In [115]: df_re = pd.DataFrame({'A': [1] * 10 + [5] * 10,
   .....:                       'B': np.arange(20)})

In [116]: df_re
Out[116]: 
    A   B
0   1   0
1   1   1
2   1   2
3   1   3
4   1   4
5   1   5
6   1   6
.. ..  ..
13  5  13
14  5  14
15  5  15
16  5  16
17  5  17
18  5  18
19  5  19

[20 rows x 2 columns]

# B過去四期的平均值，不足四期則爲NaN
In [117]: df_re.groupby('A').rolling(4).B.mean()
Out[117]: 
A    
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
         ... 
5  13    11.5
   14    12.5
   15    13.5
   16    14.5
   17    15.5
   18    16.5
   19    17.5
Name: B, Length: 20, dtype: float64

# 求和
In [118]: df_re.groupby('A').expanding().sum()
Out[118]: 
         A      B
A                
1 0    1.0    0.0
  1    2.0    1.0
  2    3.0    3.0
  3    4.0    6.0
  4    5.0   10.0
  5    6.0   15.0
  6    7.0   21.0
...    ...    ...
5 13  20.0   46.0
  14  25.0   60.0
  15  30.0   75.0
  16  35.0   91.0
  17  40.0  108.0
  18  45.0  126.0
  19  50.0  145.0

[20 rows x 2 columns]

In [119]: df_re = pd.DataFrame({'date': pd.date_range(start='2016-01-01', periods=4,
   .....:                                             freq='W'),
   .....:                       'group': [1, 1, 2, 2],
   .....:                       'val': [5, 6, 7, 8]}).set_index('date')

In [120]: df_re
Out[120]: 
            group  val
date                  
2016-01-03      1    5
2016-01-10      1    6
2016-01-17      2    7
2016-01-24      2    8

# 不均勻數據重新抽樣排列，參數可以爲60S
In [121]: df_re.groupby('group').resample('1D').ffill()
Out[121]: 
                  group  val
group date                  
1     2016-01-03      1    5
      2016-01-04      1    5
      2016-01-05      1    5
      2016-01-06      1    5
      2016-01-07      1    5
      2016-01-08      1    5
      2016-01-09      1    5
...                 ...  ...
2     2016-01-18      2    7
      2016-01-19      2    7
      2016-01-20      2    7
      2016-01-21      2    7
      2016-01-22      2    7
      2016-01-23      2    7
      2016-01-24      2    8

[16 rows x 2 columns]

五、篩選

In [122]: sf = pd.Series([1, 1, 2, 3, 3, 3])

# 哪一組的和大於2
# 比較與sf[sf.apply(lambda x: x>=2)]的不同
In [123]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[123]: 
3    3
4    3
5    3
dtype: int64
    
In [124]: dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})

# 哪一組包含2個以上元素
In [125]: dff.groupby('B').filter(lambda x: len(x) > 2)
Out[125]: 
   A  B
2  2  b
3  3  b
4  4  b
5  5  b

In [126]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
Out[126]: 
     A    B
0  NaN  NaN
1  NaN  NaN
2  2.0    b
3  3.0    b
4  4.0    b
5  5.0    b
6  NaN  NaN
7  NaN  NaN

In [127]: dff['C'] = np.arange(8)

In [128]: dff.groupby('B').filter(lambda x: len(x['C']) > 2)
Out[128]: 
   A  B  C
2  2  b  2
3  3  b  3
4  4  b  4
5  5  b  5

# 每組前兩個元素，tail()
In [129]: dff.groupby('B').head(2)
Out[129]: 
   A  B  C
0  0  a  0
1  1  a  1
2  2  b  2
3  3  b  3
6  6  c  6
7  7  c  7

六、組內函數

In [130]: grouped = df.groupby('A')

In [131]: grouped.agg(lambda x: x.std())	# 可簡化爲grouped.std()
Out[131]: 
            C         D
A                      
bar  0.181231  1.366330
foo  0.912265  0.884785

In [132]: grouped.std()
Out[132]: 
            C         D
A                      
bar  0.181231  1.366330
foo  0.912265  0.884785

In [133]: tsdf = pd.DataFrame(np.random.randn(1000, 3),
   .....:                     index=pd.date_range('1/1/2000', periods=1000),
   .....:                     columns=['A', 'B', 'C'])

In [134]: tsdf.iloc[::2] = np.nan

In [135]: grouped = tsdf.groupby(lambda x: x.year)

# 以年爲單位往下填充
In [136]: grouped.fillna(method='pad')
Out[136]: 
                   A         B         C
2000-01-01       NaN       NaN       NaN	# 由於之前無數據，因此爲NaN
2000-01-02 -0.353501 -0.080957 -0.876864
2000-01-03 -0.353501 -0.080957 -0.876864
2000-01-04  0.050976  0.044273 -0.559849
2000-01-05  0.050976  0.044273 -0.559849
2000-01-06  0.030091  0.186460 -0.680149
2000-01-07  0.030091  0.186460 -0.680149
...              ...       ...       ...
2002-09-20  2.310215  0.157482 -0.064476
2002-09-21  2.310215  0.157482 -0.064476
2002-09-22  0.005011  0.053897 -1.026922
2002-09-23  0.005011  0.053897 -1.026922
2002-09-24 -0.456542 -1.849051  1.559856
2002-09-25 -0.456542 -1.849051  1.559856
2002-09-26  1.123162  0.354660  1.128135

[1000 rows x 3 columns]

In [137]: s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])

In [138]: g = pd.Series(list('abababab'))

In [139]: gb = s.groupby(g)

# 最大3個
In [140]: gb.nlargest(3)
Out[140]: 
a  4    19.0
   0     9.0
   2     7.0
b  1     8.0
   3     5.0
   7     3.3
dtype: float64

# 每組最小3個
In [141]: gb.nsmallest(3)
Out[141]: 
a  6    4.2
   2    7.0
   0    9.0
b  5    1.0
   7    3.3
   3    5.0
dtype: float64

七、統計

In [142]: df
Out[142]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [143]: grouped = df.groupby('A')

In [144]: grouped['C'].apply(lambda x: x.describe())
Out[144]: 
A         
bar  count    3.000000
     mean     0.130980
     std      0.181231
     min     -0.077118
     25%      0.069390
     50%      0.215897
     75%      0.235029
                ...   
foo  mean    -0.359284
     std      0.912265
     min     -1.143704
     25%     -0.862495
     50%     -0.575247
     75%     -0.408530
     max      1.193555
Name: C, Length: 16, dtype: float64

In [145]: grouped = df.groupby('A')['C']

In [146]: def f(group):
   .....:     return pd.DataFrame({'original': group,
   .....:                          'demeaned': group - group.mean()})
   .....: 

In [147]: grouped.apply(f)
Out[147]: 
   original  demeaned
0 -0.575247 -0.215962
1  0.254161  0.123181
2 -1.143704 -0.784420
3  0.215897  0.084917
4  1.193555  1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211

In [148]: def f(x):
   .....:     return pd.Series([x, x ** 2], index=['x', 'x^2'])
   .....: 

In [149]: s = pd.Series(np.random.rand(5))

In [150]: s
Out[150]: 
0    0.321438
1    0.493496
2    0.139505
3    0.910103
4    0.194158
dtype: float64

In [151]: s.apply(f)
Out[151]: 
          x       x^2
0  0.321438  0.103323
1  0.493496  0.243538
2  0.139505  0.019462
3  0.910103  0.828287
4  0.194158  0.037697

In [152]: d = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [153]: def identity(df):
   .....:     print(df)
   .....:     return df

In [154]: d.groupby("a").apply(identity)
   a  b
0  x  1
   a  b
0  x  1
   a  b
1  y  2
Out[154]: 
   a  b
0  x  1
1  y  2

八、其他有用用法

In [155]: df
Out[155]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [156]: df.groupby('A').std()
Out[156]: 
            C         D
A                      
bar  0.181231  1.366330
foo  0.912265  0.884785

In [157]: from decimal import Decimal

In [158]: df_dec = pd.DataFrame(
   .....:     {'id': [1, 2, 1, 2],
   .....:      'int_column': [1, 2, 3, 4],
   .....:      'dec_column': [Decimal('0.50'), Decimal('0.15'),
   .....:                     Decimal('0.25'), Decimal('0.40')]
   .....:      }
   .....: )

# Decimal columns can be sum'd explicitly by themselves...
In [159]: df_dec.groupby(['id'])[['dec_column']].sum()
Out[159]: 
   dec_column
id           
1        0.75
2        0.55

# ...but cannot be combined with standard data types or they will be excluded
In [160]: df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()
Out[160]: 
    int_column
id            
1            4
2            6

# Use .agg function to aggregate over standard and "nuisance" data types
# at the same time
In [161]: df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})
Out[161]: 
    int_column dec_column
id                       
1            4       0.75
2            6       0.55

九、例子

1 根據要素分類

In [218]: df = pd.DataFrame({'a': [1, 0, 0], 'b': [0, 1, 0],
   .....:                    'c': [1, 0, 0], 'd': [2, 3, 4]})
   .....: 

In [219]: df
Out[219]: 
   a  b  c  d
0  1  0  1  2
1  0  1  0  3
2  0  0  0  4

# 根據各列之和分組並求各行和
In [220]: df.groupby(df.sum(), axis=1).sum()
Out[220]: 
   1  9
0  2  2
1  1  3
2  0  4

2 多列分類

In [221]: dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})

In [222]: dfg
Out[222]: 
   A  B
0  1  a
1  1  a
2  2  a
3  3  b
4  2  a

'''
dfg.groupby(['A', 'B']).groups
{(1, 'a'): Int64Index([0, 1], dtype='int64'),
 (2, 'a'): Int64Index([2, 4], dtype='int64'),
 (3, 'b'): Int64Index([3], dtype='int64')}
'''

In [223]: dfg.groupby(["A", "B"]).ngroup()	# ngroup根據數據不同數值化
Out[223]: 
0    0
1    0
2    1
3    2
4    1
dtype: int64

In [224]: dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
Out[224]: 
0    0
1    0
2    1
3    3
4    2
dtype: int64

3 根據索引分類

In [225]: df = pd.DataFrame(np.random.randn(10, 2))

In [226]: df
Out[226]: 
          0         1
0 -0.793893  0.321153
1  0.342250  1.618906
2 -0.975807  1.918201
3 -0.810847 -1.405919
4 -1.977759  0.461659
5  0.730057 -1.316938
6 -0.751328  0.528290
7 -0.257759 -1.081009
8  0.505895 -1.701948
9 -1.006349  0.020208

In [227]: df.index // 5
Out[227]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')

In [228]: df.groupby(df.index // 5).std()
Out[228]: 
          0         1
0  0.823647  1.312912
1  0.760109  0.942941
# 應用：每n行執行一個操作

4 不同列分類操作

In [229]: df = pd.DataFrame({'a': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
   .....:                    'b': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
   .....:                    'c': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
   .....:                    'd': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]})
   .....: 

In [230]: def compute_metrics(x):
   .....:     result = {'b_sum': x['b'].sum(), 'c_mean': x['c'].mean()}
   .....:     return pd.Series(result, name='metrics')
   .....: 

In [231]: result = df.groupby('a').apply(compute_metrics)

In [232]: result
Out[232]: 
metrics  b_sum  c_mean
a                     
0          2.0     0.5
1          2.0     0.5
2          2.0     0.5

In [233]: result.stack()
Out[233]: 
a  metrics
0  b_sum      2.0
   c_mean     0.5
1  b_sum      2.0
   c_mean     0.5
2  b_sum      2.0
   c_mean     0.5
dtype: float64

pandas之groupby學習筆記