pandas之groupby学习笔记

pandas库的groupby问题

一、对象分组

1 一个简单例子

In [1]: df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
   ...:                    ('bird', 'Psittaciformes', 24.0),
   ...:                    ('mammal', 'Carnivora', 80.2),
   ...:                    ('mammal', 'Primates', np.nan),
   ...:                    ('mammal', 'Carnivora', 58)],
   ...:                   index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
   ...:                   columns=('class', 'order', 'max_speed'))
   ...:

In [2]: df
Out[2]:
          class           order  max_speed
falcon     bird   Falconiformes      389.0
parrot     bird  Psittaciformes       24.0
lion     mammal       Carnivora       80.2
monkey   mammal        Primates        NaN
leopard  mammal       Carnivora       58.0

In [3]: grouped = df.groupby('class')	# 生成两组,默认axis=0/'columns'

In [4]: grouped = df.groupby('order', axis='columns')	# 无输出

In [5]: grouped = df.groupby(['class', 'order'])	# 生成四组

groupby默认按照列名排列,亦可使axis=1,这样就能以列为单位切片(结果是一列一列的组合):

In [50]: grouped = df.groupby(df.dtypes, axis=1)	# 按列的类型来分组,可见被分成了两块

In [51]: for i,j in grouped:
    ...:     print(i)
    ...:     print(j)

float64
         max_speed
falcon       389.0
parrot        24.0
lion          80.2
monkey         NaN
leopard       58.0

object
          class           order
falcon     bird   Falconiformes
parrot     bird  Psittaciformes
lion     mammal       Carnivora
monkey   mammal        Primates
leopard  mammal       Carnivora

其他常用的几种功能:

# 按照组名数量使用以下代码查看包含了哪些内容,也称为迭代
for i,j in grouped:
	print(i)
	print(j)   
for (i1,i2),j in grouped:
    print(i1, i2)
    print(j)

# 或者如下简单方式查看有哪些组:
grouped.groups

# 使用以下代码求和,前提是该列可以求和:
In []: df.groupby('class').sum()
Out[]:
        max_speed
class            
bird        413.0
mammal      138.2

# 分别求平均数、组的大小、计数:
In []: df.groupby('class').mean()
Out[]:
        max_speed
class            
bird        206.5
mammal       69.1

In []: df.groupby('class').size()
Out[]:
class
bird      2
mammal    3
dtype: int64
    
In []: df.groupby('class').count()	# 注意count与size的区别
Out[]:
        order  max_speed
class                   
bird        2          2
mammal      3          2	# 由于有一个NaN值的存在

2 多重索引与分组

In [6]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
   ...:                          'foo', 'bar', 'foo', 'foo'],
   ...:                    'B': ['one', 'one', 'two', 'three',
   ...:                          'two', 'two', 'one', 'three'],
   ...:                    'C': np.random.randn(8),
   ...:                    'D': np.random.randn(8)})

In [7]: df
Out[7]:
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860

In [8]: grouped = df.groupby('A')	# 生成2组

In [9]: grouped = df.groupby(['A', 'B'])	# 生成6组
In [10]: df2 = df.set_index(['A', 'B'])
Out[10]: # print(df2)
                  C         D
A   B                        
foo one   -1.209388 -0.309949
bar one   -0.380334 -1.352238
foo two    0.309979 -0.695926
bar three  0.650321  0.965206
foo two    0.809020  1.003307
bar two    0.668484  1.013688
foo one    0.513104  0.079576
    three  1.579055 -0.083461	# 注意多重索引此处不同

In [11]: grouped = df2.groupby(level=df2.index.names.difference(['B']))
# 等价于 grouped = df2.groupby(level=0),如要按多重索引内的索引'B',则level=1
# 由于'B'列不能求和,亦等价于 df.groupby(['A']).sum()

In [12]: grouped.sum()
Out[12]:
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

3 应用函数与按列分组

In [13]: def get_letter_type(letter):
   ....:     if letter.lower() in 'aeiou':
   ....:         return 'vowel'
   ....:     else:
   ....:         return 'consonant'

In [14]: grouped = df.groupby(get_letter_type, axis=1)
# axis=0将整个df以哪几行的形式分组,而axis=1则按照哪几列来分组
# 该处运用类似于apply函数

In []:for i,j in grouped:    
    print(i)
    print(j)

Out[]:    # 由于A的小写属于元音字母,导致原df被切分成两部分
consonant
       B         C         D
0    one -1.209388 -0.309949
1    one -0.380334 -1.352238
2    two  0.309979 -0.695926
3  three  0.650321  0.965206
4    two  0.809020  1.003307
5    two  0.668484  1.013688
6    one  0.513104  0.079576
7  three  1.579055 -0.083461
vowel
     A
0  foo
1  bar
2  foo
3  bar
4  foo
5  bar
6  foo
7  foo

4 level之用法

下面实例介绍了level的基本用法:

In [15]: lst = [1, 2, 3, 1, 2, 3]

In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
Out[16]: # print(s)
1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64

In [17]: grouped = s.groupby(level=0)	# 按照索引仍分成3组,level=1用于多索引中的第二列索引

In [18]: grouped.first()	# 每一组的第一行
Out[18]:
1    1
2    2
3    3
dtype: int64

In [19]: grouped.last()    # 每一组的最后一行
Out[19]:
1    10
2    20
3    30
dtype: int64

In [20]: grouped.sum()
Out[20]:
1    11
2    22
3    33
dtype: int64

5 排序

groupby默认按照组名升序排列,可以取消此操作。

In [21]: df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})

In [22]: df2.groupby(['X']).sum()
Out[22]:
   Y
X   
A  7
B  3

In [23]: df2.groupby(['X'], sort=False).sum()    # 并不是改为降序,而是按照组名的原有顺序
Out[23]:
   Y
X   
B  3
A  7

groupby分组的数据不会自动升序展示,而是默认按照原有顺序保存。

In [24]: df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})

In [25]: df3.groupby(['X']).get_group('A')
Out[25]:
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(['X']).get_group('B')
Out[26]:
   X  Y
1  B  4
3  B  2

6 groupby对象的函数

按照一列分组和列切片:

In [27]: df.groupby('A').groups
Out[27]:
{'bar': Int64Index([1, 3, 5], dtype='int64'),
 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}

In [28]: df.groupby(get_letter_type, axis=1).groups
Out[28]:
{'consonant': Index(['B', 'C', 'D'], dtype='object'),
 'vowel': Index(['A'], dtype='object')}

按照两列分组:

In [29]: grouped = df.groupby(['A', 'B'])

In [30]: grouped.groups
Out[30]:
{('bar', 'one'): Int64Index([1], dtype='int64'),
 ('bar', 'three'): Int64Index([3], dtype='int64'),
 ('bar', 'two'): Int64Index([5], dtype='int64'),
 ('foo', 'one'): Int64Index([0, 6], dtype='int64'),
 ('foo', 'three'): Int64Index([7], dtype='int64'),
 ('foo', 'two'): Int64Index([2, 4], dtype='int64')}

In [31]: len(grouped)
Out[31]: 6

使用tab键查看有哪些可选函数:

In [32]: df
Out[32]:
               height      weight  gender
2000-01-01  42.849980  157.500553    male
2000-01-02  49.607315  177.340407    male
2000-01-03  56.293531  171.524640    male
2000-01-04  48.421077  144.251986  female
2000-01-05  46.556882  152.526206    male
2000-01-06  68.448851  168.272968  female
2000-01-07  70.757698  136.431469    male
2000-01-08  58.909500  176.499753  female
2000-01-09  76.435631  174.094104  female
2000-01-10  45.306120  177.540920    male

In [33]: gb = df.groupby('gender')

In [34]: gb.<TAB>  # noqa: E225, E999
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform  gb.aggregate  gb.count      gb.cumprod    gb.dtype     gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth       gb.prod       gb.resample   gb.sum        gb.var        gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

7 多重索引分组

# 创建一个有二重索引的列表
In [35]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [36]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [37]: s = pd.Series(np.random.randn(8), index=index)

# 相近索引会隐藏
In [38]: s
Out[38]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

简单求和操作:

In [39]: grouped = s.groupby(level=0)

In [40]: grouped.sum()    # 等价于s.groupby('first').sum()
Out[40]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

In [41]: s.groupby(level='second').sum()    # 等价于s.sum(level='first')
Out[41]: 
second
one    0.980950
two    1.991575
dtype: float64

In [42]: s.sum(level='second')
Out[42]: 
second
one    0.980950
two    1.991575
dtype: float64

亦可同时按照两个索引列来分组求和:

In [43]: s
Out[43]: 
first  second  third
bar    doo     one     -1.131345
               two     -0.089329
baz    bee     one      0.337863
               two     -0.945867
foo    bop     one     -0.932132
               two      1.956030
qux    bop     one      0.017587
               two     -0.016692
dtype: float64

In [44]: s.groupby(level=['first', 'second']).sum()
Out[44]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64
    
In [45]: s.groupby(['first', 'second']).sum()
Out[45]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

8 组内选组

# 对每一列进行不同操作
In [53]: grouped = df.groupby(['A'])

In [54]: grouped_C = grouped['C']

In [56]: df['C'].groupby(df['A'])	# 最直观、最简单
Out[56]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f2b486509b0>

二、选择一个组

# get_group的参数为迭代时出现的组名
In [60]: grouped.get_group('bar')
Out[60]: 
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526

In [61]: df.groupby(['A', 'B']).get_group(('bar', 'one'))
Out[61]: 
     A    B         C         D
1  bar  one  0.254161  1.511763

三、聚合

In [62]: grouped = df.groupby('A')

In [63]: grouped.aggregate(np.sum)	# 等价于grouped.sum()
Out[63]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [64]: grouped = df.groupby(['A', 'B'])

In [65]: grouped.aggregate(np.sum)
Out[65]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429
    
In [66]: grouped = df.groupby(['A', 'B'], as_index=False)	# as_index去除了索引与列的差距,现在求和得到的就是一个dataframe

In [67]: grouped.aggregate(np.sum)
Out[67]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

In [68]: df.groupby('A', as_index=False).sum()
Out[68]: 
     A         C         D
0  bar  0.392940  1.732707
1  foo -1.796421  2.824590

In [70]: grouped.size()
Out[70]: 
A    B    
bar  one      1
     three    1
     two      1
foo  one      2
     three    1
     two      2
dtype: int64
    
In [71]: grouped.describe()	# 是对每个组的描述
# 更多用法:std() var() sem() nth()

1 同时应用多个函数

1.1 应用相同函数

In [72]: grouped = df.groupby('A')

In [73]: grouped['C'].agg([np.sum, np.mean, np.std])	# 可以改为aggregate
Out[73]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

In [74]: grouped.agg([np.sum, np.mean, np.std])
Out[74]: 
            C                             D                    
          sum      mean       std       sum      mean       std
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

In [75]: (grouped['C'].agg([np.sum, np.mean, np.std])
   ....:              .rename(columns={'sum': 'foo',
   ....:                               'mean': 'bar',
   ....:                               'std': 'baz'}))
   ....: 
Out[75]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

In [76]: (grouped.agg([np.sum, np.mean, np.std])
   ....:         .rename(columns={'sum': 'foo',
   ....:                          'mean': 'bar',
   ....:                          'std': 'baz'}))
   ....: 
Out[76]: 
            C                             D                    
          foo       bar       baz       foo       bar       baz
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

1.2 应用不同函数

In [77]: grouped.agg({'C': np.sum,
   ....:              'D': lambda x: np.std(x, ddof=1)})
   ....: 
Out[77]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

In [78]: grouped.agg({'C': 'sum', 'D': 'std'})
Out[78]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

In [79]: from collections import OrderedDict

In [80]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[80]: 
            D         C
A                      
bar  1.366330  0.130980
foo  0.884785 -0.359284

In [81]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[81]: 
            D         C
A                      
bar  1.366330  0.130980
foo  0.884785 -0.359284

1.3 Cython类函数的应用

目前只有sum mean std sem符合

四、转化

In [84]: index = pd.date_range('10/1/1999', periods=1100)

In [85]: ts = pd.Series(np.random.normal(0.5, 2, 1100), index)

# rollIng函数:window过去多少期,min_peirods最小观测量
In [86]: ts = ts.rolling(window=100, min_periods=100).mean().dropna()

In [87]: ts.head()
Out[87]: 
2000-01-08    0.779333
2000-01-09    0.778852
2000-01-10    0.786476
2000-01-11    0.782797
2000-01-12    0.798110
Freq: D, dtype: float64

In [88]: ts.tail()
Out[88]: 
2002-09-30    0.660294
2002-10-01    0.631095
2002-10-02    0.673601
2002-10-03    0.709213
2002-10-04    0.719369
Freq: D, dtype: float64

# 以年为单位将该列数据标准化
In [89]: transformed = (ts.groupby(lambda x: x.year)
   ....:                  .transform(lambda x: (x - x.mean()) / x.std()))
Out[89]: 
2000-01-08   -0.624080
2000-01-09   -0.763061
2000-01-10   -1.009653
2000-01-11   -0.965821
2000-01-12   -1.227731...

# 可视化数据
In [96]: compare = pd.DataFrame({'Original': ts, 'Transformed': transformed})

In [97]: compare.plot()
Out[97]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b4866c1d0>
# 以年为单位计算最大值与最小值之差
In [98]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
Out[98]: 
2000-01-08    0.623893
2000-01-09    0.623893
2000-01-10    0.623893
2000-01-11    0.623893
2000-01-12    0.623893
2000-01-13    0.623893
2000-01-14    0.623893
                ...   
2002-09-28    0.558275
2002-09-29    0.558275
2002-09-30    0.558275
2002-10-01    0.558275
2002-10-02    0.558275
2002-10-03    0.558275
2002-10-04    0.558275
Freq: D, Length: 1001, dtype: float64

# 等价形式
In [99]: max = ts.groupby(lambda x: x.year).transform('max')
In [100]: min = ts.groupby(lambda x: x.year).transform('min')
In [101]: max - min
Out[101]: 
2000-01-08    0.623893
2000-01-09    0.623893
2000-01-10    0.623893
2000-01-11    0.623893
2000-01-12    0.623893
2000-01-13    0.623893
2000-01-14    0.623893
                ...   
2002-09-28    0.558275
2002-09-29    0.558275
2002-09-30    0.558275
2002-10-01    0.558275
2002-10-02    0.558275
2002-10-03    0.558275
2002-10-04    0.558275
Freq: D, Length: 1001, dtype: float64
In [102]: data_df
Out[102]: 
            A         B         C
0    1.539708 -1.166480  0.533026
1    1.302092 -0.505754       NaN
2   -0.371983  1.104803 -0.651520
3   -1.309622  1.118697 -1.161657
4   -1.924296  0.396437  0.812436
5    0.815643  0.367816 -0.469478
6   -0.030651  1.376106 -0.645129
..        ...       ...       ...
993  0.012359  0.554602 -1.976159
994  0.042312 -1.628835  1.013822
995 -0.093110  0.683847 -0.774753
996 -0.185043  1.438572       NaN
997 -0.394469 -0.642343  0.011374
998 -1.174126  1.857148       NaN
999  0.234564  0.517098  0.393534

[1000 rows x 3 columns]

In [103]: countries = np.array(['US', 'UK', 'GR', 'JP'])

In [104]: key = countries[np.random.randint(0, 4, 1000)]

In [105]: grouped = data_df.groupby(key)

# Non-NA count in each group
In [106]: grouped.count()
Out[106]: 
      A    B    C
GR  209  217  189
JP  240  255  217
UK  216  231  193
US  239  250  217

In [107]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

In [108]: grouped_trans = transformed.groupby(key)

In [109]: grouped.mean()  # original group means
Out[109]: 
           A         B         C
GR -0.098371 -0.015420  0.068053
JP  0.069025  0.023100 -0.077324
UK  0.034069 -0.052580 -0.116525
US  0.058664 -0.020399  0.028603

In [110]: grouped_trans.mean()  # transformation did not change group means
Out[110]: 
           A         B         C
GR -0.098371 -0.015420  0.068053
JP  0.069025  0.023100 -0.077324
UK  0.034069 -0.052580 -0.116525
US  0.058664 -0.020399  0.028603

In [111]: grouped.count()  # original has some missing data points
Out[111]: 
      A    B    C
GR  209  217  189
JP  240  255  217
UK  216  231  193
US  239  250  217

In [112]: grouped_trans.count()  # counts after transformation
Out[112]: 
      A    B    C
GR  228  228  228
JP  267  267  267
UK  247  247  247
US  258  258  258

In [113]: grouped_trans.size()  # Verify non-NA count equals group size
Out[113]: 
GR    228
JP    267
UK    247
US    258
dtype: int64
  • fillna

  • 向下填充ffill/pad

  • 向上填充bfill/backfill

  • 滞后shift(1)、shift(-1)

1 窗口和重采样操作的新语法

In [115]: df_re = pd.DataFrame({'A': [1] * 10 + [5] * 10,
   .....:                       'B': np.arange(20)})

In [116]: df_re
Out[116]: 
    A   B
0   1   0
1   1   1
2   1   2
3   1   3
4   1   4
5   1   5
6   1   6
.. ..  ..
13  5  13
14  5  14
15  5  15
16  5  16
17  5  17
18  5  18
19  5  19

[20 rows x 2 columns]

# B过去四期的平均值,不足四期则为NaN
In [117]: df_re.groupby('A').rolling(4).B.mean()
Out[117]: 
A    
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
         ... 
5  13    11.5
   14    12.5
   15    13.5
   16    14.5
   17    15.5
   18    16.5
   19    17.5
Name: B, Length: 20, dtype: float64

# 求和
In [118]: df_re.groupby('A').expanding().sum()
Out[118]: 
         A      B
A                
1 0    1.0    0.0
  1    2.0    1.0
  2    3.0    3.0
  3    4.0    6.0
  4    5.0   10.0
  5    6.0   15.0
  6    7.0   21.0
...    ...    ...
5 13  20.0   46.0
  14  25.0   60.0
  15  30.0   75.0
  16  35.0   91.0
  17  40.0  108.0
  18  45.0  126.0
  19  50.0  145.0

[20 rows x 2 columns]

In [119]: df_re = pd.DataFrame({'date': pd.date_range(start='2016-01-01', periods=4,
   .....:                                             freq='W'),
   .....:                       'group': [1, 1, 2, 2],
   .....:                       'val': [5, 6, 7, 8]}).set_index('date')

In [120]: df_re
Out[120]: 
            group  val
date                  
2016-01-03      1    5
2016-01-10      1    6
2016-01-17      2    7
2016-01-24      2    8

# 不均匀数据重新抽样排列,参数可以为60S
In [121]: df_re.groupby('group').resample('1D').ffill()
Out[121]: 
                  group  val
group date                  
1     2016-01-03      1    5
      2016-01-04      1    5
      2016-01-05      1    5
      2016-01-06      1    5
      2016-01-07      1    5
      2016-01-08      1    5
      2016-01-09      1    5
...                 ...  ...
2     2016-01-18      2    7
      2016-01-19      2    7
      2016-01-20      2    7
      2016-01-21      2    7
      2016-01-22      2    7
      2016-01-23      2    7
      2016-01-24      2    8

[16 rows x 2 columns]

五、筛选

In [122]: sf = pd.Series([1, 1, 2, 3, 3, 3])

# 哪一组的和大于2
# 比较与sf[sf.apply(lambda x: x>=2)]的不同
In [123]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[123]: 
3    3
4    3
5    3
dtype: int64
    
In [124]: dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})

# 哪一组包含2个以上元素
In [125]: dff.groupby('B').filter(lambda x: len(x) > 2)
Out[125]: 
   A  B
2  2  b
3  3  b
4  4  b
5  5  b

In [126]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
Out[126]: 
     A    B
0  NaN  NaN
1  NaN  NaN
2  2.0    b
3  3.0    b
4  4.0    b
5  5.0    b
6  NaN  NaN
7  NaN  NaN

In [127]: dff['C'] = np.arange(8)

In [128]: dff.groupby('B').filter(lambda x: len(x['C']) > 2)
Out[128]: 
   A  B  C
2  2  b  2
3  3  b  3
4  4  b  4
5  5  b  5

# 每组前两个元素,tail()
In [129]: dff.groupby('B').head(2)
Out[129]: 
   A  B  C
0  0  a  0
1  1  a  1
2  2  b  2
3  3  b  3
6  6  c  6
7  7  c  7

六、组内函数

In [130]: grouped = df.groupby('A')

In [131]: grouped.agg(lambda x: x.std())	# 可简化为grouped.std()
Out[131]: 
            C         D
A                      
bar  0.181231  1.366330
foo  0.912265  0.884785

In [132]: grouped.std()
Out[132]: 
            C         D
A                      
bar  0.181231  1.366330
foo  0.912265  0.884785

In [133]: tsdf = pd.DataFrame(np.random.randn(1000, 3),
   .....:                     index=pd.date_range('1/1/2000', periods=1000),
   .....:                     columns=['A', 'B', 'C'])

In [134]: tsdf.iloc[::2] = np.nan

In [135]: grouped = tsdf.groupby(lambda x: x.year)

# 以年为单位往下填充
In [136]: grouped.fillna(method='pad')
Out[136]: 
                   A         B         C
2000-01-01       NaN       NaN       NaN	# 由于之前无数据,因此为NaN
2000-01-02 -0.353501 -0.080957 -0.876864
2000-01-03 -0.353501 -0.080957 -0.876864
2000-01-04  0.050976  0.044273 -0.559849
2000-01-05  0.050976  0.044273 -0.559849
2000-01-06  0.030091  0.186460 -0.680149
2000-01-07  0.030091  0.186460 -0.680149
...              ...       ...       ...
2002-09-20  2.310215  0.157482 -0.064476
2002-09-21  2.310215  0.157482 -0.064476
2002-09-22  0.005011  0.053897 -1.026922
2002-09-23  0.005011  0.053897 -1.026922
2002-09-24 -0.456542 -1.849051  1.559856
2002-09-25 -0.456542 -1.849051  1.559856
2002-09-26  1.123162  0.354660  1.128135

[1000 rows x 3 columns]

In [137]: s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])

In [138]: g = pd.Series(list('abababab'))

In [139]: gb = s.groupby(g)

# 最大3个
In [140]: gb.nlargest(3)
Out[140]: 
a  4    19.0
   0     9.0
   2     7.0
b  1     8.0
   3     5.0
   7     3.3
dtype: float64

# 每组最小3个
In [141]: gb.nsmallest(3)
Out[141]: 
a  6    4.2
   2    7.0
   0    9.0
b  5    1.0
   7    3.3
   3    5.0
dtype: float64

七、统计

In [142]: df
Out[142]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [143]: grouped = df.groupby('A')

In [144]: grouped['C'].apply(lambda x: x.describe())
Out[144]: 
A         
bar  count    3.000000
     mean     0.130980
     std      0.181231
     min     -0.077118
     25%      0.069390
     50%      0.215897
     75%      0.235029
                ...   
foo  mean    -0.359284
     std      0.912265
     min     -1.143704
     25%     -0.862495
     50%     -0.575247
     75%     -0.408530
     max      1.193555
Name: C, Length: 16, dtype: float64

In [145]: grouped = df.groupby('A')['C']

In [146]: def f(group):
   .....:     return pd.DataFrame({'original': group,
   .....:                          'demeaned': group - group.mean()})
   .....: 

In [147]: grouped.apply(f)
Out[147]: 
   original  demeaned
0 -0.575247 -0.215962
1  0.254161  0.123181
2 -1.143704 -0.784420
3  0.215897  0.084917
4  1.193555  1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211

In [148]: def f(x):
   .....:     return pd.Series([x, x ** 2], index=['x', 'x^2'])
   .....: 

In [149]: s = pd.Series(np.random.rand(5))

In [150]: s
Out[150]: 
0    0.321438
1    0.493496
2    0.139505
3    0.910103
4    0.194158
dtype: float64

In [151]: s.apply(f)
Out[151]: 
          x       x^2
0  0.321438  0.103323
1  0.493496  0.243538
2  0.139505  0.019462
3  0.910103  0.828287
4  0.194158  0.037697

In [152]: d = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [153]: def identity(df):
   .....:     print(df)
   .....:     return df

In [154]: d.groupby("a").apply(identity)
   a  b
0  x  1
   a  b
0  x  1
   a  b
1  y  2
Out[154]: 
   a  b
0  x  1
1  y  2

八、其他有用用法

In [155]: df
Out[155]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [156]: df.groupby('A').std()
Out[156]: 
            C         D
A                      
bar  0.181231  1.366330
foo  0.912265  0.884785

In [157]: from decimal import Decimal

In [158]: df_dec = pd.DataFrame(
   .....:     {'id': [1, 2, 1, 2],
   .....:      'int_column': [1, 2, 3, 4],
   .....:      'dec_column': [Decimal('0.50'), Decimal('0.15'),
   .....:                     Decimal('0.25'), Decimal('0.40')]
   .....:      }
   .....: )

# Decimal columns can be sum'd explicitly by themselves...
In [159]: df_dec.groupby(['id'])[['dec_column']].sum()
Out[159]: 
   dec_column
id           
1        0.75
2        0.55

# ...but cannot be combined with standard data types or they will be excluded
In [160]: df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()
Out[160]: 
    int_column
id            
1            4
2            6

# Use .agg function to aggregate over standard and "nuisance" data types
# at the same time
In [161]: df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})
Out[161]: 
    int_column dec_column
id                       
1            4       0.75
2            6       0.55

九、例子

1 根据要素分类

In [218]: df = pd.DataFrame({'a': [1, 0, 0], 'b': [0, 1, 0],
   .....:                    'c': [1, 0, 0], 'd': [2, 3, 4]})
   .....: 

In [219]: df
Out[219]: 
   a  b  c  d
0  1  0  1  2
1  0  1  0  3
2  0  0  0  4

# 根据各列之和分组并求各行和
In [220]: df.groupby(df.sum(), axis=1).sum()
Out[220]: 
   1  9
0  2  2
1  1  3
2  0  4

2 多列分类

In [221]: dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})

In [222]: dfg
Out[222]: 
   A  B
0  1  a
1  1  a
2  2  a
3  3  b
4  2  a

'''
dfg.groupby(['A', 'B']).groups
{(1, 'a'): Int64Index([0, 1], dtype='int64'),
 (2, 'a'): Int64Index([2, 4], dtype='int64'),
 (3, 'b'): Int64Index([3], dtype='int64')}
'''

In [223]: dfg.groupby(["A", "B"]).ngroup()	# ngroup根据数据不同数值化
Out[223]: 
0    0
1    0
2    1
3    2
4    1
dtype: int64

In [224]: dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
Out[224]: 
0    0
1    0
2    1
3    3
4    2
dtype: int64

3 根据索引分类

In [225]: df = pd.DataFrame(np.random.randn(10, 2))

In [226]: df
Out[226]: 
          0         1
0 -0.793893  0.321153
1  0.342250  1.618906
2 -0.975807  1.918201
3 -0.810847 -1.405919
4 -1.977759  0.461659
5  0.730057 -1.316938
6 -0.751328  0.528290
7 -0.257759 -1.081009
8  0.505895 -1.701948
9 -1.006349  0.020208

In [227]: df.index // 5
Out[227]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')

In [228]: df.groupby(df.index // 5).std()
Out[228]: 
          0         1
0  0.823647  1.312912
1  0.760109  0.942941
# 应用:每n行执行一个操作

4 不同列分类操作

In [229]: df = pd.DataFrame({'a': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
   .....:                    'b': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
   .....:                    'c': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
   .....:                    'd': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]})
   .....: 

In [230]: def compute_metrics(x):
   .....:     result = {'b_sum': x['b'].sum(), 'c_mean': x['c'].mean()}
   .....:     return pd.Series(result, name='metrics')
   .....: 

In [231]: result = df.groupby('a').apply(compute_metrics)

In [232]: result
Out[232]: 
metrics  b_sum  c_mean
a                     
0          2.0     0.5
1          2.0     0.5
2          2.0     0.5

In [233]: result.stack()
Out[233]: 
a  metrics
0  b_sum      2.0
   c_mean     0.5
1  b_sum      2.0
   c_mean     0.5
2  b_sum      2.0
   c_mean     0.5
dtype: float64
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章