pandas-groupby函數使用以及聚合函數使用(比賽中常用的寫法!!)(transform的解釋)


import pandas as pd
 
df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'], 
                   'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
                    'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
df
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

主要就是理解下面這句話!!!後面的是一些細節!!

一個基礎的語法!!!

a=df.groupby('Country', as_index = False)['Income'].agg({ 'Income_sum':'sum'})
a
Country Income_sum
0 America 40000
1 China 28000
2 India 15002
3 Japan 50000

上面代碼的意思:統計每一種Conutry下Income的總和!!!

具體起到的作用:
①groupby(‘Country’, as_index = False) :按每一種Country進行統計!!!!!
②[‘Income’]:統計的是每一種Country下Income的相關特徵!!!
③agg函數中 ‘Income_sum’:‘sum’,第一個字段’Income_sum’,其實你也可以寫’sum’,但是如果寫’Income_sum’,pandas會將
“每一個country下的Income總和”這個特徵的特證名添爲’Income_sum’

可以看到的是,這樣造出來的 Income_sum特徵,需要拼接回原來的df上去
也就是

data_merged = pd.merge(df, a, on=['Country'], how='left')
data_merged
Age Country Income Income_sum
0 5000 China 10000 28000
1 4321 China 10000 28000
2 1234 India 5000 15002
3 4010 India 5002 15002
4 250 America 40000 40000
5 250 Japan 50000 50000
6 4500 China 8000 28000
7 4321 India 5000 15002

比賽中經常用到的寫法

df.groupby('Country').Income.transform('sum')
0    28000
1    28000
2    15002
3    15002
4    40000
5    50000
6    28000
7    15002
Name: Income, dtype: int64

可以看到,上面的數據,如第0行,和China的Income總和是一樣的,而
df的第0行,實際上也是China!!,也就是說,他已經按Country字段給merge好了!!!

-------------------------------------------------分割線,下面是原理和細節

a=df.groupby('Country')
a.max()
Age Income
Country
America 250 40000
China 5000 10000
India 4321 5002
Japan 250 50000
a.agg(['min', 'mean', 'max'])
Age Income
min mean max min mean max
Country
America 250 250.000000 250 40000 40000.000000 40000
China 4321 4607.000000 5000 8000 9333.333333 10000
India 1234 3188.333333 4321 5000 5000.666667 5002
Japan 250 250.000000 250 50000 50000.000000 50000

下面這個就是單獨的取了一個字段的聚合結果
下面的結果表示,a[‘Age’]是單獨的一個字段的group,對單獨的一個字段的group我們可以對其聚合進行改名
比如下面 num_agg = {‘p’:[‘min’, ‘mean’, ‘max’]} 其實那個’p’按常規來說應該寫成Age,但是此處這麼寫,就相當於將表中的Age字段直接更名爲’p’了

num_agg = {'p':['min', 'mean', 'max']}
a['Age'].agg(num_agg)############ 只拿出了Age這一列!!!!!,也就是一個字段的時候可以改名
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
p
min mean max
Country
America 250 250.000000 250
China 4321 4607.000000 5000
India 1234 3188.333333 4321
Japan 250 250.000000 250

要注意 a是一個包含兩個字段的(Income和Age)的group,
agg函數不支持對含兩個以上字段的聚合進行名字更改(如下圖)

num_agg = {'Age_1':['min', 'mean', 'max']}
a.agg(num_agg)######  a包含了兩個以上的字段了!
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-7-8131bee46d39> in <module>()
      1 num_agg = {'Age_1':['min', 'mean', 'max']}
----> 2 a.agg(num_agg)######  a包含了兩個以上的字段了!


~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate


~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result


~/.local/lib/python3.5/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    408                           k not in obj.columns):
    409                         raise KeyError(
--> 410                             "Column '{col}' does not exist!".format(col=k))
    411 
    412                 arg = new_arg


KeyError: "Column 'Age_1' does not exist!"
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=False)
Income Age
min max min mean max
Country
America 40000 40000 250 250.000000 250
China 8000 10000 4321 4607.000000 5000
India 5000 5002 1234 3188.333333 4321
Japan 50000 50000 250 250.000000 250
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=True)
Income Age
min max min mean max
Country
America 40000 40000 250 250.000000 250
China 8000 10000 4321 4607.000000 5000
India 5000 5002 1234 3188.333333 4321
Japan 50000 50000 250 250.000000 250
num_agg = {'Age':['min', 'mean', 'max']}
a.agg(num_agg,as_index=False)
Age
min mean max
Country
America 250 250.000000 250
China 4321 4607.000000 5000
India 1234 3188.333333 4321
Japan 250 250.000000 250
num_agg = {'Age':['min', 'mean', 'max']}
a['Age'].agg(num_agg,as_index=True)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
Age
min mean max
Country
America 250 250.000000 250
China 4321 4607.000000 5000
India 1234 3188.333333 4321
Japan 250 250.000000 250
num_agg = {'min', 'mean', 'max'}#正規用法!!!!!!
a['Age'].agg(num_agg,as_index=False)
min max mean
Country
America 250 250 250.000000
China 4321 5000 4607.000000
India 1234 4321 3188.333333
Japan 250 250 250.000000
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章