import pandas as pd
df = pd. DataFrame( { 'Country' : [ 'China' , 'China' , 'India' , 'India' , 'America' , 'Japan' , 'China' , 'India' ] ,
'Income' : [ 10000 , 10000 , 5000 , 5002 , 40000 , 50000 , 8000 , 5000 ] ,
'Age' : [ 5000 , 4321 , 1234 , 4010 , 250 , 250 , 4500 , 4321 ] } )
df
Age
Country
Income
0
5000
China
10000
1
4321
China
10000
2
1234
India
5000
3
4010
India
5002
4
250
America
40000
5
250
Japan
50000
6
4500
China
8000
7
4321
India
5000
主要就是理解下面這句話!!!後面的是一些細節!!
一個基礎的語法!!!
a= df. groupby( 'Country' , as_index = False ) [ 'Income' ] . agg( { 'Income_sum' : 'sum' } )
a
Country
Income_sum
0
America
40000
1
China
28000
2
India
15002
3
Japan
50000
上面代碼的意思:統計每一種Conutry下Income的總和!!!
具體起到的作用:
①groupby(‘Country’, as_index = False) :按每一種Country進行統計!!!!!
②[‘Income’]:統計的是每一種Country下Income的相關特徵!!!
③agg函數中 ‘Income_sum’:‘sum’,第一個字段’Income_sum’,其實你也可以寫’sum’,但是如果寫’Income_sum’,pandas會將
“每一個country下的Income總和”這個特徵的特證名添爲’Income_sum’
可以看到的是,這樣造出來的 Income_sum特徵,需要拼接回原來的df上去
也就是
data_merged = pd. merge( df, a, on= [ 'Country' ] , how= 'left' )
data_merged
Age
Country
Income
Income_sum
0
5000
China
10000
28000
1
4321
China
10000
28000
2
1234
India
5000
15002
3
4010
India
5002
15002
4
250
America
40000
40000
5
250
Japan
50000
50000
6
4500
China
8000
28000
7
4321
India
5000
15002
比賽中經常用到的寫法
df. groupby( 'Country' ) . Income. transform( 'sum' )
0 28000
1 28000
2 15002
3 15002
4 40000
5 50000
6 28000
7 15002
Name: Income, dtype: int64
可以看到,上面的數據,如第0行,和China的Income總和是一樣的,而
df的第0行,實際上也是China!!,也就是說,他已經按Country字段給merge好了!!!
-------------------------------------------------分割線,下面是原理和細節
a= df. groupby( 'Country' )
a. max ( )
Age
Income
Country
America
250
40000
China
5000
10000
India
4321
5002
Japan
250
50000
a. agg( [ 'min' , 'mean' , 'max' ] )
Age
Income
min
mean
max
min
mean
max
Country
America
250
250.000000
250
40000
40000.000000
40000
China
4321
4607.000000
5000
8000
9333.333333
10000
India
1234
3188.333333
4321
5000
5000.666667
5002
Japan
250
250.000000
250
50000
50000.000000
50000
下面這個就是單獨的取了一個字段的聚合結果
下面的結果表示,a[‘Age’]是單獨的一個字段的group,對單獨的一個字段的group我們可以對其聚合進行改名
比如下面 num_agg = {‘p’:[‘min’, ‘mean’, ‘max’]} 其實那個’p’按常規來說應該寫成Age,但是此處這麼寫,就相當於將表中的Age字段直接更名爲’p’了
num_agg = { 'p' : [ 'min' , 'mean' , 'max' ] }
a[ 'Age' ] . agg( num_agg)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
p
min
mean
max
Country
America
250
250.000000
250
China
4321
4607.000000
5000
India
1234
3188.333333
4321
Japan
250
250.000000
250
要注意 a是一個包含兩個字段的(Income和Age)的group,
agg函數不支持對含兩個以上字段的聚合進行名字更改(如下圖)
num_agg = { 'Age_1' : [ 'min' , 'mean' , 'max' ] }
a. agg( num_agg)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-7-8131bee46d39> in <module>()
1 num_agg = {'Age_1':['min', 'mean', 'max']}
----> 2 a.agg(num_agg)###### a包含了兩個以上的字段了!
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1313 @Appender(_shared_docs['aggregate'])
1314 def aggregate(self, arg, *args, **kwargs):
-> 1315 return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
1316
1317 agg = aggregate
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
184
185 _level = kwargs.pop('_level', None)
--> 186 result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
187 if how is None:
188 return result
~/.local/lib/python3.5/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
408 k not in obj.columns):
409 raise KeyError(
--> 410 "Column '{col}' does not exist!".format(col=k))
411
412 arg = new_arg
KeyError: "Column 'Age_1' does not exist!"
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] , 'Income' : [ 'min' , 'max' ] }
a. agg( num_agg, as_index= False )
Income
Age
min
max
min
mean
max
Country
America
40000
40000
250
250.000000
250
China
8000
10000
4321
4607.000000
5000
India
5000
5002
1234
3188.333333
4321
Japan
50000
50000
250
250.000000
250
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] , 'Income' : [ 'min' , 'max' ] }
a. agg( num_agg, as_index= True )
Income
Age
min
max
min
mean
max
Country
America
40000
40000
250
250.000000
250
China
8000
10000
4321
4607.000000
5000
India
5000
5002
1234
3188.333333
4321
Japan
50000
50000
250
250.000000
250
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] }
a. agg( num_agg, as_index= False )
Age
min
mean
max
Country
America
250
250.000000
250
China
4321
4607.000000
5000
India
1234
3188.333333
4321
Japan
250
250.000000
250
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] }
a[ 'Age' ] . agg( num_agg, as_index= True )
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
Age
min
mean
max
Country
America
250
250.000000
250
China
4321
4607.000000
5000
India
1234
3188.333333
4321
Japan
250
250.000000
250
num_agg = { 'min' , 'mean' , 'max' }
a[ 'Age' ] . agg( num_agg, as_index= False )
min
max
mean
Country
America
250
250
250.000000
China
4321
5000
4607.000000
India
1234
4321
3188.333333
Japan
250
250
250.000000