第3章分組

import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df.head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

# import numpy as np
# import pandas as pd 
# df=pd.read_csv('data/table.csv',index_col='ID')
# df.head()

一、SAC過程

1. 內涵

SAC指的是分組操作中的split-apply-combine過程

其中split指基於某一些規則，將數據拆成若干組，apply是指對每一組獨立地使用函數，combine指將每一組的結果組合成某一類數據結構

2. apply過程

在該過程中，我們實際往往會遇到四類問題：

整合（Aggregation）——即分組計算統計量（如求均值、求每組元素個數）

變換（Transformation）——即分組對每個單元的數據進行操作（如元素標準化）

過濾（Filtration）——即按照某些規則篩選出一些組（如選出組內某一指標小於50的組）

綜合問題——即前面提及的三種問題的混合

二、groupby函數

1. 分組函數的基本內容：

（a）根據某一列分組

grouped_single = df.groupby('School')
grouped_single =df.groupby('School')
display(grouped_single)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026FFBC23DA0>

經過groupby後會生成一個groupby對象，該對象本身不會返回任何東西，只有當相應的方法被調用纔會起作用

例如取出某一個組：

grouped_single.get_group('S_1').head()
grouped_single.get_group('S_1')

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1204	S_1	C_2	F	street_5	162	63	33.8	B
1205	S_1	C_2	F	street_6	167	63	68.4	B-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1302	S_1	C_3	F	street_1	175	57	87.7	A-
1303	S_1	C_3	M	street_7	188	82	49.7	B
1304	S_1	C_3	M	street_2	195	70	85.2	A
1305	S_1	C_3	F	street_5	187	69	61.7	B-

（b）根據某幾列分組

grouped_mul = df.groupby(['School','Class'])
grouped_mul.get_group(('S_2','C_4'))
grouped_mul=df.groupby(['School','Class'])
grouped_mul.get_group(('S_2','C_1'))

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2105	S_2	C_1	M	street_4	170	81	34.2	A

（c）組容量與組數

grouped_single.size()
# grouped_single.size()

School
S_1    15
S_2    20
dtype: int64

grouped_mul.size()
# grouped_mul.size()

School  Class
S_1     C_1      5
        C_2      5
        C_3      5
S_2     C_1      5
        C_2      5
        C_3      5
        C_4      5
dtype: int64

grouped_single.ngroups
grouped_single.ngroups

grouped_mul.ngroups
grouped_mul.ngroups

（d）組的遍歷

for name,group in grouped_single:
    print(name)
    display(group.head())
# for name ,group in grouped_single:
#     print(name)
#     display(group.head())

S_1

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

S_2

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2105	S_2	C_1	M	street_4	170	81	34.2	A

（e）level參數（用於多級索引）和axis參數

df.set_index(['Gender','School']).groupby(level=1,axis=0).get_group('S_1').head()
df.set_index(['Gender','School']).groupby(level=0).get_group('M')#.head()

		Class	Address	Height	Weight	Math	Physics
Gender	School
M	S_1	C_1	street_1	173	63	34.0	A+
	S_1	C_1	street_2	186	82	87.2	B+
	S_1	C_2	street_5	188	68	97.0	A-
	S_1	C_2	street_6	160	53	58.8	A+
	S_1	C_3	street_4	161	68	31.5	B+
	S_1	C_3	street_7	188	82	49.7	B
	S_1	C_3	street_2	195	70	85.2	A
	S_2	C_1	street_7	174	84	83.3	C
	S_2	C_1	street_4	157	61	52.5	B-
	S_2	C_1	street_4	170	81	34.2	A
	S_2	C_2	street_5	193	100	39.1	B
	S_2	C_2	street_4	155	91	73.8	A+
	S_2	C_2	street_1	175	74	47.2	B-
	S_2	C_3	street_5	171	88	32.7	A
	S_2	C_3	street_4	187	73	48.9	B
	S_2	C_4	street_7	166	82	48.7	B

2. groupby對象的特點

（a）查看所有可調用的方法

由此可見，groupby對象可以使用相當多的函數，靈活程度很高

print([attr for attr in dir(grouped_single) if not attr.startswith('_')])
# print([attr for attr in dir(grouped_single) if not attr.startswith('_')])

['Address', 'Class', 'Gender', 'Height', 'Math', 'Physics', 'School', 'Weight', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']
['Address', 'Class', 'Gender', 'Height', 'Math', 'Physics', 'School', 'Weight', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']

（b）分組對象的head和first

對分組對象使用head函數，返回的是每個組的前幾行，而不是數據集前幾行

grouped_single.head(2)
grouped_single.head(1)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
2101	S_2	C_1	M	street_7	174	84	83.3	C

first顯示的是以分組爲索引的每組的第一個分組信息

grouped_single.first()
grouped_single.first()

	Class	Gender	Address	Height	Weight	Math	Physics
School
S_1	C_1	M	street_1	173	63	34.0	A+
S_2	C_1	M	street_7	174	84	83.3	C

（c）分組依據

對於groupby函數而言，分組的依據是非常自由的，只要是與數據框長度相同的列表即可，同時支持函數型分組

df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('a')#.head()
#相當於將np.random.choice(['a','b','c'],df.shape[0])當做新的一列進行分組
print(np.random.choice(['a','b','c'],df.shape[0]))
a=df.groupby(np.random.choice(['a','b','c'],df.shape[0]))

for name ,group in a:
    print(name )
    display(group)

a.size()

['a' 'b' 'b' 'a' 'c' 'b' 'c' 'b' 'b' 'b' 'b' 'c' 'c' 'a' 'a' 'b' 'b' 'a'
 'c' 'b' 'b' 'c' 'c' 'a' 'b' 'a' 'a' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'a']
a

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1203	S_1	C_2	M	street_6	160	53	58.8	A+
2101	S_2	C_1	M	street_7	174	84	83.3	C
2105	S_2	C_1	M	street_4	170	81	34.2	A
2301	S_2	C_3	F	street_4	157	78	72.3	B+
2304	S_2	C_3	F	street_6	164	81	95.5	A-
2402	S_2	C_4	M	street_7	166	82	48.7	B
2403	S_2	C_4	F	street_6	158	60	59.7	B+
2404	S_2	C_4	F	street_2	160	84	67.7	B
2405	S_2	C_4	F	street_6	193	54	47.6	B

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1204	S_1	C_2	F	street_5	162	63	33.8	B
1303	S_1	C_3	M	street_7	188	82	49.7	B
1304	S_1	C_3	M	street_2	195	70	85.2	A
1305	S_1	C_3	F	street_5	187	69	61.7	B-
2201	S_2	C_2	M	street_5	193	100	39.1	B
2204	S_2	C_2	M	street_1	175	74	47.2	B-
2303	S_2	C_3	F	street_7	190	99	65.9	C

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1205	S_1	C_2	F	street_6	167	63	68.4	B-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1302	S_1	C_3	F	street_1	175	57	87.7	A-
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2202	S_2	C_2	F	street_7	194	77	68.5	B+
2203	S_2	C_2	M	street_4	155	91	73.8	A+
2205	S_2	C_2	F	street_7	183	76	85.4	B
2302	S_2	C_3	M	street_5	171	88	32.7	A
2305	S_2	C_3	M	street_4	187	73	48.9	B
2401	S_2	C_4	F	street_2	192	62	45.3	A

a    12
b     8
c    15
dtype: int64

從原理上說，我們可以看到利用函數時，傳入的對象就是索引，因此根據這一特性可以做一些複雜的操作

df[:5].groupby(lambda x:print(x)).head(5)
a=df[:5].groupby(pd.Series([2,1,1,4,5],index=[1105,1104,1103,1102,1101]))
# a.size()
for name ,group in a:
    print(name )
    display(group)
# df[:5].groupby(lambda x:x*2).head(5)

display(df.iloc[0:5,0:5])
b=df.iloc[0:5,0:5].groupby([1,1,1,2,1],axis=1)
for name ,group in b:
    print(name )
    display(group)

	School	Class	Gender	Address	Height
ID
1101	S_1	C_1	M	street_1	173
1102	S_1	C_1	F	street_2	192
1103	S_1	C_1	M	street_2	186
1104	S_1	C_1	F	street_2	167
1105	S_1	C_1	F	street_4	159

	School	Class	Gender	Height
ID
1101	S_1	C_1	M	173
1102	S_1	C_1	F	192
1103	S_1	C_1	M	186
1104	S_1	C_1	F	167
1105	S_1	C_1	F	159

	Address
ID
1101	street_1
1102	street_2
1103	street_2
1104	street_2
1105	street_4

根據奇偶行分組

# df.groupby(lambda x :print(x))
df.index.get_loc(1102)

display(df.groupby(lambda x:'奇數行'  if not df.index.get_loc(x)%2==1 else '偶數行').groups)
df.groupby(lambda x:'奇數ID行' if  x%2==1 else '偶數ID行').groups

{'偶數行': Int64Index([1102, 1104, 1201, 1203, 1205, 1302, 1304, 2101, 2103, 2105, 2202,
             2204, 2301, 2303, 2305, 2402, 2404],
            dtype='int64', name='ID'),
 '奇數行': Int64Index([1101, 1103, 1105, 1202, 1204, 1301, 1303, 1305, 2102, 2104, 2201,
             2203, 2205, 2302, 2304, 2401, 2403, 2405],
            dtype='int64', name='ID')}





{'偶數ID行': Int64Index([1102, 1104, 1202, 1204, 1302, 1304, 2102, 2104, 2202, 2204, 2302,
             2304, 2402, 2404],
            dtype='int64', name='ID'),
 '奇數ID行': Int64Index([1101, 1103, 1105, 1201, 1203, 1205, 1301, 1303, 1305, 2101, 2103,
             2105, 2201, 2203, 2205, 2301, 2303, 2305, 2401, 2403, 2405],
            dtype='int64', name='ID')}

如果是多層索引，那麼lambda表達式中的輸入就是元組，下面實現的功能爲查看兩所學校中男女生分別均分是否及格

注意：此處只是演示groupby的用法，實際操作不會這樣寫

df.set_index(['Gender','School']).head()

		Class	Address	Height	Weight	Math	Physics
Gender	School
M	S_1	C_1	street_1	173	63	34.0	A+
F	S_1	C_1	street_2	192	73	32.5	B+
M	S_1	C_1	street_2	186	82	87.2	B+
F	S_1	C_1	street_2	167	81	80.4	B-
F	S_1	C_1	street_4	159	64	84.8	B+

df.set_index(['Gender','School']).sort_index().groupby(lambda x:print(x))

('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')





<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026FFBE152E8>

math_score = df.set_index(['Gender','School'])['Math'].sort_index()
grouped_score = df.set_index(['Gender','School']).sort_index().\
            groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))
for name,_ in grouped_score:print(name)
for name ,group in grouped_score:
    print(name)
    display(group)
# math_score =df.set_index(['Gender','School'])['Math'].sort_index().groupby(lambda x:(x,'均分及格' if math_socre[x].mean()>60 else '均分不及格'))

(('F', 'S_1'), '均分及格')
(('F', 'S_2'), '均分及格')
(('M', 'S_1'), '均分及格')
(('M', 'S_2'), '均分不及格')
(('F', 'S_1'), '均分及格')

		Class	Address	Height	Weight	Math	Physics
Gender	School
F	S_1	C_1	street_2	192	73	32.5	B+
	S_1	C_1	street_2	167	81	80.4	B-
	S_1	C_1	street_4	159	64	84.8	B+
	S_1	C_2	street_4	176	94	63.5	B-
	S_1	C_2	street_5	162	63	33.8	B
	S_1	C_2	street_6	167	63	68.4	B-
	S_1	C_3	street_1	175	57	87.7	A-
	S_1	C_3	street_5	187	69	61.7	B-

(('F', 'S_2'), '均分及格')

		Class	Address	Height	Weight	Math	Physics
Gender	School
F	S_2	C_1	street_6	161	61	50.6	B+
	S_2	C_1	street_5	159	97	72.2	B+
	S_2	C_2	street_7	194	77	68.5	B+
	S_2	C_2	street_7	183	76	85.4	B
	S_2	C_3	street_4	157	78	72.3	B+
	S_2	C_3	street_7	190	99	65.9	C
	S_2	C_3	street_6	164	81	95.5	A-
	S_2	C_4	street_2	192	62	45.3	A
	S_2	C_4	street_6	158	60	59.7	B+
	S_2	C_4	street_2	160	84	67.7	B
	S_2	C_4	street_6	193	54	47.6	B

(('M', 'S_1'), '均分及格')

		Class	Address	Height	Weight	Math	Physics
Gender	School
M	S_1	C_1	street_1	173	63	34.0	A+
	S_1	C_1	street_2	186	82	87.2	B+
	S_1	C_2	street_5	188	68	97.0	A-
	S_1	C_2	street_6	160	53	58.8	A+
	S_1	C_3	street_4	161	68	31.5	B+
	S_1	C_3	street_7	188	82	49.7	B
	S_1	C_3	street_2	195	70	85.2	A

(('M', 'S_2'), '均分不及格')

		Class	Address	Height	Weight	Math	Physics
Gender	School
M	S_2	C_1	street_7	174	84	83.3	C
	S_2	C_1	street_4	157	61	52.5	B-
	S_2	C_1	street_4	170	81	34.2	A
	S_2	C_2	street_5	193	100	39.1	B
	S_2	C_2	street_4	155	91	73.8	A+
	S_2	C_2	street_1	175	74	47.2	B-
	S_2	C_3	street_5	171	88	32.7	A
	S_2	C_3	street_4	187	73	48.9	B
	S_2	C_4	street_7	166	82	48.7	B

math_score.tail()
print(math_score[('M','S_2')])
math_score[('M', 'S_2')].mean()

(M, S_2)    83.3
(M, S_2)    52.5
(M, S_2)    34.2
(M, S_2)    39.1
(M, S_2)    73.8
(M, S_2)    47.2
(M, S_2)    32.7
(M, S_2)    48.9
(M, S_2)    48.7
Name: Math, dtype: float64





51.155555555555544

（d）groupby的[]操作

可以用[]選出groupby對象的某個或者某幾個列，上面的均分比較可以如下簡潔地寫出：

df.groupby(['Gender','School'])['Math'].mean()>=60
# df.groupby(['Gender','School'])['Math'].mean()>=60

Gender  School
F       S_1        True
        S_2        True
M       S_1        True
        S_2       False
Name: Math, dtype: bool

用列表可選出多個屬性列：

df.groupby(['Gender','School'])[['Math','Height']].mean()
# df.groupby(['Gender','School'])[['Math']].mean()
# a=df.set_index(['Gender','School']).sort_index()[['Math']]
# a.query('(School=="S_1")and (Gender=="F")').mean()

		Math	Height
Gender	School
F	S_1	64.100000	173.125000
F	S_2	66.427273	173.727273
M	S_1	63.342857	178.714286
M	S_2	51.155556	172.000000

（e）連續型變量分組

例如利用cut函數對數學成績分組：

bins = [0,40,60,80,90,100]
cuts = pd.cut(df['Math'],bins=bins) #可選label添加自定義標籤
df.groupby(cuts)['Math'].count()
# bins=[0,40,60,80,90,100]
# cuts=pd.cut(df['Math'],bins=bins)
# df.groupby(cuts).size()

Math
(0, 40]       7
(40, 60]     10
(60, 80]      9
(80, 90]      7
(90, 100]     2
dtype: int64

三、聚合、過濾和變換

1. 聚合（Aggregation）

（a）常用聚合函數

所謂聚合就是把一堆數，變成一個標量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函數

爲了熟悉操作，不妨驗證標準誤sem函數，它的計算公式是： $\frac{組內標準差}{\sqrt{組容量}}$ ，下面進行驗證：

group_m = grouped_single['Math']
display(group_m.std().values/np.sqrt(group_m.count().values)== group_m.sem().values)

group_m=grouped_single['Math']
display(group_m.std().values)
# np.sqrt()
display(np.sqrt(group_m.count().values))
# group_m.head()
group_m.std().values/np.sqrt(group_m.count().values)==group_m.sem().values

array([ True,  True])



array([23.07747407, 17.58930521])



array([3.87298335, 4.47213595])





array([ True,  True])

（b）同時使用多個聚合函數

group_m.agg(['sum','mean','std'])
group_m.agg(['sum','mean','std','sem','count'])

	sum	mean	std	sem	count
School
S_1	956.2	63.746667	23.077474	5.958578	15
S_2	1191.1	59.555000	17.589305	3.933088	20

利用元組進行重命名

group_m.agg([('rename_sum','sum'),('rename_mean','mean')])

group_m.agg([('rename_sum','sum'),('rename_mean','mean')])

	rename_sum	rename_mean
School
S_1	956.2	63.746667
S_2	1191.1	59.555000

指定哪些函數作用哪些列

grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

		Math		Height
		mean	max	var
School	Class
S_1	C_1	63.78	87.2	183.3
	C_2	64.30	97.0	132.8
	C_3	63.16	87.7	179.2
S_2	C_1	58.56	83.3	54.7
	C_2	62.80	85.4	256.0
	C_3	63.06	95.5	205.7
	C_4	53.80	67.7	300.2

（c）使用自定義函數

grouped_single['Math'].agg(lambda x:print(x.head(),x.count(),'間隔'))
#可以發現，agg函數的傳入是分組逐列進行的，有了這個特性就可以做許多事情
# grouped_single['Math'].agg(lambda x:print(x.head(),x.count(),'間隔'))

1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
Name: Math, dtype: float64 15 間隔
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
Name: Math, dtype: float64 20 間隔





School
S_1    None
S_2    None
Name: Math, dtype: object

官方沒有提供極差計算的函數，但通過agg可以容易地實現組內極差計算

grouped_single['Math'].agg(lambda x:x.max()-x.min())

grouped_single['Math'].agg(lambda x:x.max()-x.min())

School
S_1    65.5
S_2    62.8
Name: Math, dtype: float64

（d）利用NamedAgg函數進行多個聚合

注意：不支持lambda函數，但是可以使用外置的def函數

def R1(x):
    return x.max()-x.min()
def R2(x):
    return x.max()-x.median()
grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),
                           max_score1=pd.NamedAgg(column='col2', aggfunc='max'),
                           range_score2=pd.NamedAgg(column='col3', aggfunc=R2)).head()



display(grouped_single['Math'].head())

def R1(x):
    return x.max()-x.min()
def R2(x):
    return x.max()-x.median()
grouped_single['Math','Height'].agg(min_score1=pd.NamedAgg(column='Math',aggfunc=R1),
                          max_score1=pd.NamedAgg(column='Math',aggfunc='max'),
                          range_score2=pd.NamedAgg(column='Height',aggfunc=R2)).head()#

ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
Name: Math, dtype: float64


F:\dev\anaconda\envs\python35\lib\site-packages\ipykernel_launcher.py:17: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

	min_score1	max_score1	range_score2
School
S_1	65.5	97.0	20.0
S_2	62.8	95.5	23.5

（e）帶參數的聚合函數

判斷是否組內數學分數至少有一個值在50-52之間：

def f(s,low,high):
    return s.between(low,high).max()
grouped_single['Math'].agg(f,50,52)

grouped_single['Math'].agg(lambda x:print(x.between(50,52)))
grouped_single['Math'].agg(lambda x:print(x.between(50,52).max()))
def f(s,low,high):
    return s.between(low,high).any()
grouped_single['Math'].agg(f,50,52)

1101    False
1102    False
1103    False
1104    False
1105    False
1201    False
1202    False
1203    False
1204    False
1205    False
1301    False
1302    False
1303    False
1304    False
1305    False
Name: Math, dtype: bool
2101    False
2102     True
2103    False
2104    False
2105    False
2201    False
2202    False
2203    False
2204    False
2205    False
2301    False
2302    False
2303    False
2304    False
2305    False
2401    False
2402    False
2403    False
2404    False
2405    False
Name: Math, dtype: bool
False
True





School
S_1    False
S_2     True
Name: Math, dtype: bool

如果需要使用多個函數，並且其中至少有一個帶參數，則使用wrap技巧：

def f_test(s,low,high):
    return s.between(low,high).max()
def agg_f(f_mul,name,*args):#,**kwargs
    def wrapper(x):
        return f_mul(x,*args)#,**kwargs
    wrapper.__name__ = name
    return wrapper
new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)
grouped_single['Math'].agg([new_f,'mean']).head()



# def f_test(s,low,high):
#     return s.between(low,high).max()
# def agg_f(f_mul,*args):
#     def wrapper(x):
#         return f_mul(x,*args)
#     return wrapper
# grouped_single['Math'].agg([agg_f(f_test,50,52),'mean'])

	at_least_one_in_50_52	mean
School
S_1	False	63.746667
S_2	True	59.555000

現在這段的目的就是我agg裏面能夠加帶參數的函數，那麼我們知道agg的傳入x會傳到agg_f(f_test,50,52)裏面，那agg_f()的返回結果是個什麼呢？是wrapper，那麼wrapper返回的又是什麼？是f_mul(x,50,52)，這樣就把外層的參數通過包裹傳到了內層，並且最終agg傳入的x會最終傳入f_mul中的x，巧妙地利用agg_f中的args將數值傳到f_mul中的args數值。

2. 過濾（Filteration）

filter函數是用來篩選某些組的（務必記住結果是組的全體），因此傳入的值應當是布爾標量

grouped_single[['Math','Physics']].filter(lambda x:print((x['Math']>32).all())).head()
# grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32)).head()
# grouped_single[['Math','Physics']].agg(lambda x:print(x['Math']>32))
# grouped_single[['Math','Physics']].agg(lambda x:print(x.head(),x.count(),'間隔'))
grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()

False
True

	Math	Physics
ID
2101	83.3	C
2102	50.6	B+
2103	52.5	B-
2104	72.2	B+
2105	34.2	A

grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>34).all()).head()

	Math	Physics
ID

filter選的是組，所以組的所有人都成績超過32（34）則返回True，否則返回False，注意True和False選的是組，所以32的時候因爲兩個組只有一個符合條件所以選出來一個，34都不符合，所以沒有選出來的。

3. 變換（Transformation）

（a）傳入對象

transform函數中傳入的對象是組內的列，並且返回值需要與列長完全一致

grouped_single[['Math','Height']].agg(lambda x:print(x-x.min())).head()

1101     2.5
1102     1.0
1103    55.7
1104    48.9
1105    53.3
1201    65.5
1202    32.0
1203    27.3
1204     2.3
1205    36.9
1301     0.0
1302    56.2
1303    18.2
1304    53.7
1305    30.2
Name: Math, dtype: float64
2101    50.6
2102    17.9
2103    19.8
2104    39.5
2105     1.5
2201     6.4
2202    35.8
2203    41.1
2204    14.5
2205    52.7
2301    39.6
2302     0.0
2303    33.2
2304    62.8
2305    16.2
2401    12.6
2402    16.0
2403    27.0
2404    35.0
2405    14.9
Name: Math, dtype: float64
1101    14
1102    33
1103    27
1104     8
1105     0
1201    29
1202    17
1203     1
1204     3
1205     8
1301     2
1302    16
1303    29
1304    36
1305    28
Name: Height, dtype: int64
2101    19
2102     6
2103     2
2104     4
2105    15
2201    38
2202    39
2203     0
2204    20
2205    28
2301     2
2302    16
2303    35
2304     9
2305    32
2401    37
2402    11
2403     3
2404     5
2405    38
Name: Height, dtype: int64

	Math	Height
School
S_1	None	None
S_2	None	None

grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()
grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()

	Math	Height
ID
1101	2.5	14
1102	1.0	33
1103	55.7	27
1104	48.9	8
1105	53.3	0

如果返回了標量值，那麼組內的所有元素會被廣播爲這個值

grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()
# grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()

	Math	Height
ID
1101	63.746667	175.733333
1102	63.746667	175.733333
1103	63.746667	175.733333
1104	63.746667	175.733333
1105	63.746667	175.733333

（b）利用變換方法進行組內標準化

grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()
grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()

	Math	Height
ID
1101	-1.288991	-0.214991
1102	-1.353990	1.279460
1103	1.016287	0.807528
1104	0.721627	-0.686923
1105	0.912289	-1.316166

（c）利用變換方法進行組內缺失值的均值填充

df_nan = df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()

df_nan=df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()

	ID	Math	School
0	1101	34.0	S_1
1	1102	NaN	S_1
2	1103	NaN	S_1
3	1104	80.4	S_1
4	1105	84.8	S_1

df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()
df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()

	ID	Math	School
0	1101	68.214286	S_1
1	1102	68.214286	S_1
2	1103	87.200000	S_1
3	1104	80.400000	S_1
4	1105	68.214286	S_1

四、apply函數

1. apply函數的靈活性

可能在所有的分組函數中，apply是應用最爲廣泛的，這得益於它的靈活性：

對於傳入值而言，從下面的打印內容可以看到是以分組的表傳入apply中：

df.groupby('School').apply(lambda x:print(x.head(5)))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B-
2104    S_2   C_1      F  street_5     159      97  72.2      B+
2105    S_2   C_1      M  street_4     170      81  34.2       A

apply函數的靈活性很大程度來源於其返回值的多樣性：

① 標量返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())
df[['School','Math','Height']].groupby('School').apply(lambda x:print(x,x.max()))
display(df[['School','Math','Height']].groupby('School').agg(lambda x:x.max()))
df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())

     School  Math  Height
ID                       
1101    S_1  34.0     173
1102    S_1  32.5     192
1103    S_1  87.2     186
1104    S_1  80.4     167
1105    S_1  84.8     159
1201    S_1  97.0     188
1202    S_1  63.5     176
1203    S_1  58.8     160
1204    S_1  33.8     162
1205    S_1  68.4     167
1301    S_1  31.5     161
1302    S_1  87.7     175
1303    S_1  49.7     188
1304    S_1  85.2     195
1305    S_1  61.7     187 School    S_1
Math       97
Height    195
dtype: object
     School  Math  Height
ID                       
2101    S_2  83.3     174
2102    S_2  50.6     161
2103    S_2  52.5     157
2104    S_2  72.2     159
2105    S_2  34.2     170
2201    S_2  39.1     193
2202    S_2  68.5     194
2203    S_2  73.8     155
2204    S_2  47.2     175
2205    S_2  85.4     183
2301    S_2  72.3     157
2302    S_2  32.7     171
2303    S_2  65.9     190
2304    S_2  95.5     164
2305    S_2  48.9     187
2401    S_2  45.3     192
2402    S_2  48.7     166
2403    S_2  59.7     158
2404    S_2  67.7     160
2405    S_2  47.6     193 School     S_2
Math      95.5
Height     194
dtype: object

	Math	Height
School
S_1	97.0	195
S_2	95.5	194

	School	Math	Height
School
S_1	S_1	97.0	195
S_2	S_2	95.5	194

② 列表返回值

display(df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head())
df[['School','Math','Height']].groupby('School').transform(lambda x:x-x.min()).head()

	Math	Height
ID
1101	2.5	14.0
1102	1.0	33.0
1103	55.7	27.0
1104	48.9	8.0
1105	53.3	0.0

	Math	Height
ID
1101	2.5	14
1102	1.0	33
1103	55.7	27
1104	48.9	8
1105	53.3	0

③ 數據框返回值

df[['School','Math','Height']].groupby('School')\
    .apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),
                                  'col2':x['Math']-x['Math'].min(),
                                  'col3':x['Height']-x['Height'].max(),
                                  'col4':x['Height']-x['Height'].min()})).head()




df[['School','Math','Height']].groupby('School').apply(lambda x:pd.DataFrame({
    'col1':x['Math']-x['Math'].max(),
    'col2':x['Math']-x['Math'].min(),
    'col3':x['Height']-x['Height'].max(),
    'col4':x['Height']-x['Height'].min()
})).head()

	col1	col2	col3	col4
ID
1101	-63.0	2.5	-22	14
1102	-64.5	1.0	-3	33
1103	-9.8	55.7	-9	27
1104	-16.6	48.9	-28	8
1105	-12.2	53.3	-36	0

2. 用apply同時統計多個指標

此處可以藉助OrderedDict工具進行快捷的統計：

from collections import OrderedDict
def f(df):
    data = OrderedDict()
    data['M_sum'] = df['Math'].sum()
    data['W_var'] = df['Weight'].var()
    data['H_mean'] = df['Height'].mean()
    print('data',data)
    print('series')
    print(pd.Series(data))
    return pd.Series(data)
grouped_single.apply(f)



from collections import OrderedDict
def f(df):
    data=OrderedDict()
    data['M_sum']=df['Math'].sum()
    data['W_var']=df['Weight'].var()
    data['H_mean']=df['Height'].mean()
    return pd.Series(data)
grouped_single.apply(f)

data OrderedDict([('M_sum', 956.2000000000002), ('W_var', 117.42857142857143), ('H_mean', 175.73333333333332)])
series
M_sum     956.200000
W_var     117.428571
H_mean    175.733333
dtype: float64
data OrderedDict([('M_sum', 1191.1), ('W_var', 181.08157894736837), ('H_mean', 172.95)])
series
M_sum     1191.100000
W_var      181.081579
H_mean     172.950000
dtype: float64

	M_sum	W_var	H_mean
School
S_1	956.2	117.428571	175.733333
S_2	1191.1	181.081579	172.950000

五、問題與練習

1. 問題

【問題一】什麼是fillna的前向/後向填充，如何實現？

df = pd.read_csv('data/table.csv',index_col='ID')
df.head(3)
df_nan = df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()
df_nan.Math=df_nan.Math.fillna(method='bfill')
df_nan.head()

	ID	Math	School
0	1101	34.0	S_1
1	1102	87.2	S_1
2	1103	87.2	S_1
3	1104	97.0	S_1
4	1105	97.0	S_1

fillna 的method方法可以控制參數的填充方式，是向上填充：將缺失值填充爲該列中它上一個未缺失值；向下填充相反

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

pad / ffill: 向下自動填充

backfill / bfill: 向上自動填充

【問題二】下面的代碼實現了什麼功能？請仿照設計一個它的groupby版本。

s = pd.Series ([0, 1, 1, 0, 1, 1, 1, 0])
s1 = s.cumsum()
result = s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1,fill_value =0)

s1:將s序列求累加和 [0, 1, 2, 2, 3, 4, 5, 5]

s.mul(s1)😒 與s1累乘 [0, 1, 2, 0, 3, 4, 5, 0]

.diff() 求一階差分 [nan, 1.0, 1.0, -2.0, 3.0, 1.0, 1.0, -5.0]

.where(lambda x: x < 0) 值是否小於0:[nan, nan, nan, -2.0, nan, nan, nan, -5.0]

.ffill()：向下填充 [nan, nan, nan, -2.0, -2.0, -2.0, -2.0, -5.0]

.add(s1,fill_value =0) 缺失值補0後與s1求和 [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 3.0, 0.0]

list(s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1,fill_value =0))
gp =df.groupby('School')

gp.apply(lambda x:x['Math'].mul(x['Math'].cumsum()).diff().where(lambda m: m < 0).ffill().add(x['Math'].cumsum(),fill_value =0))

School  ID  
S_1     1101       34.00
        1102       66.50
        1103      153.70
        1104      234.10
        1105      318.90
        1201      415.90
        1202    -9421.00
        1203    -9362.20
        1204   -11740.56
        1205   -11672.16
        1301   -21966.61
        1302   -21878.91
        1303   -25585.41
        1304   -25500.21
        1305   -16257.66
S_2     2101       83.30
        2102      -29.65
        2103       22.85
        2104       95.05
        2105    -8364.36
        2201    -8325.26
        2202    -8256.76
        2203    -8182.96
        2204    -9864.48
        2205    -9779.08
        2301    -2042.69
        2302   -25111.27
        2303   -25045.37
        2304   -24949.87
        2305   -37377.81
        2401     -300.07
        2402     -251.37
        2403     -191.67
        2404     -123.97
        2405   -19527.49
Name: Math, dtype: float64

【問題三】如何計算組內0.25分位數與0.75分位數？要求顯示在同一張表上。

【問題四】既然索引已經能夠選出某些符合條件的子集，那麼filter函數的設計有什麼意義？

【問題五】整合、變換、過濾三者在輸入輸出和功能上有何異同？

【問題六】在帶參數的多函數聚合時，有辦法能夠繞過wrap技巧實現同樣功能嗎？

問題三

gp.apply(lambda x:pd.DataFrame({'q25':x.quantile(0.25),
                                  'q75':x.quantile(0.75)
                                       }))

		q25	q75
School
S_1	Height	164.50	187.500
	Weight	63.00	77.000
	Math	41.85	85.000
S_2	Height	159.75	187.750
	Weight	70.25	85.000
	Math	47.50	72.225

問題四

filter函數是用來篩選組的,結果是組的全體

問題五

整合（Aggregation）分組計算統計量：輸入的是每組數據，輸出是每組的統計量，在列維度上是標量。

變換（Transformation）—即分組對每個單元的數據進行操作（如元素標準化）：輸入的是每組數據，輸出是每組數據經過某種規則變換後的數據,不改變數據的維度。

過濾（Filtration）—即按照某些規則篩選出一些組:輸入的是每組數據，輸出的是滿足要求的組的所有數據。

問題六

2. 練習

【練習一】：現有一份關於diamonds的數據集，列分別記錄了克拉數、顏色、開採深度、價格，請解決下列問題：

pd.read_csv('data/Diamonds.csv').head()

	carat	color	depth	price
0	0.23	E	61.5	326
1	0.21	E	59.8	326
2	0.23	E	56.9	327
3	0.29	I	62.4	334
4	0.31	J	63.3	335

(a) 在所有重量超過1克拉的鑽石中，價格的極差是多少？

(b) 若以開採深度的0.2\0.4\0.6\0.8分位數爲分組依據，每一組中鑽石顏色最多的是哪一種？該種顏色是組內平均而言單位重量最貴的嗎？

(d) 請按顏色分組，分別計算價格關於克拉數的迴歸係數。（單變量的簡單線性迴歸，並只使用Pandas和Numpy完成）

a

df=pd.read_csv('data/Diamonds.csv')
df.head()

	carat	color	depth	price
0	0.23	E	61.5	326
1	0.21	E	59.8	326
2	0.23	E	56.9	327
3	0.29	I	62.4	334
4	0.31	J	63.3	335

a=df[df['carat']>1]
a['price'].max()-a['price'].min()

df_r=df.query('carat>1')['price']
df_r.max()-df_r.min()

b

np.linspace(0,1,6)

array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])

bins=df['depth'].quantile(np.linspace(0,1,6)).tolist()
cuts=pd.cut(df['depth'],bins=bins)
df['cuts']=cuts
df.head()

	carat	color	depth	price	cuts
0	0.23	E	61.5	326	(60.8, 61.6]
1	0.21	E	59.8	326	(43.0, 60.8]
2	0.23	E	56.9	327	(43.0, 60.8]
3	0.29	I	62.4	334	(62.1, 62.7]
4	0.31	J	63.3	335	(62.7, 79.0]

color_result = df.groupby('cuts')['color'].describe()
color_result

	count	unique	top	freq
cuts
(43.0, 60.8]	11294	7	E	2259
(60.8, 61.6]	11831	7	G	2593
(61.6, 62.1]	10403	7	G	2247
(62.1, 62.7]	10137	7	G	2193
(62.7, 79.0]	10273	7	G	2000

df['均重價格']=df['price']/df['carat']
color_result['top'] == [i[1] for i in df.groupby(['cuts'
                                ,'color'])['均重價格'].mean().groupby(['cuts']).idxmax().values]

cuts
(43.0, 60.8]    False
(60.8, 61.6]    False
(61.6, 62.1]    False
(62.1, 62.7]     True
(62.7, 79.0]     True
Name: top, dtype: bool

c

df = df.drop(columns='均重價格')
cuts = pd.cut(df['carat'],bins=[0,0.5,1,1.5,2,np.inf]) #可選label添加自定義標籤
df['cuts'] = cuts
df.head()

	carat	color	depth	price	cuts
0	0.23	E	61.5	326	(0.0, 0.5]
1	0.21	E	59.8	326	(0.0, 0.5]
2	0.23	E	56.9	327	(0.0, 0.5]
3	0.29	I	62.4	334	(0.0, 0.5]
4	0.31	J	63.3	335	(0.0, 0.5]

def f(nums):
    if not nums:        
        return 0
    res = 1                            
    cur_len = 1                        
    for i in range(1, len(nums)):      
        if nums[i-1] < nums[i]:        
            cur_len += 1                
            res = max(cur_len, res)     
        else:                       
            cur_len = 1                 
    return res

for name,group in df.groupby('cuts'):
    group = group.sort_values(by='depth')
    s = group['price']
    print(name,f(s.tolist()))

(0.0, 0.5] 8
(0.5, 1.0] 8
(1.0, 1.5] 7
(1.5, 2.0] 11
(2.0, inf] 7

d

for name,group in df[['carat','price','color']].groupby('color'):
    L1 = np.array([np.ones(group.shape[0]),group['carat']]).reshape(2,group.shape[0])
    L2 = group['price']
    result = (np.linalg.inv(L1.dot(L1.T)).dot(L1)).dot(L2).reshape(2,1)
    print('當顏色爲%s時，截距項爲：%f，迴歸係數爲：%f'%(name,result[0],result[1]))

當顏色爲D時，截距項爲：-2361.017152，迴歸係數爲：8408.353126
當顏色爲E時，截距項爲：-2381.049600，迴歸係數爲：8296.212783
當顏色爲F時，截距項爲：-2665.806191，迴歸係數爲：8676.658344
當顏色爲G時，截距項爲：-2575.527643，迴歸係數爲：8525.345779
當顏色爲H時，截距項爲：-2460.418046，迴歸係數爲：7619.098320
當顏色爲I時，截距項爲：-2878.150356，迴歸係數爲：7761.041169
當顏色爲J時，截距項爲：-2920.603337，迴歸係數爲：7094.192092

【練習二】：有一份關於美國10年至17年的非法藥物數據集，列分別記錄了年份、州（5個）、縣、藥物類型、報告數量，請解決下列問題：

pd.read_csv('data/Drugs.csv').head()

	YYYY	State	COUNTY	SubstanceName	DrugReports
0	2010	VA	ACCOMACK	Propoxyphene	1
1	2010	OH	ADAMS	Morphine	9
2	2010	PA	ADAMS	Methadone	2
3	2010	VA	ALEXANDRIA CITY	Heroin	5
4	2010	PA	ALLEGHENY	Hydromorphone	5

(a) 按照年份統計，哪個縣的報告數量最多？這個縣所屬的州在當年也是報告數最多的嗎？

(b) 從14年到15年，Heroin的數量增加最多的是哪一個州？它在這個州是所有藥物中增幅最大的嗎？若不是，請找出符合該條件的藥物。

a

df = pd.read_csv('data/Drugs.csv')
df.head()

	YYYY	State	COUNTY	SubstanceName	DrugReports
0	2010	VA	ACCOMACK	Propoxyphene	1
1	2010	OH	ADAMS	Morphine	9
2	2010	PA	ADAMS	Methadone	2
3	2010	VA	ALEXANDRIA CITY	Heroin	5
4	2010	PA	ALLEGHENY	Hydromorphone	5

idx=pd.IndexSlice
for i in range(2010,2018):
    county = (df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0])
    state = df.query('COUNTY == "%s"'%county)['State'].iloc[0]
    state_true = df.groupby(['State','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0]
    if state==state_true:
        print('在%d年，%s縣的報告數最多，它所屬的州%s也是報告數最多的'%(i,county,state))
    else:
        print('在%d年，%s縣的報告數最多，但它所屬的州%s不是報告數最多的，%s州報告數最多'%(i,county,state,state_true))

在2010年，PHILADELPHIA縣的報告數最多，它所屬的州PA也是報告數最多的
在2011年，PHILADELPHIA縣的報告數最多，但它所屬的州PA不是報告數最多的，OH州報告數最多
在2012年，PHILADELPHIA縣的報告數最多，但它所屬的州PA不是報告數最多的，OH州報告數最多
在2013年，PHILADELPHIA縣的報告數最多，但它所屬的州PA不是報告數最多的，OH州報告數最多
在2014年，PHILADELPHIA縣的報告數最多，但它所屬的州PA不是報告數最多的，OH州報告數最多
在2015年，PHILADELPHIA縣的報告數最多，但它所屬的州PA不是報告數最多的，OH州報告數最多
在2016年，HAMILTON縣的報告數最多，它所屬的州OH也是報告數最多的
在2017年，HAMILTON縣的報告數最多，它所屬的州OH也是報告數最多的

b

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]
df_add = df_b.groupby(['YYYY','State']).sum()
(df_add.loc[2015]-df_add.loc[2014]).idxmax()

DrugReports    OH
dtype: object

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['State']=='OH')]
df_add = df_b.groupby(['YYYY','SubstanceName']).sum()
display((df_add.loc[2015]-df_add.loc[2014]).idxmax()) #這裏利用了索引對齊的特點
display((df_add.loc[2015]/df_add.loc[2014]).idxmax())

DrugReports    Heroin
dtype: object



DrugReports    Acetyl fentanyl
dtype: object

第3章 分組

第3章 分組

一、SAC過程

1. 內涵

SAC指的是分組操作中的split-apply-combine過程

其中split指基於某一些規則，將數據拆成若干組，apply是指對每一組獨立地使用函數，combine指將每一組的結果組合成某一類數據結構

2. apply過程

在該過程中，我們實際往往會遇到四類問題：

整合（Aggregation）——即分組計算統計量（如求均值、求每組元素個數）

變換（Transformation）——即分組對每個單元的數據進行操作（如元素標準化）

過濾（Filtration）——即按照某些規則篩選出一些組（如選出組內某一指標小於50的組）

綜合問題——即前面提及的三種問題的混合

二、groupby函數

1. 分組函數的基本內容：

（a）根據某一列分組

經過groupby後會生成一個groupby對象，該對象本身不會返回任何東西，只有當相應的方法被調用纔會起作用

例如取出某一個組：

（b）根據某幾列分組

（c）組容量與組數

（d）組的遍歷

（e）level參數（用於多級索引）和axis參數

2. groupby對象的特點

（a）查看所有可調用的方法

由此可見，groupby對象可以使用相當多的函數，靈活程度很高

（b）分組對象的head和first

對分組對象使用head函數，返回的是每個組的前幾行，而不是數據集前幾行

first顯示的是以分組爲索引的每組的第一個分組信息

（c）分組依據

對於groupby函數而言，分組的依據是非常自由的，只要是與數據框長度相同的列表即可，同時支持函數型分組

從原理上說，我們可以看到利用函數時，傳入的對象就是索引，因此根據這一特性可以做一些複雜的操作

根據奇偶行分組

如果是多層索引，那麼lambda表達式中的輸入就是元組，下面實現的功能爲查看兩所學校中男女生分別均分是否及格

注意：此處只是演示groupby的用法，實際操作不會這樣寫

（d）groupby的[]操作

可以用[]選出groupby對象的某個或者某幾個列，上面的均分比較可以如下簡潔地寫出：

用列表可選出多個屬性列：

（e）連續型變量分組

例如利用cut函數對數學成績分組：

三、聚合、過濾和變換

1. 聚合（Aggregation）

（a）常用聚合函數

所謂聚合就是把一堆數，變成一個標量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函數

爲了熟悉操作，不妨驗證標準誤sem函數，它的計算公式是：組內標準差組容量\frac{組內標準差}{\sqrt{組容量}}組容量​組內標準差​，下面進行驗證：

（b）同時使用多個聚合函數

利用元組進行重命名

指定哪些函數作用哪些列

（c）使用自定義函數

官方沒有提供極差計算的函數，但通過agg可以容易地實現組內極差計算

（d）利用NamedAgg函數進行多個聚合

注意：不支持lambda函數，但是可以使用外置的def函數

（e）帶參數的聚合函數

判斷是否組內數學分數至少有一個值在50-52之間：

如果需要使用多個函數，並且其中至少有一個帶參數，則使用wrap技巧：

2. 過濾（Filteration）

filter函數是用來篩選某些組的（務必記住結果是組的全體），因此傳入的值應當是布爾標量

3. 變換（Transformation）

（a）傳入對象

transform函數中傳入的對象是組內的列，並且返回值需要與列長完全一致

如果返回了標量值，那麼組內的所有元素會被廣播爲這個值

（b）利用變換方法進行組內標準化

（c）利用變換方法進行組內缺失值的均值填充

四、apply函數

1. apply函數的靈活性

可能在所有的分組函數中，apply是應用最爲廣泛的，這得益於它的靈活性：

對於傳入值而言，從下面的打印內容可以看到是以分組的表傳入apply中：

apply函數的靈活性很大程度來源於其返回值的多樣性：

① 標量返回值

② 列表返回值

③ 數據框返回值

2. 用apply同時統計多個指標

此處可以藉助OrderedDict工具進行快捷的統計：

五、問題與練習

1. 問題

【問題一】 什麼是fillna的前向/後向填充，如何實現？

【問題二】 下面的代碼實現了什麼功能？請仿照設計一個它的groupby版本。

【問題三】 如何計算組內0.25分位數與0.75分位數？要求顯示在同一張表上。

【問題四】 既然索引已經能夠選出某些符合條件的子集，那麼filter函數的設計有什麼意義？

【問題五】 整合、變換、過濾三者在輸入輸出和功能上有何異同？

【問題六】 在帶參數的多函數聚合時，有辦法能夠繞過wrap技巧實現同樣功能嗎？

2. 練習

第3章分組

第3章分組

爲了熟悉操作，不妨驗證標準誤sem函數，它的計算公式是： $\frac{組內標準差}{\sqrt{組容量}}$ ，下面進行驗證：

【問題一】什麼是fillna的前向/後向填充，如何實現？

【問題二】下面的代碼實現了什麼功能？請仿照設計一個它的groupby版本。

【問題三】如何計算組內0.25分位數與0.75分位數？要求顯示在同一張表上。

【問題四】既然索引已經能夠選出某些符合條件的子集，那麼filter函數的設計有什麼意義？

【問題五】整合、變換、過濾三者在輸入輸出和功能上有何異同？

【問題六】在帶參數的多函數聚合時，有辦法能夠繞過wrap技巧實現同樣功能嗎？

【練習一】：現有一份關於diamonds的數據集，列分別記錄了克拉數、顏色、開採深度、價格，請解決下列問題：