Python 支持向量机就该这样操作（案例分析）

关注微信公共号：小程在线

关注CSDN博客：程志伟的博客

学习本篇文章大约需要3小时时间

我们要了解了SVC类的各种重要参数，属性和接口，其中参数包括软间隔的惩罚系数C，核函数kernel，核函数的相关参数gamma，coef0和degree，解决样本不均衡的参数class_weight，控制概率的参数probability，控制计算内存的参数cache_size，属性主要包括调用支持向量的属性support_vectors_和查看特征
重要性的属性coef_。接口中，我们学习了最核心的decision_function。除此之外，我们介绍了分类模型的模型评
估指标：混淆矩阵和ROC曲线，还介绍了部分特征工程和数据预处理的思路。

导入需要的库

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

导入数据，探索数据
weather = pd.read_csv(r"H:\程志伟\python\菜菜的机器学习skleaen课堂\SVM数据\weatherAUS5000.csv",index_col=0)
weather.head()
Out[2]:
Date Location MinTemp ... Temp9am Temp3pm RainTomorrow
0 2015-03-24 Adelaide 12.3 ... 15.1 17.7 No
1 2011-07-12 Adelaide 7.9 ... 8.4 11.3 No
2 2010-02-08 Adelaide 24.0 ... 32.4 37.4 No
3 2016-09-19 Adelaide 6.7 ... 11.2 15.9 No
4 2014-03-05 Adelaide 16.7 ... 20.8 23.7 No

[5 rows x 22 columns]

特征/标签	含义
Date	观察日期
Location	获取该信息的气象站的名称
MinTemp	以摄氏度为单位的最低温度
MaxTemp	以摄氏度为单位的最高温度
Rainfall	当天记录的降雨量，单位为mm
Evaporation	到早上9点之前的24小时的A级蒸发量（mm）
Sunshine	白日受到日照的完整小时
WindGustDir	在到午夜12点前的24小时中的最强风的风向
WindGustSpeed	在到午夜12点前的24小时中的最强风速（km / h）
WindDir9am	上午9点时的风向
WindDir3pm	下午3点时的风向
WindSpeed9am	上午9点之前每个十分钟的风速的平均值（km / h）
WindSpeed3pm	下午3点之前每个十分钟的风速的平均值（km / h）
Humidity9am	上午9点的湿度（百分比）
Humidity3am	下午3点的湿度（百分比）
Pressure9am	上午9点平均海平面上的大气压（hpa）
Pressure3pm	下午3点平均海平面上的大气压（hpa）
Cloud9am	上午9点的天空被云层遮蔽的程度，这是以“oktas”来衡量的，这个单位记录了云层遮挡天空的程度。0表示完全晴朗的天空，而8表示它完全是阴天。
Cloud3pm	下午3点的天空被云层遮蔽的程度
Temp9am	上午9点的摄氏度温度
Temp3pm	下午3点的摄氏度温度
RainTomorrow	目标变量，我们的标签：明天下雨了吗？

#将特征矩阵和标签Y分开

X = weather.iloc[:,:-1]
Y = weather.iloc[:,-1]

#探索数据类型
X.shape
Out[4]: (5000, 21)

X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 5000 non-null object
1 Location 5000 non-null object
2 MinTemp 4979 non-null float64
3 MaxTemp 4987 non-null float64
4 Rainfall 4950 non-null float64
5 Evaporation 2841 non-null float64
6 Sunshine 2571 non-null float64
7 WindGustDir 4669 non-null object
8 WindGustSpeed 4669 non-null float64
9 WindDir9am 4651 non-null object
10 WindDir3pm 4887 non-null object
11 WindSpeed9am 4949 non-null float64
12 WindSpeed3pm 4919 non-null float64
13 Humidity9am 4936 non-null float64
14 Humidity3pm 4880 non-null float64
15 Pressure9am 4506 non-null float64
16 Pressure3pm 4504 non-null float64
17 Cloud9am 3111 non-null float64
18 Cloud3pm 3012 non-null float64
19 Temp9am 4967 non-null float64
20 Temp3pm 4912 non-null float64
dtypes: float64(16), object(5)
memory usage: 859.4+ KB

#探索缺失值
X.isnull().mean()
Out[6]:
Date 0.0000
Location 0.0000
MinTemp 0.0042
MaxTemp 0.0026
Rainfall 0.0100
Evaporation 0.4318
Sunshine 0.4858
WindGustDir 0.0662
WindGustSpeed 0.0662
WindDir9am 0.0698
WindDir3pm 0.0226
WindSpeed9am 0.0102
WindSpeed3pm 0.0162
Humidity9am 0.0128
Humidity3pm 0.0240
Pressure9am 0.0988
Pressure3pm 0.0992
Cloud9am 0.3778
Cloud3pm 0.3976
Temp9am 0.0066
Temp3pm 0.0176
dtype: float64

#探索标签的分类
np.unique(Y)
Out[7]: array(['No', 'Yes'], dtype=object)

1.2 分集，优先探索标签
分训练集和测试集，并做描述性统计
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y,test_size=0.3,random_state=420)

#恢复索引
for i in [Xtrain, Xtest, Ytrain, Ytest]:
i.index = range(i.shape[0])

是否有样本不平衡问题
Ytrain.value_counts()
Out[10]:
No 2704
Yes 796
Name: RainTomorrow, dtype: int64

Ytest.value_counts()
Out[11]:
No 1157
Yes 343
Name: RainTomorrow, dtype: int64

将标签编码
from sklearn.preprocessing import LabelEncoder
encorder = LabelEncoder().fit(Ytrain)

Ytrain = pd.DataFrame(encorder.transform(Ytrain))
Ytest = pd.DataFrame(encorder.transform(Ytest))

1.3 探索特征，开始处理特征矩阵
1.3.1 描述性统计与异常值
Xtrain.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
Out[14]:
count mean std ... 90% 99% max
MinTemp 3486.0 12.225645 6.396243 ... 20.9 25.900 29.0
MaxTemp 3489.0 23.245543 7.201839 ... 33.0 40.400 46.4
Rainfall 3467.0 2.487049 7.949686 ... 6.6 41.272 115.8
Evaporation 1983.0 5.619163 4.383098 ... 10.2 20.600 56.0
Sunshine 1790.0 7.508659 3.805841 ... 12.0 13.300 13.9
WindGustSpeed 3263.0 39.858413 13.219607 ... 57.0 76.000 117.0
WindSpeed9am 3466.0 14.046163 8.670472 ... 26.0 37.000 65.0
WindSpeed3pm 3437.0 18.553390 8.611818 ... 30.0 43.000 65.0
Humidity9am 3459.0 69.069095 18.787698 ... 94.0 100.000 100.0
Humidity3pm 3408.0 51.651995 20.697872 ... 79.0 98.000 100.0
Pressure9am 3154.0 1017.622067 7.065236 ... 1027.0 1033.247 1038.1
Pressure3pm 3154.0 1015.227077 7.032531 ... 1024.4 1030.800 1036.0
Cloud9am 2171.0 4.491939 2.858781 ... 8.0 8.000 8.0
Cloud3pm 2095.0 4.603819 2.655765 ... 8.0 8.000 8.0
Temp9am 3481.0 16.989859 6.537552 ... 26.0 31.000 38.0
Temp3pm 3431.0 21.719003 7.031199 ... 31.4 38.600 45.9

[16 rows x 13 columns]

Xtest.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
Out[15]:
count mean std ... 90% 99% max
MinTemp 1493.0 11.916812 6.375377 ... 20.48 25.316 28.3
MaxTemp 1498.0 22.906809 6.986043 ... 32.60 38.303 45.1
Rainfall 1483.0 2.241807 7.988822 ... 5.20 35.372 108.2
Evaporation 858.0 5.657809 4.105762 ... 10.40 19.458 38.8
Sunshine 781.0 7.677465 3.862294 ... 12.20 13.400 13.9
WindGustSpeed 1406.0 40.044097 14.027052 ... 57.00 78.000 122.0
WindSpeed9am 1483.0 13.986514 9.124337 ... 26.00 39.360 72.0
WindSpeed3pm 1482.0 18.601215 8.850446 ... 31.00 43.000 56.0
Humidity9am 1477.0 68.688558 18.876448 ... 95.00 100.000 100.0
Humidity3pm 1472.0 51.431386 20.459957 ... 78.00 96.290 100.0
Pressure9am 1352.0 1017.763536 6.910275 ... 1026.50 1033.449 1038.2
Pressure3pm 1350.0 1015.397926 6.916976 ... 1024.20 1031.151 1036.9
Cloud9am 940.0 4.494681 2.870468 ... 8.00 8.000 8.0
Cloud3pm 917.0 4.403490 2.731969 ... 8.00 8.000 8.0
Temp9am 1486.0 16.751817 6.339816 ... 25.45 30.200 35.1
Temp3pm 1481.0 21.483660 6.770567 ... 30.90 37.400 42.9

[16 rows x 13 columns]

Xtrain.shape
Out[16]: (3500, 21)

Xtest.shape
Out[17]: (1500, 21)

1.3.2 处理困难特征：日期
Xtrainc = Xtrain.copy()
Xtrainc.sort_values(by="Location")
Out[18]:
Date Location MinTemp ... Cloud3pm Temp9am Temp3pm
2796 2015-03-24 Adelaide 12.3 ... NaN 15.1 17.7
2975 2012-08-17 Adelaide 7.8 ... NaN 8.3 12.5
775 2013-03-16 Adelaide 17.4 ... NaN 19.1 20.7
861 2011-07-12 Adelaide 7.9 ... NaN 8.4 11.3
2906 2015-08-24 Adelaide 9.2 ... NaN 9.9 13.4
... ... ... ... ... ... ...
2223 2009-05-08 Woomera 9.2 ... 1.0 13.7 20.1
1984 2014-05-26 Woomera 15.5 ... 7.0 18.0 21.5
1592 2012-01-10 Woomera 16.8 ... 6.0 18.3 24.9
2824 2015-11-03 Woomera 16.2 ... 7.0 20.5 26.2
1005 2010-05-14 Woomera 3.9 ... 1.0 11.5 18.5

[3500 rows x 21 columns]

Xtrain.iloc[:,0].value_counts()
Out[19]:
2014-05-16 6
2015-10-12 6
2015-07-03 6
2012-09-18 5
2012-11-23 5
..
2013-11-17 1
2008-12-23 1
2011-10-26 1
2010-06-15 1
2011-06-20 1
Name: Date, Length: 2141, dtype: int64

#首先，日期不是独一无二的，日期有重复
#其次，在我们分训练集和测试集之后，日期也不是连续的，而是分散的
#某一年的某一天倾向于会下雨？或者倾向于不会下雨吗？
#不是日期影响了下雨与否，反而更多的是这一天的日照时间，湿度，温度等等这些因素影响了是否会下雨
#光看日期，其实感觉它对我们的判断并无直接影响
#如果我们把它当作连续型变量处理，那算法会人为它是一系列1~3000左右的数字，不会意识到这是日期

Xtrain.iloc[:,0].value_counts().count()
Out[20]: 2141

#如果我们把它当作分类型变量处理，类别太多，有2141类，如果换成数值型，会被直接当成连续型变量，如果做成哑
变量，我们特征的维度会爆炸

我们的特征中有一列叫做“Rainfall"，这是表示当前日期当前地区下的降雨量，换句话说，也就是”今
天的降雨量“。凭常识我们认为，今天是否下雨，应该会影响明天是否下雨，比如有的地方可能就有这样的气候，
一旦下雨就连着下很多天，也有可能有的地方的气候就是一场暴雨来得快去的快。因此，我们可以将时间对气候的
连续影响，转换为”今天是否下雨“这个特征，巧妙地将样本对应标签之间的联系，转换成是特征与标签之间的联系了
Xtrain["Rainfall"].head(20)
Out[21]:
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.2
8 0.0
9 0.2
10 1.0
11 0.0
12 0.2
13 0.0
14 0.0
15 3.0
16 0.2
17 0.0
18 35.2
19 0.0
Name: Rainfall, dtype: float64

Xtrain["Rainfall"].isnull().sum()
Out[22]: 33

Xtrain.loc[Xtrain["Rainfall"] >= 1,"RainToday"] = "Yes"
Xtrain.loc[Xtrain["Rainfall"] < 1,"RainToday"] = "No"
Xtrain.loc[Xtrain["Rainfall"] == np.nan,"RainToday"] = np.nan

Xtest.loc[Xtest["Rainfall"] >= 1,"RainToday"] = "Yes"
Xtest.loc[Xtest["Rainfall"] < 1,"RainToday"] = "No"
Xtest.loc[Xtest["Rainfall"] == np.nan,"RainToday"] = np.nan

Xtrain.head()
Out[25]:
Date Location MinTemp ... Temp9am Temp3pm RainToday
0 2015-08-24 Katherine 17.5 ... 27.5 NaN No
1 2016-12-10 Tuggeranong 9.5 ... 14.6 23.6 No
2 2010-04-18 Albany 13.0 ... 17.5 20.8 No
3 2009-11-26 Sale 13.9 ... 18.5 27.5 No
4 2014-04-25 Mildura 6.0 ... 12.4 22.4 No

[5 rows x 22 columns]

Xtest.head()
Out[26]:
Date Location MinTemp ... Temp9am Temp3pm RainToday
0 2016-01-23 NorahHead 22.0 ... 26.2 23.1 Yes
1 2009-03-05 MountGambier 12.0 ... 14.8 17.5 Yes
2 2010-03-05 MountGinini 9.1 ... NaN NaN NaN
3 2013-10-26 Wollongong 13.1 ... 16.8 19.6 No
4 2016-11-28 Sale 12.2 ... 13.6 19.0 No

[5 rows x 22 columns]

我们就创造了一个特征，今天是否下雨 RainToday。

日期本身并不影响天气，但是日期所在的月份和季节其实是影响天气的，如果任选梅雨季节的某一天，那明天下雨的可能性必然比非梅雨季节的那一天要大。虽然我们无法让机器学习体会不同月份是什么季节，但是我们可以对不同月份进行分组，算法可以通过训练感受到，“这个月或者这个季节更容易下雨”。因此，我们可以将月份或者季节提取出来，作为一个特征使用，而舍弃掉具体的日期。如此，我们又可以创造第二个特征，月份"Month”。

#提取出月份
Xtrain.loc[0,"Date"].split("-")
Out[27]: ['2015', '08', '24']

int(Xtrain.loc[0,"Date"].split("-")[1])
Out[28]: 8

Xtrain["Date"] = Xtrain["Date"].apply(lambda x:int(x.split("-")[1]))

#替换完毕后，我们需要修改列的名称
#rename是比较少有的，可以用来修改单个列名的函数
#我们通常都直接使用 df.columns = 某个列表这样的形式来一次修改所有的列名
#但rename允许我们只修改某个单独的列

Xtrain.head()
Out[30]:
Date Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 ... NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 ... NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 ... 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 ... 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 ... 4.0 12.4 22.4 No

[5 rows x 22 columns]

Xtrain.loc[:,'Date'].value_counts()
Out[31]:
3 334
5 324
7 316
9 302
6 302
1 300
11 299
10 282
4 265
2 264
12 259
8 253
Name: Date, dtype: int64

Xtrain = Xtrain.rename(columns={"Date":"Month"})
Xtrain.head()
Out[32]:
Month Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 ... NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 ... NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 ... 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 ... 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 ... 4.0 12.4 22.4 No

[5 rows x 22 columns]

Xtest["Date"] = Xtest["Date"].apply(lambda x:int(x.split("-")[1]))
Xtest = Xtest.rename(columns={"Date":"Month"})
Xtest.head()
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[33]:
Month Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 1 NorahHead 22.0 27.8 ... NaN 26.2 23.1 Yes
1 3 MountGambier 12.0 18.6 ... 7.0 14.8 17.5 Yes
2 3 MountGinini 9.1 13.3 ... NaN NaN NaN NaN
3 10 Wollongong 13.1 20.3 ... NaN 16.8 19.6 No
4 11 Sale 12.2 20.0 ... 4.0 13.6 19.0 No

[5 rows x 22 columns]

通过时间，我们处理出两个新特征，“今天是否下雨”和“月份”

1.3.3 处理困难特征：地点

不同的地点因为气候不同，所以对“明天是否会下雨”有着不同的影响。如果我们能够将地点转换为这个地方的气候的话，我们就可以将不同城市打包到同一个气候中，而同一个气候下反应的降雨情况应该是相似的。

处理思路：全国主要城市的气候，主要城市的经纬度（地点），我们就可以通过计算我们样本中的每个气候站到各个主要城市的地理距离，来找出一个离这个气象站最近的主要城市，而这个主要城市的气候就是我们样本点所在的地点的气候

Xtrain.loc[:,'Location'].value_counts().count()
Out[34]: 49

cityll = pd.read_csv(r"H:\程志伟\python\菜菜的机器学习skleaen课堂\SVM数据\cityll.csv",index_col=0)
city_climate = pd.read_csv(r"H:\程志伟\python\菜菜的机器学习skleaen课堂\SVM数据\Cityclimate.csv")

cityll.head()
Out[36]:
City Latitude Longitude Latitudedir Longitudedir
0 Adelaide 34.9285° 138.6007° S, E
1 Albany 35.0275° 117.8840° S, E
2 Albury 36.0737° 146.9135° S, E
3 Wodonga 36.1241° 146.8818° S, E
4 AliceSprings 23.6980° 133.8807° S, E

city_climate.head()
Out[37]:
City Climate
0 Adelaide Warm temperate
1 Albany Mild temperate
2 Albury Hot dry summer, cool winter
3 Wodonga Hot dry summer, cool winter
4 AliceSprings Hot dry summer, warm winter

将这两张表处理成可以使用的样子，首先要去掉cityll中经纬度上带有的度数符号，然后要将两张表合并起来

#去掉度数符号
cityll.loc[0,'Latitude'][:-1]
Out[38]: '34.9285'

cityll["Latitudenum"] = cityll["Latitude"].apply(lambda x:float(x[:-1]))
cityll["Longitudenum"] = cityll["Longitude"].apply(lambda x:float(x[:-1]))
cityll.head()
Out[39]:
City Latitude Longitude ... Longitudedir Latitudenum Longitudenum
0 Adelaide 34.9285° 138.6007° ... E 34.9285 138.6007
1 Albany 35.0275° 117.8840° ... E 35.0275 117.8840
2 Albury 36.0737° 146.9135° ... E 36.0737 146.9135
3 Wodonga 36.1241° 146.8818° ... E 36.1241 146.8818
4 AliceSprings 23.6980° 133.8807° ... E 23.6980 133.8807

[5 rows x 7 columns]

citylld = cityll.iloc[:,[0,5,6]]

#将city_climate中的气候添加到我们的citylld中
citylld["climate"] = city_climate.iloc[:,-1]
citylld.head()
__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[40]:
City Latitudenum Longitudenum climate
0 Adelaide 34.9285 138.6007 Warm temperate
1 Albany 35.0275 117.8840 Mild temperate
2 Albury 36.0737 146.9135 Hot dry summer, cool winter
3 Wodonga 36.1241 146.8818 Hot dry summer, cool winter
4 AliceSprings 23.6980 133.8807 Hot dry summer, warm winter

citylld.loc[:,'climate'].value_counts()
Out[41]:
Hot dry summer, cool winter 24
Hot dry summer, warm winter 18
Warm temperate 18
High humidity summer, warm winter 17
Mild temperate 9
Cool temperate 9
Warm humid summer, mild winter 5
Name: climate, dtype: int64

想要计算距离，我们就会需要所有样本数据中的城市。我们认为，只有出现在训练集中的地点才会出现在测试集中

samplecity = pd.read_csv(r"H:\程志伟\\samplecity.csv",index_col=0)
#我们对samplecity也执行同样的处理：去掉经纬度中度数的符号，并且舍弃我们的经纬度的方向
samplecity["Latitudenum"] = samplecity["Latitude"].apply(lambda x:float(x[:-1]))
samplecity["Longitudenum"] = samplecity["Longitude"].apply(lambda x:float(x[:-1]))
samplecityd = samplecity.iloc[:,[0,5,6]]
samplecityd.head()
Out[42]:
City Latitudenum Longitudenum
0 Canberra 35.2809 149.1300
1 Sydney 33.8688 151.2093
2 Perth 31.9505 115.8605
3 Darwin 12.4634 130.8456
4 Hobart 42.8821 147.3272

我们现在有了主要城市的经纬度和对应的气候，也有了我们的样本的地点所对应的经纬度，接下来我们要开始计算我们样本上的地点到每个主要城市的距离，而离我们的样本地点最近的那个主要城市的气候，就是我们样本点的气候。

#首先使用radians将角度转换成弧度
from math import radians, sin, cos, acos
citylld.loc[:,"slat"] = citylld.iloc[:,1].apply(lambda x : radians(x))
citylld.loc[:,"slon"] = citylld.iloc[:,2].apply(lambda x : radians(x))
samplecityd.loc[:,"elat"] = samplecityd.iloc[:,1].apply(lambda x : radians(x))
samplecityd.loc[:,"elon"] = samplecityd.iloc[:,2].apply(lambda x : radians(x))

citylld.head()
Out[46]:
City Latitudenum ... slat slon
0 Adelaide 34.9285 ... 0.609617 2.419039
1 Albany 35.0275 ... 0.611345 2.057464
2 Albury 36.0737 ... 0.629605 2.564124
3 Wodonga 36.1241 ... 0.630484 2.563571
4 AliceSprings 23.6980 ... 0.413608 2.336659

[5 rows x 6 columns]

samplecityd.head()
Out[47]:
City Latitudenum Longitudenum elat elon
0 Canberra 35.2809 149.1300 0.615768 2.602810
1 Sydney 33.8688 151.2093 0.591122 2.639100
2 Perth 31.9505 115.8605 0.557641 2.022147
3 Darwin 12.4634 130.8456 0.217527 2.283687
4 Hobart 42.8821 147.3272 0.748434 2.571345

import sys
for i in range(samplecityd.shape[0]):
slat = citylld.loc[:,"slat"]
slon = citylld.loc[:,"slon"]
elat = samplecityd.loc[i,"elat"]
elon = samplecityd.loc[i,"elon"]
dist = 6371.01 * np.arccos(np.sin(slat)*np.sin(elat) +
np.cos(slat)*np.cos(elat)*np.cos(slon.values - elon))
city_index = np.argsort(dist)[0]
#每次计算后，取距离最近的城市，然后将最近的城市和城市对应的气候都匹配到samplecityd中
samplecityd.loc[i,"closest_city"] = citylld.loc[city_index,"City"]
samplecityd.loc[i,"climate"] = citylld.loc[city_index,"climate"]

#查看最后的结果，需要检查城市匹配是否基本正确
samplecityd.head()
Out[49]:
City Latitudenum ... closest_city    climate
0 Canberra 35.2809 ... Canberra Cool temperate
1 Sydney 33.8688 ... Sydney Warm temperate
2 Perth 31.9505 ... Perth Warm temperate
3 Darwin 12.4634 ... Darwin High humidity summer, warm winter
4 Hobart 42.8821 ... Hobart Cool temperate

[5 rows x 7 columns]

#查看气候的分布
samplecityd["climate"].value_counts()
Out[50]:
Warm temperate 15
Mild temperate 10
Cool temperate 9
Hot dry summer, cool winter 6
High humidity summer, warm winter 4
Hot dry summer, warm winter 3
Warm humid summer, mild winter 2
Name: climate, dtype: int64

#确认无误后，取出样本城市所对应的气候，并保存
locafinal = samplecityd.iloc[:,[0,-1]]

locafinal.head()
Out[52]:
City climate
0 Canberra Cool temperate
1 Sydney Warm temperate
2 Perth Warm temperate
3 Darwin High humidity summer, warm winter
4 Hobart Cool temperate

locafinal.columns = ["Location","Climate"]

#在这里设定locafinal的索引为地点，是为了之后进行map的匹配
locafinal = locafinal.set_index(keys="Location")

locafinal.to_csv(r"H:\程志伟\python\\samplelocation.csv")

locafinal.head()
Out[56]:
Climate
Location
Canberra Cool temperate
Sydney Warm temperate
Perth Warm temperate
Darwin High humidity summer, warm winter
Hobart Cool temperate

有了每个样本城市所对应的气候，我们接下来就使用气候来替掉原本的城市，原本的气象站的名称。在这里，我们可以使用map功能，map能够将特征中的值一一对应到我们设定的字典中，并且用字典中的值来替换样本中原本的值.

Xtrain.head()
Out[57]:
Month Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 ... NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 ... NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 ... 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 ... 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 ... 4.0 12.4 22.4 No

[5 rows x 22 columns]

#map将数据进行替换
import re
Xtrain["Location"] = Xtrain["Location"].map(locafinal.iloc[:,0])
Xtrain.head()
Out[58]:
Month Location ... Temp3pm RainToday
0 8 High humidity summer, warm winter ... NaN No
1 12 Cool temperate ... 23.6 No
2 4 Mild temperate ... 20.8 No
3 11 Mild temperate ... 27.5 No
4 4 Hot dry summer, cool winter ... 22.4 No

[5 rows x 22 columns]

#将location中的内容替换，并且确保匹配进入的气候字符串中不含有逗号，气候两边不含有空格
#我们使用re这个模块来消除逗号

#re.sub(希望替换的值，希望被替换成的值，要操作的字符串)
#x.strip()是去掉空格的函数

Xtrain["Location"] = Xtrain["Location"].apply(lambda x:re.sub(",","",x.strip()))
Xtrain.head()
Out[60]:
Month Location MinTemp ... Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 ... 27.5 NaN No
1 12 Cool temperate 9.5 ... 14.6 23.6 No
2 4 Mild temperate 13.0 ... 17.5 20.8 No
3 11 Mild temperate 13.9 ... 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 ... 12.4 22.4 No

[5 rows x 22 columns]

Xtest["Location"] = Xtest["Location"].map(locafinal.iloc[:,0]).apply(lambda x:re.sub(",","",x.strip()))

#修改特征内容之后，我们使用新列名“Climate”来替换之前的列名“Location”
Xtrain = Xtrain.rename(columns={"Location":"Climate"})
Xtest = Xtest.rename(columns={"Location":"Climate"})
Xtrain.head()
Out[62]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 ... 27.5 NaN No
1 12 Cool temperate 9.5 ... 14.6 23.6 No
2 4 Mild temperate 13.0 ... 17.5 20.8 No
3 11 Mild temperate 13.9 ... 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 ... 12.4 22.4 No

[5 rows x 22 columns]

Xtest.head()
Out[63]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 1 Cool temperate 22.0 ... 26.2 23.1 Yes
1 3 Mild temperate 12.0 ... 14.8 17.5 Yes
2 3 Cool temperate 9.1 ... NaN NaN NaN
3 10 Warm temperate 13.1 ... 16.8 19.6 No
4 11 Mild temperate 12.2 ... 13.6 19.0 No

[5 rows x 22 columns]

到这里，地点就处理完毕了。其实，我们还没有将这个特征转化为数字，即还没有对它进行编码。我们稍后和其他
的分类型变量一起来编码

1.3.4 处理分类型变量：缺失值

#查看缺失值的缺失情况
Xtrain.isnull().mean()
Out[64]:
Month 0.000000
Climate 0.000000
MinTemp 0.004000
MaxTemp 0.003143
Rainfall 0.009429
Evaporation 0.433429
Sunshine 0.488571
WindGustDir 0.067714
WindGustSpeed 0.067714
WindDir9am 0.067429
WindDir3pm 0.024286
WindSpeed9am 0.009714
WindSpeed3pm 0.018000
Humidity9am 0.011714
Humidity3pm 0.026286
Pressure9am 0.098857
Pressure3pm 0.098857
Cloud9am 0.379714
Cloud3pm 0.401429
Temp9am 0.005429
Temp3pm 0.019714
RainToday 0.009429
dtype: float64

#首先找出，分类型特征都有哪些
cate = Xtrain.columns[Xtrain.dtypes == "object"].tolist()

#除了特征类型为"object"的特征们，还有虽然用数字表示，但是本质为分类型特征的云层遮蔽程度
cloud = ["Cloud9am","Cloud3pm"]
cate = cate + cloud
cate
Out[66]:
['Climate',
'WindGustDir',
'WindDir9am',
'WindDir3pm',
'RainToday',
'Cloud9am',
'Cloud3pm']

#对于分类型特征，我们使用众数来进行填补
from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values=np.nan,strategy="most_frequent")

#注意，我们使用训练集数据来训练我们的填补器，本质是在生成训练集中的众数
si.fit(Xtrain.loc[:,cate])
Out[67]:
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
missing_values=nan, strategy='most_frequent', verbose=0)

#然后我们用训练集中的众数来同时填补训练集和测试集
Xtrain.loc[:,cate] = si.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = si.transform(Xtest.loc[:,cate])

Xtrain.head()
Out[69]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 ... 27.5 NaN No
1 12 Cool temperate 9.5 ... 14.6 23.6 No
2 4 Mild temperate 13.0 ... 17.5 20.8 No
3 11 Mild temperate 13.9 ... 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 ... 12.4 22.4 No

[5 rows x 22 columns]

Xtest.head()
Out[70]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 1 Cool temperate 22.0 ... 26.2 23.1 Yes
1 3 Mild temperate 12.0 ... 14.8 17.5 Yes
2 3 Cool temperate 9.1 ... NaN NaN No
3 10 Warm temperate 13.1 ... 16.8 19.6 No
4 11 Mild temperate 12.2 ... 13.6 19.0 No

[5 rows x 22 columns]

#查看分类型特征是否依然存在缺失值
Xtrain.loc[:,cate].isnull().mean()
Out[71]:
Climate 0.0
WindGustDir 0.0
WindDir9am 0.0
WindDir3pm 0.0
RainToday 0.0
Cloud9am 0.0
Cloud3pm 0.0
dtype: float64

Xtest.loc[:,cate].isnull().mean()
Out[72]:
Climate 0.0
WindGustDir 0.0
WindDir9am 0.0
WindDir3pm 0.0
RainToday 0.0
Cloud9am 0.0
Cloud3pm 0.0
dtype: float64

1.3.5 处理分类型变量：将分类型变量编码
#将所有的分类型变量编码为数字，一个类别是一个数字

from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()

#利用训练集进行fit

oe = oe.fit(Xtrain.loc[:,cate])

#用训练集的编码结果来编码训练和测试特征矩阵
#在这里如果测试特征矩阵报错，就说明测试集中出现了训练集中从未见过的类别
Xtrain.loc[:,cate] = oe.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = oe.transform(Xtest.loc[:,cate])

Xtrain.loc[:,cate].head()
Out[76]:
Climate WindGustDir WindDir9am WindDir3pm RainToday Cloud9am Cloud3pm
0 1.0 2.0 6.0 0.0 0.0 0.0 7.0
1 0.0 6.0 4.0 6.0 0.0 7.0 7.0
2 4.0 13.0 4.0 0.0 0.0 1.0 3.0
3 4.0 8.0 3.0 8.0 0.0 6.0 6.0
4 2.0 5.0 0.0 6.0 0.0 2.0 4.0

Xtest.loc[:,cate].head()
Out[77]:
Climate WindGustDir WindDir9am WindDir3pm RainToday Cloud9am Cloud3pm
0 0.0 11.0 8.0 11.0 1.0 7.0 7.0
1 4.0 12.0 12.0 8.0 1.0 8.0 7.0
2 0.0 4.0 3.0 9.0 0.0 7.0 7.0
3 6.0 12.0 13.0 9.0 0.0 7.0 7.0
4 4.0 0.0 12.0 0.0 0.0 8.0 4.0

1.3.6 处理连续型变量：填补缺失值
col = Xtrain.columns.tolist()
for i in cate:
col.remove(i)

col
Out[78]:
['Month',
'MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Temp9am',
'Temp3pm']

#实例化模型，填补策略为"mean"表示均值
impmean = SimpleImputer(missing_values=np.nan,strategy = "mean")

#用训练集来fit模型
impmean = impmean.fit(Xtrain.loc[:,col])

#分别在训练集和测试集上进行均值填补
Xtrain.loc[:,col] = impmean.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = impmean.transform(Xtest.loc[:,col])

Xtrain.head()
Out[82]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8.0 1.0 17.5 36.0 ... 7.0 27.5 21.719003 0.0
1 12.0 0.0 9.5 25.0 ... 7.0 14.6 23.600000 0.0
2 4.0 4.0 13.0 22.6 ... 3.0 17.5 20.800000 0.0
3 11.0 4.0 13.9 29.8 ... 6.0 18.5 27.500000 0.0
4 4.0 2.0 6.0 23.5 ... 4.0 12.4 22.400000 0.0

[5 rows x 22 columns]

Xtest.head()
Out[83]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 1.0 0.0 22.0 27.8 ... 7.0 26.200000 23.100000 1.0
1 3.0 4.0 12.0 18.6 ... 7.0 14.800000 17.500000 1.0
2 3.0 0.0 9.1 13.3 ... 7.0 16.989859 21.719003 0.0
3 10.0 6.0 13.1 20.3 ... 7.0 16.800000 19.600000 0.0
4 11.0 4.0 12.2 20.0 ... 4.0 13.600000 19.000000 0.0

[5 rows x 22 columns]

Xtrain.isnull().sum()
Out[84]:
Month 0
Climate 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 0
WindGustSpeed 0
WindDir9am 0
WindDir3pm 0
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm 0
RainToday 0
dtype: int64

Xtest.isnull().sum()
Out[85]:
Month 0
Climate 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 0
WindGustSpeed 0
WindDir9am 0
WindDir3pm 0
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm 0
RainToday 0
dtype: int64

1.3.7 处理连续型变量：无量纲化
col.remove("Month")
col
Out[86]:
['MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Temp9am',
'Temp3pm']

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss = ss.fit(Xtrain.loc[:,col])
Xtrain.loc[:,col] = ss.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = ss.transform(Xtest.loc[:,col])

Xtrain.head()
Out[89]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8.0 1.0 0.826375 1.774044 ... 7.0 1.612270 0.000000 0.0
1 12.0 0.0 -0.427048 0.244031 ... 7.0 -0.366608 0.270238 0.0
2 4.0 4.0 0.121324 -0.089790 ... 3.0 0.078256 -0.132031 0.0
3 11.0 4.0 0.262334 0.911673 ... 6.0 0.231658 0.830540 0.0
4 4.0 2.0 -0.975421 0.035393 ... 4.0 -0.704091 0.097837 0.0

[5 rows x 22 columns]

Xtest.head()
Out[90]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 1.0 0.0 1.531425 0.633489 ... 7.0 1.412848 0.198404 1.0
1 3.0 4.0 -0.035354 -0.646158 ... 7.0 -0.335927 -0.606132 1.0
2 3.0 0.0 -0.489720 -1.383346 ... 7.0 0.000000 0.000000 0.0
3 10.0 6.0 0.136992 -0.409702 ... 7.0 -0.029125 -0.304431 0.0
4 11.0 4.0 -0.004018 -0.451429 ... 4.0 -0.520009 -0.390632 0.0

[5 rows x 22 columns]

Ytrain.head()
Out[91]:
0
0 0
1 0
2 0
3 1
4 0

Ytest.head()
Out[92]:
0
0 0
1 0
2 1
3 0
4 0

1.4 建模与模型评估
from time import time
import datetime
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, recall_score

#建模选择自然是我们的支持向量机SVC，首先用核函数的学习曲线来选择核函数
#我们希望同时观察，精确性，recall以及AUC分数
times = time() #因为SVM是计算量很大的模型，所以我们需要时刻监控我们的模型运行时间
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest) #获取模型的结果
score = clf.score(Xtest,Ytest) #返回准确度
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

linear 's testing accuracy 0.844000, recall is 0.469388', auc is 0.869029
00:07:008706

poly 's testing accuracy 0.840667, recall is 0.457726', auc is 0.868157
00:07:870836

rbf 's testing accuracy 0.813333, recall is 0.306122', auc is 0.814873
00:10:493201

sigmoid 's testing accuracy 0.655333, recall is 0.154519', auc is 0.437308
00:11:395323

我们注意到，模型的准确度和auc面积还是勉勉强强，但是每个核函数下的recall都不太高。相比之下，其实线性模
型的效果是最好的。那现在我们可以开始考虑了，在这种状况下，我们要向着什么方向进行调参呢？我们最想要的
是什么？
我们可以有不同的目标：
一，我希望不计一切代价判断出少数类，得到最高的recall。
二，我们希望追求最高的预测准确率，一切目的都是为了让accuracy更高，我们不在意recall或者AUC。
三，我们希望达到recall，ROC和accuracy之间的平衡，不追求任何一个也不牺牲任何一个

1.5 模型调参
1.5.1 最求最高Recall
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" %(kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

linear 's testing accuracy 0.796667, recall is 0.775510', auc is 0.870062
00:05:915204

poly 's testing accuracy 0.793333, recall is 0.763848', auc is 0.871448
00:06:949940

rbf 's testing accuracy 0.803333, recall is 0.600583', auc is 0.819713
00:09:667872

sigmoid 's testing accuracy 0.562000, recall is 0.282799', auc is 0.437119
00:11:633267

在锁定了线性核函数之后，我甚至可以将class_weight调节得更加倾向于少数类，来不计代价提升recall
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
,class_weight = {1:10} #注意，这里写的其实是，类别1：10，隐藏了类别0：1这个比例
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" %(kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

linear 's testing accuracy 0.636667, recall is 0.912536', auc is 0.866360
00:12:969724

poly 's testing accuracy 0.634667, recall is 0.912536', auc is 0.866885
00:14:926113

rbf 's testing accuracy 0.790000, recall is 0.553936', auc is 0.802820
00:18:623275

sigmoid 's testing accuracy 0.228667, recall is 1.000000', auc is 0.436592
00:21:038490

随着recall地无节制上升，我们的精确度下降得十分厉害，不过看起来AUC面积却还好，稳定保持在0.86左右。如
果此时我们的目的就是追求一个比较高的AUC分数和比较好的recall，那我们的模型此时就算是很不错了。虽然现
在，我们的精确度很低，但是我们的确精准地捕捉出了每一个雨天。

1.5.2 追求最高准确率

如果我们的样本非常不均衡，但是此时却有很多多数类被判错的话，那我们可以让模型任性地把所有地样本都判断为0，完全不顾少数类。

Ytrain = Ytrain.iloc[:,0].ravel()
Ytest = Ytest.iloc[:,0].ravel()

valuec = pd.Series(Ytest).value_counts()
valuec
Out[98]:
0 1157
1 343
dtype: int64

valuec[0]/valuec.sum()
Out[99]: 0.7713333333333333

全部判断为多数类的概率为0.7713333333333333，而上面的准确率为0.844000，说明有小数类也被分类正确。

我们可以使用混淆矩阵来计算我们的特异度，如果特异度非常高，则证明多数类上已经很难被操作了。
from sklearn.metrics import confusion_matrix as CM
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
cm = CM(Ytest,result,labels=(1,0))
cm
Out[100]:
array([[ 161, 182],
[ 52, 1105]], dtype=int64)

specificity = cm[1,1]/cm[1,:].sum()
specificity
Out[101]: 0.9550561797752809

#几乎所有的0都被判断正确了，还有不少1也被判断正确了

以试试看使用class_weight将模型向少数类的方向稍微调整，已查看我们是否有更多的空间来提升我们的准确
率。如果在轻微向少数类方向调整过程中，出现了更高的准确率，则说明模型还没有到极限。
irange = np.linspace(0.01,0.05,10)
for i in irange:
times = time()
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
,class_weight = {1:1+i}
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" % (1+i,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.010000 testing accuracy 0.844667, recall is 0.475219', auc is 0.869157
00:05:282753
under ratio 1:1.014444 testing accuracy 0.844667, recall is 0.478134', auc is 0.869185
00:06:590835
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869198
00:06:539148
under ratio 1:1.023333 testing accuracy 0.845333, recall is 0.481050', auc is 0.869175
00:06:203914
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869394
00:06:585682
under ratio 1:1.032222 testing accuracy 0.844000, recall is 0.481050', auc is 0.869528
00:06:291609
under ratio 1:1.036667 testing accuracy 0.844000, recall is 0.481050', auc is 0.869659
00:05:643494
under ratio 1:1.041111 testing accuracy 0.844667, recall is 0.483965', auc is 0.869629
00:06:332509
under ratio 1:1.045556 testing accuracy 0.844667, recall is 0.483965', auc is 0.869712
00:06:435075
under ratio 1:1.050000 testing accuracy 0.845333, recall is 0.486880', auc is 0.869863
00:06:337505

惊喜出现了，我们的最高准确度是84.53%，超过了我们之前什么都不做的时候得到的84.40%。可见，模型还是有
潜力的。我们可以继续细化我们的学习曲线来进行调整。
irange_ = np.linspace(0.018889,0.027778,10)
for i in irange_:
times = time()
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
,class_weight = {1:1+i}
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869213
00:05:366834
under ratio 1:1.019877 testing accuracy 0.844000, recall is 0.478134', auc is 0.869228
00:05:415828
under ratio 1:1.020864 testing accuracy 0.844000, recall is 0.478134', auc is 0.869218
00:05:250741
under ratio 1:1.021852 testing accuracy 0.844667, recall is 0.478134', auc is 0.869188
00:05:137647
under ratio 1:1.022840 testing accuracy 0.844667, recall is 0.478134', auc is 0.869220
00:05:145678
under ratio 1:1.023827 testing accuracy 0.844667, recall is 0.481050', auc is 0.869188
00:05:224714
under ratio 1:1.024815 testing accuracy 0.844667, recall is 0.481050', auc is 0.869231
00:04:954503
under ratio 1:1.025803 testing accuracy 0.844000, recall is 0.481050', auc is 0.869253
00:05:323782
under ratio 1:1.026790 testing accuracy 0.844000, recall is 0.481050', auc is 0.869314
00:05:072606
under ratio 1:1.027778 testing accuracy 0.844667, recall is 0.481050', auc is 0.869374
00:05:165673

模型的效果没有太好，并没有再出现比我们的84.53%精确度更高的取值。可见，模型在不做样本平衡的情况下，
准确度其实已经非常接近极限了，让模型向着少数类的方向调节，不能够达到质变。

如果我们真的希望再提升准确度，只能选择更换模型的方式。

from sklearn.linear_model import LogisticRegression as LR

logclf = LR(solver="liblinear").fit(Xtrain, Ytrain)
logclf.score(Xtest,Ytest)
Out[105]: 0.8486666666666667

C_range = np.linspace(3,5,10)
for C in C_range:
logclf = LR(solver="liblinear",C=C).fit(Xtrain, Ytrain)
print(C,logclf.score(Xtest,Ytest))
3.0 0.8493333333333334
3.2222222222222223 0.8493333333333334
3.4444444444444446 0.8493333333333334
3.6666666666666665 0.8493333333333334
3.888888888888889 0.8493333333333334
4.111111111111111 0.8493333333333334
4.333333333333333 0.8493333333333334
4.555555555555555 0.8493333333333334
4.777777777777778 0.8493333333333334
5.0 0.8493333333333334

尽管我们实现了非常小的提升，但可以看出来，模型的精确度还是没有能够实现质变

1.5.3 追求平衡
import matplotlib.pyplot as plt
C_range = np.linspace(0.01,20,20)
recallall = []
aucall = []
scoreall = []
for C in C_range:
times = time()
clf = SVC(kernel = "linear",C=C,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
recallall.append(recall)
aucall.append(auc)
scoreall.append(score)
print("under C %f, testing accuracy is %f,recall is %f', auc is %f" % (C,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

print(max(aucall),C_range[aucall.index(max(aucall))])
plt.figure()
plt.plot(C_range,recallall,c="red",label="recall")
plt.plot(C_range,aucall,c="black",label="auc")
plt.plot(C_range,scoreall,c="orange",label="accuracy")
plt.legend()
plt.show()
under C 0.010000, testing accuracy is 0.800000,recall is 0.752187', auc is 0.870634
00:00:759538
under C 1.062105, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870024
00:06:290471
under C 2.114211, testing accuracy is 0.794000,recall is 0.772595', auc is 0.870160
00:11:766379
under C 3.166316, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870165
00:15:810231
under C 4.218421, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870112
00:20:414523
under C 5.270526, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870082
00:24:632509
under C 6.322632, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870100
00:29:588032
under C 7.374737, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870022
00:34:028188
under C 8.426842, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870090
00:37:620724
under C 9.478947, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870123
00:44:379563
under C 10.531053, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870092
00:47:067436
under C 11.583158, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870097
00:50:203707
under C 12.635263, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870019
00:55:903715
under C 13.687368, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870039
00:58:575636
under C 14.739474, testing accuracy is 0.795333,recall is 0.772595', auc is 0.869986
01:04:257676
under C 15.791579, testing accuracy is 0.795333,recall is 0.772595', auc is 0.869997
01:07:978324
under C 16.843684, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870032
01:13:093954
under C 17.895789, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870024
01:16:849644
under C 18.947895, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870014
01:21:223715
under C 20.000000, testing accuracy is 0.794667,recall is 0.772595', auc is 0.870047
01:25:687908
0.8706340666900172 0.01

但当C到1以上之后，模型的表现开始逐渐稳定，在C逐渐变大之后，模型的效果并没有显著地提高。可以认为我们设定的C值范围太大了，然而再继续增大或者缩小C值的范围，AUC面积也只能够在0.86上下进行变化了，调节C值不能够让模型的任何指标实现质变。我们把目前为止最佳的C值带入模型，看看我们的准确率，Recall的具体值。

times = time()
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f,recall is %f', auc is %f" % (score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.795333,recall is 0.772595', auc is 0.870165
00:16:030390

我们是否可以通过调整阈值来对这个模型进行改进
from sklearn.metrics import roc_curve as ROC
import matplotlib.pyplot as plt

FPR, Recall, thresholds = ROC(Ytest,clf.decision_function(Xtest),pos_label=1)
area = roc_auc_score(Ytest,clf.decision_function(Xtest))
plt.figure()
plt.plot(FPR, Recall, color='red',
label='ROC curve (area = %0.2f)' % area)
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('Recall')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

以此模型作为基础，我们来求解最佳阈值
maxindex = (Recall - FPR).tolist().index(max(Recall - FPR))
thresholds[maxindex]
Out[111]: -0.08950517388953827

基于我们选出的最佳阈值，我们来认为确定y_predict，并确定在这个阈值下的recall和准确度的值
from sklearn.metrics import accuracy_score as AC
times = time()
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)

prob = pd.DataFrame(clf.decision_function(Xtest))
prob.loc[prob.iloc[:,0] >= thresholds[maxindex],"y_pred"]=1
prob.loc[prob.iloc[:,0] < thresholds[maxindex],"y_pred"]=0
prob.loc[:,"y_pred"].isnull().sum()
Out[113]: 0

#检查模型本身的准确度
score = AC(Ytest,prob.loc[:,"y_pred"].values)
recall = recall_score(Ytest, prob.loc[:,"y_pred"])
print("testing accuracy %f,recall is %f" % (score,recall))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.789333,recall is 0.804665
01:04:299623

反而还不如我们不调整时的效果好。可见，如果我们追求平衡，那SVC本身的结果就已经非常接近最优结果了。调
节阈值，调节参数C和调节class_weight都不一定有效果。

Python 支持向量机就该这样操作（案例分析）

号称能打败MLP的KAN到底行不行？数学核心原理全面解析

同事使用 insert into select 迁移数据，开开心心上线，上线后被公司开除！

DeepFilterNet复现

Kettle 安裝與簡單案例介紹

GIT 史上最詳細Git使用教程

Julia（未來可能替代Python與R語言）數據抽樣與結果評價

mysql 免安裝版本

R語言兩種方法連接oracle以及將處理後的數據導入數據庫中

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結