機器學習：SVM核函數的優勢和缺陷

看起來，除了Sigmoid核函數，其他核函數效果都還不錯。但其實rbf和poly都有自己的弊端，我們使用乳腺癌數據集作爲例子來展示一下：

from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from time import time
import datetime
#實例化數據集，提取特徵和標籤
data = load_breast_cancer()
X = data.data
y = data.target
#切分訓練集和測試集
Xtrain, Xtest, Ytrain, Ytest =
train_test_split(X,y,test_size=0.3,random_state=420)
Kernel = ["linear","poly","rbf","sigmoid"]
for kernel in Kernel:
time0 = time()
clf= SVC(kernel = kernel
, gamma="auto"
).fit(Xtrain,Ytrain)
print("The accuracy under kernel %s is %f" %
(kernel,clf.score(Xtest,Ytest)))
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

The accuracy under kernel linear is 0.929825
00:00:308202

然後我們發現，怎麼跑都跑不出來。模型一直停留在線性核函數之後，就沒有再打印結果了。這證明，多項式核函數此時此刻要消耗大量的時間，運算非常的緩慢。在循環中去掉多項式核函數，再試試看能否跑出結果:

Kernel = ["linear","rbf","sigmoid"]
for kernel in Kernel:
time0 = time()
clf= SVC(kernel = kernel
, gamma="auto"
).fit(Xtrain,Ytrain)
print("The accuracy under kernel %s is %f" %
(kernel,clf.score(Xtest,Ytest)))
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

The accuracy under kernel linear is 0.929825
00:00:386003
The accuracy under kernel rbf is 0.596491
00:00:030880
The accuracy under kernel sigmoid is 0.596491
00:00:006026

我們可以有兩個發現。首先，乳腺癌數據集是一個線性數據集，線性核函數跑出來的效果很好。rbf和sigmoid兩個擅長非線性的數據從效果上來看完全不可用。其次，線性核函數的運行速度遠遠不如非線性的兩個核函數。
如果數據是線性的，那如果我們把degree參數調整爲1，多項式核函數應該也可以得到不錯的結果：

#如果數據是線性的，可以把多項式核函數的degree設爲1
Kernel = ["linear","poly","rbf","sigmoid"]

for kernel in Kernel:
    time0 = time()
    clf= SVC(kernel = kernel
             , gamma="auto"
             , degree = 1
            ).fit(Xtrain,Ytrain)
    print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
    print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

The accuracy under kernel linear is 0.929825
00:00:365032
The accuracy under kernel poly is 0.923977
00:00:039890
The accuracy under kernel rbf is 0.596491
00:00:022904
The accuracy under kernel sigmoid is 0.596491
00:00:007016

多項式核函數的運行速度立刻加快了，並且精度也提升到了接近線性核函數的水平，可喜可賀。但是，之前的實驗中，我們瞭解到，rbf在線性數據上也可以表現得非常好，那在這裏，爲什麼跑出來的結果如此糟糕呢？
其實，這裏真正的問題是數據的量綱問題。回憶一下我們如何求解決策邊界，如何判斷點是否在決策邊界的一邊？是靠計算”距離“，雖然我們不能說SVM是完全的距離類模型，但是它嚴重受到數據量綱的影響。讓我們來探索一下乳腺癌數據集的量綱：

#查看數據量綱
import pandas as pd
data = pd.DataFrame(X)
data.describe().T

一眼望去，果然數據存在嚴重的量綱不一的問題。我們來使用數據預處理中的標準化的類，對數據進行標準化：

#數據標準化消除量綱不統一問題
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
data = pd.DataFrame(X)
data.describe().T

標準化完畢後，再次讓SVC在覈函數中遍歷，此時我們把degree的數值設定爲1，觀察各個核函數在去量綱後的數據上的表現：

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420)

Kernel = ["linear","poly","rbf","sigmoid"]

for kernel in Kernel:
    time0 = time()
    clf= SVC(kernel = kernel
             , gamma="auto"
             , degree = 1
            ).fit(Xtrain,Ytrain)
    print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
    print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

The accuracy under kernel linear is 0.976608
00:00:004986
The accuracy under kernel poly is 0.964912
00:00:003990
The accuracy under kernel rbf is 0.970760
00:00:004987
The accuracy under kernel sigmoid is 0.953216
00:00:003989

量綱統一之後，可以觀察到，所有核函數的運算時間都大大地減少了，尤其是對於線性核來說，而多項式核函數居然變成了計算最快的。其次，rbf表現出了非常優秀的結果。經過我們的探索，我們可以得到的結論是：
1. 線性核，尤其是多項式核函數在高次項時計算非常緩慢
2. rbf和多項式核函數都不擅長處理量綱不統一的數據集
幸運的是，這兩個缺點都可以由數據無量綱化來解決。因此，SVM執行之前，非常推薦先進行數據的無量綱化！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習：SVM核函數的優勢和缺陷

qq表情是否可以貼入博客

大數據分析平臺的演進之路

sklearn機器學習:K-Means

sklearn機器學習：K-Means之提前停止迭代

機器學習：聚類算法的模型評估指標：輪廓係數

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結