機器學習_scikit-learn中的SVM

原創

2018-11-04 12:11

支持向量機（SVM）是一種可用於分類和迴歸的有效監督學習方法，是一種基於最大間隔的分類器。對於 SVM 的理解需要很多數學理論知識，尤其是對於拉格朗日對偶求解的理解。由於水平有限，文章並未用python實現SVM算法，而是用Scikit-learn中的libsvm來簡單介紹下支持向量機的用法。

最大間隔的直觀理解

對於一組數據集，我們希望用超平面很好的將其分類，以下以2D數據集舉例，如下圖所示。那麼在2D數據集中的分類超平面即是線

很明顯該數據集有多個解，可以有多條直線將數據集分類，但是什麼纔是最優解呢？這就是尋找最大間隔的意義。最優解通常要保證是穩定、可靠的決策界，下圖中的“黑線”即是最優解，其中兩條藍線就是支持向量機。

多條直線“解”

數學模型表示

1.分類標籤yi 定義爲+1和-1，方便表示和運算
2.假設一個超平面將數據集很好的分離，超平面的方程爲f(x)=wTx+b=0 ，那麼在超平面的一側的數據集f(x)>=0 st. yi=1 而另一側 f(x)<=0 st. yi=−1 ，而支持向量機爲f(x)=wTx+b=±1
3.由以上兩個條件可將支持向量機求解轉化yif(x)>=1 求最優解

由幾何知識可知最兩平行線的間隔爲 1||w|| ，求最大間隔則可轉化爲：

{m i n (1 2 | | w | | 2) y i (w T x + b) - 1 \geq 0

由拉格朗日數乘子及對偶可將上式轉化爲求解

L(w,b,ξ,α,μ)=12||w||2−∑ni=1α(yi(wTx+b)−1)

原始問題

優化問題

先求極小，對w 、 b 求偏導，令其=0，再求極大值，此處省略過程，直接寫推導後的結果：

scikit-learn中的SVM包

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
import numpy as np

data=np.loadtxt(fname='mlslpic/logistci_data1.txt', dtype='float', delimiter=',')
dataSet=data[:,0:2]
classLabel=data[:,2]
X = scale(dataSet)
y = classLabel
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
        ('clf', SVC(kernel='rbf'))
    ])
paramters = {
    'clf__gamma' : (0.01, 0.03, 0.1),
    'clf__C' : (0.01,0.1, 0.3)
}

grid_search = GridSearchCV(pipeline, paramters, n_jobs=1, verbose=1, scoring='accuracy', refit=True)
grid_search.fit(X_train, y_train)

print('最佳效果 ： %.3f' % grid_search.best_score_)
print('最優參數集：')
best_paramters = grid_search.best_estimator_.get_params()
for param_name in sorted(paramters.keys()):
    print('%s, %r' % (param_name, best_paramters[param_name]))
predictions = grid_search.predict(X_test)
print('結果報告：', classification_report(y_test, predictions))

結果

Fitting 3 folds for each of 9 candidates, totalling 27 fits
最佳效果： 0.867
最優參數集：
clf__C, 0.3
clf__gamma, 0.1
結果報告： precision recall f1-score support
0.0 0.92 0.92 0.92 12
1.0 0.92 0.92 0.92 13
avg / total 0.92 0.92 0.92 25

ROC曲線

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
false_postive_rate, recall, thresholds = roc_curve(y_test, predictions, pos_label=1)
roc_auc = auc(false_postive_rate, recall)
print(roc_auc)
plt.title('Recevier Operating Characteristic')
plt.plot(false_postive_rate, recall, 'b', label='AUC = %.2f' % roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('Fall-out')
plt.ylabel('Recall')
plt.show()

ROC曲線越靠近左上角，則準確率越高，訓練效果越好。

附錄，skicit-learn中SVM各項參數

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習_scikit-learn中的SVM

最大間隔的直觀理解

數學模型表示

scikit-learn中的SVM包

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

Spring Security Oauth2實踐(3) - 單點登錄（SSO）

Spring Security Oauth2實踐(1) - 授權碼模式

利用jstack工具分析JVM線程

Spring Security Oauth2實踐(2) - 客戶端對接

算法練習_LeetCode_鏈表1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結