machine learning——SVM Linear Classification

引言：

本博文MATLAB中的使用LIBSVM库来实现一个SVM线性分类的简单例子。

题目：

这是斯坦福大学的一个课堂习题（顺便推荐这个大学的网站了），放上题目链接：
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex7/ex7.html
数据在这里下载：
http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/exercises/ex7materials/ex7Data.zip

LIBSVM库安装：

参考这个博主的博客，写的真的很好：
https://blog.csdn.net/qq_31781741/article/details/82666861#commentBox

SVM：

我们使用的SVM公式如下，推导和求解的方法都比较复杂，所以这里只给出公式,具体的SVM不再详细讲，网上有很多精彩的讲解

二维分类问题：

（1）首先考虑具有两个功能的分类问题。使用以下命令将“ twofeature.txt”数据文件加载到Matlab / Octave中：
[ trainlabels , trainfeatures ] = libsvmread ( ’ twofeature . txt ’ );
请注意，此文件是针对LIBSVM格式化的，因此无法使用常规的Matlab / Octave命令加载该文件。(这里的常规命令是指的load命令，其实我觉得也可以使用常规命令，不贵需要另外的格式处理)
（2）首先对twofeature . txt进行打点，生成的图像如下所示：
代码：（注意这里是MATLAB格式的代码喔！！）

% Load training features and labels
[y, x] = libsvmread('twofeature.txt');
figure
pos = find(y == 1);
neg = find(y == -1);
plot(x(pos,1), x(pos,2), 'ko', 'MarkerFaceColor', 'b'); hold on;
plot(x(neg,1), x(neg,2), 'ko', 'MarkerFaceColor', 'g')

分离间隙有些明显。但是，蓝色类别在最左边。现在，我们将研究异常值如何影响SVM决策边界。

设置C=1

SVM优化问题中的参数C是正成本因素，会惩罚分类错误的训练示例。
首先，我们将使用C = 1运行分类器
model = svmtrain ( trainlabels , trainfeatures , ’−s 0 −t 0 −c 1 ’ );
训练完成后，“模型”将是包含模型参数的结构。现在，我们可以通过以下代码获取w和b：

model = svmtrain(y, x, sprintf('-s 0 -t 0 -c %g', C));
w = model.SVs' * model.sv_coef
b = -model.rho
if (model.Label(1) == -1)
    w = -w; b = -b;
end

一旦有了w和b，就可以使用它们绘制决策边界。结果如下图所示。在C = 1的情况下，我们看到异常值是分类错误，但决策范围是合理的：
得到w和b值：

设置C=100

现在，让我们看看当成本因素高得多时会发生什么。训练模型并再次绘制决策边界，这次将C设置为100。现在可以正确地分类离群值，但是决策边界对于其余数据似乎不是很自然的选择：此示例说明了成本代价很大，SVM算法将很难避免错误分类。
折衷方案是该算法将较少权重以产生较大的分离余量
C=100下的w和b：

不同C的对比：

调节C可以调节分类面的Margin，C越大，Margin越小正确率也越高，但是在非线性的分类问题中可能是会出现过拟合的，所以选择一个合适的C值非常重要。

垃圾邮件分类示例：

现在，让我们回到上一个练习中的垃圾邮件分类示例。在数据文件夹中，应该有与Naive Bayes练习中看到的相同的4个训练集，但现在仅格式化为LIBSVM。它们被命名为：
a. email train-50.txt (based on 50 email documents)
b. email train-100.txt (100 documents)
c. email train-400.txt (400 documents)
d. email train-all.txt (the complete 700 training documents)

选择不同的训练集的规模来做比较，可以得到一个结论当训练集规模越大的时候，那么我们预测时的误差也就越小，下图分别是50、100、400规模的训练集时得到的准确度，因为分类器的特征值的维度太高，无法画出分界面来直观观看。
以其中一个文件为例，可以得到输出如下
50 documents: Accuracy = 75.3846% (196/260)
100 documents: Accuracy = 88.4615% (230/260)
400 documents: Accuracy = 98.0769% (255/260)
the complete 700 training documents: Accuracy = 98.4615% (256/260)

完整代码：
1.m

% SVM Linear classification
% A 2-feature example

clear all; close all; 

% Load training features and labels
[y, x] = libsvmread('twofeature.txt');

% Set the cost
C1= 1;
C2 = 10;
C3 = 50;
C4 = 100;
% Train the model and get the primal variables w, b from the model
% Libsvm options
% -s 0 : classification
% -t 0 : linear kernel
% -c somenumber : set the cost
model = svmtrain(y, x, sprintf('-s 0 -t 0 -c %g', C1));
w = model.SVs' * model.sv_coef
b = -model.rho
if (model.Label(1) == -1)
    w = -w; b = -b;
end


% Plot the data points
figure
pos = find(y == 1);
neg = find(y == -1);
plot(x(pos,1), x(pos,2), 'ko', 'MarkerFaceColor', 'b'); hold on;
plot(x(neg,1), x(neg,2), 'ko', 'MarkerFaceColor', 'g')

% Plot the decision boundary
plot_x = linspace(min(x(:,1)), max(x(:,1)), 30);
plot_y = (-1/w(2))*(w(1)*plot_x + b);
plot(plot_x, plot_y, 'r-', 'LineWidth', 2)
% Plot the decision boundary2
model = svmtrain(y, x, sprintf('-s 0 -t 0 -c %g', C2));
w = model.SVs' * model.sv_coef
b = -model.rho
if (model.Label(1) == -1)
    w = -w; b = -b;
end
plot_x = linspace(min(x(:,1)), max(x(:,1)), 30);
plot_y = (-1/w(2))*(w(1)*plot_x + b);
plot(plot_x, plot_y, 'b-', 'LineWidth', 2)
% Plot the decision boundary3
model = svmtrain(y, x, sprintf('-s 0 -t 0 -c %g', C3));
w = model.SVs' * model.sv_coef
b = -model.rho
if (model.Label(1) == -1)
    w = -w; b = -b;
end
plot_x = linspace(min(x(:,1)), max(x(:,1)), 30);
plot_y = (-1/w(2))*(w(1)*plot_x + b);
plot(plot_x, plot_y, 'c-', 'LineWidth', 2)
% Plot the decision boundary4
model = svmtrain(y, x, sprintf('-s 0 -t 0 -c %g', C4));
w = model.SVs' * model.sv_coef
b = -model.rho
if (model.Label(1) == -1)
    w = -w; b = -b;
end
plot_x = linspace(min(x(:,1)), max(x(:,1)), 30);
plot_y = (-1/w(2))*(w(1)*plot_x + b);
plot(plot_x, plot_y, 'k-', 'LineWidth', 2)
title(sprintf('SVM Linear Classifier'), 'FontSize', 14)

2.m

% SVM Email text classification

clear all; close all; clc

% Load training features and labels
[train_y, train_x] = libsvmread('email_train-all.txt');

% Train the model and get the primal variables w, b from the model

% Libsvm options
% -t 0 : linear kernel
% Leave other options as their defaults 
model = svmtrain(train_y, train_x, '-t 0');
w = model.SVs' * model.sv_coef;
b = -model.rho;
if (model.Label(1) == -1)
    w = -w; b = -b;
end

% Load testing features and labels
[test_y, test_x] = libsvmread('email_test.txt');

[predicted_label, accuracy, decision_values] = svmpredict(test_y, test_x, model);
% After running svmpredict, the accuracy should be printed to the matlab
% console

其他参考：
https://blog.csdn.net/gyh_420/article/details/77943973（重点）
https://www.cnblogs.com/zx-zhang/p/9972173.html

machine learning——SVM Linear Classification

引言：

题目：

LIBSVM库安装：

SVM：

二维分类问题：

设置C=1

设置C=100

垃圾邮件分类示例：

操作系統實驗六、死鎖問題實驗——單車道問題

Python初學系列——蟒蛇繪製及turtle庫的使用

Python爬蟲入門——requests爬取單張圖片/視頻

Python初學系列—字符串

計算機視覺——圖像仿射變換與變形（重採樣與不同線性插值方法比較）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結