【Exercise 6 Support Vector Machines】
【代碼】【第一部分】
ex6.m
數據1:線性可分數據
利用無核SVM訓練
數據2:
實現高斯核
利用高斯核訓練
數據3
參數選擇:C、σ
高斯核訓練
%% Machine Learning Online Class
% Exercise 6 | Support Vector Machines
%
% Instructions
% ------------
%
% This file contains code that helps you get started on the
% exercise. You will need to complete the following functions:
%
% gaussianKernel.m
% dataset3Params.m
% processEmail.m
% emailFeatures.m
%
% For this exercise, you will not need to change any code in this file,
% or any other files other than those mentioned above.
%
%% Initialization
clear ; close all; clc
%% =============== Part 1: Loading and Visualizing Data ================
% We start the exercise by first loading and visualizing the dataset.
% The following code will load the dataset into your environment and plot
% the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data1:
% You will have X, y in your environment
load('ex6data1.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ==================== Part 2: Training Linear SVM ====================
% The following code will train a linear SVM on the dataset and plot the
% decision boundary learned.
%
% Load from ex6data1:
% You will have X, y in your environment
load('ex6data1.mat');
fprintf('\nTraining Linear SVM ...\n')
% You should try to change the C value below and see how the decision
% boundary varies (e.g., try C = 1000)
C = 1;
model = svmTrain(X, y, C, @linearKernel, 1e-3, 20);
visualizeBoundaryLinear(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% =============== Part 3: Implementing Gaussian Kernel ===============
% You will now implement the Gaussian kernel to use
% with the SVM. You should complete the code in gaussianKernel.m
%
fprintf('\nEvaluating the Gaussian Kernel ...\n')
x1 = [1 2 1]; x2 = [0 4 -1]; sigma = 2;
sim = gaussianKernel(x1, x2, sigma);
fprintf(['Gaussian Kernel between x1 = [1; 2; 1], x2 = [0; 4; -1], sigma = %f :' ...
'\n\t%f\n(for sigma = 2, this value should be about 0.324652)\n'], sigma, sim);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% =============== Part 4: Visualizing Dataset 2 ================
% The following code will load the next dataset into your environment and
% plot the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data2:
% You will have X, y in your environment
load('ex6data2.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ========== Part 5: Training SVM with RBF Kernel (Dataset 2) ==========
% After you have implemented the kernel, we can now use it to train the
% SVM classifier.
%
fprintf('\nTraining SVM with RBF Kernel (this may take 1 to 2 minutes) ...\n');
% Load from ex6data2:
% You will have X, y in your environment
load('ex6data2.mat');
% SVM Parameters
C = 1; sigma = 0.1;
% We set the tolerance and max_passes lower here so that the code will run
% faster. However, in practice, you will want to run the training to
% convergence.
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
visualizeBoundary(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% =============== Part 6: Visualizing Dataset 3 ================
% The following code will load the next dataset into your environment and
% plot the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data3:
% You will have X, y in your environment
load('ex6data3.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ========== Part 7: Training SVM with RBF Kernel (Dataset 3) ==========
% This is a different dataset that you can use to experiment with. Try
% different values of C and sigma here.
%
% Load from ex6data3:
% You will have X, y in your environment
load('ex6data3.mat');
% Try different SVM Parameters here
[C, sigma] = dataset3Params(X, y, Xval, yval);
% Train the SVM
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
visualizeBoundary(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
linearKernel.m
【待】
function sim = linearKernel(x1, x2)
%LINEARKERNEL returns a linear kernel between x1 and x2
% sim = linearKernel(x1, x2) returns a linear kernel between x1 and x2
% and returns the value in sim
% Ensure that x1 and x2 are column vectors
x1 = x1(:); x2 = x2(:);
% Compute the kernel
sim = x1' * x2; % dot product
end
gussianKernel.m
套公式
function sim = gaussianKernel(x1, x2, sigma)
%RBFKERNEL returns a radial basis function kernel between x1 and x2
% sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2
% and returns the value in sim
% Ensure that x1 and x2 are column vectors
x1 = x1(:); x2 = x2(:);
% You need to return the following variables correctly.
sim = 0;
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return the similarity between x1
% and x2 computed using a Gaussian kernel with bandwidth
% sigma
%
%
sim=exp(-sum((x1-x2).^2)/2/sigma/sigma);
% =============================================================
end
data3Params.m
兩層for循環,逐對參數計算誤差(預測與實際不相符的概率)
error_val(i,j)=mean(double(predictions ~= yval));
求最小值:每計算一個比較一下/內建函數最後統一找
function [C, sigma] = dataset3Params(X, y, Xval, yval)
%DATASET3PARAMS returns your choice of C and sigma for Part 3 of the exercise
%where you select the optimal (C, sigma) learning parameters to use for SVM
%with RBF kernel
% [C, sigma] = DATASET3PARAMS(X, y, Xval, yval) returns your choice of C and
% sigma. You should complete this function to return the optimal C and
% sigma based on a cross-validation set.
%
% You need to return the following variables correctly.
C = 1;
sigma = 0.3;
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return the optimal C and sigma
% learning parameters found using the cross validation set.
% You can use svmPredict to predict the labels on the cross
% validation set. For example,
% predictions = svmPredict(model, Xval);
% will return the predictions on the cross validation set.
%
% Note: You can compute the prediction error using
% mean(double(predictions ~= yval))
%
C_vec=[0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];
sigma_vec=[0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];
error_val=zeros(length(C_vec),length(sigma_vec));
% 兩種方法求最值:計算一個比較一下/內建函數最後統一找
for i=1:length(C_vec)
C_temp=C_vec(i);
for j=1:length(sigma_vec)
sigma_temp=sigma_vec(j);
model= svmTrain(X, y, C_temp, @(x1, x2) gaussianKernel(x1, x2, sigma_temp));
predictions = svmPredict(model, Xval);
error_val(i,j)=mean(double(predictions ~= yval));
%{
error_val=mean(double(predictions ~= yval));
if exist('error_min')
if error_val<error_min
error_min=error_val;
C=C_temp; sigma=sigma_temp;
end
else
error_min=error_val;
C=C_temp; sigma=sigma_temp;
end
%}
end
end
[~,ind]=min(error_val(:));
[i,j]=ind2sub(size(error_val),ind);
C=C_vec(i);
sigma=sigma_vec(j);
% =========================================================================
end
svmPredict.m
【待】
function pred = svmPredict(model, X)
%SVMPREDICT returns a vector of predictions using a trained SVM model
%(svmTrain).
% pred = SVMPREDICT(model, X) returns a vector of predictions using a
% trained SVM model (svmTrain). X is a mxn matrix where there each
% example is a row. model is a svm model returned from svmTrain.
% predictions pred is a m x 1 column of predictions of {0, 1} values.
%
% Check if we are getting a column vector, if so, then assume that we only
% need to do prediction for a single example
if (size(X, 2) == 1)
% Examples should be in rows
X = X';
end
% Dataset
m = size(X, 1);
p = zeros(m, 1);
pred = zeros(m, 1);
if strcmp(func2str(model.kernelFunction), 'linearKernel')
% We can use the weights and bias directly if working with the
% linear kernel
p = X * model.w + model.b;
elseif strfind(func2str(model.kernelFunction), 'gaussianKernel')
% Vectorized RBF Kernel
% This is equivalent to computing the kernel on every pair of examples
X1 = sum(X.^2, 2);
X2 = sum(model.X.^2, 2)';
K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
K = model.kernelFunction(1, 0) .^ K;
K = bsxfun(@times, model.y', K);
K = bsxfun(@times, model.alphas', K);
p = sum(K, 2);
else
% Other Non-linear kernel
for i = 1:m
prediction = 0;
for j = 1:size(model.X, 1)
prediction = prediction + ...
model.alphas(j) * model.y(j) * ...
model.kernelFunction(X(i,:)', model.X(j,:)');
end
p(i) = prediction + model.b;
end
end
% Convert predictions into 0 / 1
pred(p >= 0) = 1;
pred(p < 0) = 0;
end
visualizeBoundaryLinear.m
畫線性邊界
function visualizeBoundaryLinear(X, y, model)
%VISUALIZEBOUNDARYLINEAR plots a linear decision boundary learned by the
%SVM
% VISUALIZEBOUNDARYLINEAR(X, y, model) plots a linear decision boundary
% learned by the SVM and overlays the data on it
w = model.w;
b = model.b;
xp = linspace(min(X(:,1)), max(X(:,1)), 100);
yp = - (w(1)*xp + b)/w(2);
plotData(X, y);
hold on;
plot(xp, yp, '-b');
hold off
end
visualizeBoundary.m
利用contour函數畫非線性邊界。
-> contour本質是3D繪圖,故首先meshgrid。(實際上contour函數支持向量形式的xy)。
-> 在生成的網格上,逐點計算預測值(0/1)。
-> 利用contour函數畫邊界。注意:由於svmPredic函數輸出的是0/1,故用0.5作爲臨界值。參數0.5表明只畫高度爲0.5這“一層”的等高線,且根據函數規定,只畫“一層”時需要傳遞兩個重複值,即[0.5 0.5]。
利用imagesc函數同時作出預測值,可以看出明顯的離散0/1,決策邊界(白色)恰好就是01的分界處。
MATLAB中的contour函數對網格點之間的空隙進行了插值,所以可以在原本只有01的數據中找到高度爲0.5的層。
去掉[0.5 0.5]這一參數,再作出邊界,放大,能夠看出插值的效果。
對比ex2中logistic的邊界繪製函數plotDecisionBoundary:
其作圖根據是θTx(而不是g(θTx)>0.5),故contour函數臨界值取0(而不是0.5)
利用imagesc函數作出θTx,可以看出連續變化而不是離散01
function visualizeBoundary(X, y, model, varargin)
%VISUALIZEBOUNDARY plots a non-linear decision boundary learned by the SVM
% VISUALIZEBOUNDARYLINEAR(X, y, model) plots a non-linear decision
% boundary learned by the SVM and overlays the data on it
% Plot the training data on top of the boundary
plotData(X, y)
% Make classification predictions over a grid of values
x1plot = linspace(min(X(:,1)), max(X(:,1)), 100)';
x2plot = linspace(min(X(:,2)), max(X(:,2)), 100)';
[X1, X2] = meshgrid(x1plot, x2plot);
vals = zeros(size(X1));
for i = 1:size(X1, 2)
this_X = [X1(:, i), X2(:, i)];
vals(:, i) = svmPredict(model, this_X);
end
% Plot the SVM boundary
hold on
contour(X1, X2, vals, [0.5 0.5], 'b');
% contour函數的x,y既可以是向量,也可以是meshgrid之後的矩陣
%{
imagesc(x1plot, x2plot, vals)
hold on
contour(x1plot, x2plot, vals,[0.5 ,0.5] , 'w','LineWidth',2);
hold on
plotData(X, y)
%}
hold off;
end
【第二部分】
ex6_spam.m
-> 實現郵件處理有關函數:轉換成索引、轉換成特徵x,在郵件樣本上驗證效果
-> 引入處理好的數據集/測試集xy,利用無核svm模型訓練分類器,預測
-> 列出所訓練出的模型中,對預測爲“垃圾郵件”最具有效力的10個詞及其權重
-> 對某封具體郵件樣本進行處理、喂入模型、給出預測
%% Machine Learning Online Class
% Exercise 6 | Spam Classification with SVMs
%
% Instructions
% ------------
%
% This file contains code that helps you get started on the
% exercise. You will need to complete the following functions:
%
% gaussianKernel.m
% dataset3Params.m
% processEmail.m
% emailFeatures.m
%
% For this exercise, you will not need to change any code in this file,
% or any other files other than those mentioned above.
%
%% Initialization
clear ; close all; clc
%% ==================== Part 1: Email Preprocessing ====================
% To use an SVM to classify emails into Spam v.s. Non-Spam, you first need
% to convert each email into a vector of features. In this part, you will
% implement the preprocessing steps for each email. You should
% complete the code in processEmail.m to produce a word indices vector
% for a given email.
fprintf('\nPreprocessing sample email (emailSample1.txt)\n');
% Extract Features
file_contents = readFile('emailSample1.txt');
word_indices = processEmail(file_contents);
% Print Stats
fprintf('Word Indices: \n');
fprintf(' %d', word_indices);
fprintf('\n\n');
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ==================== Part 2: Feature Extraction ====================
% Now, you will convert each email into a vector of features in R^n.
% You should complete the code in emailFeatures.m to produce a feature
% vector for a given email.
fprintf('\nExtracting features from sample email (emailSample1.txt)\n');
% Extract Features
file_contents = readFile('emailSample1.txt');
word_indices = processEmail(file_contents);
features = emailFeatures(word_indices);
% Print Stats
fprintf('Length of feature vector: %d\n', length(features));
fprintf('Number of non-zero entries: %d\n', sum(features > 0));
fprintf('Program paused. Press enter to continue.\n');
pause;
%% =========== Part 3: Train Linear SVM for Spam Classification ========
% In this section, you will train a linear classifier to determine if an
% email is Spam or Not-Spam.
% Load the Spam Email dataset
% You will have X, y in your environment
load('spamTrain.mat');
fprintf('\nTraining Linear SVM (Spam Classification)\n')
fprintf('(this may take 1 to 2 minutes) ...\n')
C = 0.1;
model = svmTrain(X, y, C, @linearKernel);
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);
%% =================== Part 4: Test Spam Classification ================
% After training the classifier, we can evaluate it on a test set. We have
% included a test set in spamTest.mat
% Load the test dataset
% You will have Xtest, ytest in your environment
load('spamTest.mat');
fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')
p = svmPredict(model, Xtest);
fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
pause;
%% ================= Part 5: Top Predictors of Spam ====================
% Since the model we are training is a linear SVM, we can inspect the
% weights learned by the model to understand better how it is determining
% whether an email is spam or not. The following code finds the words with
% the highest weights in the classifier. Informally, the classifier
% 'thinks' that these words are the most likely indicators of spam.
%
% Sort the weights and obtin the vocabulary list
[weight, idx] = sort(model.w, 'descend');
vocabList = getVocabList();
fprintf('\nTop predictors of spam: \n');
for i = 1:15
fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i));
end
fprintf('\n\n');
fprintf('\nProgram paused. Press enter to continue.\n');
pause;
%% =================== Part 6: Try Your Own Emails =====================
% Now that you've trained the spam classifier, you can use it on your own
% emails! In the starter code, we have included spamSample1.txt,
% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
% The following code reads in one of these emails and then uses your
% learned SVM classifier to determine whether the email is Spam or
% Not Spam
% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
filename = 'spamSample1.txt';
% Read and predict
file_contents = readFile(filename);
word_indices = processEmail(file_contents);
x = emailFeatures(word_indices);
p = svmPredict(model, x);
fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);
fprintf('(1 indicates spam, 0 indicates not spam)\n\n');
processEmail.m
處理郵件,逐詞檢查:是否在單詞列表裏?若在,在push該詞在單詞列表的索引號,否則跳過。
(可以遍歷單詞列表/利用find函數)
假如一封郵件爲 xxx aaa yyy bbb(預處理後),而xxx在單詞列表裏編號爲(93) bbb爲(20),那麼輸出結果是[93,20]
function word_indices = processEmail(email_contents)
%PROCESSEMAIL preprocesses a the body of an email and
%returns a list of word_indices
% word_indices = PROCESSEMAIL(email_contents) preprocesses
% the body of an email and returns a list of indices of the
% words contained in the email.
%
% Load Vocabulary
vocabList = getVocabList();
% Init return value
word_indices = [];
% ========================== Preprocess Email ===========================
% Find the Headers ( \n\n and remove )
% Uncomment the following lines if you are working with raw emails with the
% full headers
% hdrstart = strfind(email_contents, ([char(10) char(10)]));
% email_contents = email_contents(hdrstart(1):end);
% Lower case
email_contents = lower(email_contents);
% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
email_contents = regexprep(email_contents, '<[^<>]+>', ' ');
% Handle Numbers
% Look for one or more characters between 0-9
email_contents = regexprep(email_contents, '[0-9]+', 'number');
% Handle URLS
% Look for strings starting with http:// or https://
email_contents = regexprep(email_contents, ...
'(http|https)://[^\s]*', 'httpaddr');
% Handle Email Addresses
% Look for strings with @ in the middle
email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');
% Handle $ sign
email_contents = regexprep(email_contents, '[$]+', 'dollar');
% ========================== Tokenize Email ===========================
% Output the email to screen as well
fprintf('\n==== Processed Email ====\n\n');
% Process file
l = 0;
while ~isempty(email_contents)
% Tokenize and also get rid of any punctuation
[str, email_contents] = ...
strtok(email_contents, ...
[' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);
% Remove any non alphanumeric characters
str = regexprep(str, '[^a-zA-Z0-9]', '');
% Stem the word
% (the porterStemmer sometimes has issues, so we use a try catch block)
try str = porterStemmer(strtrim(str));
catch str = ''; continue;
end;
% Skip the word if it is too short
if length(str) < 1
continue;
end
% Look up the word in the dictionary and add to word_indices if
% found
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to add the index of str to
% word_indices if it is in the vocabulary. At this point
% of the code, you have a stemmed word from the email in
% the variable str. You should look up str in the
% vocabulary list (vocabList). If a match exists, you
% should add the index of the word to the word_indices
% vector. Concretely, if str = 'action', then you should
% look up the vocabulary list to find where in vocabList
% 'action' appears. For example, if vocabList{18} =
% 'action', then, you should add 18 to the word_indices
% vector (e.g., word_indices = [word_indices ; 18]; ).
%
% Note: vocabList{idx} returns a the word with index idx in the
% vocabulary list.
%
% Note: You can use strcmp(str1, str2) to compare two strings (str1 and
% str2). It will return 1 only if the two strings are equivalent.
%
for i = 1: length(vocabList)
if strcmp(str,vocabList{i})==1
word_indices=[word_indices;i];
break
end
end
%{
result=find(strcmp(str,vocabList)==1);
if isempty(result)
else
word_indices=[word_indices;result];
end
%}
% =============================================================
% Print to screen, ensuring that the output lines are not too long
if (l + length(str) + 1) > 78
fprintf('\n');
l = 0;
end
fprintf('%s ', str);
l = l + length(str) + 1;
end
% Print footer
fprintf('\n\n=========================\n');
end
emailFeatures.m
轉換成用於訓練的特徵x,規則如:
[1,3] -> 1010
[2] -> 0100
[2,3,4] -> 0111
...
可以利用MATLAB靈活的索引方式簡便實現
function x = emailFeatures(word_indices)
%EMAILFEATURES takes in a word_indices vector and produces a feature vector
%from the word indices
% x = EMAILFEATURES(word_indices) takes in a word_indices vector and
% produces a feature vector from the word indices.
% Total number of words in the dictionary
n = 1899;
% You need to return the following variables correctly.
x = zeros(n, 1);
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return a feature vector for the
% given email (word_indices). To help make it easier to
% process the emails, we have have already pre-processed each
% email and converted each word in the email into an index in
% a fixed dictionary (of 1899 words). The variable
% word_indices contains the list of indices of the words
% which occur in one email.
%
% Concretely, if an email has the text:
%
% The quick brown fox jumped over the lazy dog.
%
% Then, the word_indices vector for this text might look
% like:
%
% 60 100 33 44 10 53 60 58 5
%
% where, we have mapped each word onto a number, for example:
%
% the -- 60
% quick -- 100
% ...
%
% (note: the above numbers are just an example and are not the
% actual mappings).
%
% Your task is take one such word_indices vector and construct
% a binary feature vector that indicates whether a particular
% word occurs in the email. That is, x(i) = 1 when word i
% is present in the email. Concretely, if the word 'the' (say,
% index 60) appears in the email, then x(60) = 1. The feature
% vector should look like:
%
% x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];
%
%
x(word_indices)=ones(length(word_indices),1);
% =========================================================================
end
2-25