本文講介紹垃圾郵件分類,其中用到SVM算法、Logistic迴歸、SEA-Logistic深度網絡分類。下面分別講解這幾個算法在垃圾郵件分類中的用法。
數據集爲spamData.mat,訓練集有3065個樣本,測試集有1536個樣本,每個樣本的維度爲57.
數據集下載地址:https://github.com/probml/pmtkdata/tree/master/spamData
一、SVM分類
二、Logistic分類
三、SEA-Logistic分類
一、SVM分類
原理我就不講了,前面博文有,而且網上有太多了資料,還有我沒有別人講的好,不能獻醜了,我現在只講如何利用Libsvm、MATLAB自帶的svm工具箱分類垃圾郵件。
1.libsvm垃圾郵件分類
libsvm庫下載:http://www.csie.ntu.edu.tw/~cjlin/libsvm/
詳解:http://www.matlabsky.com/thread-11925-1-1.html
下載好的libsvm,然後添加到主路徑下File->set path ->add with subfolders->加入libsvm-3.11文件夾的路徑。
1.首先在MATLAB命令窗【Commond Window】中輸入:mex -setup
2.出現 Would you like mex to locate installed compilers [y]/n? 選擇y
3.Select a compiler:
[1] Microsoft Visual C++ 2010 in E:\VS2010
[0] None 選擇:1
4.Are these correct [y]/n? 選擇y
好了現在就可以用了
load('spamData.mat');
model = svmtrain(ytrain,Xtrain,'-t 0');
[predict_label,accuracy] = svmpredict(ytest,Xtest,model);
上面加紅色標註的-t x x可以取0,1,2,3,4。系統默認爲2,如果不加-t x
0)線性核函數
1)多項式核函數
2)RBF核函數
3)sigmoid核函數
4)自定義核函數
從上表可以看出,線性核函數效果最好,能夠達到91.1458%。
數據要這樣處理下:
ytrain(ytrain==0) = -1;
ytest(ytest==0) = -1;
2.MATLAB自帶的svm工具箱分類垃圾郵件
Matlab自帶了svm工具箱,現在我就介紹如何利用這個工具箱來做垃圾郵件分類。下面我先給出程序,再來解釋。
load spamData
svmStruct = svmtrain(Xtrain,ytrain,'showplot',true);
classes=svmclassify(svmStruct,Xtest,'showplot',true);
nCorrect=sum(classes==ytest);
accuracy = nCorrect/length(classes);
accuracy = 100*accuracy;
accuracy = double(accuracy);
fprintf('accuracy=%s%%\n',accuracy);
運行會出現這個結果就忽略它,對結果沒有什麼影響。
在程序中點擊右鍵打開svmtrain.m這個文件。在文件的dflts ={'linear',.......},我得是287行。可以改變核函數,可以選擇okfuns = {'linear','quadratic', 'radial','rbf','polynomial','mlp'}; 上面dflts ={'linear',.......}可以改變。
linear是線性核。
quadratic是二次核函數。
radial是什麼核?
rbf 是徑向基核,通常叫做高斯核,但是Ng說跟高斯沒有什麼關係。
polynomial是多項式核。
mlp多層感知器核函數。
下面來看使用各個核在垃圾郵件分類中的識別率
可以看到二次核函數效果最好,能達到85.55%。
整理的數據集見資源,三種不同的預處理方式
給出Excise8.1的作業結果
logistic迴歸
SVM
二、Logistic分類
可以參考這篇文獻:http://www.docin.com/p-160363677.html
前面的博文講過softmax分類,可以改爲Logistic迴歸。http://blog.csdn.net/hlx371240/article/details/40015395
這篇博文是在《最優化計算方法》這門課寫的,當時用LBFGS和SD法優化參數,現在我用工具箱直接進行分類。
%% STEP 0: Initialise constants and parameters
inputSize = 57; % Size of input vector
numClasses = 2; % Number of classes
lambda = 1e-4; % Weight decay parameter
%%=====================================================================
%% STEP 1: Load data
load('D:\機器學習課程\作業三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2; % Remap 0 to 10
inputData = Xtrain;
DEBUG = false;
if DEBUG
inputSize = 8;
inputData = randn(8, 100);
labels = randi(10, 100, 1);
end
% Randomly initialise theta
theta = 0.005 * randn(numClasses * inputSize, 1);%輸入的是一個列向量
%%======================================================================
%% STEP 2: Implement softmaxCost
[cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, ytrain);
%%======================================================================
%% STEP 3: Learning parameters
options.maxIter = 100;
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
inputData, ytrain, options);
%%======================================================================
%% STEP 4: Testing
Xtest=Xtest';
ytest(ytest==0) = 2; % Remap 0 to 10
size(softmaxModel.optTheta)
size(inputData)
[pred] = softmaxPredict(softmaxModel, Xtest);
acc = mean(ytest(:) == pred(:));
fprintf('Accuracy: %0.3f%%\n', acc * 100);
softmaxTrain.m
<span style="font-family:Times New Roman;">function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels, options)
if ~exist('options', 'var')
options = struct;
end
if ~isfield(options, 'maxIter')
options.maxIter = 400;
end
% initialize parameters
theta = 0.005 * randn(numClasses * inputSize, 1);
% Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
% function. Generally, for minFunc to work, you
% need a function pointer with two outputs: the
% function value and the gradient. In our problem,
% softmaxCost.m satisfies this.
minFuncOptions.display = 'on';
[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...
numClasses, inputSize, lambda, ...
inputData, labels), ...
theta, options);
% Fold softmaxOptTheta into a nicer format
softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);
softmaxModel.inputSize = inputSize;
softmaxModel.numClasses = numClasses;
end
softmaxCost.m
<span style="font-size:14px;">function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)
% numClasses - the number of classes
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
% a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%
% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize);%將輸入的參數列向量變成一個矩陣
numCases = size(data, 2);%輸入樣本的個數
groundTruth = full(sparse(labels, 1:numCases, 1));%這裏sparse是生成一個稀疏矩陣,該矩陣中的值都是第三個值1
%稀疏矩陣的小標由labels和1:numCases對應值構成
cost = 0;
thetagrad = zeros(numClasses, inputSize);
M = bsxfun(@minus,theta*data,max(theta*data, [], 1));
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2);
thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta;
grad = [thetagrad(:)];
end</span>
softmaxPredict.m
function [pred] = softmaxPredict(softmaxModel, data)
% Unroll the parameters from theta
theta = softmaxModel.optTheta; % this provides a numClasses x inputSize matrix
pred = zeros(1, size(data, 2));
[nop, pred] = max(theta * data);
end
initializeParameters.m
<span style="font-family:Times New Roman;">function theta = initializeParameters(hiddenSize, visibleSize)
%% Initialize parameters randomly based on layer sizes.
r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); % we'll choose weights uniformly from the interval [-r, r]
W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
W2 = rand(visibleSize, hiddenSize) * 2 * r - r;
b1 = zeros(hiddenSize, 1);
b2 = zeros(visibleSize, 1);
% Convert weights and bias gradients to the vector form.
% This step will "unroll" (flatten and concatenate together) all
% your parameters into a vector, which can then be used with minFunc.
theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];
end
sigmoidInv.m
function sigmInv = sigmoidInv(x)
sigmInv = sigmoid(x).*(1-sigmoid(x));
end
這個算法利用的LBFGS,擬牛頓法,可以節約內存,用近似的Hessian矩陣代替精確的Hessian矩陣,這個是我的美女老師講的,建議同學們選下學期美女老師的《數值優化》的課程,我覺得她講得很好,聽課絕對認真。
還有一個工具箱(minfunc)到資源下載。最後得到的識別率爲:92.057%。但是不是每次跑出來的程序都是這個識別率,因爲參數是隨機產生的。
三、SAE-Logistic分類
SAE全稱爲Sparse Auto Encoder,是加了一層自學習層進一步提取特徵,因爲郵件中每個詞的出現可能存在某種關聯,然後再分類,這也是神經網絡提取特徵。
可以參考這篇文獻:http://nlp.stanford.edu/~socherr/sparseAutoencoder_2011new.pdf
<span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';">%STEP 2: 初始化參數和load數據
</span><span style="font-family:Times New Roman;font-size:14px;">clear all;
clc;
load('D:\機器學習課程\作業三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2;
Xtest=Xtest';
ytest(ytest == 0) = 2; % Remap 0 to 10
inputSize = 57;
numLabels = 2;
a=[57 50 45 40 35 30 25 20 15 10 5];
sparsityParam = 0.1;
lambda = 3e-3; % weight decay parameter
beta = 3; % weight of sparsity penalty term
numClasses = 2; % Number of classes (MNIST images fall into 10 classes)
lambda = 1e-4; % Weight decay parameter
%% ======================================================================
%STEP 2: 訓練自學習層SAE
for i=1:11
hiddenSize = a(i);
theta = initializeParameters(hiddenSize, inputSize);
%-------------------------------------------------------------------
opttheta = theta;
addpath minFunc/
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';
[opttheta, loss] = minFunc( @(p) sparseAutoencoderCost(p, ...
inputSize, hiddenSize, ...
lambda, sparsityParam, ...
beta, Xtrain), ...
theta, options);
trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
Xtrain);
%% ================================================
%STEP 3: 訓練Softmax分類器
saeSoftmaxTheta = 0.005 * randn(hiddenSize * numClasses, 1);
softmaxLambda = 1e-4;
numClasses = 2;
softoptions = struct;
softoptions.maxIter = 500;
softmaxModel = softmaxTrain(hiddenSize,numClasses,softmaxLambda,...
trainFeatures,ytrain,softoptions);
theta_new = softmaxModel.optTheta(:);
%% ============================================================
stack = cell(1,1);
stack{1}.w = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize);
stack{1}.b =opttheta(2*hiddenSize*inputSize+1:2*hiddenSize*inputSize+hiddenSize);
[stackparams, netconfig] = stack2params(stack);
stackedAETheta = [theta_new;stackparams];
addpath minFunc/;
options = struct;
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';
[stackedAEOptTheta,cost] = minFunc(@(p)stackedAECost(p,inputSize,hiddenSize,numClasses, netconfig,lambda, Xtrain, ytrain),stackedAETheta,options);
%% =================================================================
%STEP 4: 測試
[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSize, ...
numClasses, netconfig, Xtest);
acc = mean(ytest(:) == pred(:));
fprintf('Accuracy = %0.3f%%\n', acc * 100);
result(i)=acc * 100;
end</span><span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';">
</span>
stackedAEPredict.m<span style="font-family:Times New Roman;font-size:14px;">function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
a{layer+1} = sigmoid(z{layer+1});
end
[~, pred] = max(softmaxTheta * a{depth+1});
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end</span>
stackedAECost.m<span style="font-family:Times New Roman;font-size:14px;">function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
numClasses, netconfig, ...
lambda, data, labels)
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize: the number of hidden units *at the 2nd layer*
% numClasses: the number of categories
% netconfig: the network configuration of the stack
% lambda: the weight regularization penalty
% data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example.
% labels: A vector containing labels, where labels(i) is the label for the
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
stackgrad{d}.w = zeros(size(stack{d}.w));
stackgrad{d}.b = zeros(size(stack{d}.b));
end
cost = 0; % You need to compute this
numCases = size(data, 2);%輸入樣本的個數
groundTruth = full(sparse(labels, 1:numCases, 1));
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
a{layer+1} = sigmoid(z{layer+1});
end
M = softmaxTheta * a{depth+1};
M = bsxfun(@minus, M, max(M));
p = bsxfun(@rdivide, exp(M), sum(exp(M)));
cost = -1/numClasses * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2);
softmaxThetaGrad = -1/numClasses * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta;
d = cell(depth+1);
d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1});
for layer = (depth:-1:2)
d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer});
end
for layer = (depth:-1:1)
stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}';
stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2);
end
% -------------------------------------------------------------------------
%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end</span>
feedForwardAutoencoder.m<span style="font-family:Times New Roman;font-size:14px;">function [activation] = feedForwardAutoencoder(theta, hiddenSize, visibleSize, data)
% theta: trained weights from the autoencoder
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example.
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
% Instructions: Compute the activation of the hidden layer for the Sparse Autoencoder.
activation = sigmoid(W1*data+repmat(b1,[1,size(data,2)]));
end
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end</span>
sparseAutoencoderCost.m<span style="font-family:Times New Roman;font-size:14px;">function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
lambda, sparsityParam, beta, data)
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
% notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example.
% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
cost = 0;
W1grad = zeros(size(W1));
W2grad = zeros(size(W2));
b1grad = zeros(size(b1));
b2grad = zeros(size(b2));
Jcost = 0;%直接誤差
Jweight = 0;%權值懲罰
Jsparse = 0;%稀疏性懲罰
[n m] = size(data);%m爲樣本的個數,n爲樣本的特徵數
%前向算法計算各神經網絡節點的線性組合值和active值
z2 = W1*data+repmat(b1,1,m);%注意這裏一定要將b1向量複製擴展成m列的矩陣
a2 = sigmoid(z2);
z3 = W2*a2+repmat(b2,1,m);
a3 = sigmoid(z3);
% 計算預測產生的誤差
Jcost = (0.5/m)*sum(sum((a3-data).^2));
%計算權值懲罰項
Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));
%計算稀釋性規則項
rho = (1/m).*sum(a2,2);%求出第一個隱含層的平均值向量
Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...
(1-sparsityParam).*log((1-sparsityParam)./(1-rho)));
%損失函數的總表達式
cost = Jcost+lambda*Jweight+beta*Jsparse;
%反向算法求出每個節點的誤差值
d3 = -(data-a3).*sigmoidInv(z3);
sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因爲加入了稀疏規則項,所以
%計算偏導時需要引入該項
d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2);
%計算W1grad
W1grad = W1grad+d2*data';
W1grad = (1/m)*W1grad+lambda*W1;
%計算W2grad
W2grad = W2grad+d3*a2';
W2grad = (1/m).*W2grad+lambda*W2;
%計算b1grad
b1grad = b1grad+sum(d2,2);
b1grad = (1/m)*b1grad;%注意b的偏導是一個向量,所以這裏應該把每一行的值累加起來
%計算b2grad
b2grad = b2grad+sum(d3,2);
b2grad = (1/m)*b2grad;
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
end
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
function sigmInv = sigmoidInv(x)
sigmInv = sigmoid(x).*(1-sigmoid(x));
end</span>
stack2params.m<span style="font-family:Times New Roman;font-size:14px;">function [params, netconfig] = stack2params(stack)
params = [];
for d = 1:numel(stack)
params = [params ; stack{d}.w(:) ; stack{d}.b(:) ];
assert(size(stack{d}.w, 1) == size(stack{d}.b, 1), ...
['The bias should be a *column* vector of ' ...
int2str(size(stack{d}.w, 1)) 'x1']);
if d < numel(stack)
assert(size(stack{d}.w, 1) == size(stack{d+1}.w, 2), ...
['The adjacent layers L' int2str(d) ' and L' int2str(d+1) ...
' should have matching sizes.']);
end
end
if nargout > 1
% Setup netconfig
if numel(stack) == 0
netconfig.inputsize = 0;
netconfig.layersizes = {};
else
netconfig.inputsize = size(stack{1}.w, 2);
netconfig.layersizes = {};
for d = 1:numel(stack)
netconfig.layersizes = [netconfig.layersizes ; size(stack{d}.w,1)];
end
end
end
end</span>
實驗結果:
從圖中可以看出識別率基本都在90%以上,當隱含層的神經元爲25個時候,識別率達到93.36%。