Spam classification(垃圾郵件分類)—SVM、Logistic分類、SEA-Logistic(深度網絡)分類

本文講介紹垃圾郵件分類，其中用到SVM算法、Logistic迴歸、SEA-Logistic深度網絡分類。下面分別講解這幾個算法在垃圾郵件分類中的用法。

數據集爲spamData.mat，訓練集有3065個樣本，測試集有1536個樣本，每個樣本的維度爲57.

數據集下載地址：https://github.com/probml/pmtkdata/tree/master/spamData

一、SVM分類

二、Logistic分類

三、SEA-Logistic分類

一、SVM分類

原理我就不講了，前面博文有，而且網上有太多了資料，還有我沒有別人講的好，不能獻醜了，我現在只講如何利用Libsvm、MATLAB自帶的svm工具箱分類垃圾郵件。

1.libsvm垃圾郵件分類

libsvm庫下載：http://www.csie.ntu.edu.tw/~cjlin/libsvm/

詳解：http://www.matlabsky.com/thread-11925-1-1.html

下載好的libsvm，然後添加到主路徑下File->set path ->add with subfolders->加入libsvm-3.11文件夾的路徑。

1.首先在MATLAB命令窗【Commond Window】中輸入：mex -setup

2.出現 Would you like mex to locate installed compilers [y]/n? 選擇y

3.Select a compiler:
[1] Microsoft Visual C++ 2010 in E:\VS2010
[0] None 選擇：1

4.Are these correct [y]/n? 選擇y

好了現在就可以用了

load('spamData.mat');
model = svmtrain(ytrain,Xtrain,'-t 0');
[predict_label,accuracy] = svmpredict(ytest,Xtest,model);

上面加紅色標註的-t x x可以取0，1,2,3,4。系統默認爲2，如果不加-t x

0）線性核函數
1）多項式核函數
2）RBF核函數
3）sigmoid核函數
4）自定義核函數

從上表可以看出，線性核函數效果最好，能夠達到91.1458%。

數據要這樣處理下：

ytrain(ytrain==0) = -1;

ytest(ytest==0) = -1;

2.MATLAB自帶的svm工具箱分類垃圾郵件

Matlab自帶了svm工具箱，現在我就介紹如何利用這個工具箱來做垃圾郵件分類。下面我先給出程序，再來解釋。

load spamData
svmStruct = svmtrain(Xtrain,ytrain,'showplot',true);
classes=svmclassify(svmStruct,Xtest,'showplot',true);
nCorrect=sum(classes==ytest);
accuracy = nCorrect/length(classes);
accuracy = 100*accuracy;
accuracy = double(accuracy);
fprintf('accuracy=%s%%\n',accuracy);

運行會出現這個結果就忽略它，對結果沒有什麼影響。

在程序中點擊右鍵打開svmtrain.m這個文件。在文件的dflts ={'linear'，.......}，我得是287行。可以改變核函數，可以選擇okfuns = {'linear','quadratic', 'radial','rbf','polynomial','mlp'}; 上面dflts ={'linear'，.......}可以改變。

linear是線性核。

quadratic是二次核函數。

radial是什麼核？

rbf 是徑向基核，通常叫做高斯核，但是Ng說跟高斯沒有什麼關係。

polynomial是多項式核。

mlp多層感知器核函數。

下面來看使用各個核在垃圾郵件分類中的識別率

可以看到二次核函數效果最好，能達到85.55%。

整理的數據集見資源，三種不同的預處理方式

給出Excise8.1的作業結果

logistic迴歸

SVM

二、Logistic分類

可以參考這篇文獻：http://www.docin.com/p-160363677.html

前面的博文講過softmax分類，可以改爲Logistic迴歸。http://blog.csdn.net/hlx371240/article/details/40015395

這篇博文是在《最優化計算方法》這門課寫的，當時用LBFGS和SD法優化參數，現在我用工具箱直接進行分類。

%% STEP 0: Initialise constants and parameters
inputSize = 57; % Size of input vector 
numClasses = 2;     % Number of classes 
lambda = 1e-4; % Weight decay parameter
%%=====================================================================
%% STEP 1: Load data
load('D:\機器學習課程\作業三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2; % Remap 0 to 10
inputData = Xtrain;
DEBUG = false;
if DEBUG
    inputSize = 8;
    inputData = randn(8, 100);
    labels = randi(10, 100, 1);
end
% Randomly initialise theta
theta = 0.005 * randn(numClasses * inputSize, 1);%輸入的是一個列向量
%%======================================================================
%% STEP 2: Implement softmaxCost
[cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, ytrain);                                   
%%======================================================================
%% STEP 3: Learning parameters
options.maxIter = 100;
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
                            inputData, ytrain, options);
%%======================================================================
%% STEP 4: Testing
Xtest=Xtest';
ytest(ytest==0) = 2; % Remap 0 to 10
size(softmaxModel.optTheta)
size(inputData)
[pred] = softmaxPredict(softmaxModel, Xtest);
acc = mean(ytest(:) == pred(:));
fprintf('Accuracy: %0.3f%%\n', acc * 100);

softmaxTrain.m

<span style="font-family:Times New Roman;">function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels, options)
if ~exist('options', 'var')
    options = struct;
end
if ~isfield(options, 'maxIter')
    options.maxIter = 400;
end
% initialize parameters
theta = 0.005 * randn(numClasses * inputSize, 1);
% Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % softmaxCost.m satisfies this.
minFuncOptions.display = 'on';
[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...
                                   numClasses, inputSize, lambda, ...
                                   inputData, labels), ...                                   
                              theta, options);
% Fold softmaxOptTheta into a nicer format
softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);
softmaxModel.inputSize = inputSize;
softmaxModel.numClasses = numClasses;                       
end

softmaxCost.m

<span style="font-size:14px;">function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)

% numClasses - the number of classes 
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%

% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize);%將輸入的參數列向量變成一個矩陣

numCases = size(data, 2);%輸入樣本的個數
groundTruth = full(sparse(labels, 1:numCases, 1));%這裏sparse是生成一個稀疏矩陣，該矩陣中的值都是第三個值1
                                                    %稀疏矩陣的小標由labels和1:numCases對應值構成
cost = 0;
thetagrad = zeros(numClasses, inputSize);
M = bsxfun(@minus,theta*data,max(theta*data, [], 1));
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2);
thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta;

grad = [thetagrad(:)];
end</span>

softmaxPredict.m

function [pred] = softmaxPredict(softmaxModel, data)
% Unroll the parameters from theta
theta = softmaxModel.optTheta;  % this provides a numClasses x inputSize matrix
pred = zeros(1, size(data, 2));
[nop, pred] = max(theta * data);
end

initializeParameters.m

<span style="font-family:Times New Roman;">function theta = initializeParameters(hiddenSize, visibleSize)
%% Initialize parameters randomly based on layer sizes.
r  = sqrt(6) / sqrt(hiddenSize+visibleSize+1);   % we'll choose weights uniformly from the interval [-r, r]
W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
W2 = rand(visibleSize, hiddenSize) * 2 * r - r;
b1 = zeros(hiddenSize, 1);
b2 = zeros(visibleSize, 1);
% Convert weights and bias gradients to the vector form.
% This step will "unroll" (flatten and concatenate together) all 
% your parameters into a vector, which can then be used with minFunc. 
theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];
end

sigmoidInv.m

function sigmInv = sigmoidInv(x)
    sigmInv = sigmoid(x).*(1-sigmoid(x));
end

這個算法利用的LBFGS，擬牛頓法，可以節約內存，用近似的Hessian矩陣代替精確的Hessian矩陣，這個是我的美女老師講的，建議同學們選下學期美女老師的《數值優化》的課程，我覺得她講得很好，聽課絕對認真。

還有一個工具箱（minfunc）到資源下載。

最後得到的識別率爲：92.057%。但是不是每次跑出來的程序都是這個識別率，因爲參數是隨機產生的。

三、SAE-Logistic分類
SAE全稱爲Sparse Auto Encoder，是加了一層自學習層進一步提取特徵，因爲郵件中每個詞的出現可能存在某種關聯，然後再分類，這也是神經網絡提取特徵。

可以參考這篇文獻：http://nlp.stanford.edu/~socherr/sparseAutoencoder_2011new.pdf

博文：http://blog.csdn.net/hlx371240/article/details/40201499

網絡結果如下圖所示：

main.m

<span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';">%STEP 2: 初始化參數和load數據
</span><span style="font-family:Times New Roman;font-size:14px;">clear all;
clc;
load('D:\機器學習課程\作業三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2;
Xtest=Xtest';
ytest(ytest == 0) = 2; % Remap 0 to 10
inputSize  = 57;
numLabels  = 2;
a=[57 50 45 40 35 30 25 20 15 10 5];
sparsityParam = 0.1;
lambda = 3e-3;       % weight decay parameter
beta = 3;            % weight of sparsity penalty term
numClasses = 2;     % Number of classes (MNIST images fall into 10 classes)
lambda = 1e-4; % Weight decay parameter
%% ======================================================================
%STEP 2: 訓練自學習層SAE
for i=1:11
    hiddenSize = a(i);
    theta = initializeParameters(hiddenSize, inputSize);
    %-------------------------------------------------------------------
    opttheta = theta;
    addpath minFunc/
    options.Method = 'lbfgs';
    options.maxIter = 400;
    options.display = 'on';
    [opttheta, loss] = minFunc( @(p) sparseAutoencoderCost(p, ...
        inputSize, hiddenSize, ...
        lambda, sparsityParam, ...
        beta, Xtrain), ...
        theta, options);
    trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
        Xtrain);
    %% ================================================
    %STEP 3: 訓練Softmax分類器
    saeSoftmaxTheta = 0.005 * randn(hiddenSize * numClasses, 1);
    softmaxLambda = 1e-4;
    numClasses = 2;
    softoptions = struct;
    softoptions.maxIter = 500;
    softmaxModel = softmaxTrain(hiddenSize,numClasses,softmaxLambda,...
        trainFeatures,ytrain,softoptions);
    theta_new = softmaxModel.optTheta(:);
    %% ============================================================
    stack = cell(1,1);
    stack{1}.w = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize);
    stack{1}.b =opttheta(2*hiddenSize*inputSize+1:2*hiddenSize*inputSize+hiddenSize);
    [stackparams, netconfig] = stack2params(stack);
    stackedAETheta = [theta_new;stackparams];
    addpath minFunc/;
    options = struct;
    options.Method = 'lbfgs';
    options.maxIter = 400;
    options.display = 'on';
    [stackedAEOptTheta,cost] =  minFunc(@(p)stackedAECost(p,inputSize,hiddenSize,numClasses, netconfig,lambda, Xtrain, ytrain),stackedAETheta,options);
    %% =================================================================
    %STEP 4: 測試
    [pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSize, ...
        numClasses, netconfig, Xtest);
    acc = mean(ytest(:) == pred(:));
    fprintf('Accuracy = %0.3f%%\n', acc * 100);
    result(i)=acc * 100;
end</span><span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';">
</span>

stackedAEPredict.m

<span style="font-family:Times New Roman;font-size:14px;">function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)                     
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end
[~, pred] = max(softmaxTheta * a{depth+1});
end
% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end</span>

stackedAECost.m

<span style="font-family:Times New Roman;font-size:14px;">function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
                                              numClasses, netconfig, ...
                                              lambda, data, labels)                                       
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% netconfig:   the network configuration of the stack
% lambda:      the weight regularization penalty
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% labels: A vector containing labels, where labels(i) is the label for the
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
    stackgrad{d}.w = zeros(size(stack{d}.w));
    stackgrad{d}.b = zeros(size(stack{d}.b));
end
cost = 0; % You need to compute this
numCases = size(data, 2);%輸入樣本的個數
groundTruth = full(sparse(labels, 1:numCases, 1));
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end
M = softmaxTheta * a{depth+1};
M = bsxfun(@minus, M, max(M));
p = bsxfun(@rdivide, exp(M), sum(exp(M)));
cost = -1/numClasses * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2);
softmaxThetaGrad = -1/numClasses * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta;
d = cell(depth+1);
d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1});
for layer = (depth:-1:2)
  d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer});
end
for layer = (depth:-1:1)
  stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}';
  stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2);
end
% -------------------------------------------------------------------------
%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];
end
% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end</span>

feedForwardAutoencoder.m

<span style="font-family:Times New Roman;font-size:14px;">function [activation] = feedForwardAutoencoder(theta, hiddenSize, visibleSize, data)
% theta: trained weights from the autoencoder
% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
%  Instructions: Compute the activation of the hidden layer for the Sparse Autoencoder.
activation  = sigmoid(W1*data+repmat(b1,[1,size(data,2)]));
end
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end</span>

sparseAutoencoderCost.m

<span style="font-family:Times New Roman;font-size:14px;">function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)
% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 
% The input theta is a vector (because minFunc expects the parameters to be a vector). 
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values). 
% Here, we initialize them to zeros. 
cost = 0;
W1grad = zeros(size(W1)); 
W2grad = zeros(size(W2));
b1grad = zeros(size(b1)); 
b2grad = zeros(size(b2));

Jcost = 0;%直接誤差
Jweight = 0;%權值懲罰
Jsparse = 0;%稀疏性懲罰
[n m] = size(data);%m爲樣本的個數，n爲樣本的特徵數
%前向算法計算各神經網絡節點的線性組合值和active值
z2 = W1*data+repmat(b1,1,m);%注意這裏一定要將b1向量複製擴展成m列的矩陣
a2 = sigmoid(z2);
z3 = W2*a2+repmat(b2,1,m);
a3 = sigmoid(z3);
% 計算預測產生的誤差
Jcost = (0.5/m)*sum(sum((a3-data).^2));
%計算權值懲罰項
Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));
%計算稀釋性規則項
rho = (1/m).*sum(a2,2);%求出第一個隱含層的平均值向量
Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...
        (1-sparsityParam).*log((1-sparsityParam)./(1-rho)));
%損失函數的總表達式
cost = Jcost+lambda*Jweight+beta*Jsparse;
%反向算法求出每個節點的誤差值
d3 = -(data-a3).*sigmoidInv(z3);
sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因爲加入了稀疏規則項，所以
                                                            %計算偏導時需要引入該項
d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2); 
%計算W1grad 
W1grad = W1grad+d2*data';
W1grad = (1/m)*W1grad+lambda*W1;
%計算W2grad  
W2grad = W2grad+d3*a2';
W2grad = (1/m).*W2grad+lambda*W2;
%計算b1grad 
b1grad = b1grad+sum(d2,2);
b1grad = (1/m)*b1grad;%注意b的偏導是一個向量，所以這裏應該把每一行的值累加起來
%計算b2grad 
b2grad = b2grad+sum(d3,2);
b2grad = (1/m)*b2grad;
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
end

function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end
function sigmInv = sigmoidInv(x)
    sigmInv = sigmoid(x).*(1-sigmoid(x));
end</span>

stack2params.m

<span style="font-family:Times New Roman;font-size:14px;">function [params, netconfig] = stack2params(stack)

params = [];
for d = 1:numel(stack)
    params = [params ; stack{d}.w(:) ; stack{d}.b(:) ];
    assert(size(stack{d}.w, 1) == size(stack{d}.b, 1), ...
        ['The bias should be a *column* vector of ' ...
         int2str(size(stack{d}.w, 1)) 'x1']);
    if d < numel(stack)
        assert(size(stack{d}.w, 1) == size(stack{d+1}.w, 2), ...
            ['The adjacent layers L' int2str(d) ' and L' int2str(d+1) ...
             ' should have matching sizes.']);
    end
end
if nargout > 1
    % Setup netconfig
    if numel(stack) == 0
        netconfig.inputsize = 0;
        netconfig.layersizes = {};
    else
        netconfig.inputsize = size(stack{1}.w, 2);
        netconfig.layersizes = {};
        for d = 1:numel(stack)
            netconfig.layersizes = [netconfig.layersizes ; size(stack{d}.w,1)];
        end
    end
end
end</span>

實驗結果：