Ufldl Exercise:Self-Taught Learning

將稀疏自編碼器與softmax迴歸結合起來，形成一個簡單的三層網絡，數據還是mnist，即第一層爲28*28個神經元的輸入層，中間一個200個神經元的隱藏層，後面再跟一個softmax的5分類loss層。
首先用稀疏自編碼的方法無監督訓練隱藏層，數據爲5-9label的圖像，將隱藏層的權值W1訓練成下面這樣的模式（之前已經做過的）：

爲200張圖片，每張圖片都是一個神經元的權值按順序組成的數組可視化後的圖片。

有了這些權值W1，保留不變，將0-4label的圖像輸入，經過與W1相乘後再通過隱藏層sigmoid激活後很快可得到隱藏層的輸出，再把這些輸出（200*訓練圖像數量）輸入到最後的softmax層進行softmax迴歸訓練，因爲只有0-4label，所以softmax爲5分類，計算相應的代價函數進行梯度下降訓練，得到最後訓練好的包含W2的softmaxModel,再用這個model對另一半的測試數據測試，得到準確率。
代碼如下：
stlExercise.m

%% CS294A/CS294W Self-taught Learning Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  self-taught learning. You will need to complete code in feedForwardAutoencoder.m
%  You will also need to have implemented sparseAutoencoderCost.m and 
%  softmaxCost.m from previous exercises.
%
%% ======================================================================
%  STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.

clc;
clear;

inputSize  = 28 * 28;
numLabels  = 5;
hiddenSize = 200;
sparsityParam = 0.1; % desired average activation of the hidden units.
                     % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
                     %  in the lecture notes). 
lambda = 3e-3;       % weight decay parameter       
beta = 3;            % weight of sparsity penalty term   
maxIter = 400;

%% ======================================================================
%  STEP 1: Load data from the MNIST database
%
%  This loads our training and test data from the MNIST database files.
%  We have sorted the data for you in this so that you will not have to
%  change it.

% Load MNIST database files
mnistData   = loadMNISTImages('mnist/train-images-idx3-ubyte');
mnistLabels = loadMNISTLabels('mnist/train-labels-idx1-ubyte');

% Set Unlabeled Set (All Images)

% Simulate a Labeled and Unlabeled set
labeledSet   = find(mnistLabels >= 0 & mnistLabels <= 4); %符合要求的數字在數組內的下標
unlabeledSet = find(mnistLabels >= 5);

numTrain = round(numel(labeledSet)/2);
trainSet = labeledSet(1:numTrain);      %有label的一半train
testSet  = labeledSet(numTrain+1:end);          %一半test

unlabeledData = mnistData(:, unlabeledSet);

trainData   = mnistData(:, trainSet);   %有label的train data
trainLabels = mnistLabels(trainSet)' + 1; % Shift Labels to the Range 1-5

testData   = mnistData(:, testSet);
testLabels = mnistLabels(testSet)' + 1;   % Shift Labels to the Range 1-5

% Output Some Statistics
fprintf('# examples in unlabeled set: %d\n', size(unlabeledData, 2));
fprintf('# examples in supervised training set: %d\n\n', size(trainData, 2));
fprintf('# examples in supervised testing set: %d\n\n', size(testData, 2));

%% ======================================================================
%  STEP 2: Train the sparse autoencoder
%  This trains the sparse autoencoder on the unlabeled training
%  images. 

%% ----------------- YOUR CODE HERE ----------------------
%  Find opttheta by running the sparse autoencoder on
%  unlabeledTrainingImages

%  Randomly initialize the parameters
theta = initializeParameters(hiddenSize, inputSize);

addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % sparseAutoencoderCost.m satisfies this.
options.maxIter = 400;    % Maximum number of iterations of L-BFGS to run 
options.display = 'on';


[opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
                                   inputSize, hiddenSize, ...
                                   lambda, sparsityParam, ...
                                   beta, unlabeledData), ...
                              theta, options);

%% -----------------------------------------------------

% Visualize weights
W1 = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize);
display_network(W1');

%%======================================================================
%% STEP 3: Extract Features from the Supervised Dataset
%  
%  You need to complete the code in feedForwardAutoencoder.m so that the 
%  following command will extract features from the data.

trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
                                       trainData);  %200*train數據量

testFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
                                       testData);  %200*test數據量

%%======================================================================
%% STEP 4: Train the softmax classifier

softmaxModel = struct;  
%% ----------------- YOUR CODE HERE ----------------------
%  Use softmaxTrain.m from the previous exercise to train a multi-class
%  classifier. 

%  Use lambda = 1e-4 for the weight regularization for softmax

% You need to compute softmaxModel using softmaxTrain on trainFeatures and
% trainLabels

lambda = 1e-4;
numClasses = 5;
options.maxIter = 400;
inputSize = hiddenSize;
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
                            trainFeatures, trainLabels, options);

%% -----------------------------------------------------


%%======================================================================
%% STEP 5: Testing 

%% ----------------- YOUR CODE HERE ----------------------
% Compute Predictions on the test set (testFeatures) using softmaxPredict
% and softmaxModel
[pred] = softmaxPredict(softmaxModel, testFeatures);

%% -----------------------------------------------------
% Classification Score
fprintf('Test Accuracy: %f%%\n', 100*mean(pred(:) == testLabels(:)));

% (note that we shift the labels by 1, so that digit 0 now corresponds to
%  label 1)
%
% Accuracy is the proportion of correctly classified images
% The results for our implementation was:
%
% Accuracy: 98.3%
%
%

sparseAutoencoderCost.m

function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)

% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 

% The input theta is a vector (because minFunc expects the parameters to be a vector). 
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values). 
% Here, we initialize them to zeros. 
cost = 0;

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
%                and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) 
% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term 
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
% 
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. 
% 
numPatches = size(data,2);
KLdist = 0;

z2 = W1 * data + repmat(b1,[1,numPatches]);   %25*10000
a2 = sigmoid(z2);                         %25*10000
p = sum(a2,2);                 %25*1
z3 = W2 * a2 + repmat(b2,[1,numPatches]); %64*10000
a3 = sigmoid(z3);                          %64*10000
J_sparse = 0.5 * sum(sum((a3-data).^2));           

p = p / numPatches;   %計算隱藏層的平均激活度

%% 向後傳輸
residual3 = -(data-a3).*a3.*(1-a3);  %殘差公式 %64x10000
tmp = beta * ( - sparsityParam ./ p + (1-sparsityParam) ./ (1-p)); %25*1
%  25x10000   25x64 64x10000
residual2 = (W2' * residual3 + repmat(tmp,[1,numPatches])) .* a2.*(1-a2);
W2grad = residual3 * a2' / numPatches + lambda * W2 ;  %64*25
W1grad = residual2 * data'  / numPatches + lambda * W1 ;%25*64
b2grad = sum(residual3,2) / numPatches; %64*1
b1grad = sum(residual2,2) / numPatches; %25*1

%% 計算KL相對熵
for j = 1:hiddenSize
    KLdist = KLdist + sparsityParam *log( sparsityParam / p(j) )   +   (1 - sparsityParam) * log((1-sparsityParam) / (1 - p(j)));
end

cost = J_sparse / numPatches + (sum(sum(W1.^2)) + sum(sum(W2.^2))) * lambda / 2  + beta * KLdist;

%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll
% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)

    sigm = 1 ./ (1 + exp(-x));
end

softmaxCost.m

function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)

% numClasses - the number of classes 
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%

% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize); %10*784

numCases = size(data, 2);   %60000

groundTruth = full(sparse(labels, 1:numCases, 1));  %10*60000
cost = 0;

thetagrad = zeros(numClasses, inputSize);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost and gradient for softmax regression.
%                You need to compute thetagrad and cost.
%                The groundTruth matrix might come in handy.

thetaTx = theta * data;   % 10*60000
%減去最大值 防止exp值過大溢出
thetaTx = bsxfun(@minus, thetaTx, max(thetaTx, [], 1));   % 10*60000
eThetaTx = exp(thetaTx);   % 10*60000
hypothesis = eThetaTx./(repmat(sum(eThetaTx),size(eThetaTx,1),1)); % 10*60000

cost = -sum(sum(log(hypothesis).*groundTruth))/numCases;
cost = cost + sum(sum(theta.^2))*lambda/2;

thetagrad = -(groundTruth - hypothesis)*data'./numCases + lambda.*theta;  %10*784

% ------------------------------------------------------------------
% Unroll the gradient matrices into a vector for minFunc
grad = [thetagrad(:)];
end

feedForwardAutoencoder.m

function [activation] = feedForwardAutoencoder(theta, hiddenSize, visibleSize, data)

% theta: trained weights from the autoencoder
% visibleSize: the number of input units (probably 784) 
% hiddenSize: the number of hidden units (probably 200) 
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 

% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the activation of the hidden layer for the Sparse Autoencoder.

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);  %200*784
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); %200*1

activation = sigmoid(W1 * data + repmat(b1,[1,size(data,2)]));  %200*15000..

%-------------------------------------------------------------------

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

主要就是這幾個文件，其餘的其他文章裏也都提到了，基本上就是把前兩章的內容結合起來而已。
運行結果如下：
examples in unlabeled set: 29404
examples in supervised training set: 15298

examples in supervised testing set: 15298

Iteration FunEvals Step Length Function Val Opt Cond
1 5 3.14777e-02 1.10167e+02 8.54784e+03
2 6 1.00000e+00 9.14035e+01 4.37969e+03
3 7 1.00000e+00 8.38713e+01 1.98396e+03
4 8 1.00000e+00 7.95523e+01 2.03865e+03
5 9 1.00000e+00 7.12491e+01 2.72443e+03
6 10 1.00000e+00 5.43268e+01 2.97157e+03
7 11 1.00000e+00 3.28241e+01 8.45774e+02
8 ….
…
…
…
387 398 1.00000e+00 1.16473e+01 5.81143e+00
388 399 1.00000e+00 1.16468e+01 7.29393e+00
389 400 1.00000e+00 1.16462e+01 9.87254e+00
390 401 1.00000e+00 1.16455e+01 8.87469e+00
391 402 1.00000e+00 1.16451e+01 6.20162e+00
392 403 1.00000e+00 1.16445e+01 5.85851e+00
393 404 1.00000e+00 1.16442e+01 6.73153e+00
394 405 1.00000e+00 1.16436e+01 7.44381e+00
395 406 1.00000e+00 1.16433e+01 1.08066e+01
396 407 1.00000e+00 1.16426e+01 6.51165e+00
397 408 1.00000e+00 1.16423e+01 5.74701e+00
398 409 1.00000e+00 1.16419e+01 6.52625e+00
399 410 1.00000e+00 1.16415e+01 6.77704e+00
400 411 1.00000e+00 1.16411e+01 8.99886e+00
Exceeded Maximum Number of Iterations
Iteration FunEvals Step Length Function Val Opt Cond
1 3 6.30893e-01 1.33252e+00 1.33745e+01
2 4 1.00000e+00 4.75012e-01 4.63605e+00
3 5 1.00000e+00 3.35942e-01 3.22785e+00
4 6 1.00000e+00 2.52347e-01 1.51473e+00
5 7 1.00000e+00 1.99355e-01 8.24715e-01
6 8 1.00000e+00 1.65120e-01 9.29615e-01
7 9 1.00000e+00 1.38252e-01 5.01888e-01
8 10 1.00000e+00 1.28380e-01 3.52842e-01
9 11 1.00000e+00 1.14536e-01 2.63068e-01
10 12 1.00000e+00 1.00907e-01 3.94667e-01
11 13 1.00000e+00 9.30330e-02 2.36750e-01
12 14 1.00000e+00 8.75129e-02 1.66041e-01
13 15 1.00000e+00 8.30299e-02 1.09462e-01
14 16 1.00000e+00 7.91959e-02 1.07245e-01
15 17 1.00000e+00 7.65056e-02 1.08345e-01
16 18 1.00000e+00 7.50167e-02 1.22889e-01
17 19 1.00000e+00 7.35633e-02 6.50477e-02
18 20 1.00000e+00 7.28726e-02 4.92305e-02
19 21 1.00000e+00 7.23238e-02 4.84382e-02
20 22 1.00000e+00 7.18056e-02 5.33487e-02
21 23 1.00000e+00 7.13970e-02 5.01899e-02
22 24 1.00000e+00 7.11652e-02 3.17180e-02
23 25 1.00000e+00 7.09514e-02 2.80332e-02
24 26 1.00000e+00 7.08163e-02 2.81097e-02
25 27 1.00000e+00 7.06447e-02 2.41427e-02
26 28 1.00000e+00 7.05202e-02 1.79102e-02
27 29 1.00000e+00 7.04211e-02 1.02507e-02
28 30 1.00000e+00 7.03787e-02 1.20261e-02
29 31 1.00000e+00 7.03509e-02 1.02202e-02
30 32 1.00000e+00 7.03269e-02 8.55803e-03
31 33 1.00000e+00 7.03047e-02 6.39926e-03
32 34 1.00000e+00 7.02937e-02 4.76259e-03
33 35 1.00000e+00 7.02883e-02 3.95992e-03
34 36 1.00000e+00 7.02837e-02 3.65665e-03
35 37 1.00000e+00 7.02802e-02 3.33861e-03
36 38 1.00000e+00 7.02776e-02 2.33751e-03
37 39 1.00000e+00 7.02764e-02 1.37919e-03
38 40 1.00000e+00 7.02755e-02 1.25428e-03
39 41 1.00000e+00 7.02747e-02 1.19318e-03
40 42 1.00000e+00 7.02741e-02 1.35342e-03
41 43 1.00000e+00 7.02738e-02 9.47215e-04
42 44 1.00000e+00 7.02736e-02 4.93086e-04
43 45 1.00000e+00 7.02735e-02 3.68544e-04
44 46 1.00000e+00 7.02734e-02 3.88469e-04
45 47 1.00000e+00 7.02733e-02 3.51496e-04
46 49 4.74595e-01 7.02733e-02 4.15585e-04
47 50 1.00000e+00 7.02733e-02 2.27196e-04
48 51 1.00000e+00 7.02733e-02 1.46213e-04
49 52 1.00000e+00 7.02733e-02 1.18617e-04
50 53 1.00000e+00 7.02733e-02 1.00106e-04
51 54 1.00000e+00 7.02733e-02 7.27407e-05
52 56 3.17967e-01 7.02732e-02 5.64051e-05
53 57 1.00000e+00 7.02732e-02 3.64929e-05
Function Value changing by less than TolX
Test Accuracy: 98.339652%
結果還是和標準答案一模一樣啊（確實Ng做的教程相當好，自己完成的代碼只有十幾行，簡單易操作，卻都是最關鍵最能學到東西的地方）

有一點需要在意的就是訓練時，稀疏自編碼器的訓練相當耗時，也就是W1的訓練，花了有10分鐘，也不知道爲什麼，意外地長。。然後display了W1的權值圖像後（也就是最上面那幅）1秒鐘就完成了softmax的訓練和測試，\(⊙o⊙)//。。。。可以看到迭代53次後就達到目標loss了，而之前沒有加中間的稀疏自編碼層時，也就是直接輸入28*28個，輸出10個，大概350次（也運行相當久）才達到目標。由此可見無監督學習到的特徵再與有監督學習結合起來（術語叫半監督學習？）的效果在數據集比較小的情況下確實很有用啊！
最後基本上只用了3層10min便得到98%的準確率，比在caffe上跑的cnn簡單太多，耗時又少，雖然準確率還是要差一些的。。

Ufldl Exercise:Self-Taught Learning

《21天實戰Caffe》第六章習題6.4 自己的手寫體數字圖片送入lenet測試

protobuf 學習收藏的文章

win10 + ubuntu16.04 安裝總結

Linux學習筆記之 Linux軟件的安裝與卸載

ubuntu C++ 編碼調試相關

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結