Deep learning：一(基礎知識_1)

出處：http://www.cnblogs.com/tornadomeet 歡迎轉載或分享，但請務必聲明文章出處。

　　前言:

　　最近打算稍微系統的學習下deep learing的一些理論知識，打算採用Andrew Ng的網頁教程UFLDL Tutorial，據說這個教程寫得淺顯易懂，也不太長。不過在這這之前還是複習下machine learning的基礎知識，見網頁：http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=DeepLearning。內容其實很短，每小節就那麼幾分鐘，且講得非常棒。

　　教程中的一些術語:

　　Model representation:

　　其實就是指學習到的函數的表達形式，可以用矩陣表示。

　　Vectorized implementation:

　　指定是函數表達式的矢量實現。

　　Feature scaling：

　　指是將特徵的每一維都進行一個尺度變化，比如說都讓其均值爲0等。

　　Normal equations:

　　這裏指的是多元線性迴歸中參數解的矩陣形式，這個解方程稱爲normal equations.

　　Optimization objective:

　　指的是需要優化的目標函數，比如說logistic中loss function表達式的公式推導。或者多元線性迴歸中帶有規則性的目標函數。

　　Gradient Descent、Newton’s Method：

　　都是求目標函數最小值的方法。

　　Common variations:

　　指的是規則項表達形式的多樣性。

　　一些筆記：

　　模型表達就是給出輸入和輸出之間的函數關係式，當然這個函數是有前提假設的，裏面可以含有參數。此時如果有許多訓練樣本的話，同樣可以給出訓練樣本的平均相關的誤差函數，一般該函數也稱作是損失函數（Loss function）。我們的目標是求出模型表達中的參數，這是通過最小化損失函數來求得的。一般最小化損失函數是通過梯度下降法（即先隨機給出參數的一組值，然後更新參數，使每次更新後的結構都能夠讓損失函數變小，最終達到最小即可）。在梯度下降法中，目標函數其實可以看做是參數的函數，因爲給出了樣本輸入和輸出值後，目標函數就只剩下參數部分了，這時可以把參數看做是自變量，則目標函數變成參數的函數了。梯度下降每次都是更新每個參數，且每個參數更新的形式是一樣的，即用前一次該參數的值減掉學習率和目標函數對該參數的偏導數（如果只有1個參數的話，就是導數），爲什麼要這樣做呢？通過取不同點處的參數可以看出，這樣做恰好可以使原來的目標函數值變低，因此符合我們的要求（即求函數的最小值）。即使當學習速率固定(但不能太大)，梯度下降法也是可以收斂到一個局部最小點的，因爲梯度值會越來越小，它和固定的學習率相乘後的積也會越來越小。在線性迴歸問題中我們就可以用梯度下降法來求迴歸方程中的參數。有時候該方法也稱爲批量梯度下降法，這裏的批量指的是每一時候參數的更新使用到了所有的訓練樣本。

Vectorized implementation指的是矢量實現，由於實際問題中很多變量都是向量的，所有如果要把每個分量都寫出來的話會很不方便，應該儘量寫成矢量的形式。比如上面的梯度下降法的參數更新公式其實也是可以用矢量形式實現的。矢量形式的公式簡單，且易用matlab編程。由於梯度下降法是按照梯度方向來收斂到極值的，如果輸入樣本各個維數的尺寸不同（即範圍不同），則這些參數的構成的等高線不同的方向胖瘦不同，這樣會導致參數的極值收斂速度極慢。因此在進行梯度下降法求參數前，需要先進行feature scaling這一項，一般都是把樣本中的各維變成0均值，即先減掉該維的均值，然後除以該變量的range。

接下來就是學習率對梯度下降法的影響。如果學習速率過大，這每次迭代就有可能出現超調的現象，會在極值點兩側不斷髮散，最終損失函數的值是越變越大，而不是越來越小。在損失函數值——迭代次數的曲線圖中，可以看到，該曲線是向上遞增的。當然了，當學習速率過大時，還可能出現該曲線不斷震盪的情形。如果學習速率太小，這該曲線下降得很慢，甚至在很多次迭代處曲線值保持不變。那到底該選什麼值呢？這個一般是根據經驗來選取的，比如從…0.0001,0.001,.0.01,0.1,1.0…這些參數中選，看那個參數使得損失值和迭代次數之間的函數曲線下降速度最快。

同一個問題可以選用不同的特徵和不同的模型，特徵方面，比如單個面積特徵其實是可以寫成長和寬2個特徵的。不同模型方面，比如在使用多項式擬合模型時，可以指定x的指數項最多到多少。當用訓練樣本來進行數據的測試時，一般都會將所有的訓練數據整理成一個矩陣，矩陣的每一行就是一個訓練樣本，這樣的矩陣有時候也會叫做是“design matrix”。當用矩陣的形式來解多項式模型的參數時，參數w=inv(X’*X)*X’*y,這個方程也稱爲normal equations. 雖然X’*X是方陣，但是它的逆不一定存在（當一個方陣的逆矩陣不存在時，該方陣也稱爲sigular）。比如說當X是單個元素0時，它的倒數不存在，這就是個Sigular矩陣，當然了這個例子太特殊了。另一個比較常見的例子就是參數的個數比訓練樣本的個數還要多時也是非可逆矩陣。這時候要求解的話就需要引入regularization項，或者去掉一些特徵項（典型的就是降維，去掉那些相關性強的特徵）。另外，對線性迴歸中的normal equations方程求解前，不需要對輸入樣本的特徵進行feature scale（這個是有理論依據的）。

　　上面講的函數一般都是迴歸方面的，也就是說預測值是連續的，如果我們需要預測的值只有2種，要麼是要麼不是，即預測值要麼是0要麼是1，那麼就是分類問題了。這樣我們需要有一個函數將原本的預測值映射到0到1之間，通常這個函數就是logistic function，或者叫做sigmoid function。因爲這種函數值還是個連續的值，所以對logistic函數的解釋就是在給定x的值下輸出y值爲1的概率。

　　Convex函數其實指的是隻有一個極值點的函數，而non-convex可能有多個極值點。一般情況下我們都希望損失函數的形式是convex的。在分類問題情況下，先考慮訓練樣本中值爲1的那些樣本集，這時候我的損失函數要求我們當預測值爲1時，損失函數值最小（爲0），當預測值爲0時，此時損失函數的值最大，爲無窮大，所以這種情況下一般採用的是-log(h(x)),剛好滿足要求。同理，當訓練樣本值爲0時，一般採用的損失函數是-log(1-h(x)).因此將這兩種整合在一起時就爲-y*log(h(x))-(1-y)*log(1-h(x))，結果是和上面的一樣，不過表達式更緊湊了，選這樣形式的loss函數是通過最大釋然估計(MLE)求得的。這種情況下依舊可以使用梯度下降法來求解參數的最優值。在求參數的迭代公式時，同樣需要求損失函數的偏導，很奇怪的時，這時候的偏導函數和多元線性迴歸時的偏導函數結構類似，只是其中的預測函數一個是普通的線性函數，一個是線性函數和sigmoid的複合的函數。

　　梯度下降法是用來求函數值最小處的參數值，而牛頓法是用來求函數值爲0處的參數值，這兩者的目的初看是感覺有所不同，但是再仔細觀察下牛頓法是求函數值爲0時的情況，如果此時的函數是某個函數A的導數，則牛頓法也算是求函數A的最小值（當然也有可能是最大值）了，因此這兩者方法目的還是具有相同性的。牛頓法的參數求解也可以用矢量的形式表示，表達式中有hession矩陣和一元導函數向量。

　　下面來比較梯度法和牛頓法，首先的不同之處在於梯度法中需要選擇學習速率，而牛頓法不需要選擇任何參數。第二個不同之處在於梯度法需要大量的迭代次數才能找到最小值，而牛頓法只需要少量的次數便可完成。但是梯度法中的每一次迭代的代價要小，其複雜度爲O(n),而牛頓法的每一次迭代的代價要大，爲O(n^3)。因此當特徵的數量n比較小時適合選擇牛頓法，當特徵數n比較大時，最好選梯度法。這裏的大小以n等於1000爲界來計算。

　　如果當系統的輸入特徵有多個，而系統的訓練樣本比較少時，這樣就很容易造成over-fitting的問題。這種情況下要麼通過降維方法來減小特徵的個數（也可以通過模型選擇的方法），要麼通過regularization的方法，通常情況下通過regularization方法在特徵數很多的情況下是最有效，但是要求這些特徵都只對最終的結果預測起少部分作用。因爲規則項可以作用在參數上，讓最終的參數很小，當所有參數都很小的情況下，這些假設就是簡單假設，從而能夠很好的解決over-fitting的問題。一般對參數進行regularization時，前面都有一個懲罰係數，這個係數稱爲regularization parameter，如果這個規則項係數太大的話，有可能導致系統所有的參數最終都很接近0，所有會出現欠擬合的現象。在多元線性迴歸中，規則項一般懲罰的是參數1到n（當然有的也可以將參數0加入懲罰項，但不常見）。隨着訓練樣本的增加，這些規則項的作用在慢慢減小，因此學習到的系統的參數傾向而慢慢增加。規則項還有很多種形式，有的規則項不會包含特徵的個數，如L2-norm regularization(或者叫做2-norm regularization).當然了，還有L1-norm regularization。由於規則項的形式有很多種，所以這種情形也稱爲規則項的common variations.

　　在有規則項的線性迴歸問題求解中，如果採用梯度下降法，則參數的更新公式類似（其中參數0的公式是一樣的，因爲規則項中沒有懲罰參數0），不同之處在於其它參數的更新公式中的更新不是用本身的參數去減掉後面一串，而是用本身參數乘以（1-alpha*lamda/m）再減掉其它的，當然了這個數在很多情況下和1是相等的，也就很前面的無規則項的梯度下降法類似了。它的normal equation也很前面的類似，大致爲inv(X’*X+lamda*A)*X’*y,多了一項，其中A是一個對角矩陣，除了第一個元素爲0外，其它元素都爲1（在通用規則項下的情形）。這種情況下前面的矩陣一般就是可逆的了，即在樣本數量小於特徵數量的情況下是可解的。當爲logistic迴歸的情況中（此時的loss函數中含有對數項），如果使用梯度下降法，則參數的更新方程中也和線性迴歸中的類似，也是要乘以（1-alpha*lamda/m），nomal equation中也是多了一個矩陣，這樣同理就解決了不可逆問題。在牛頓法的求解過程中，加了規則項後的一元導向量都隨着改變，hession矩陣也要在最後加入lamda/m*A矩陣，其中A和前面的一樣。

　　logistic迴歸與多充線性迴歸實際上有很多相同之處，最大的區別就在於他們的因變量不同，其他的基本都差不多，正是因爲如此，這兩種迴歸可以歸於同一個家族，即廣義線性模型（generalized linear model）。這一家族中的模型形式基本上都差不多，不同的就是因變量不同，如果是連續的，就是多重線性迴歸，如果是二項分佈，就是logistic迴歸，如果是poisson分佈，就是poisson迴歸，如果是負二項分佈，就是負二項迴歸，等等。只要注意區分它們的因變量就可以了。logistic迴歸的因變量可以是二分類的，也可以是多分類的，但是二分類的更爲常用，也更加容易解釋。所以實際中最爲常用的就是二分類的logistic迴歸。

　　參考資料：

http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=DeepLearning

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

Deep learning：二(linear regression練習)

　　前言

　　本文是多元線性迴歸的練習，這裏練習的是最簡單的二元線性迴歸，參考斯坦福大學的教學網http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex2/ex2.html。本題給出的是50個數據樣本點，其中x爲這50個小朋友到的年齡，年齡爲2歲到8歲，年齡可有小數形式呈現。Y爲這50個小朋友對應的身高，當然也是小數形式表示的。現在的問題是要根據這50個訓練樣本，估計出3.5歲和7歲時小孩子的身高。通過畫出訓練樣本點的分佈憑直覺可以發現這是一個典型的線性迴歸問題。

　　matlab函數介紹:

　　legend:

　　比如legend('Training data', 'Linear regression')，它表示的是標出圖像中各曲線標誌所代表的意義,這裏圖像的第一條曲線（其實是離散的點）表示的是訓練樣本數據，第二條曲線（其實是一條直線）表示的是迴歸曲線。

　　hold on, hold off:

　　hold on指在前一幅圖的情況下打開畫紙，允許在上面繼續畫曲線。hold off指關閉前一副畫的畫紙。

　　linspace：

　　比如linspace(-3, 3, 100)指的是給出-3到3之間的100個數，均勻的選取，即線性的選取。

　　logspace:

　　比如logspace(-2, 2, 15)，指的是在10^(-2)到10^(2)之間選取15個數，這些數按照指數大小來選取，即指數部分是均勻選取的，但是由於都取了10爲底的指數，所以最終是服從指數分佈選取的。

　　實驗結果：

　　訓練樣本散點和迴歸曲線預測圖：

　　損失函數與參數之間的曲面圖:

　　損失函數的等高線圖：

　　程序代碼及註釋：

　　採用normal equations方法求解：

%%方法一
x = load('ex2x.dat');
y = load('ex2y.dat');
plot(x,y,'*')
xlabel('height')
ylabel('age')
x = [ones(size(x),1),x];
w=inv(x'*x)*x'*y
hold on
%plot(x,0.0639*x+0.7502) 
plot(x(:,2),0.0639*x(:,2)+0.7502)%更正後的代碼

　　採用gradient descend過程求解：

% Exercise 2 Linear Regression

% Data is roughly based on 2000 CDC growth figures
% for boys
%
% x refers to a boy's age
% y is a boy's height in meters
%

clear all; close all; clc
x = load('ex2x.dat'); y = load('ex2y.dat');

m = length(y); % number of training examples


% Plot the training data
figure; % open a new figure window
plot(x, y, 'o');
ylabel('Height in meters')
xlabel('Age in years')

% Gradient descent
x = [ones(m, 1) x]; % Add a column of ones to x
theta = zeros(size(x(1,:)))'; % initialize fitting parameters
MAX_ITR = 1500;
alpha = 0.07;

for num_iterations = 1:MAX_ITR
    % This is a vectorized version of the 
    % gradient descent update formula
    % It's also fine to use the summation formula from the videos
    
    % Here is the gradient
    grad = (1/m).* x' * ((x * theta) - y);
    
    % Here is the actual update
    theta = theta - alpha .* grad;
    
    % Sequential update: The wrong way to do gradient descent
    % grad1 = (1/m).* x(:,1)' * ((x * theta) - y);
    % theta(1) = theta(1) + alpha*grad1;
    % grad2 = (1/m).* x(:,2)' * ((x * theta) - y);
    % theta(2) = theta(2) + alpha*grad2;
end
% print theta to screen
theta

% Plot the linear fit
hold on; % keep previous plot visible
plot(x(:,2), x*theta, '-')
legend('Training data', 'Linear regression')%標出圖像中各曲線標誌所代表的意義
hold off % don't overlay any more plots on this figure，指關掉前面的那幅圖

% Closed form solution for reference
% You will learn about this method in future videos
exact_theta = (x' * x)\x' * y

% Predict values for age 3.5 and 7
predict1 = [1, 3.5] *theta
predict2 = [1, 7] * theta


% Calculate J matrix

% Grid over which we will calculate J
theta0_vals = linspace(-3, 3, 100);
theta1_vals = linspace(-1, 1, 100);

% initialize J_vals to a matrix of 0's
J_vals = zeros(length(theta0_vals), length(theta1_vals));

for i = 1:length(theta0_vals)
      for j = 1:length(theta1_vals)
      t = [theta0_vals(i); theta1_vals(j)];    
      J_vals(i,j) = (0.5/m) .* (x * t - y)' * (x * t - y);
    end
end

% Because of the way meshgrids work in the surf command, we need to 
% transpose J_vals before calling surf, or else the axes will be flipped
J_vals = J_vals';
% Surface plot
figure;
surf(theta0_vals, theta1_vals, J_vals)
xlabel('\theta_0'); ylabel('\theta_1');

% Contour plot
figure;
% Plot J_vals as 15 contours spaced logarithmically between 0.01 and 100
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 2, 15))%畫出等高線
xlabel('\theta_0'); ylabel('\theta_1');%類似於轉義字符，但是最多隻能是到參數0~9

　　參考資料:

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex2/ex2.html

Deep learning：三(Multivariance Linear Regression練習)

　　前言:

　　本文主要是來練習多變量線性迴歸問題(其實本文也就3個變量)，參考資料見網頁：http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex3/ex3.html.其實在上一篇博文Deep learning：二(linear regression練習)中已經簡單介紹過一元線性迴歸問題的求解，但是那個時候用梯度下降法求解時，給出的學習率是固定的0.7.而本次實驗中學習率需要自己來選擇，因此我們應該從小到大（比如從0.001到10）來選擇，通過觀察損失值與迭代次數之間的函數曲線來決定使用哪個學習速率。當有了學習速率alpha後，則本問問題求解方法和上面的沒差別。

　　本文要解決的問題是給出了47個訓練樣本，訓練樣本的y值爲房子的價格，x屬性有2個，一個是房子的大小，另一個是房子臥室的個數。需要通過這些訓練數據來學習系統的函數，從而預測房子大小爲1650，且臥室有3個的房子的價格。

　　實驗基礎：

　　dot(A,B):表示的是向量A和向量B的內積。

　　又線性迴歸的理論可以知道系統的損失函數如下所示：

　其向量表達形式如下：

　　當使用梯度下降法進行參數的求解時，參數的更新公式如下：

　　當然它也有自己的向量形式（程序中可以體現）。

　　實驗結果：

　　測試學習率的結果如下：

　　由此可知，選用學習率爲1時，可以到達很快的收斂速度，因此最終的程序中使用的學習率爲1.

　　最終使用梯度下降法和公式法的預測結果如下：

　　可以看出兩者的結果是一致的。

　　實驗主要程序及代碼：

%% 方法一：梯度下降法
x = load('ex3x.dat');
y = load('ex3y.dat');

x = [ones(size(x,1),1) x];
meanx = mean(x);%求均值
sigmax = std(x);%求標準偏差
x(:,2) = (x(:,2)-meanx(2))./sigmax(2);
x(:,3) = (x(:,3)-meanx(3))./sigmax(3);

figure
itera_num = 100; %嘗試的迭代次數
sample_num = size(x,1); %訓練樣本的次數
alpha = [0.01, 0.03, 0.1, 0.3, 1, 1.3];%因爲差不多是選取每個3倍的學習率來測試，所以直接枚舉出來
plotstyle = {'b', 'r', 'g', 'k', 'b--', 'r--'};

theta_grad_descent = zeros(size(x(1,:)));
for alpha_i = 1:length(alpha) %嘗試看哪個學習速率最好
    theta = zeros(size(x,2),1); %theta的初始值賦值爲0
    Jtheta = zeros(itera_num, 1);
    for i = 1:itera_num %計算出某個學習速率alpha下迭代itera_num次數後的參數       
        Jtheta(i) = (1/(2*sample_num)).*(x*theta-y)'*(x*theta-y);%Jtheta是個行向量
        grad = (1/sample_num).*x'*(x*theta-y);
        theta = theta - alpha(alpha_i).*grad;
    end
    plot(0:49, Jtheta(1:50),char(plotstyle(alpha_i)),'LineWidth', 2)%此處一定要通過char函數來轉換
    hold on
    
    if(1 == alpha(alpha_i)) %通過實驗發現alpha爲1時效果最好，則此時的迭代後的theta值爲所求的值
        theta_grad_descent = theta
    end
end
legend('0.01','0.03','0.1','0.3','1','1.3');
xlabel('Number of iterations')
ylabel('Cost function')

%下面是預測公式
price_grad_descend = theta_grad_descent'*[1 (1650-meanx(2))/sigmax(2) (3-meanx(3)/sigmax(3))]'
                                     
                                     
%%方法二：normal equations
x = load('ex3x.dat');
y = load('ex3y.dat');
x = [ones(size(x,1),1) x];

theta_norequ = inv((x'*x))*x'*y
price_norequ = theta_norequ'*[1 1650 3]'

　　參考資料:

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex3/ex3.html

Deep learning：四(logistic regression練習)

　　前言：

　　本節來練習下logistic regression相關內容，參考的資料爲網頁：http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html。這裏給出的訓練樣本的特徵爲80個學生的兩門功課的分數，樣本值爲對應的同學是否允許被上大學，如果是允許的話則用’1’表示，否則不允許就用’0’表示，這是一個典型的二分類問題。在此問題中，給出的80個樣本中正負樣本各佔40個。而這節採用的是logistic regression來求解，該求解後的結果其實是一個概率值，當然通過與0.5比較就可以變成一個二分類問題了。

　　實驗基礎：

　　在logistic regression問題中，logistic函數表達式如下：

　　這樣做的好處是可以把輸出結果壓縮到0~1之間。而在logistic迴歸問題中的損失函數與線性迴歸中的損失函數不同，這裏定義的爲：

　　如果採用牛頓法來求解迴歸方程中的參數，則參數的迭代公式爲：

　　其中一階導函數和hessian矩陣表達式如下：

　　當然了，在編程的時候爲了避免使用for循環，而應該直接使用這些公式的矢量表達式（具體的見程序內容）。

　　一些matlab函數：

　　find:

　　是找到的一個向量，其結果是find函數括號值爲真時的值的下標編號。

　　inline:

　　構造一個內嵌的函數，很類似於我們在草稿紙上寫的數學推導公式一樣。參數一般用單引號弄起來，裏面就是函數的表達式，如果有多個參數，則後面用單引號隔開一一說明。比如：g = inline('sin(alpha*x)','x','alpha')，則該二元函數是g(x,alpha) = sin(alpha*x)。

　　實驗結果：

　　訓練樣本的分佈圖以及所學習到的分類界面曲線：

　　損失函數值和迭代次數之間的曲線：

　　最終輸出的結果：

　　可以看出當一個小孩的第一門功課爲20分，第二門功課爲80分時，這個小孩不允許上大學的概率爲0.6680，因此如果作爲二分類的話，就說明該小孩不會被允許上大學。

　　實驗代碼（原網頁提供）：

% Exercise 4 -- Logistic Regression

clear all; close all; clc

x = load('ex4x.dat'); 
y = load('ex4y.dat');

[m, n] = size(x);

% Add intercept term to x
x = [ones(m, 1), x]; 

% Plot the training data
% Use different markers for positives and negatives
figure
pos = find(y); neg = find(y == 0);%find是找到的一個向量，其結果是find函數括號值爲真時的值的編號
plot(x(pos, 2), x(pos,3), '+')
hold on
plot(x(neg, 2), x(neg, 3), 'o')
hold on
xlabel('Exam 1 score')
ylabel('Exam 2 score')


% Initialize fitting parameters
theta = zeros(n+1, 1);

% Define the sigmoid function
g = inline('1.0 ./ (1.0 + exp(-z))'); 

% Newton's method
MAX_ITR = 7;
J = zeros(MAX_ITR, 1);

for i = 1:MAX_ITR
    % Calculate the hypothesis function
    z = x * theta;
    h = g(z);%轉換成logistic函數
    
    % Calculate gradient and hessian.
    % The formulas below are equivalent to the summation formulas
    % given in the lecture videos.
    grad = (1/m).*x' * (h-y);%梯度的矢量表示法
    H = (1/m).*x' * diag(h) * diag(1-h) * x;%hessian矩陣的矢量表示法
    
    % Calculate J (for testing convergence)
    J(i) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h));%損失函數的矢量表示法
    
    theta = theta - H\grad;%是這樣子的嗎？
end
% Display theta
theta

% Calculate the probability that a student with
% Score 20 on exam 1 and score 80 on exam 2 
% will not be admitted
prob = 1 - g([1, 20, 80]*theta)

%畫出分界面
% Plot Newton's method result
% Only need 2 points to define a line, so choose two endpoints
plot_x = [min(x(:,2))-2,  max(x(:,2))+2];
% Calculate the decision boundary line
plot_y = (-1./theta(3)).*(theta(2).*plot_x +theta(1));
plot(plot_x, plot_y)
legend('Admitted', 'Not admitted', 'Decision Boundary')
hold off

% Plot J
figure
plot(0:MAX_ITR-1, J, 'o--', 'MarkerFaceColor', 'r', 'MarkerSize', 8)
xlabel('Iteration'); ylabel('J')
% Display J
J

　　參考資料：

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html

Deep learning：五(regularized線性迴歸練習)

　　前言：

　　本節主要是練習regularization項的使用原則。因爲在機器學習的一些模型中，如果模型的參數太多，而訓練樣本又太少的話，這樣訓練出來的模型很容易產生過擬合現象。因此在模型的損失函數中，需要對模型的參數進行“懲罰”，這樣的話這些參數就不會太大，而越小的參數說明模型越簡單，越簡單的模型則越不容易產生過擬合現象。本文參考的資料參考網頁：http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex5/ex5.html。主要是給定7個訓練樣本點，需要用這7個點來模擬一個5階多項式。主要測試的是不同的regularization參數對最終學習到的曲線的影響。

　　實驗基礎：

　　此時的模型表達式如下所示：

　　模型中包含了規則項的損失函數如下：

　　模型的normal equation求解爲：

　　程序中主要測試lambda=0,1,10這3個參數對最終結果的影響。

　　一些matlab函數：

　　plot:

　　主要是將plot繪曲線的一些性質。比如說：plot(x,y,'o','MarkerEdgeColor','b','MarkerFaceColor','r')這裏是繪製x-y的點圖，每個點都是圓圈表示，圓圈的邊緣用藍色表示，圓圈裏面填充的是紅色。由此可知’MarkerEdgeColor’和’MarkerFaceColor’的含義了。

　　diag:

　　diag使用來產生對角矩陣的，它是用一個列向量來生成對角矩陣的，所以其參數應該是個列向量，比如說如果想產生3*3的對角矩陣，則可以是diag(ones(3,1)).

　　legend：

　　注意轉義字符的使用，比如說legned(‘\lambda_0’)，說明標註的是lamda0.

　　實驗結果：

　　樣本點的分佈和最終學習到的曲線如下所示：

　　可以看出，當lambda=1時，模型最好，不容易產生過擬合現象，且有對原始數據有一定的模擬。

　　實驗主要代碼：

clc,clear
%加載數據
x = load('ex5Linx.dat');
y = load('ex5Liny.dat');

%顯示原始數據
plot(x,y,'o','MarkerEdgeColor','b','MarkerFaceColor','r')

%將特徵值變成訓練樣本矩陣
x = [ones(length(x),1) x x.^2 x.^3 x.^4 x.^5];
[m n] = size(x);
n = n -1;

%計算參數sidta，並且繪製出擬合曲線
rm = diag([0;ones(n,1)]);%lamda後面的矩陣
lamda = [0 1 10]';
colortype = {'g','b','r'};
sida = zeros(n+1,3);
xrange = linspace(min(x(:,2)),max(x(:,2)))';
hold on;
for i = 1:3
    sida(:,i) = inv(x'*x+lamda(i).*rm)*x'*y;%計算參數sida
    norm_sida = norm(sida)
    yrange = [ones(size(xrange)) xrange xrange.^2 xrange.^3,...
        xrange.^4 xrange.^5]*sida(:,i);
    plot(xrange',yrange,char(colortype(i)))
    hold on
end
legend('traning data', '\lambda=0', '\lambda=1','\lambda=10')%注意轉義字符的使用方法
hold off

　　參考資料：

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex5/ex5.html

Deep learning：六(regularized logistic迴歸練習)

　　前言：

　　在上一講Deep learning：五(regularized線性迴歸練習)中已經介紹了regularization項在線性迴歸問題中的應用，這節主要是練習regularization項在logistic迴歸中的應用，並使用牛頓法來求解模型的參數。參考的網頁資料爲：http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex5/ex5.html。要解決的問題是，給出了具有2個特徵的一堆訓練數據集，從該數據的分佈可以看出它們並不是非常線性可分的，因此很有必要用更高階的特徵來模擬。例如本程序中個就用到了特徵值的6次方來求解。

　　實驗基礎：

　　contour:

　　該函數是繪製輪廓線的，比如程序中的contour(u, v, z, [0, 0], 'LineWidth', 2)，指的是在二維平面U-V中繪製曲面z的輪廓，z的值爲0，輪廓線寬爲2。注意此時的z對應的範圍應該與U和V所表達的範圍相同。

　　在logistic迴歸中，其表達式爲：

　　在此問題中，將特徵x映射到一個28維的空間中，其x向量映射後爲：

　　此時加入了規則項後的系統的損失函數爲：

　　對應的牛頓法參數更新方程爲：

　　其中：

　　公式中的一些宏觀說明（直接截的原網頁）：

　　實驗結果：

　　原訓練數據點的分佈情況：

　　當lambda=0時所求得的分界曲面：

　　當lambda=1時所求得的分界曲面：

　　當lambda=10時所求得的分界曲面：

　　實驗程序代碼：

%載入數據
clc,clear,close all;
x = load('ex5Logx.dat');
y = load('ex5Logy.dat');

%畫出數據的分佈圖
plot(x(find(y),1),x(find(y),2),'o','MarkerFaceColor','b')
hold on;
plot(x(find(y==0),1),x(find(y==0),2),'r+')
legend('y=1','y=0')

% Add polynomial features to x by 
% calling the feature mapping function
% provided in separate m-file
x = map_feature(x(:,1), x(:,2));

[m, n] = size(x);

% Initialize fitting parameters
theta = zeros(n, 1);

% Define the sigmoid function
g = inline('1.0 ./ (1.0 + exp(-z))'); 

% setup for Newton's method
MAX_ITR = 15;
J = zeros(MAX_ITR, 1);

% Lambda is the regularization parameter
lambda = 1;%lambda=0,1,10，修改這個地方，運行3次可以得到3種結果。

% Newton's Method
for i = 1:MAX_ITR
    % Calculate the hypothesis function
    z = x * theta;
    h = g(z);
    
    % Calculate J (for testing convergence)
    J(i) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h))+ ...
    (lambda/(2*m))*norm(theta([2:end]))^2;
    
    % Calculate gradient and hessian.
    G = (lambda/m).*theta; G(1) = 0; % extra term for gradient
    L = (lambda/m).*eye(n); L(1) = 0;% extra term for Hessian
    grad = ((1/m).*x' * (h-y)) + G;
    H = ((1/m).*x' * diag(h) * diag(1-h) * x) + L;
    
    % Here is the actual update
    theta = theta - H\grad;
  
end
% Show J to determine if algorithm has converged
J
% display the norm of our parameters
norm_theta = norm(theta) 

% Plot the results 
% We will evaluate theta*x over a 
% grid of features and plot the contour 
% where theta*x equals zero

% Here is the grid range
u = linspace(-1, 1.5, 200);
v = linspace(-1, 1.5, 200);

z = zeros(length(u), length(v));
% Evaluate z = theta*x over the grid
for i = 1:length(u)
    for j = 1:length(v)
        z(i,j) = map_feature(u(i), v(j))*theta;%這裏繪製的並不是損失函數與迭代次數之間的曲線，而是線性變換後的值
    end
end
z = z'; % important to transpose z before calling contour

% Plot z = 0
% Notice you need to specify the range [0, 0]
contour(u, v, z, [0, 0], 'LineWidth', 2)%在z上畫出爲0值時的界面，因爲爲0時剛好概率爲0.5，符合要求
legend('y = 1', 'y = 0', 'Decision boundary')
title(sprintf('\\lambda = %g', lambda), 'FontSize', 14)


hold off

% Uncomment to plot J
% figure
% plot(0:MAX_ITR-1, J, 'o--', 'MarkerFaceColor', 'r', 'MarkerSize', 8)
% xlabel('Iteration'); ylabel('J')

　　參考文獻：

Deep learning：五(regularized線性迴歸練習)

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex5/ex5.html

Deep learning：七(基礎知識_2)

　　前面的文章已經介紹過了2種經典的機器學習算法：線性迴歸和logistic迴歸，並且在後面的練習中也能夠感覺到這2種方法在一些問題的求解中能夠取得很好的效果。現在開始來看看另一種機器學習算法——神經網絡。線性迴歸或者logistic迴歸問題理論上不是可以解決所有的迴歸和分類問題麼，那麼爲什麼還有其它各種各樣的機器學習算法呢？比如這裏馬上要講的神經網絡算法。其實原因很簡單，在前面的一系列博文練習中可以發現，那些樣本點的輸入特徵維數都非常小（比如說2到3維），在使用logistic迴歸求解時，需要把原始樣本特徵重新映射到高維空間中，本來是3維的如果映射到最高指數爲6的空間中，結果就變成了28維了。但是一般現實生活中的數據特徵非常大，比如一張小的可憐的灰度圖片50*50，本身就只有2500個特徵，如果要採用logistic迴歸來做目標檢測的話，則有可能達到上百萬的特徵了。這樣不僅計算量複雜，而且因爲特徵維數過大容易是學習到的函數產生過擬合現象。總的來說，只有線性迴歸和logistic迴歸在現實生活中是遠遠不夠的，因此，神經網絡由於它特有的優勢就慢慢被研究了。

　　神經網絡模型的表達結構是比較清晰的，輸入值和對應的權重相乘然後相加最終加上個偏移值就是輸出了。只是數學公式比較繁瑣，容易弄錯。假設第j層網絡有Sj個節點，而第j+1層網絡有S(j+1)個節點，則第j層的參數應該是個矩陣，矩陣大小爲S(j+1)*(Sj+1)，當然了，此時是因爲那個權值爲1的那個網絡節點沒有算進去。很顯然，爲了方便公式的表達，神經網絡中經常使用矢量化的數學公式。爲什麼神經網絡最有學習功能呢？首先從生物上來講，它模擬了人的大腦的功能，而人的大腦就有很強大的學習機制。其次從神經網絡的模型中也可以看出，如果我們只看輸出層已經和輸出層相連的最後一層可以發現，它其實就是一個簡單的線性迴歸方程（如果使輸出在0~1之間，則是logistic迴歸方程），也就是說前面那麼多的網絡只是自己學習到了一些新的特徵，而這些新的特徵是很適合作爲問題求解的特徵的。因此，說白了，神經網絡是爲了學習到更適合問題求解的一些特徵。

　　表面上看，神經網絡的前一層和當前層是直接連接的，前一層的輸出值的線性組合構成了當前層的輸出，這樣即使是有很多層的神經網絡，不也只能學習到輸入特徵的線性組合麼？那爲什麼說神經網絡可以學習任意的非線性函數呢？其實是剛纔我犯了一個本質錯誤，因爲前一層輸出的線性組合並不直接是本層的輸出，而是一般還通過一個函數複合，比如說最常見的函數logistic函數（其它的函數比如雙曲正切函數也是很常用的），要不然可就真是隻能學習到線性的特徵了。神經網絡的功能是比較強大的，比如說單層的神經網絡可以學習到”and”,”or”，,”not”以及非或門等，兩層的神經網絡可以學習到”xor”門（通過與門和非或門構成的一個或門合成），3層的神經網絡是可以學習到任意函數的（不包括輸入輸出層）等，這些在神經網絡的發展過程中有不少有趣的故事。當然了，神經網絡也是很容易用來擴展到多分類問題的，如果是n分類問題，則只需在設計的網絡的輸出層設置n個節點即可。這樣如果系統是可分的話則總有一個學習到的網絡能夠使輸入的特徵最終在n個輸出節點中只有一個爲1，這就達到了多分類的目的。

　　神經網絡的損失函數其實是很容易確定的，這裏以多分類的神經網絡爲例。當然了，這裏談到損失函數是在有監督學習理論框架下的，因爲只有這樣才能夠知道損失了多少（最近有發展到無監督學習框架中也是可以計算損失函數的，比如說AutoEncoder等）。假設網絡中各個參數均已學到，那麼對於每個輸入樣本，就能夠得出一個輸出值了，這個輸出值和輸入樣本標註的輸出值做比較就能夠得到一個損失項。由於多分類中的輸出值是一個多維的向量，所以計算它的損失時需要每一維都求（既然是多分類問題，那麼訓練樣本所標註的值也應該爲多維的，至少可以轉換成多維的）。這樣的話，神經網絡的損失函數表達式與前面的logistic迴歸中損失函數表達式很類似，很容易理解。

　　有了損失函數的表達式，我們就可以用梯度下降法或者牛頓法來求網絡的參數了，不管是哪種方法，都需要計算出損失函數對某個參數的偏導數，這樣我們的工作重點就在求損失函數對各個參數的偏導數了，求該偏導數中最著名的算法就是BP算法，也叫做反向傳播算法。在使用BP算法求偏導數時，可以證明損失函數對第l層的某個參數的偏導與第l層中該節點的誤差，以及該參數對應前一層網絡編號在本層的輸出（即l層）的輸出值有關，那麼此時的工作就轉換成了每一層網絡的每一個節點的誤差的求法了（當然了，輸入層是不用計算誤差的）。而又可通過理論證明，每個節點的誤差是可以通過下一層網絡的所以節點反向傳播計算得到（這也是反向傳播算法名字的來源）。總結一下，當有多個訓練樣本時，每次輸入一個樣本，然後求出每個節點的輸出值，接着通過輸入樣本的樣本值反向求出每個節點的誤差，這樣損失函數對每個節點的誤差可以通過該節點的輸出值已經誤差來累加得到，當所有的樣本都經過同樣的處理後，其最終的累加值就是損失函數對應位置參數的偏導數了。BP算法的理論來源是一個節點的誤差是由前面簡單的誤差傳遞過來的，傳遞係數就是網絡的係數。

　　一般情況下,使用梯度下降法解決神經網絡問題時是很容易出錯,因爲求解損失函數對參數的偏導數過程有不少矩陣，在程序中容易弄錯,如果損失函數或者損失函數的偏導數都求錯了的話,那麼後面的迭代過程就更加錯了,導致不會收斂，所以很有必要檢查一下偏導數是否正確。Andrew Ng在課程中告訴大家使用gradient checking的方法來檢測，即當求出了損失函數的偏導數後，取一個參數值，計算出該參數值處的偏導數值，然後在該參數值附近取2個參數點，利用損失函數在這個兩個點值的差除以這2個點的距離（其實如果這2個點足夠靠近的話，這個結果就是導數的定義了），比較這兩次計算出的結果是否相等，如果接近相等的話，則說明很大程度上，這個偏導數沒有計算出錯，後面的工作也就可以放心的進行了，這時候一定要記住不要再運行gradient checking，因爲在運行gradient checking時會使用BP進行每層的誤差等計算，這樣很耗時（但是我感覺即使不計算gradient checking,不也要使用BP算法進行反向計算麼？）。

　　在進行網絡訓練時，千萬不要將參數的初始值設置成一樣的，因爲這樣學習的每一層的參數最終都是一樣的，也就是說學習到的隱含特徵是一樣的，那麼就多餘了，且效果不好。因此明智的做法是對這些參數的初始化應該隨機，且一般是滿足均值爲0，且在0左右附近的隨機。

　　如果採用同樣的算法求解網絡的參數的話（比如說都是用BP算法），那麼網絡的性能就取決於網絡的結構（即隱含層的個數以及每個隱含層神經元的個數），一般默認的結構是：只取一個隱含層，如果需要取多個隱含層的話就將每個隱含層神經元的個數設置爲相同，當然了隱含層神經元的個數越多則效果會越好。

Deep learning：八(Sparse Autoencoder)

　　前言：

　　這節課來學習下Deep learning領域比較出名的一類算法——sparse autoencoder，即稀疏模式的自動編碼。我們知道，deep learning也叫做unsupervised learning，所以這裏的sparse autoencoder也應是無監督的。按照前面的博文：Deep learning：一(基礎知識_1)，Deep learning：七(基礎知識_2)所講，如果是有監督的學習的話，在神經網絡中，我們只需要確定神經網絡的結構就可以求出損失函數的表達式了（當然，該表達式需對網絡的參數進行”懲罰”，以便使每個參數不要太大）,同時也能夠求出損失函數偏導函數的表達式，然後利用優化算法求出網絡最優的參數。應該清楚的是，損失函數的表達式中，需要用到有標註值的樣本。那麼這裏的sparse autoencoder爲什麼能夠無監督學習呢？難道它的損失函數的表達式中不需要標註的樣本值（即通常所說的y值）麼？其實在稀疏編碼中”標註值”也是需要的，只不過它的輸出理論值是本身輸入的特徵值x，其實這裏的標註值y=x。這樣做的好處是，網絡的隱含層能夠很好的代替輸入的特徵，因爲它能夠比較準確的還原出那些輸入特徵值。Sparse autoencoder的一個網絡結構圖如下所示：

　　損失函數的求法：

　　無稀疏約束時網絡的損失函數表達式如下：

　　稀疏編碼是對網絡的隱含層的輸出有了約束，即隱含層節點輸出的平均值應儘量爲0，這樣的話，大部分的隱含層節點都處於非activite狀態。因此，此時的sparse autoencoder損失函數表達式爲：

　　後面那項爲KL距離，其表達式如下：

　　隱含層節點輸出平均值求法如下：

　　其中的參數一般取很小，比如說0.05，也就是小概率發生事件的概率。這說明要求隱含層的每一個節點的輸出均值接近0.05（其實就是接近0，因爲網絡中activite函數爲sigmoid函數），這樣就達到稀疏的目的了。KL距離在這裏表示的是兩個向量之間的差異值。從約束函數表達式中可以看出，差異越大則”懲罰越大”，因此最終的隱含層節點的輸出會接近0.05。

　　損失函數的偏導數的求法：

　　如果不加入稀疏規則，則正常情況下由損失函數求損失函數偏導數的過程如下：

　　而加入了稀疏性後，神經元節點的誤差表達式由公式：

　　變成公式：

　　梯度下降法求解：

　　有了損失函數及其偏導數後就可以採用梯度下降法來求網絡最優化的參數了，整個流程如下所示：

　　從上面的公式可以看出，損失函數的偏導其實是個累加過程，每來一個樣本數據就累加一次。這是因爲損失函數本身就是由每個訓練樣本的損失疊加而成的，而按照加法的求導法則，損失函數的偏導也應該是由各個訓練樣本所損失的偏導疊加而成。從這裏可以看出，訓練樣本輸入網絡的順序並不重要，因爲每個訓練樣本所進行的操作是等價的，後面樣本的輸入所產生的結果並不依靠前一次輸入結果（只是簡單的累加而已，而這裏的累加是順序無關的）。

　　參考資料：

Deep learning：一(基礎知識_1)

Deep learning：七(基礎知識_2)

http://deeplearning.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity

Deep learning：九(Sparse Autoencoder練習)

　　前言：

　　現在來進入sparse autoencoder的一個實例練習，參考Ng的網頁教程：Exercise:Sparse Autoencoder。這個例子所要實現的內容大概如下：從給定的很多張自然圖片中截取出大小爲8*8的小patches圖片共10000張，現在需要用sparse autoencoder的方法訓練出一個隱含層網絡所學習到的特徵。該網絡共有3層，輸入層是64個節點，隱含層是25個節點，輸出層當然也是64個節點了。

　　實驗基礎：

　　其實實現該功能的主要步驟還是需要計算出網絡的損失函數以及其偏導數，具體的公式可以參考前面的博文Deep learning：八(Sparse Autoencoder)。下面用簡單的語言大概介紹下這個步驟，方便大家理清算法的流程。

　　1. 計算出網絡每個節點的輸入值（即程序中的z值）和輸出值（即程序中的a值，a是z的sigmoid函數值）。

　　2. 利用z值和a值計算出網絡每個節點的誤差值（即程序中的delta值）。

　　3. 這樣可以利用上面計算出的每個節點的a，z，delta來表達出系統的損失函數以及損失函數的偏導數了，當然這些都是一些數學推導，其公式就是前面的博文Deep learning：八(Sparse Autoencoder)了。

　　其實步驟1是前向進行的，也就是說按照輸入層——》隱含層——》輸出層的方向進行計算。而步驟2是方向進行的（這也是該算法叫做BP算法的來源），即每個節點的誤差值是按照輸出層——》隱含層——》輸入層方向進行的。

　　一些malab函數：

　　bsxfun:

　　C=bsxfun(fun,A,B)表達的是兩個數組A和B間元素的二值操作，fun是函數句柄或者m文件，或者是內嵌的函數。在實際使用過程中fun有很多選擇比如說加，減等，前面需要使用符號’@’.一般情況下A和B需要尺寸大小相同，如果不相同的話，則只能有一個維度不同，同時A和B中在該維度處必須有一個的維度爲1。比如說bsxfun(@minus, A, mean(A))，其中A和mean(A)的大小是不同的，這裏的意思需要先將mean(A)擴充到和A大小相同，然後用A的每個元素減去擴充後的mean(A)對應元素的值。

　　rand：

　　生成均勻分佈的僞隨機數。分佈在（0~1）之間
　　主要語法：rand(m,n)生成m行n列的均勻分佈的僞隨機數
rand(m,n,'double')生成指定精度的均勻分佈的僞隨機數，參數還可以是'single'
rand(RandStream,m,n)利用指定的RandStream(我理解爲隨機種子)生成僞隨機數

　　randn：

　　生成標準正態分佈的僞隨機數（均值爲0，方差爲1）。主要語法：和上面一樣

　　randi：

　　生成均勻分佈的僞隨機整數
　主要語法：randi（iMax）在閉區間（0，iMax）生成均勻分佈的僞隨機整數
randi（iMax，m，n）在閉區間（0，iMax）生成mXn型隨機矩陣
r = randi([iMin,iMax],m,n)在閉區間（iMin，iMax）生成mXn型隨機矩陣

　　exist:

　　測試參數是否存在，比如說exist('opt_normalize', 'var')表示檢測變量opt_normalize是否存在，其中的’var’表示變量的意思。

　　colormap:

　　設置當前常見的顏色值表。

　　floor：

　　floor(A):取不大於A的最大整數。

　　ceil:

　　ceil(A):取不小於A的最小整數。

　　imagesc:

　　imagesc和image類似，可以用於顯示圖像。比如imagesc(array,'EraseMode','none',[-1 1])，這裏的意思是將array中的數據線性映射到[-1,1]之間，然後使用當前設置的顏色表進行顯示。此時的[-1,1]充滿了整個顏色表。背景擦除模式設置爲node，表示不擦除背景。

　　repmat:

　　該函數是擴展一個矩陣並把原來矩陣中的數據複製進去。比如說B = repmat(A,m,n)，就是創建一個矩陣B，B中複製了共m*n個A矩陣，因此B矩陣的大小爲[size(A,1)*m size(A,2)*m]。

　　使用函數句柄的作用：

　　不使用函數句柄的情況下，對函數多次調用，每次都要爲該函數進行全面的路徑搜索，直接影響計算速度，藉助句柄可以完全避免這種時間損耗。也就是直接指定了函數的指針。函數句柄就像一個函數的名字，有點類似於C++程序中的引用。

　　實驗流程：

　　首先運行主程序train.m中的步驟1，即隨機採樣出10000個小的patch，並且顯示出其中的204個patch圖像，圖像顯示如下所示：

　　然後運行train.m中的步驟2和步驟3，進行損失函數和梯度函數的計算並驗證。進行gradient checking的時間可能會太長，我這裏大概用了1個半小時以上（反正1個多小時還沒checking完，所以去睡覺了），當用gradient checking時，發現誤差只有6.5101e-11，遠小於1e-9，所以說明前面的損失函數和偏導函數程序是對的。後面就可以接着用優化算法來求參數了，本程序給的是優化算法是L-BFGS。經過幾分鐘的優化，就出結果了。

　　最後的W1的權值如下所示：

　　實驗代碼：

　　train.m:

%% CS294A/CS294W Programming Assignment Starter Code

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  programming assignment. You will need to complete the code in sampleIMAGES.m,
%  sparseAutoencoderCost.m and computeNumericalGradient.m. 
%  For the purpose of completing the assignment, you do not need to
%  change the code in this file. 
%
%%======================================================================
%% STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.

visibleSize = 8*8;   % number of input units 
hiddenSize = 25;     % number of hidden units 
sparsityParam = 0.01;   % desired average activation of the hidden units.
                     % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
             %  in the lecture notes). 
lambda = 0.0001;     % weight decay parameter       
beta = 3;            % weight of sparsity penalty term       

%%======================================================================
%% STEP 1: Implement sampleIMAGES
%
%  After implementing sampleIMAGES, the display_network command should
%  display a random sample of 200 patches from the dataset

patches = sampleIMAGES;
display_network(patches(:,randi(size(patches,2),204,1)),8);%randi(size(patches,2),204,1)
                                                           %爲產生一個204維的列向量，每一維的值爲0~10000
                                                           %中的隨機數，說明是隨機取204個patch來顯示


%  Obtain random parameters theta
theta = initializeParameters(hiddenSize, visibleSize);

%%======================================================================
%% STEP 2: Implement sparseAutoencoderCost
%
%  You can implement all of the components (squared error cost, weight decay term,
%  sparsity penalty) in the cost function at once, but it may be easier to do 
%  it step-by-step and run gradient checking (see STEP 3) after each step.  We 
%  suggest implementing the sparseAutoencoderCost function using the following steps:
%
%  (a) Implement forward propagation in your neural network, and implement the 
%      squared error term of the cost function.  Implement backpropagation to 
%      compute the derivatives.   Then (using lambda=beta=0), run Gradient Checking 
%      to verify that the calculations corresponding to the squared error cost 
%      term are correct.
%
%  (b) Add in the weight decay term (in both the cost function and the derivative
%      calculations), then re-run Gradient Checking to verify correctness. 
%
%  (c) Add in the sparsity penalty term, then re-run Gradient Checking to 
%      verify correctness.
%
%  Feel free to change the training settings when debugging your
%  code.  (For example, reducing the training set size or 
%  number of hidden units may make your code run faster; and setting beta 
%  and/or lambda to zero may be helpful for debugging.)  However, in your 
%  final submission of the visualized weights, please use parameters we 
%  gave in Step 0 above.

[cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, lambda, ...
                                     sparsityParam, beta, patches);

%%======================================================================
%% STEP 3: Gradient Checking
%
% Hint: If you are debugging your code, performing gradient checking on smaller models 
% and smaller training sets (e.g., using only 10 training examples and 1-2 hidden 
% units) may speed things up.

% First, lets make sure your numerical gradient computation is correct for a
% simple function.  After you have implemented computeNumericalGradient.m,
% run the following: 
checkNumericalGradient();

% Now we can use it to check your cost function and derivative calculations
% for the sparse autoencoder.  
numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ...
                                                  hiddenSize, lambda, ...
                                                  sparsityParam, beta, ...
                                                  patches), theta);

% Use this to visually compare the gradients side by side
%disp([numgrad grad]); 

% Compare numerically computed gradients with the ones obtained from backpropagation
diff = norm(numgrad-grad)/norm(numgrad+grad);
disp(diff); % Should be small. In our implementation, these values are
            % usually less than 1e-9.

            % When you got this working, Congratulations!!! 

%%======================================================================
%% STEP 4: After verifying that your implementation of
%  sparseAutoencoderCost is correct, You can start training your sparse
%  autoencoder with minFunc (L-BFGS).

%  Randomly initialize the parameters
theta = initializeParameters(hiddenSize, visibleSize);

%  Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % sparseAutoencoderCost.m satisfies this.
options.maxIter = 400;      % Maximum number of iterations of L-BFGS to run 
options.display = 'on';


[opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
                                   visibleSize, hiddenSize, ...
                                   lambda, sparsityParam, ...
                                   beta, patches), ...
                              theta, options);

%%======================================================================
%% STEP 5: Visualization 

W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
figure;
display_network(W1', 12); 

print -djpeg weights.jpg   % save the visualization to a file

　　sampleIMAGES.m:

function patches = sampleIMAGES()
% sampleIMAGES
% Returns 10000 patches for training

load IMAGES;    % load images from disk 

patchsize = 8;  % we'll use 8x8 patches 
numpatches = 10000;

% Initialize patches with zeros.  Your code will fill in this matrix--one
% column per patch, 10000 columns. 
patches = zeros(patchsize*patchsize, numpatches);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Fill in the variable called "patches" using data 
%  from IMAGES.  
%  
%  IMAGES is a 3D array containing 10 images
%  For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,
%  and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize
%  it. (The contrast on these images look a bit off because they have
%  been preprocessed using using "whitening."  See the lecture notes for
%  more details.) As a second example, IMAGES(21:30,21:30,1) is an image
%  patch corresponding to the pixels in the block (21,21) to (30,30) of
%  Image 1
for imageNum = 1:10%在每張圖片中隨機選取1000個patch，共10000個patch
    [rowNum colNum] = size(IMAGES(:,:,imageNum));
    for patchNum = 1:1000%實現每張圖片選取1000個patch
        xPos = randi([1,rowNum-patchsize+1]);
        yPos = randi([1, colNum-patchsize+1]);
        patches(:,(imageNum-1)*1000+patchNum) = reshape(IMAGES(xPos:xPos+7,yPos:yPos+7,...
                                                        imageNum),64,1);
    end
end


%% ---------------------------------------------------------------
% For the autoencoder to work well we need to normalize the data
% Specifically, since the output of the network is bounded between [0,1]
% (due to the sigmoid activation function), we have to make sure 
% the range of pixel values is also bounded between [0,1]
patches = normalizeData(patches);

end


%% ---------------------------------------------------------------
function patches = normalizeData(patches)

% Squash data to [0.1, 0.9] since we use sigmoid as the activation
% function in the output layer

% Remove DC (mean of images). 
patches = bsxfun(@minus, patches, mean(patches));

% Truncate to +/-3 standard deviations and scale to -1 to 1
pstd = 3 * std(patches(:));
patches = max(min(patches, pstd), -pstd) / pstd;%因爲根據3sigma法則，95%以上的數據都在該區域內
                                                % 這裏轉換後將數據變到了-1到1之間

% Rescale from [-1,1] to [0.1,0.9]
patches = (patches + 1) * 0.4 + 0.1;

end

　　initializeParameters.m:

function theta = initializeParameters(hiddenSize, visibleSize)

%% Initialize parameters randomly based on layer sizes.
r  = sqrt(6) / sqrt(hiddenSize+visibleSize+1);   % we'll choose weights uniformly from the interval [-r, r]
W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
W2 = rand(visibleSize, hiddenSize) * 2 * r - r;

b1 = zeros(hiddenSize, 1);
b2 = zeros(visibleSize, 1);

% Convert weights and bias gradients to the vector form.
% This step will "unroll" (flatten and concatenate together) all 
% your parameters into a vector, which can then be used with minFunc. 
theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];

end

sparseAutoencoderCost.m:

function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)

% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 
  
% The input theta is a vector (because minFunc expects the parameters to be a vector). 
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 

%將長向量轉換成每一層的權值矩陣和偏置向量值
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values). 
% Here, we initialize them to zeros. 
cost = 0;
W1grad = zeros(size(W1)); 
W2grad = zeros(size(W2));
b1grad = zeros(size(b1)); 
b2grad = zeros(size(b2));

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
%                and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) 
% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term 
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
% 
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. 
% 

Jcost = 0;%直接誤差
Jweight = 0;%權值懲罰
Jsparse = 0;%稀疏性懲罰
[n m] = size(data);%m爲樣本的個數，n爲樣本的特徵數

%前向算法計算各神經網絡節點的線性組合值和active值
z2 = W1*data+repmat(b1,1,m);%注意這裏一定要將b1向量複製擴展成m列的矩陣
a2 = sigmoid(z2);
z3 = W2*a2+repmat(b2,1,m);
a3 = sigmoid(z3);

% 計算預測產生的誤差
Jcost = (0.5/m)*sum(sum((a3-data).^2));

%計算權值懲罰項
Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));

%計算稀釋性規則項
rho = (1/m).*sum(a2,2);%求出第一個隱含層的平均值向量
Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...
        (1-sparsityParam).*log((1-sparsityParam)./(1-rho)));

%損失函數的總表達式
cost = Jcost+lambda*Jweight+beta*Jsparse;

%反向算法求出每個節點的誤差值
d3 = -(data-a3).*sigmoidInv(z3);
sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因爲加入了稀疏規則項，所以
                                                             %計算偏導時需要引入該項
d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2); 

%計算W1grad 
W1grad = W1grad+d2*data';
W1grad = (1/m)*W1grad+lambda*W1;

%計算W2grad  
W2grad = W2grad+d3*a2';
W2grad = (1/m).*W2grad+lambda*W2;

%計算b1grad 
b1grad = b1grad+sum(d2,2);
b1grad = (1/m)*b1grad;%注意b的偏導是一個向量，所以這裏應該把每一行的值累加起來

%計算b2grad 
b2grad = b2grad+sum(d3,2);
b2grad = (1/m)*b2grad;



% %%方法二,每次處理1個樣本，速度慢
% m=size(data,2);
% rho=zeros(size(b1));
% for i=1:m
%     %feedforward
%     a1=data(:,i);
%     z2=W1*a1+b1;
%     a2=sigmoid(z2);
%     z3=W2*a2+b2;
%     a3=sigmoid(z3);
%     %cost=cost+(a1-a3)'*(a1-a3)*0.5;
%     rho=rho+a2;
% end
% rho=rho/m;
% sterm=beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));
% %sterm=beta*2*rho;
% for i=1:m
%     %feedforward
%     a1=data(:,i);
%     z2=W1*a1+b1;
%     a2=sigmoid(z2);
%     z3=W2*a2+b2;
%     a3=sigmoid(z3);
%     cost=cost+(a1-a3)'*(a1-a3)*0.5;
%     %backpropagation
%     delta3=(a3-a1).*a3.*(1-a3);
%     delta2=(W2'*delta3+sterm).*a2.*(1-a2);
%     W2grad=W2grad+delta3*a2';
%     b2grad=b2grad+delta3;
%     W1grad=W1grad+delta2*a1';
%     b1grad=b1grad+delta2;
% end
% 
% kl=sparsityParam*log(sparsityParam./rho)+(1-sparsityParam)*log((1-sparsityParam)./(1-rho));
% %kl=rho.^2;
% cost=cost/m;
% cost=cost+sum(sum(W1.^2))*lambda/2.0+sum(sum(W2.^2))*lambda/2.0+beta*sum(kl);
% W2grad=W2grad./m+lambda*W2;
% b2grad=b2grad./m;
% W1grad=W1grad./m+lambda*W1;
% b1grad=b1grad./m;


%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll
% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)

    sigm = 1 ./ (1 + exp(-x));
end

%sigmoid函數的逆函數
function sigmInv = sigmoidInv(x)

    sigmInv = sigmoid(x).*(1-sigmoid(x));
end

computeNumericalGradient.m:

function numgrad = computeNumericalGradient(J, theta)
% numgrad = computeNumericalGradient(J, theta)
% theta: a vector of parameters
% J: a function that outputs a real-number. Calling y = J(theta) will return the
% function value at theta. 
  
% Initialize numgrad with zeros
numgrad = zeros(size(theta));

%% ---------- YOUR CODE HERE --------------------------------------
% Instructions: 
% Implement numerical gradient checking, and return the result in numgrad.  
% (See Section 2.3 of the lecture notes.)
% You should write code so that numgrad(i) is (the numerical approximation to) the 
% partial derivative of J with respect to the i-th input argument, evaluated at theta.  
% I.e., numgrad(i) should be the (approximately) the partial derivative of J with 
% respect to theta(i).
%                
% Hint: You will probably want to compute the elements of numgrad one at a time. 

epsilon = 1e-4;
n = size(theta,1);
E = eye(n);
for i = 1:n
    delta = E(:,i)*epsilon;
    numgrad(i) = (J(theta+delta)-J(theta-delta))/(epsilon*2.0);
end

% n=size(theta,1);
% E=eye(n);
% epsilon=1e-4;
% for i=1:n
%     dtheta=E(:,i)*epsilon;
%     numgrad(i)=(J(theta+dtheta)-J(theta-dtheta))/epsilon/2.0;
% end

%% ---------------------------------------------------------------
end

　　checkNumericalGradient.m:

function [] = checkNumericalGradient()
% This code can be used to check your numerical gradient implementation 
% in computeNumericalGradient.m
% It analytically evaluates the gradient of a very simple function called
% simpleQuadraticFunction (see below) and compares the result with your numerical
% solution. Your numerical gradient implementation is incorrect if
% your numerical solution deviates too much from the analytical solution.
  
% Evaluate the function and gradient at x = [4; 10]; (Here, x is a 2d vector.)
x = [4; 10];
[value, grad] = simpleQuadraticFunction(x);

% Use your code to numerically compute the gradient of simpleQuadraticFunction at x.
% (The notation "@simpleQuadraticFunction" denotes a pointer to a function.)
numgrad = computeNumericalGradient(@simpleQuadraticFunction, x);

% Visually examine the two gradient computations.  The two columns
% you get should be very similar. 
disp([numgrad grad]);
fprintf('The above two columns you get should be very similar.\n(Left-Your Numerical Gradient, Right-Analytical Gradient)\n\n');

% Evaluate the norm of the difference between two solutions.  
% If you have a correct implementation, and assuming you used EPSILON = 0.0001 
% in computeNumericalGradient.m, then diff below should be 2.1452e-12 
diff = norm(numgrad-grad)/norm(numgrad+grad);
disp(diff); 
fprintf('Norm of the difference between numerical and analytical gradient (should be < 1e-9)\n\n');
end


  
function [value,grad] = simpleQuadraticFunction(x)
% this function accepts a 2D vector as input. 
% Its outputs are:
%   value: h(x1, x2) = x1^2 + 3*x1*x2
%   grad: A 2x1 vector that gives the partial derivatives of h with respect to x1 and x2 
% Note that when we pass @simpleQuadraticFunction(x) to computeNumericalGradients, we're assuming
% that computeNumericalGradients will use only the first returned value of this function.

value = x(1)^2 + 3*x(1)*x(2);

grad = zeros(2, 1);
grad(1)  = 2*x(1) + 3*x(2);
grad(2)  = 3*x(1);

end

　　display_network.m:

function [h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)
% This function visualizes filters in matrix A. Each column of A is a
% filter. We will reshape each column into a square image and visualizes
% on each cell of the visualization panel. 
% All other parameters are optional, usually you do not need to worry
% about it.
% opt_normalize: whether we need to normalize the filter so that all of
% them can have similar contrast. Default value is true.
% opt_graycolor: whether we use gray as the heat map. Default is true.
% cols: how many columns are there in the display. Default value is the
% squareroot of the number of columns in A.
% opt_colmajor: you can switch convention to row major for A. In that
% case, each row of A is a filter. Default value is false.
warning off all

%exist(A),測試A是否存在，'var'表示只檢測變量
if ~exist('opt_normalize', 'var') || isempty(opt_normalize)
    opt_normalize= true;
end

if ~exist('opt_graycolor', 'var') || isempty(opt_graycolor)
    opt_graycolor= true;
end

if ~exist('opt_colmajor', 'var') || isempty(opt_colmajor)
    opt_colmajor = false;
end

% rescale
A = A - mean(A(:));

%colormap(gray)表示用灰度場景
if opt_graycolor, colormap(gray); end

% compute rows, cols
[L M]=size(A);
sz=sqrt(L);
buf=1;
if ~exist('cols', 'var')%沒有給定列數的情況下
    if floor(sqrt(M))^2 ~= M %M不是平方數時
        n=ceil(sqrt(M));
        while mod(M, n)~=0 && n<1.2*sqrt(M), n=n+1; end
        m=ceil(M/n);%m是最終要的小patch圖像的尺寸大小
    else
        n=sqrt(M);
        m=n;
    end
else
    n = cols;
    m = ceil(M/n);
end

array=-ones(buf+m*(sz+buf),buf+n*(sz+buf));

if ~opt_graycolor
    array = 0.1.* array;
end


if ~opt_colmajor
    k=1;
    for i=1:m
        for j=1:n
            if k>M, 
                continue; 
            end
            clim=max(abs(A(:,k)));
            if opt_normalize
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
            else
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/max(abs(A(:)));
            end
            k=k+1;
        end
    end
else
    k=1;
    for j=1:n
        for i=1:m
            if k>M, 
                continue; 
            end
            clim=max(abs(A(:,k)));
            if opt_normalize
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
            else
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz);
            end
            k=k+1;
        end
    end
end

if opt_graycolor
    h=imagesc(array,'EraseMode','none',[-1 1]);%這裏講EraseMode設置爲none,表示重繪時不擦除任何像素點
else
    h=imagesc(array,'EraseMode','none',[-1 1]);
end
axis image off

drawnow;

warning on all

　　實驗總結：

　　實驗結果顯示的那些權值圖像代表什麼呢？參考了內容Visualizing a Trained Autoencoder可以知道，如果輸入的特徵滿足二泛數小於1的約束，即滿足：

那麼可以證明只有當輸入的x中的每一維滿足：時，其對隱含層的active才最大，也就是說最容易是隱含層的節點輸出爲1，可以看出，輸入值和權值應該是正相關的。

　　2013.5.6補：

　　以前博文中在用vector的方式寫sparseAutoencoderCost.m文件時，一直不成功，現已經解決該問題了，解決方法是：把以前的Iweight換成Jweight即可。

　　參考資料：

Exercise:Sparse Autoencoder

Deep learning：八(Sparse Autoencoder)

Autoencoders and Sparsity

Visualizing a Trained Autoencoder

UFLDL練習(Sparse Autoencoder)

http://code.google.com/p/nlsbook/source/browse/trunk/nlsbook/cs294ps1/starter/?r=28

Deep learning：十(PCA和whitening)

　　PCA：

　　PCA的具有2個功能,一是維數約簡（可以加快算法的訓練速度，減小內存消耗等），一是數據的可視化。

　　PCA並不是線性迴歸，因爲線性迴歸是保證得到的函數是y值方面誤差最小，而PCA是保證得到的函數到所降的維度上的誤差最小。另外線性迴歸是通過x值來預測y值，而PCA中是將所有的x樣本都同等對待。

　　在使用PCA前需要對數據進行預處理，首先是均值化，即對每個特徵維，都減掉該維的平均值，然後就是將不同維的數據範圍歸一化到同一範圍，方法一般都是除以最大值。但是比較奇怪的是，在對自然圖像進行均值處理時並不是不是減去該維的平均值，而是減去這張圖片本身的平均值。因爲PCA的預處理是按照不同應用場合來定的。

　　自然圖像指的是人眼經常看見的圖像，其符合某些統計特徵。一般實際過程中，只要是拿正常相機拍的，沒有加入很多人工創作進去的圖片都可以叫做是自然圖片，因爲很多算法對這些圖片的輸入類型還是比較魯棒的。在對自然圖像進行學習時，其實不需要太關注對圖像做方差歸一化，因爲自然圖像每一部分的統計特徵都相似，只需做均值爲0化就ok了。不過對其它的圖片進行訓練時，比如首先字識別等，就需要進行方差歸一化了。

　　PCA的計算過程主要是要求2個東西，一個是降維後的各個向量的方向，另一個是原先的樣本在新的方向上投影后的值。

　　首先需求出訓練樣本的協方差矩陣，如公式所示（輸入數據已經均值化過）：

　　求出訓練樣本的協方差矩陣後，將其進行SVD分解，得出的U向量中的每一列就是這些數據樣本的新的方向向量了，排在前面的向量代表的是主方向，依次類推。用U’*X得到的就是降維後的樣本值z了，即：

　　（其實這個z值的幾何意義是原先點到該方向上的距離值，但是這個距離有正負之分），這樣PCA的2個主要計算任務已經完成了。用U*z就可以將原先的數據樣本x給還原出來。

　　在使用有監督學習時，如果要採用PCA降維，那麼只需將訓練樣本的x值抽取出來，計算出主成分矩陣U以及降維後的值z，然後讓z和原先樣本的y值組合構成新的訓練樣本來訓練分類器。在測試過程中，同樣可以用原先的U來對新的測試樣本降維，然後輸入到訓練好的分類器中即可。

　　有一個觀點需要注意，那就是PCA並不能阻止過擬合現象。表明上看PCA是降維了，因爲在同樣多的訓練樣本數據下，其特徵數變少了，應該是更不容易產生過擬合現象。但是在實際操作過程中，這個方法阻止過擬合現象效果很小，主要還是通過規則項來進行阻止過擬合的。

　　並不是所有ML算法場合都需要使用PCA來降維，因爲只有當原始的訓練樣本不能滿足我們所需要的情況下才使用，比如說模型的訓練速度，內存大小，希望可視化等。如果不需要考慮那些情況，則也不一定需要使用PCA算法了。

　　Whitening：

　　Whitening的目的是去掉數據之間的相關聯度，是很多算法進行預處理的步驟。比如說當訓練圖片數據時，由於圖片中相鄰像素值有一定的關聯，所以很多信息是冗餘的。這時候去相關的操作就可以採用白化操作。數據的whitening必須滿足兩個條件：一是不同特徵間相關性最小，接近0；二是所有特徵的方差相等（不一定爲1）。常見的白化操作有PCA whitening和ZCA whitening。

　　PCA whitening是指將數據x經過PCA降維爲z後，可以看出z中每一維是獨立的，滿足whitening白化的第一個條件，這是只需要將z中的每一維都除以標準差就得到了每一維的方差爲1，也就是說方差相等。公式爲：

　　ZCA whitening是指數據x先經過PCA變換爲z，但是並不降維，因爲這裏是把所有的成分都選進去了。這是也同樣滿足whtienning的第一個條件，特徵間相互獨立。然後同樣進行方差爲1的操作，最後將得到的矩陣左乘一個特徵向量矩陣U即可。

　　ZCA whitening公式爲：

　　參考資料：

Deep learning：十一(PCA和whitening在二維數據中的練習)

　　前言：

　　這節主要是練習下PCA，PCA Whitening以及ZCA Whitening在2D數據上的使用，2D的數據集是45個數據點，每個數據點是2維的。參考的資料是：Exercise:PCA in 2D。結合前面的博文Deep learning：十(PCA和whitening)理論知識，來進一步理解PCA和Whitening的作用。

　　matlab某些函數：

　　scatter:

　　scatter(X,Y,<S>,<C>,’<type>’);
　　<S> – 點的大小控制，設爲和X，Y同長度一維向量，則值決定點的大小；設爲常數或缺省，則所有點大小統一。
　　<C> – 點的顏色控制，設爲和X，Y同長度一維向量，則色彩由值大小線性分佈；設爲和X，Y同長度三維向量，則按colormap RGB值定義每點顏色，[0,0,0]是黑色，[1,1,1]是白色。缺省則顏色統一。
　　<type> – 點型：可選filled指代填充，缺省則畫出的是空心圈。

　　plot:

　　plot可以用來畫直線，比如說plot([1 2],[0 4])是畫出一條連接(1,0)到(2,4)的直線，主要點座標的對應關係。

　　實驗過程：

　　一、首先download這些二維數據，因爲數據是以文本方式保存的，所以load的時候是以ascii碼讀入的。然後對輸入樣本進行協方差矩陣計算，並計算出該矩陣的SVD分解，得到其特徵值向量，在原數據點上畫出2條主方向，如下圖所示：

　　二、將經過PCA降維後的新數據在座標中顯示出來，如下圖所示：

　　三、用新數據反過來重建原數據，其結果如下圖所示:

　　四、使用PCA whitening的方法得到原數據的分佈情況如：

　　五、使用ZCA whitening的方法得到的原數據的分佈如下所示：

　　PCA whitening和ZCA whitening不同之處在於處理後的結果數據的方差不同，儘管不同維度的方差是相等的。

　　實驗代碼：

close all

%%================================================================
%% Step 0: Load data
%  We have provided the code to load data from pcaData.txt into x.
%  x is a 2 * 45 matrix, where the kth column x(:,k) corresponds to
%  the kth data point.Here we provide the code to load natural image data into x.
%  You do not need to change the code below.

x = load('pcaData.txt','-ascii');
figure(1);
scatter(x(1, :), x(2, :));
title('Raw data');


%%================================================================
%% Step 1a: Implement PCA to obtain U 
%  Implement PCA to obtain the rotation matrix U, which is the eigenbasis
%  sigma. 

% -------------------- YOUR CODE HERE -------------------- 
u = zeros(size(x, 1)); % You need to compute this
[n m] = size(x);
%x = x-repmat(mean(x,2),1,m);%預處理，均值爲0
sigma = (1.0/m)*x*x';
[u s v] = svd(sigma);


% -------------------------------------------------------- 
hold on
plot([0 u(1,1)], [0 u(2,1)]);%畫第一條線
plot([0 u(1,2)], [0 u(2,2)]);%第二條線
scatter(x(1, :), x(2, :));
hold off

%%================================================================
%% Step 1b: Compute xRot, the projection on to the eigenbasis
%  Now, compute xRot by projecting the data on to the basis defined
%  by U. Visualize the points by performing a scatter plot.

% -------------------- YOUR CODE HERE -------------------- 
xRot = zeros(size(x)); % You need to compute this
xRot = u'*x;


% -------------------------------------------------------- 

% Visualise the covariance matrix. You should see a line across the
% diagonal against a blue background.
figure(2);
scatter(xRot(1, :), xRot(2, :));
title('xRot');

%%================================================================
%% Step 2: Reduce the number of dimensions from 2 to 1. 
%  Compute xRot again (this time projecting to 1 dimension).
%  Then, compute xHat by projecting the xRot back onto the original axes 
%  to see the effect of dimension reduction

% -------------------- YOUR CODE HERE -------------------- 
k = 1; % Use k = 1 and project the data onto the first eigenbasis
xHat = zeros(size(x)); % You need to compute this
xHat = u*([u(:,1),zeros(n,1)]'*x);


% -------------------------------------------------------- 
figure(3);
scatter(xHat(1, :), xHat(2, :));
title('xHat');


%%================================================================
%% Step 3: PCA Whitening
%  Complute xPCAWhite and plot the results.

epsilon = 1e-5;
% -------------------- YOUR CODE HERE -------------------- 
xPCAWhite = zeros(size(x)); % You need to compute this
xPCAWhite = diag(1./sqrt(diag(s)+epsilon))*u'*x;



% -------------------------------------------------------- 
figure(4);
scatter(xPCAWhite(1, :), xPCAWhite(2, :));
title('xPCAWhite');

%%================================================================
%% Step 3: ZCA Whitening
%  Complute xZCAWhite and plot the results.

% -------------------- YOUR CODE HERE -------------------- 
xZCAWhite = zeros(size(x)); % You need to compute this
xZCAWhite = u*diag(1./sqrt(diag(s)+epsilon))*u'*x;

% -------------------------------------------------------- 
figure(5);
scatter(xZCAWhite(1, :), xZCAWhite(2, :));
title('xZCAWhite');

%% Congratulations! When you have reached this point, you are done!
%  You can now move onto the next PCA exercise. :)

　　參考資料：

Exercise:PCA in 2D

Deep learning：十二(PCA和whitening在二自然圖像中的練習)

　　前言:

　　現在來用PCA，PCA Whitening對自然圖像進行處理。這些理論知識參考前面的博文：Deep learning：十(PCA和whitening)。而本次試驗的數據，步驟，要求等參考網頁：http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial 。實驗數據是從自然圖像中隨機選取10000個12*12的patch，然後對這些patch進行99%的方差保留的PCA計算，最後對這些patch做PCA Whitening和ZCA Whitening，並進行比較。

　　實驗環境：matlab2012a

　　實驗過程及結果：

　　隨機選取10000個patch，並顯示其中204個patch，如下圖所示：

　　然後對這些patch做均值爲0化操作得到如下圖：

　　對選取出的patch做PCA變換得到新的樣本數據，其新樣本數據的協方差矩陣如下圖所示：

　　保留99%的方差後的PCA還原原始數據，如下所示：

　　PCA Whitening後的圖像如下：

　　此時樣本patch的協方差矩陣如下:

　　ZCA Whitening的結果如下：

　　實驗代碼及註釋：

%%================================================================
%% Step 0a: Load data
%  Here we provide the code to load natural image data into x.
%  x will be a 144 * 10000 matrix, where the kth column x(:, k) corresponds to
%  the raw image data from the kth 12x12 image patch sampled.
%  You do not need to change the code below.

x = sampleIMAGESRAW();
figure('name','Raw images');
randsel = randi(size(x,2),204,1); % A random selection of samples for visualization
display_network(x(:,randsel));%爲什麼x有負數還可以顯示？

%%================================================================
%% Step 0b: Zero-mean the data (by row)
%  You can make use of the mean and repmat/bsxfun functions.

% -------------------- YOUR CODE HERE -------------------- 
x = x-repmat(mean(x,1),size(x,1),1);%求的是每一列的均值
%x = x-repmat(mean(x,2),1,size(x,2));

%%================================================================
%% Step 1a: Implement PCA to obtain xRot
%  Implement PCA to obtain xRot, the matrix in which the data is expressed
%  with respect to the eigenbasis of sigma, which is the matrix U.


% -------------------- YOUR CODE HERE -------------------- 
xRot = zeros(size(x)); % You need to compute this
[n m] = size(x);
sigma = (1.0/m)*x*x';
[u s v] = svd(sigma);
xRot = u'*x;


%%================================================================
%% Step 1b: Check your implementation of PCA
%  The covariance matrix for the data expressed with respect to the basis U
%  should be a diagonal matrix with non-zero entries only along the main
%  diagonal. We will verify this here.
%  Write code to compute the covariance matrix, covar. 
%  When visualised as an image, you should see a straight line across the
%  diagonal (non-zero entries) against a blue background (zero entries).

% -------------------- YOUR CODE HERE -------------------- 
covar = zeros(size(x, 1)); % You need to compute this
covar = (1./m)*xRot*xRot';

% Visualise the covariance matrix. You should see a line across the
% diagonal against a blue background.
figure('name','Visualisation of covariance matrix');
imagesc(covar);

%%================================================================
%% Step 2: Find k, the number of components to retain
%  Write code to determine k, the number of components to retain in order
%  to retain at least 99% of the variance.

% -------------------- YOUR CODE HERE -------------------- 
k = 0; % Set k accordingly
ss = diag(s);
% for k=1:m
%    if sum(s(1:k))./sum(ss) < 0.99
%        continue;
% end
%其中cumsum(ss)求出的是一個累積向量，也就是說ss向量值的累加值
%並且(cumsum(ss)/sum(ss))<=0.99是一個向量，值爲0或者1的向量，爲1表示滿足那個條件
k = length(ss((cumsum(ss)/sum(ss))<=0.99));

%%================================================================
%% Step 3: Implement PCA with dimension reduction
%  Now that you have found k, you can reduce the dimension of the data by
%  discarding the remaining dimensions. In this way, you can represent the
%  data in k dimensions instead of the original 144, which will save you
%  computational time when running learning algorithms on the reduced
%  representation.
% 
%  Following the dimension reduction, invert the PCA transformation to produce 
%  the matrix xHat, the dimension-reduced data with respect to the original basis.
%  Visualise the data and compare it to the raw data. You will observe that
%  there is little loss due to throwing away the principal components that
%  correspond to dimensions with low variation.

% -------------------- YOUR CODE HERE -------------------- 
xHat = zeros(size(x));  % You need to compute this
xHat = u*[u(:,1:k)'*x;zeros(n-k,m)];

% Visualise the data, and compare it to the raw data
% You should observe that the raw and processed data are of comparable quality.
% For comparison, you may wish to generate a PCA reduced image which
% retains only 90% of the variance.

figure('name',['PCA processed images ',sprintf('(%d / %d dimensions)', k, size(x, 1)),'']);
display_network(xHat(:,randsel));
figure('name','Raw images');
display_network(x(:,randsel));

%%================================================================
%% Step 4a: Implement PCA with whitening and regularisation
%  Implement PCA with whitening and regularisation to produce the matrix
%  xPCAWhite. 

epsilon = 0.1;
xPCAWhite = zeros(size(x));

% -------------------- YOUR CODE HERE -------------------- 
xPCAWhite = diag(1./sqrt(diag(s)+epsilon))*u'*x;
figure('name','PCA whitened images');
display_network(xPCAWhite(:,randsel));

%%================================================================
%% Step 4b: Check your implementation of PCA whitening 
%  Check your implementation of PCA whitening with and without regularisation. 
%  PCA whitening without regularisation results a covariance matrix 
%  that is equal to the identity matrix. PCA whitening with regularisation
%  results in a covariance matrix with diagonal entries starting close to 
%  1 and gradually becoming smaller. We will verify these properties here.
%  Write code to compute the covariance matrix, covar. 
%
%  Without regularisation (set epsilon to 0 or close to 0), 
%  when visualised as an image, you should see a red line across the
%  diagonal (one entries) against a blue background (zero entries).
%  With regularisation, you should see a red line that slowly turns
%  blue across the diagonal, corresponding to the one entries slowly
%  becoming smaller.

% -------------------- YOUR CODE HERE -------------------- 
covar = (1./m)*xPCAWhite*xPCAWhite';

% Visualise the covariance matrix. You should see a red line across the
% diagonal against a blue background.
figure('name','Visualisation of covariance matrix');
imagesc(covar);

%%================================================================
%% Step 5: Implement ZCA whitening
%  Now implement ZCA whitening to produce the matrix xZCAWhite. 
%  Visualise the data and compare it to the raw data. You should observe
%  that whitening results in, among other things, enhanced edges.

xZCAWhite = zeros(size(x));

% -------------------- YOUR CODE HERE -------------------- 
xZCAWhite = u*xPCAWhite;

% Visualise the data, and compare it to the raw data.
% You should observe that the whitened images have enhanced edges.
figure('name','ZCA whitened images');
display_network(xZCAWhite(:,randsel));
figure('name','Raw images');
display_network(x(:,randsel));

　　參考資料:

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

Deep learning：十三(Softmax Regression)

　　在前面的logistic regression博文Deep learning：四(logistic regression練習) 中，我們知道logistic regression很適合做一些非線性方面的分類問題，不過它只適合處理二分類的問題，且在給出分類結果時還會給出結果的概率。那麼如果需要用類似的方法（這裏類似的方法指的是輸出分類結果並且給出概率值）來處理多分類問題的話該怎麼擴展呢？本次要講的就是對logstic regression擴展的一種多分類器，softmax regression。參考的內容爲網頁：http://deeplearning.stanford.edu/wiki/index.php/Softmax_Regression

　　在Logistic regression中，所學習的系統的程爲：

　　其對應的損失函數爲：

　　可以看出，給定一個樣本，就輸出一個概率值，該概率值表示的含義是這個樣本屬於類別’1’的概率，因爲總共纔有2個類別，所以另一個類別的概率直接用1減掉剛剛的結果即可。如果現在的假設是多分類問題，比如說總共有k個類別。在softmax regression中這時候的系統的方程爲：

　　其中的參數sidta不再是列向量，而是一個矩陣，矩陣的每一行可以看做是一個類別所對應分類器的參數，總共有k行。所以矩陣sidta可以寫成下面的形式：

　　此時，系統損失函數的方程爲：

　　其中的1{.}是一個指示性函數，即當大括號中的值爲真時，該函數的結果就爲1，否則其結果就爲0。

　　當然了，如果要用梯度下降法，牛頓法，或者L-BFGS法求得系統的參數的話，就必須求出損失函數的偏導函數，softmax regression中損失函數的偏導函數如下所示：

　　注意公式中的是一個向量，表示的是針對第i個類別而求得的。所以上面的公式還只是一個類別的偏導公式，我們需要求出所有類別的偏導公式。表示的是損失函數對第j個類別的第l個參數的偏導。

　　比較有趣的時，softmax regression中對參數的最優化求解不只一個，每當求得一個優化參數時，如果將這個參數的每一項都減掉同一個數，其得到的損失函數值也是一樣的。這說明這個參數不是唯一解。用數學公式證明過程如下所示：

　　那這個到底是什麼原因呢？從宏觀上可以這麼理解，因爲此時的損失函數不是嚴格非凸的，也就是說在局部最小值點附近是一個”平坦”的，所以在這個參數附近的值都是一樣的了。那麼怎樣避免這個問題呢？其實加入規則項就可以解決（比如說，用牛頓法求解時，hession矩陣如果沒有加入規則項，就有可能不是可逆的從而導致了剛纔的情況，如果加入了規則項後該hession矩陣就不會不可逆了），加入規則項後的損失函數表達式如下：

　　這個時候的偏導函數表達式如下所示：

　　接下來剩下的問題就是用數學優化的方法來求解了，另外還可以從數學公式的角度去理解softmax regression是logistic regression的擴展。

　　網頁教程中還介紹了softmax regression和k binary classifiers之間的區別和使用條件。總結就這麼一個要點：如果所需的分類類別之間是嚴格相互排斥的，也就是兩種類別不能同時被一個樣本佔有，這時候應該使用softmax regression。反正，如果所需分類的類別之間允許某些重疊，這時候就應該使用binary classifiers了。

　　參考資料：

Deep learning：四(logistic regression練習)

http://deeplearning.stanford.edu/wiki/index.php/Softmax_Regression

Deep learning：十四(Softmax Regression練習)

　　前言：

　　這篇文章主要是用來練習softmax regression在多分類器中的應用，關於該部分的理論知識已經在前面的博文中Deep learning：十三(Softmax Regression)有所介紹。本次的實驗內容是參考網頁：http://deeplearning.stanford.edu/wiki/index.php/Exercise:Softmax_Regression。主要完成的是手寫數字識別，採用的是MNIST手寫數字數據庫，其中訓練樣本有6萬個，測試樣本有1萬個，且數字是0~9這10個。每個樣本是一張小圖片，大小爲28*28的。

　　實驗環境：matlab2012a

　　實驗基礎：

　　這次實驗只用了softmax模型，也就是說沒有隱含層，而只有輸入層和輸出層，因爲實驗中並沒有提取出MINST樣本的特徵，而是直接用的原始像素特徵。實驗中主要是計算系統的損失函數和其偏導數，其計算公式如下所示：

　　一些matlab函數：

　　sparse:

　　生成一個稀疏矩陣，比如說sparse(A, B, k)，，其中A和B是個向量，k是個常量。這裏生成的稀疏矩陣的值都爲參數k，稀疏矩陣位置值座標點有A和B相應的位置點值構成。

　　full:

　　生成一個正常矩陣，一般都是利用稀疏矩陣來還原的。

　　實驗錯誤：

　　按照作者給的starter code，結果連數據都加載不下來，出現如下錯誤提示：Error using permute Out of memory. Type HELP MEMORY for your options. 結果跟蹤定位到loadMNISTImages.m文件中的images = permute(images,[2 1 3])這句代碼，究其原因就是說images矩陣過大，在有限內存下不能夠將其進行維度旋轉變換。可是這個數據已經很小了，才幾十兆而已，參考了很多out of memory的方法都不管用，後面直接把改句的前面一句代碼images = reshape(images, numCols, numRows, numImages);改成images = reshape(images, numRows, numCols, numImages);反正實現的效果都是一樣的。因爲原因是內存問題，所以要麼用64bit的matlab，要買自己對該函數去優化下，節省運行過程中的內存。

　　實驗結果：

　　Accuracy: 92.640%

　　和網頁教程中給的結果非常相近了。

　　實驗主要部分代碼：

　　softmaxExercise.m:

%% CS294A/CS294W Softmax Exercise 

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  softmax exercise. You will need to write the softmax cost function 
%  in softmaxCost.m and the softmax prediction function in softmaxPred.m. 
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%  (However, you may be required to do so in later exercises)

%%======================================================================
%% STEP 0: Initialise constants and parameters
%
%  Here we define and initialise some constants which allow your code
%  to be used more generally on any arbitrary input. 
%  We also initialise some parameters used for tuning the model.

inputSize = 28 * 28; % Size of input vector (MNIST images are 28x28)
numClasses = 10;     % Number of classes (MNIST images fall into 10 classes)

lambda = 1e-4; % Weight decay parameter

%%======================================================================
%% STEP 1: Load data
%
%  In this section, we load the input and output data.
%  For softmax regression on MNIST pixels, 
%  the input data is the images, and 
%  the output data is the labels.
%

% Change the filenames if you've saved the files under different names
% On some platforms, the files might be saved as 
% train-images.idx3-ubyte / train-labels.idx1-ubyte

images = loadMNISTImages('train-images.idx3-ubyte');
labels = loadMNISTLabels('train-labels.idx1-ubyte');
labels(labels==0) = 10; % Remap 0 to 10

inputData = images;

% For debugging purposes, you may wish to reduce the size of the input data
% in order to speed up gradient checking. 
% Here, we create synthetic dataset using random data for testing

% DEBUG = true; % Set DEBUG to true when debugging.
DEBUG = false;
if DEBUG
    inputSize = 8;
    inputData = randn(8, 100);
    labels = randi(10, 100, 1);
end

% Randomly initialise theta
theta = 0.005 * randn(numClasses * inputSize, 1);%輸入的是一個列向量

%%======================================================================
%% STEP 2: Implement softmaxCost
%
%  Implement softmaxCost in softmaxCost.m. 

[cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, labels);
                                     
%%======================================================================
%% STEP 3: Gradient checking
%
%  As with any learning algorithm, you should always check that your
%  gradients are correct before learning the parameters.
% 

if DEBUG
    numGrad = computeNumericalGradient( @(x) softmaxCost(x, numClasses, ...
                                    inputSize, lambda, inputData, labels), theta);

    % Use this to visually compare the gradients side by side
    disp([numGrad grad]); 

    % Compare numerically computed gradients with those computed analytically
    diff = norm(numGrad-grad)/norm(numGrad+grad);
    disp(diff); 
    % The difference should be small. 
    % In our implementation, these values are usually less than 1e-7.

    % When your gradients are correct, congratulations!
end

%%======================================================================
%% STEP 4: Learning parameters
%
%  Once you have verified that your gradients are correct, 
%  you can start training your softmax regression code using softmaxTrain
%  (which uses minFunc).

options.maxIter = 100;
%softmaxModel其實只是一個結構體，裏面包含了學習到的最優參數以及輸入尺寸大小和類別個數信息
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
                            inputData, labels, options);
                          
% Although we only use 100 iterations here to train a classifier for the 
% MNIST data set, in practice, training for more iterations is usually
% beneficial.

%%======================================================================
%% STEP 5: Testing
%
%  You should now test your model against the test images.
%  To do this, you will first need to write softmaxPredict
%  (in softmaxPredict.m), which should return predictions
%  given a softmax model and the input data.

images = loadMNISTImages('t10k-images.idx3-ubyte');
labels = loadMNISTLabels('t10k-labels.idx1-ubyte');
labels(labels==0) = 10; % Remap 0 to 10

inputData = images;
size(softmaxModel.optTheta)
size(inputData)

% You will have to implement softmaxPredict in softmaxPredict.m
[pred] = softmaxPredict(softmaxModel, inputData);

acc = mean(labels(:) == pred(:));
fprintf('Accuracy: %0.3f%%\n', acc * 100);

% Accuracy is the proportion of correctly classified images
% After 100 iterations, the results for our implementation were:
%
% Accuracy: 92.200%
%
% If your values are too low (accuracy less than 0.91), you should check 
% your code for errors, and make sure you are training on the 
% entire data set of 60000 28x28 training images 
% (unless you modified the loading code, this should be the case)

　　softmaxCost.m

function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)

% numClasses - the number of classes 
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%

% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize);%將輸入的參數列向量變成一個矩陣

numCases = size(data, 2);%輸入樣本的個數
groundTruth = full(sparse(labels, 1:numCases, 1));%這裏sparse是生成一個稀疏矩陣，該矩陣中的值都是第三個值1
                                                    %稀疏矩陣的小標由labels和1:numCases對應值構成
cost = 0;

thetagrad = zeros(numClasses, inputSize);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost and gradient for softmax regression.
%                You need to compute thetagrad and cost.
%                The groundTruth matrix might come in handy.

M = bsxfun(@minus,theta*data,max(theta*data, [], 1));
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2);
thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta;



% ------------------------------------------------------------------
% Unroll the gradient matrices into a vector for minFunc
grad = [thetagrad(:)];
end

　　softmaxTrain.m:

function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels, options)
%softmaxTrain Train a softmax model with the given parameters on the given
% data. Returns softmaxOptTheta, a vector containing the trained parameters
% for the model.
%
% inputSize: the size of an input vector x^(i)
% numClasses: the number of classes 
% lambda: weight decay parameter
% inputData: an N by M matrix containing the input data, such that
%            inputData(:, c) is the cth input
% labels: M by 1 matrix containing the class labels for the
%            corresponding inputs. labels(c) is the class label for
%            the cth input
% options (optional): options
%   options.maxIter: number of iterations to train for

if ~exist('options', 'var')
    options = struct;
end

if ~isfield(options, 'maxIter')
    options.maxIter = 400;
end

% initialize parameters
theta = 0.005 * randn(numClasses * inputSize, 1);

% Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % softmaxCost.m satisfies this.
minFuncOptions.display = 'on';

[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...
                                   numClasses, inputSize, lambda, ...
                                   inputData, labels), ...                                   
                              theta, options);

% Fold softmaxOptTheta into a nicer format
softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);
softmaxModel.inputSize = inputSize;
softmaxModel.numClasses = numClasses;
                          
end

　　softmaxPredict.m:

function [pred] = softmaxPredict(softmaxModel, data)

% softmaxModel - model trained using softmaxTrain
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
%
% Your code should produce the prediction matrix 
% pred, where pred(i) is argmax_c P(y(c) | x(i)).
 
% Unroll the parameters from theta
theta = softmaxModel.optTheta;  % this provides a numClasses x inputSize matrix
pred = zeros(1, size(data, 2));

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute pred using theta assuming that the labels start 
%                from 1.


[nop, pred] = max(theta * data);
%  pred= max(peed_temp);


% ---------------------------------------------------------------------

end

　　參考資料：

Deep learning：十三(Softmax Regression)

http://deeplearning.stanford.edu/wiki/index.php/Exercise:Softmax_Regression

Deep learning：十五(Self-Taught Learning練習)

　　前言：

　　本次實驗主要是練習soft- taught learning的實現。參考的資料爲網頁：http://deeplearning.stanford.edu/wiki/index.php/Exercise:Self-Taught_Learning。Soft-taught leaning是用的無監督學習來學習到特徵提取的參數，然後用有監督學習來訓練分類器。這裏分別是用的sparse autoencoder和softmax regression。實驗的數據依舊是手寫數字數據庫MNIST Dataset.

　　實驗基礎：

　　從前面的知識可以知道，sparse autoencoder的輸出應該是和輸入數據尺寸大小一樣的，且很相近，那麼我們訓練出的sparse autoencoder模型該怎樣提取出特徵向量呢？其實輸入樣本經過sparse code提取出特徵的表達式就是隱含層的輸出了，首先來看看前面的經典sparse code模型，如下圖所示：

　　拿掉那個後面的輸出層後，隱含層的值就是我們所需要的特徵值了，如下圖所示：

　　從教程中可知，在unsupervised learning中有兩個觀點需要特別注意，一個是self-taught learning，一個是semi-supervised learning。Self-taught learning是完全無監督的。教程中有舉了個例子，很好的說明了這個問題，比如說我們需要設計一個系統來分類出轎車和摩托車。如果我們給出的訓練樣本圖片是自然界中隨便下載的（也就是說這些圖片中可能有轎車和摩托車，有可能都沒有，且大多數情況下是沒有的），然後使用的是這些樣本來特徵模型的話，那麼此時的方法就叫做self-taught learning。如果我們訓練的樣本圖片都是轎車和摩托車的圖片，只是我們不知道哪張圖對應哪種車，也就是說沒有標註，此時的方法不能叫做是嚴格的unsupervised feature，只能叫做是semi-supervised learning。

　　一些matlab函數：

　　numel:

　　比如說n = numel(A)表示返回矩陣A中元素的個數。

　　unique:

　　unique爲找出向量中的非重複元素並進行排序後輸出。

　　實驗結果：

　　採用數字5~9的樣本來進行無監督訓練，採用的方法是sparse autoencoder，可以提取出這些數據的權值，權值轉換成圖片顯示如下：

　　但是本次實驗主要是進行0~4這5個數字的分類，雖然進行無監督訓練用的是數字5~9的訓練樣本，這依然不會影響後面的結果。只是後面的分類器設計是用的softmax regression，所以是有監督的。最後據官網網頁上的結果精度是98%，而直接用原始的像素點進行分類器的設計不僅效果要差（才96%），而且訓練的速度也會變慢不少。

　　實驗主要部分代碼：

　　stlExercise.m:

%% CS294A/CS294W Self-taught Learning Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  self-taught learning. You will need to complete code in feedForwardAutoencoder.m
%  You will also need to have implemented sparseAutoencoderCost.m and 
%  softmaxCost.m from previous exercises.
%
%% ======================================================================
%  STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.

inputSize  = 28 * 28;
numLabels  = 5;
hiddenSize = 200;
sparsityParam = 0.1; % desired average activation of the hidden units.
                     % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
                     %  in the lecture notes). 
lambda = 3e-3;       % weight decay parameter       
beta = 3;            % weight of sparsity penalty term   
maxIter = 400;

%% ======================================================================
%  STEP 1: Load data from the MNIST database
%
%  This loads our training and test data from the MNIST database files.
%  We have sorted the data for you in this so that you will not have to
%  change it.

% Load MNIST database files
mnistData   = loadMNISTImages('train-images.idx3-ubyte');
mnistLabels = loadMNISTLabels('train-labels.idx1-ubyte');

% Set Unlabeled Set (All Images)

% Simulate a Labeled and Unlabeled set
labeledSet   = find(mnistLabels >= 0 & mnistLabels <= 4);
unlabeledSet = find(mnistLabels >= 5);

%%增加的一行代碼
unlabeledSet = unlabeledSet(1:end/3);

numTest = round(numel(labeledSet)/2);%拿一半的樣本來訓練%
numTrain = round(numel(labeledSet)/3);
trainSet = labeledSet(1:numTrain);
testSet  = labeledSet(numTrain+1:2*numTrain);

unlabeledData = mnistData(:, unlabeledSet);%%爲什麼這兩句連在一起都要出錯呢？
% pack;
trainData   = mnistData(:, trainSet);
trainLabels = mnistLabels(trainSet)' + 1; % Shift Labels to the Range 1-5

% mnistData2 = mnistData;
testData   = mnistData(:, testSet);
testLabels = mnistLabels(testSet)' + 1;   % Shift Labels to the Range 1-5

% Output Some Statistics
fprintf('# examples in unlabeled set: %d\n', size(unlabeledData, 2));
fprintf('# examples in supervised training set: %d\n\n', size(trainData, 2));
fprintf('# examples in supervised testing set: %d\n\n', size(testData, 2));

%% ======================================================================
%  STEP 2: Train the sparse autoencoder
%  This trains the sparse autoencoder on the unlabeled training
%  images. 

%  Randomly initialize the parameters
theta = initializeParameters(hiddenSize, inputSize);

%% ----------------- YOUR CODE HERE ----------------------
%  Find opttheta by running the sparse autoencoder on
%  unlabeledTrainingImages

opttheta = theta; 
addpath minFunc/
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';
[opttheta, loss] = minFunc( @(p) sparseAutoencoderLoss(p, ...
      inputSize, hiddenSize, ...
      lambda, sparsityParam, ...
      beta, unlabeledData), ...
      theta, options);


%% -----------------------------------------------------
                          
% Visualize weights
W1 = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize);
display_network(W1');

%%======================================================================
%% STEP 3: Extract Features from the Supervised Dataset
%  
%  You need to complete the code in feedForwardAutoencoder.m so that the 
%  following command will extract features from the data.

trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
                                       trainData);

testFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
                                       testData);

%%======================================================================
%% STEP 4: Train the softmax classifier

softmaxModel = struct;  
%% ----------------- YOUR CODE HERE ----------------------
%  Use softmaxTrain.m from the previous exercise to train a multi-class
%  classifier. 

%  Use lambda = 1e-4 for the weight regularization for softmax
lambda = 1e-4;
inputSize = hiddenSize;
numClasses = numel(unique(trainLabels));%unique爲找出向量中的非重複元素並進行排序

% You need to compute softmaxModel using softmaxTrain on trainFeatures and
% trainLabels


% You need to compute softmaxModel using softmaxTrain on trainFeatures and
% trainLabels

options.maxIter = 100;
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
                            trainFeatures, trainLabels, options);



%% -----------------------------------------------------


%%======================================================================
%% STEP 5: Testing 

%% ----------------- YOUR CODE HERE ----------------------
% Compute Predictions on the test set (testFeatures) using softmaxPredict
% and softmaxModel


[pred] = softmaxPredict(softmaxModel, testFeatures);


%% -----------------------------------------------------

% Classification Score
fprintf('Test Accuracy: %f%%\n', 100*mean(pred(:) == testLabels(:)));

% (note that we shift the labels by 1, so that digit 0 now corresponds to
%  label 1)
%
% Accuracy is the proportion of correctly classified images
% The results for our implementation was:
%
% Accuracy: 98.3%
%
%

　　feedForwardAutoencoder.m:

function [activation] = feedForwardAutoencoder(theta, hiddenSize, visibleSize, data)

% theta: trained weights from the autoencoder
% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
  
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the activation of the hidden layer for the Sparse Autoencoder.
activation  = sigmoid(W1*data+repmat(b1,[1,size(data,2)]));

%-------------------------------------------------------------------

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

　　參考資料:

http://deeplearning.stanford.edu/wiki/index.php/Exercise:Self-Taught_Learning

MNIST Dataset

Deep learning：十六(deep networks)

　　本節參考的是網頁http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial中關於Building Deep Networks for Classification一節的介紹。分下面2大部分內容：

　　1. 從self-taught到deep networks：

　　從前面的關於self-taught learning介紹（Deep learning：十五(Self-Taught Learning練習)）可以看到，該ML方法在特徵提取方面是完全用的無監督方法，本次要講的就是在上面的基礎上再用有監督的方法繼續對網絡的參數進行微調，這樣就可以得到更好的效果了。把self-taught learning的兩個步驟合在一起的結構圖如下所示：

　　很顯然，上面是一個多層神經網絡，三層。

　　一般的，前面的無監督學習到的模型參數可以當做是有監督學習參數的初始化值，這樣當我們用有大量的標註了的數據時，就可以採用梯度下降等方法來繼續優化參數了，因爲有了剛剛的初始化參數，此時的優化結果一般都能收斂到比較好的局部最優解。如果是隨機初始化模型的參數值的話，那麼在多層神經網絡中一般很難收斂到局部較好值，因爲多層神經網絡的系統函數是非凸的。

　　那麼該什麼時候使用微調技術來調整無監督學習的結果呢？只有我們有大量標註的樣本下纔可以。當我們有大量無標註的樣本，但有一小部分標註的樣本時也是不適合使用微調技術的。如果我們不想使用微調技術的話，那麼在第三層分類器的設計時，應該採用級聯的表達方式，也就是說學習到的結果和原始的特徵值一起輸入。當然了，如果採用了微調技術，則效果更好，就不需要繼續用級聯的特徵表達了。

　　2. Deep networks小綜述：

　　如果使用多層神經網絡的話，那麼將可以得到對輸入更復雜的函數表示，因爲神經網絡的每一層都是上一層的非線性變換。當然，此時要求每一層的activation函數是非線性的，否則就沒有必要用多層了。

　　Deep networks的優點：

　　一、比單層神經網絡能學習到更復雜的表達。比如說用k層神經網絡能學習到的函數（且每層網絡節點個數時多項式的）如果要用k-1層神經網絡來學習，則這k-1層神經網絡節點的個數必須是指數級龐大的數字。

　　二、不同層的網絡學習到的特徵是由最底層到最高層慢慢上升的。比如在圖像的學習中，第一個隱含層層網絡可能學習的是邊緣特徵，第二隱含層就學習到的是輪廓什麼的，後面的就會更高級有可能是圖像目標中的一個部位，也就是是底層隱含層學習底層特徵，高層隱含層學習高層特徵。

　　三、這種多層神經網絡的結構和人體大腦皮層的多層感知結構非常類似，所以說有一定的生物理論基礎。

　　Deep networks的缺點：

　　一、網絡的層次越深，所需的訓練樣本數越多，如果是用有監督學習的話，那麼這些樣本就更難獲取，因爲要進行各種標註。但是如果樣本數太少的話，就很容易產生過擬合現象。

　　二、因爲多層神經網絡的參數優化問題是一個高階非凸優化問題，這個問題通常收斂到一個比較差的局部解，普通的優化算法一般都效果不好。也就是說，參數的優化問題是個難點。

　　三、梯度擴散問題。因爲當網絡層次比較深時，在計算損失函數的偏導時一般需要使用BP算法，但是這些梯度值隨着深度慢慢靠前而顯著下降，這樣導致前面的網絡對最終的損失函數的貢獻很小。這樣的話前面的權值更新速度就非常非常慢了。一個理論上比較好的解決方法是將後面網絡的結構的神經元的個數提高非常多，以至於它不會影響前面網絡的結構的學習。但這樣豈不是和低深度的網絡結構一樣了嗎？所以不妥。

　　所以一般都是採用的層次貪婪訓練方法來訓練網絡的參數，即先訓練網絡的第一個隱含層，然後接着訓練第二個，第三個…最後用這些訓練好的網絡參數值作爲整體網絡參數的初始值。這樣的好處是數據更容易獲取，因爲前面的網絡層次基本都用無監督的方法獲得，很容易，只有最後一個輸出層需要有監督的數據。另外由於無監督學習其實隱形之中已經提供了一些輸入數據的先驗知識，所以此時的參數初始化值一般都能得到最終比較好的局部最優解。比較常見的一種層次貪婪訓練方法就是stacked autoencoders。它的編碼公式如下所示：

　　解碼公式如下：

　　最後的就是用stacked autoencoders學習到的參數來初始化整個網絡了，此時整個網絡可以看做是一個單一的神經網絡模型，只是它是多層的而已，而通常的BP算法是對任意層的網絡都有效的。最後的參數調整步驟和前面學習到的稀疏編碼模型是一樣的。其過程截圖如下：

　　參考資料：

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

Deep learning：十五(Self-Taught Learning練習)

Deep learning：十七(Linear Decoders，Convolution和Pooling)

　　本文主要是學習下Linear Decoder已經在大圖片中經常採用的技術convolution和pooling，分別參考網頁http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial中對應的章節部分。

　　Linear Decoders:

　　以三層的稀疏編碼神經網絡而言，在sparse autoencoder中的輸出層滿足下面的公式：

　　從公式中可以看出，a3的輸出值是f函數的輸出，而在普通的sparse autoencoder中f函數一般爲sigmoid函數，所以其輸出值的範圍爲(0,1)，所以可以知道a3的輸出值範圍也在0到1之間。另外我們知道，在稀疏模型中的輸出層應該是儘量和輸入層特徵相同，也就是說a3=x1，這樣就可以推導出x1也是在0和1之間，那就是要求我們對輸入到網絡中的數據要先變換到0和1之間，這一條件雖然在有些領域滿足，比如前面實驗中的MINIST數字識別。但是有些領域，比如說使用了PCA Whitening後的數據，其範圍卻不一定在0和1之間。因此Linear Decoder方法就出現了。Linear Decoder是指在隱含層採用的激發函數是sigmoid函數，而在輸出層的激發函數採用的是線性函數，比如說最特別的線性函數——等值函數。此時，也就是說輸出層滿足下面公式：

　　這樣在用ＢＰ算法進行梯度的求解時，只需要更改誤差點的計算公式而已，改成如下公式：

　　Convolution:

　　在瞭解convolution前，先認識下爲什麼要從全部連接網絡發展到局部連接網絡。在全局連接網絡中，如果我們的圖像很大，比如說爲96*96，隱含層有要學習100個特徵，則這時候把輸入層的所有點都與隱含層節點連接，則需要學習10^6個參數，這樣的話在使用BP算法時速度就明顯慢了很多。

　　所以後面就發展到了局部連接網絡，也就是說每個隱含層的節點只與一部分連續的輸入點連接。這樣的好處是模擬了人大腦皮層中視覺皮層不同位置只對局部區域有響應。局部連接網絡在神經網絡中的實現使用convolution的方法。它在神經網絡中的理論基礎是對於自然圖像來說，因爲它們具有穩定性，即圖像中某個部分的統計特徵和其它部位的相似，因此我們學習到的某個部位的特徵也同樣適用於其它部位。

　　下面具體看一個例子是怎樣實現convolution的，假如對一張大圖片Xlarge的數據集，r*c大小，則首先需要對這個數據集隨機採樣大小爲a*b的小圖片，然後用這些小圖片patch進行學習（比如說sparse autoencoder），此時的隱含節點爲k個。因此最終學習到的特徵數爲：

　　此時的convolution移動是有重疊的。

　　Pooling：

　　雖然按照convolution的方法可以減小不少需要訓練的網絡參數，比如說96*96，,100個隱含層的，採用8*8patch，也100個隱含層，則其需要訓練的參數個數減小到了10^3，大大的減小特徵提取過程的困難。但是此時同樣出現了一個問題，即它的輸出向量的維數變得很大，本來完全連接的網絡輸出只有100維的，現在的網絡輸出爲89*89*100=792100維，大大的變大了，這對後面的分類器的設計同樣帶來了困難，所以pooling方法就出現了。

　　爲什麼pooling的方法可以工作呢？首先在前面的使用convolution時是利用了圖像的stationarity特徵，即不同部位的圖像的統計特徵是相同的，那麼在使用convolution對圖片中的某個局部部位計算時，得到的一個向量應該是對這個圖像局部的一個特徵，既然圖像有stationarity特徵，那麼對這個得到的特徵向量進行統計計算的話，所有的圖像局部塊應該也都能得到相似的結果。對convolution得到的結果進行統計計算過程就叫做pooling，由此可見pooling也是有效的。常見的pooling方法有max pooling和average pooling等。並且學習到的特徵具有旋轉不變性（這個原因暫時沒能理解清楚）。

　　從上面的介紹可以簡單的知道，convolution是爲了解決前面無監督特徵提取學習計算複雜度的問題，而pooling方法是爲了後面有監督特徵分類器學習的，也是爲了減小需要訓練的系統參數（當然這是在普遍例子中的理解，也就是說我們採用無監督的方法提取目標的特徵，而採用有監督的方法來訓練分類器）。

　　參考資料：

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

Deep learning：十八(關於隨機採樣)

　　由於最近在看deep learning中的RBMs網絡，而RBMs中本身就有各種公式不好理解，再來幾個Gibbs採樣，就更令人頭疼了。所以還是覺得先看下Gibbs採樣的理論知識。經過調查發現Gibbs是隨機採樣中的一種。所以本節也主要是簡單層次的理解下隨機採用知識。參考的知識是博客隨機模擬的基本思想和常用採樣方法（sampling），該博文是網上找到的解釋得最通俗的。其實學校各種帶數學公式的知識時，最好有學者能用通俗易懂的語言介紹，這對入門學者來說極其重要。當然了，還參考了網頁http://www.jdl.ac.cn/user/lyqing/StatLearning/StatlLearning_handout.html中的一些資料。

　　採樣是指我們知道一個樣本x（大多數情況下是多維的）的概率分佈函數，要通過這個函數來產生多個樣本點集合。有的人可能會問，這有什麼難的，matlaab等工具不是有命令來產生各種分佈的樣本麼？比如說均值分佈，正太分佈的。對，確實沒錯，但這些分佈樣本點的產生也不是很容易的，需要精心設計。如果把函數域中的每個函數都去花精力設計它的樣本產生方法，豈不是很費力。所以就出現了隨機採樣的方法，只要能逼近理論結果值就ok了。當然了，這只是隨機採用方法出現的一種原因，純屬個人理解，肯定還有不少其它方面的因素的。

　　分下面幾個部分來介紹常見的隨機採樣方法：

　　一、拒絕——接受採樣

　　該方法是用一個我們很容易採樣到的分佈去模擬需要採樣的分佈。它要滿足一些條件，如下：

　　其具體的採集過程如下所示：

　　幾何上的解釋如下：

　　由上面的解釋可知，其實是在給定一個樣本x的情況下，然後又隨機選取一個y值，該y值是在輪廓線Mq(x)下隨機產生的，如果該y值落在2條曲線之間，則被拒絕，否則就會被接受。這很容易理解，關於其理論的各種推導這裏就免了，太枯燥了，哈哈。

　　二、重要性採樣。

　　我對重要性採樣的理解是該方法目的並不是用來產生一個樣本的，而是求一個函數的定積分的，只是因爲該定積分的求法是通過對另一個叫容易採集分佈的隨機採用得到的（本人研究比較淺，暫時只能這麼理解着）。如下圖所示：

　　其中通過對q(x)的隨機採樣，得到大量的樣本x，然後求出f(x)*w(x)的均值，最終得出積分I值。其中的w(x)也就是重要性了，此時如果q(x)概率大，則得到的x樣本數就多，這樣w(x)的值也就多了，也間接體現了它越重要。

　　三、 Metropolis-Hasting

　　該方法是用一個建議分佈以一定概率來更新樣本，有點類似拒絕——接受採樣。其過程如下所示：

　　四、Gibbs採樣

　　Gibss採用是需要知道樣本中一個屬性在其它所有屬性下的條件概率，然後利用這個條件概率來分佈產生各個屬性的樣本值。其過程如下所示：

　　參考資料：

隨機模擬的基本思想和常用採樣方法（sampling）

http://www.jdl.ac.cn/user/lyqing/StatLearning/StatlLearning_handout.html

Deep learning：十九(RBM簡單理解)

　　這篇博客主要用來簡單介紹下RBM網絡，因爲deep learning中的一個重要網絡結構DBN就可以由RBM網絡疊加而成，所以對RBM的理解有利於我們對DBN算法以及deep learning算法的進一步理解。Deep learning是從06年開始火得，得益於大牛Hinton的文章，不過這位大牛的文章比較晦澀難懂，公式太多，對於我這種菜鳥級別來說讀懂它的paper壓力太大。縱觀大部分介紹RBM的paper，都會提到能量函數。因此有必要先了解下能量函數的概念。參考網頁http://202.197.191.225:8080/30/text/chapter06/6_2t24.htm關於能量函數的介紹：

　　一個事物有相應的穩態，如在一個碗內的小球會停留在碗底，即使受到擾動偏離了碗底，在擾動消失後，它會回到碗底。學過物理的人都知道，穩態是它勢能最低的狀態。因此穩態對應與某一種能量的最低狀態。將這種概念引用到Hopfield網絡中去，Hopfield構造了一種能量函數的定義。這是他所作的一大貢獻。引進能量函數概念可以進一步加深對這一類動力系統性質的認識，可以把求穩態變成一個求極值與優化的問題，從而爲Hopfield網絡找到一個解優化問題的應用。

　　下面來看看RBM網絡，其結構圖如下所示：

　　可以看到RBM網絡共有2層，其中第一層稱爲可視層，一般來說是輸入層，另一層是隱含層，也就是我們一般指的特徵提取層。在一般的文章中，都把這2層的節點看做是二值的，也就是隻能取0或1，當然了，RBM中節點是可以取實數值的，這裏取二值只是爲了更好的解釋各種公式而已。在前面一系列的博文中可以知道，我們設計一個網絡結構後，接下來就應該想方設法來求解網絡中的參數值。而這又一般是通過最小化損失函數值來解得的，比如在autoencoder中是通過重構值和輸入值之間的誤差作爲損失函數（當然了，一般都會對參數進行規制化的）；在logistic迴歸中損失函數是與輸出值和樣本標註值的差有關。那麼在RBM網絡中，我們的損失函數的表達式是什麼呢，損失函數的偏導函數又該怎麼求呢？

　　在瞭解這個問題之前，我們還是先從能量函數出發。針對RBM模型而言，輸入v向量和隱含層輸出向量h之間的能量函數值爲：

　　而這2者之間的聯合概率爲：

　　其中Z是歸一化因子，其值爲：

　　這裏爲了習慣，把輸入v改成函數的自變量x，則關於x的概率分佈函數爲：

　　令一箇中間變量F(x)爲：

　　則x的概率分佈可以重新寫爲：

　　這時候它的偏導函數取負後爲：

　　從上面能量函數的抽象介紹中可以看出，如果要使系統（這裏即指RBM網絡）達到穩定，則應該是系統的能量值最小，由上面的公式可知，要使能量E最小，應該使F(x)最小，也就是要使P(x)最大。因此此時的損失函數可以看做是-P(x)，且求導時需要是加上負號的。

　　另外在圖RBM中，可以很容易得到下面的概率值公式：

　　此時的F(v)爲（也就是F(x)）：

　　這個函數也被稱做是自由能量函數。另外經過一些列的理論推導，可以求出損失函數的偏導函數公式爲：

　　很明顯，我們這裏是吧-P(v)當成了損失函數了。另外，估計大家在看RBM相關文章時，一定會介紹Gibbs採樣的知識，關於Gibbs內容可以簡單參考上一篇博文：Deep learning：十八(關於隨機採樣)。那麼爲什麼要用隨機採用來得到數據呢，我們不是都有訓練樣本數據了麼？其實這個問題我也一直沒弄明白。在看過一些簡單的RBM代碼後，暫時只能這麼理解：在上面文章最後的求偏導公式裏，是兩個數的減法，按照一般paper上所講，這個被減數等於輸入樣本數據的自由能量函數期望值，而減數是模型產生樣本數據的自由能量函數期望值。而這個模型樣本數據就是利用Gibbs採樣獲得的，大概就是用原始的數據v輸入到網絡，計算輸出h(1)，然後又反推v(1)，繼續計算h(2)，…，當最後反推出的v(k)和k比較接近時停止，這個時候的v(k)就是模型數據樣本了。

　　也可以參考博文淺談Deep Learning的基本思想和方法來理解：假設有一個二部圖，每一層的節點之間沒有鏈接，一層是可視層，即輸入數據層（v)，一層是隱藏層(h)，如果假設所有的節點都是二值變量節點（只能取0或者1值），同時假設全概率分佈p(v, h)滿足Boltzmann 分佈，我們稱這個模型是Restrict Boltzmann Machine (RBM)。下面我們來看看爲什麼它是Deep Learning方法。首先，這個模型因爲是二部圖，所以在已知v的情況下，所有的隱藏節點之間是條件獨立的，即p(h|v) =p(h1|v).....p(hn|v)。同理，在已知隱藏層h的情況下，所有的可視節點都是條件獨立的，同時又由於所有的v和h滿足Boltzmann 分佈，因此，當輸入v的時候，通過p(h|v) 可以得到隱藏層h，而得到隱藏層h之後，通過p(v|h) 又能得到可視層，通過調整參數，我們就是要使得從隱藏層得到的可視層v1與原來的可視層v如果一樣，那麼得到的隱藏層就是可視層另外一種表達，因此隱藏層可以作爲可視層輸入數據的特徵，所以它就是一種Deep Learning方法。

　　參考資料：

http://202.197.191.225:8080/30/text/chapter06/6_2t24.htm

http://deeplearning.net/tutorial/rbm.html

http://edchedch.wordpress.com/2011/07/18/introduction-to-restricted-boltzmann-machines/

Deep learning：十八(關於隨機採樣)

淺談Deep Learning的基本思想和方法

Deep learning：二十(無監督特徵學習中關於單層網絡的分析)

　　本文是讀Ng團隊的論文” An Analysis of Single-Layer Networks in Unsupervised Feature Learning”後的分析，主要是針對一個隱含層的網絡結構進行分析的，分別對比了4種網絡結構，k-means, sparse autoencoder, sparse rbm, gmm。最後作者得出了下面幾個結論：1. 網絡中隱含層神經元節點的個數，採集的密度（也就是convolution時的移動步伐）和感知區域大小對最終特徵提取效果的影響很大，甚至比網絡的層次數，deep learning學習算法本身還要重要。2. Whitening在預處理過程中還是很有必要的。3. 在以上4種實驗算法中，k-means效果竟然最好。因此在最後作者給出結論時的建議是，儘量使用whitening對數據進行預處理，每一層訓練更多的特徵數，採用更密集的方法對數據進行採樣。

　　NORB：

　　該數據庫參考網頁：http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/index.html。該數據庫是由5種玩具模型的圖片構成：4只腳的動物，飛機，卡車，人，小轎車，由於每一種玩具模型又有幾種，所以總共是有60種類別。總共用2個攝像頭，在9種高度和18種方位角拍攝的。部分截圖如下：

　　CIFAR-10：

　　該數據庫參考網頁：http://www.cs.toronto.edu/~kriz/cifar.html。這個數據庫也是圖片識別的，共有10個類別，飛機，鳥什麼的。每一個類別的圖片有6000張，其中5000張用於訓練，1000張用於測試。圖片的大小爲32*32的。部分截圖如下：

　　一般在deep learning中，最大的缺陷就是有很多參數需要調整，比如說學習速率，稀疏度懲罰係數，權值懲罰係數，momentum(不懂怎麼翻譯，好像rbm中需要用到)等。而這些參數最終的確定需要通過交叉驗證獲得，本身這樣的結構訓練起來所用時間就長，這麼多參數要用交叉驗證來獲取時間就更多了。所以本文得出的結論用kmeans效果那麼好，且無需有這些參數要考慮。

　　下面是上面4種算法的一些簡單介紹：

　　Sparse autoencoder:

　　其網絡函數表達式如下：

　　Sparse RBM:

　　和Sparse auto-encoder函數表達類似，只不過求解參數時的思想不同而已。另外在Sparse RBM中，參數優化主要用CD（對比散度）算法。而在Sparse autoencoder在參數優化時主要使用bp算法。

　　K-means聚類：

　　如果是用hard-kmeans的話，其目標函數公式如下：

　　其中c(j)爲聚類得到的類別中心點。

　　如果用soft-kmeasn的話，則表達式如下：

　　其中Zk的計算公式如下：

　　Uk爲元素z的均值。

　　GMM：

　　其目標函數表達式如下：

　　分類算法統一採用的是svm。

　　當訓練出特徵提取的網絡參數後，就可以對輸入的圖片進行特徵提取了，其特徵提取的示意圖如下所示：

　　實驗結果：

　　首先來看看有無whitening學習到的圖片特徵在這4種情況下的顯示如下：

　　可以看出whitening後學習到更多的細節，且whitening後幾種算法都能學到類似gabor濾波器的效果，因此並不一定是deep learning的結構纔可以學到這些特性。

　　下面的這個曲線圖表明，隱含層節點的個數越多則最後的識別率會越高，並且可以看出soft kmeans的效果要最好。

　　從下面的曲線可以看出當stride越小時，效果越好，不過作者建議最好將該參數設置爲大於1，因爲如果設置太小，則計算量會增大，比如在sparse coding中，每次測試圖片輸入時，對小patch進行convolution時都要經過數學優化來求其輸出（和autoencoder，rbm等deep learning算法不同），所以計算量會特別大。不過當stride值越大則識別率會顯著下降。

　　而這下面這張圖則表明當Receptive filed size爲6時，效果最好。不過作者也認爲這不一定，因爲如果把該參數調大，這意味着需要更多的訓練樣本纔有可能體會出該參數的作用，因此這個感知器區域即使比較小，也是可以學到不錯的特徵的。

　　參考資料：

　　An Analysis of Single-Layer Networks in Unsupervised Feature Learning, Adam Coates, Honglak Lee, and Andrew Y. Ng. In AISTATS 14, 2011.

http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/index.html

http://www.cs.toronto.edu/~kriz/cifar.html

Deep learning：二十一(隨機初始化在無監督特徵學習中的作用)

　　這又是Ng團隊的一篇有趣的paper。Ng團隊在上篇博客文章Deep learning：二十(無監督特徵學習中關於單層網絡的分析)中給出的結論是：網絡中隱含節點的個數，convolution尺寸和移動步伐等參數比網絡的層次比網絡參數的學習算法本身還要重要，也就是說即使是使用單層的網絡，只要隱含層的節點數夠大，convolution尺寸和移動步伐較小，用簡單的算法（比如kmeans算法）也可取得不亞於其它複雜的deep learning最優效果算法。而在本文On random weights and unsupervised feature learning中又提出了個新觀點：即根本就無需通過那些複雜且消耗大量時間去訓練網絡的參數的deep learning算法，我們只需隨機給網絡賦一組參數值，其最終取得的特徵好壞不比那些預訓練和仔細調整後得到的效果些，而且這樣還可以減少大量的訓練時間。

　　以上兩個結論不免能引起大家很多疑惑，既然這麼多人去研究深度學習，提出了那麼多深度學習的算法，並構建了各種深度網絡結構，而現在卻發現只需用單層網絡，不需要任何深度學習算法，就可以取得接近深度學習算法的最優值，甚至更好。那麼深度學習還有必要值得研究麼？單層網絡也就沒有必要叫深度學習了，還是叫以前的神經網絡學習算了。這種問題對於我這種菜鳥來說是沒法解答的，還是靜觀吧，呵呵。

　　文章主要是回答兩個問題：1. 爲什麼隨機初始化有時候能夠表現那麼好？ 2. 如果用無監督學習的方法來預賦值，用有監督學習的方法來微調這些值，那這些方法的作用何在？

　　針對第一個問題，作者認爲隨機初始化網絡參數能夠取得很好的效果是因爲，如果網絡的結構確定了，則網絡本身就對輸入的數據由一定的選擇性，比如說會選擇頻率選擇性和平移不變性。其公式如下：

　　因此，最優輸入處的頻率是濾波f取最大的幅值時的頻率，這是網絡具有頻率選擇性的原因；後面那個相位值是沒有固定的，說明網絡本身也具有平移不變形選擇性。（其實這個公式沒太看得，文章附錄有其證明過程）。下面這張圖時隨機給定的網絡值和其對應的最佳響應輸入：

　　其中圓形卷積是指其卷積發生可以超出圖片的範圍，而有效卷積則必須全部在圖片範圍內進行。其示意圖可以參考下面的：

　　作者給出了沒有使用convolution和使用了convolution時的分類準確度對比圖，圖如下所示：

　　其中不使用convolution和使用convolution的區別是，前者在每個位置進行convolution時使用的網絡參數是不同的，而後者對應的參數是相同的。由上圖也可以知道，使用convolution的方法效果會更好。

　　下面是作者給出第二個問題的答案，首先看下圖：

　　由上圖可知，使用預訓練參數比隨機初始化參數的分類效果要好，測試數據庫是NORB和CIFAR。預訓練參數值的作用作者好像也沒給出具體解釋。只是給出了建議：與其在網絡訓練方法上花費時間，還不如選擇一個更好的網絡結構。

　　最後，作者給出了怎樣通過隨機算法來選擇網絡的結構。因爲這樣可以節省不少時間，如下表所示：

　　參考資料：

　　On random weights and unsupervised feature learning. In ICML 2011,Saxe, A., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., & Ng, A. (2011).

Deep learning：二十(無監督特徵學習中關於單層網絡的分析)

Deep learning：二十二(linear decoder練習)

　　前言：

　　本節是練習Linear decoder的應用，關於Linear decoder的相關知識介紹請參考：Deep learning：十七(Linear Decoders，Convolution和Pooling)，實驗步驟參考Exercise: Implement deep networks for digit classification。本次實驗是用linear decoder的sparse autoencoder來訓練出stl-10數據庫圖片的patch特徵。並且這次的訓練權值是針對rgb圖像塊的。

　　基礎知識：

　　PCA Whitening是保證數據各維度的方差爲1，而ZCA Whitening是保證數據各維度的方差相等即可，不一定要唯一。並且這兩種whitening的一般用途也不一樣，PCA Whitening主要用於降維且去相關性，而ZCA Whitening主要用於去相關性，且儘量保持原數據。

　　Matlab的一些知識：

　　函數句柄的好處就是把一個函數作爲參數傳入到本函數中，在該函數內部可以利用該函數進行各種運算得出最後需要的結果，比如說函數中要用到各種求導求積分的方法，如果是傳入該函數經過各種運算後的值的話，那麼在調用該函數前就需要不少代碼，這樣比較累贅，所以採用函數句柄後這些代碼直接放在了函數內部，每調用一次無需在函數外面實現那麼多的東西。

　　Matlab中保存各種數據時可以採用save函數，並將其保持爲.mat格式的，這樣在matlab的current folder中看到的是.mat格式的文件，但是直接在文件夾下看，它是不直接顯示後綴的，且顯示的是Microsoft Access Table Shortcut，也就是.mat的簡稱。

　　關於實驗的一些說明：

　　在Ng的教程和實驗中，它的輸入樣本矩陣是每一列代表一個樣本的，列數爲樣本的總個數。

　　matlab中矩陣64*10w大小肯定是可以的。

　　在本次實驗中，ZCA Whitening是針對patches進行的，且patches的均值化是對每一維進行的（感覺這種均值化比較靠譜，前面有文章是進行對patch中一個樣本求均值，感覺那樣很不靠譜，不過那是在natural image中做的，因爲natural image每一維的統計特性都一樣，所以可以那樣均值化，但還是感覺不太靠譜）。因爲使用的是ZCA whitening，所以新的向量並沒有進行降維，只是去了相關性和讓每一維的方差都相等而已。另外，由此可見，在進行數據Whitening時並不需要對原始的大圖片進行whitening，而是你用什麼數據輸入網絡去訓練就對什麼數據進行whitening，而這裏，是用的小patches來訓練的，所以應該對小patches進行whitening。

　　關於本次實驗的一些數據和變量分配如下：

　　總共需訓練的樣本矩陣大小爲192*100000。因爲輸入訓練的一個patch大小爲8*8的，所以網絡的輸入層節點數爲192（=8*8*3，因爲是3通道的，每一列按照rgb的順序排列），另外本次試驗的隱含層個數爲400，權值懲罰係數爲0.003，稀疏性懲罰係數爲5，稀疏性體現在3.5%的隱含層節點被激發。ZCA白化時分母加上0.1的值防止出現大的數值。

　　用的是Linear decoder，所以最後的輸出層的激發函數爲1，即輸出和輸入相等。這樣在問題內部的計算量變小了點。

　　程序中最後需要把學習到的網絡權值給顯示出來，不過這個顯示的內容已經包括了whitening部分了，所以是whitening和sparse autoencoder的組合。程序中顯示用的是displayColorNetwork( (W*ZCAWhite)');

　　這裏爲什麼要用(W*ZCAWhite)'呢？首先，使用W*ZCAWhite是因爲每個樣本x輸入網絡，其輸出等價於W*ZCAWhite*x；另外，由於W*ZCAWhite的每一行纔是一個隱含節點的變換值,而displayColorNetwork函數是把每一列顯示一個小圖像塊的，所以需要對其轉置。

　　實驗結果：

　　原始圖片截圖：

　　ZCA Whitening後截圖;

　　學習到的400個特徵顯示如下：

　　實驗主要部分代碼：

%% CS294A/CS294W Linear Decoder Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  linear decoder exericse. For this exercise, you will only need to modify
%  the code in sparseAutoencoderLinearCost.m. You will not need to modify
%  any code in this file.

%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

imageChannels = 3;     % number of channels (rgb, so 3)

patchDim   = 8;          % patch dimension
numPatches = 100000;   % number of patches

visibleSize = patchDim * patchDim * imageChannels;  % number of input units 
outputSize  = visibleSize;   % number of output units
hiddenSize  = 400;           % number of hidden units %中間的隱含層還變多了

sparsityParam = 0.035; % desired average activation of the hidden units.
lambda = 3e-3;         % weight decay parameter       
beta = 5;              % weight of sparsity penalty term       

epsilon = 0.1;           % epsilon for ZCA whitening

%%======================================================================
%% STEP 1: Create and modify sparseAutoencoderLinearCost.m to use a linear decoder,
%          and check gradients
%  You should copy sparseAutoencoderCost.m from your earlier exercise 
%  and rename it to sparseAutoencoderLinearCost.m. 
%  Then you need to rename the function from sparseAutoencoderCost to
%  sparseAutoencoderLinearCost, and modify it so that the sparse autoencoder
%  uses a linear decoder instead. Once that is done, you should check 
% your gradients to verify that they are correct.

% NOTE: Modify sparseAutoencoderCost first!

% To speed up gradient checking, we will use a reduced network and some
% dummy patches

debugHiddenSize = 5;
debugvisibleSize = 8;
patches = rand([8 10]);%隨機產生10個樣本，每個樣本爲一個8維的列向量，元素值爲0~1
theta = initializeParameters(debugHiddenSize, debugvisibleSize); 

[cost, grad] = sparseAutoencoderLinearCost(theta, debugvisibleSize, debugHiddenSize, ...
                                           lambda, sparsityParam, beta, ...
                                           patches);

% Check gradients
numGrad = computeNumericalGradient( @(x) sparseAutoencoderLinearCost(x, debugvisibleSize, debugHiddenSize, ...
                                                  lambda, sparsityParam, beta, ...
                                                  patches), theta);

% Use this to visually compare the gradients side by side
disp([numGrad cost]); 

diff = norm(numGrad-grad)/norm(numGrad+grad);
% Should be small. In our implementation, these values are usually less than 1e-9.
disp(diff); 

assert(diff < 1e-9, 'Difference too large. Check your gradient computation again');

% NOTE: Once your gradients check out, you should run step 0 again to
%       reinitialize the parameters
%}

%%======================================================================
%% STEP 2: Learn features on small patches
%  In this step, you will use your sparse autoencoder (which now uses a 
%  linear decoder) to learn features on small patches sampled from related
%  images.

%% STEP 2a: Load patches
%  In this step, we load 100k patches sampled from the STL10 dataset and
%  visualize them. Note that these patches have been scaled to [0,1]

load stlSampledPatches.mat

displayColorNetwork(patches(:, 1:100));

%% STEP 2b: Apply preprocessing
%  In this sub-step, we preprocess the sampled patches, in particular, 
%  ZCA whitening them. 
% 
%  In a later exercise on convolution and pooling, you will need to replicate 
%  exactly the preprocessing steps you apply to these patches before 
%  using the autoencoder to learn features on them. Hence, we will save the
%  ZCA whitening and mean image matrices together with the learned features
%  later on.

% Subtract mean patch (hence zeroing the mean of the patches)
meanPatch = mean(patches, 2);  %注意這裏減掉的是每一維屬性的均值，爲什麼會和其它的不同呢？
patches = bsxfun(@minus, patches, meanPatch);%每一維都均值化

% Apply ZCA whitening
sigma = patches * patches' / numPatches;
[u, s, v] = svd(sigma);
ZCAWhite = u * diag(1 ./ sqrt(diag(s) + epsilon)) * u';%求出ZCAWhitening矩陣
patches = ZCAWhite * patches;
figure
displayColorNetwork(patches(:, 1:100));

%% STEP 2c: Learn features
%  You will now use your sparse autoencoder (with linear decoder) to learn
%  features on the preprocessed patches. This should take around 45 minutes.

theta = initializeParameters(hiddenSize, visibleSize);

% Use minFunc to minimize the function
addpath minFunc/

options = struct;
options.Method = 'lbfgs'; 
options.maxIter = 400;
options.display = 'on';

[optTheta, cost] = minFunc( @(p) sparseAutoencoderLinearCost(p, ...
                                   visibleSize, hiddenSize, ...
                                   lambda, sparsityParam, ...
                                   beta, patches), ...
                              theta, options);%注意它的參數

% Save the learned features and the preprocessing matrices for use in 
% the later exercise on convolution and pooling
fprintf('Saving learned features and preprocessing matrices...\n');                          
save('STL10Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');
fprintf('Saved\n');

%% STEP 2d: Visualize learned features

W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);
b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
figure;
%這裏爲什麼要用(W*ZCAWhite)'呢？首先，使用W*ZCAWhite是因爲每個樣本x輸入網絡，
%其輸出等價於W*ZCAWhite*x；另外，由於W*ZCAWhite的每一行纔是一個隱含節點的變換值
%而displayColorNetwork函數是把每一列顯示一個小圖像塊的，所以需要對其轉置。
displayColorNetwork( (W*ZCAWhite)');

sparseAutoencoderLinearCost.m:

function [cost,grad] = sparseAutoencoderLinearCost(theta, visibleSize, hiddenSize, ...
                                                            lambda, sparsityParam, beta, data)
% -------------------- YOUR CODE HERE --------------------
% Instructions:
%   Copy sparseAutoencoderCost in sparseAutoencoderCost.m from your
%   earlier exercise onto this file, renaming the function to
%   sparseAutoencoderLinearCost, and changing the autoencoder to use a
%   linear decoder.
% -------------------- YOUR CODE HERE --------------------                                    
% The input theta is a vector because minFunc only deal with vectors. In
% this step, we will convert theta to matrix format such that they follow
% the notation in the lecture notes.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Loss and gradient variables (your code needs to compute these values)
m = size(data, 2);%樣本點的個數

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the loss for the Sparse Autoencoder and gradients
%                W1grad, W2grad, b1grad, b2grad
%
%  Hint: 1) data(:,i) is the i-th example
%        2) your computation of loss and gradients should match the size
%        above for loss, W1grad, W2grad, b1grad, b2grad

% z2 = W1 * x + b1
% a2 = f(z2)
% z3 = W2 * a2 + b2
% h_Wb = a3 = f(z3)

z2 = W1 * data + repmat(b1, [1, m]);
a2 = sigmoid(z2);
z3 = W2 * a2 + repmat(b2, [1, m]);
a3 = z3;

rhohats = mean(a2,2);
rho = sparsityParam;
KLsum = sum(rho * log(rho ./ rhohats) + (1-rho) * log((1-rho) ./ (1-rhohats)));


squares = (a3 - data).^2;
squared_err_J = (1/2) * (1/m) * sum(squares(:));
weight_decay_J = (lambda/2) * (sum(W1(:).^2) + sum(W2(:).^2));
sparsity_J = beta * KLsum;

cost = squared_err_J + weight_decay_J + sparsity_J;%損失函數值

% delta3 = -(data - a3) .* fprime(z3);
% but fprime(z3) = a3 * (1-a3)
delta3 = -(data - a3);
beta_term = beta * (- rho ./ rhohats + (1-rho) ./ (1-rhohats));
delta2 = ((W2' * delta3) + repmat(beta_term, [1,m]) ) .* a2 .* (1-a2);

W2grad = (1/m) * delta3 * a2' + lambda * W2;
b2grad = (1/m) * sum(delta3, 2);
W1grad = (1/m) * delta2 * data' + lambda * W1;
b1grad = (1/m) * sum(delta2, 2);

%-------------------------------------------------------------------
% Convert weights and bias gradients to a compressed form
% This step will concatenate and flatten all your gradients to a vector
% which can be used in the optimization method.
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end
%-------------------------------------------------------------------
% We are giving you the sigmoid function, you may find this function
% useful in your computation of the loss and the gradients.
function sigm = sigmoid(x)

    sigm = 1 ./ (1 + exp(-x));
end

　　參考資料：

Deep learning：十七(Linear Decoders，Convolution和Pooling)

Exercise: Implement deep networks for digit classification

Deep learning：二十三(Convolution和Pooling練習)

　　前言：

　　本次實驗是練習convolution和pooling的使用，更深一層的理解怎樣對大的圖片採用convolution得到每個特徵的輸出結果，然後採用pooling方法對這些結果進行計算，使之具有平移不變等特性。實驗參考的是斯坦福網頁教程：Exercise:Convolution and Pooling。也可以參考前面的博客：Deep learning：十七(Linear Decoders，Convolution和Pooling)，且本次試驗是在前面博文Deep learning：二十二(linear decoder練習)的學習到的特徵提取網絡上進行的。

　　實驗基礎：

　　首先來看看整個訓練和測試過程的大概流程：從本文可以更清楚的看到，在訓練階段，是對小的patches進行whitening的。由於輸入的數據是大的圖片，所以每次進行convolution時都需要進行whitening和網絡的權值計算，這樣每一個學習到的隱含層節點的特徵對每一張圖片都可以得到一張稍小的特徵圖片，接着對這張特徵圖片進行均值pooling（在這之前，程序中有一些代碼來測試convolution和pooling代碼的正確性）。有了這些特徵值以及標註值，就可以用softmax來訓練多分類器了。

　　在測試階段是對大圖片採取convolution的，每次convolution的圖像塊也同樣需要用訓練時的whitening參數進行預處理，分別經過convolution和pooling提取特徵，這和前面的訓練過程一樣。然後用訓練好的softmax分類器就可進行預測了。

　　訓練特徵提取的網絡參數用的時間比較多，而訓練比如說softmax分類器則用的時間比較短。

　　在matlab中當有n維數組時，一般是從右向左進行剝皮計算，因爲matlab輸出都是按照這種方法進行的。當然了，如果要理解的話，從左向右和從右向左都是可以的，只要是方便理解就行。

　　程序中進行convolution測試的理由是：先用cnnConvolve函數計算出所給樣本的convolution值，然後隨機選取多個patch，用直接代數運算的方法得出網絡的輸出值，如果對於所有(比如說這裏選的1000個)的patch，這兩者之間的差都非常小的話，說明convution計算是正確的。

　　程序中進行pooling測試的理由是：採用函數cnnPool來計算，而該函數的參數爲polling的維數以及需要pooling的數據。因此程序中先隨便給一組數據，然後用手動的方法計算出均值pooling的結果，最後用cnnPool函數也計算出一個結果，如果兩者的結果相同，則說明pooling函數是正確的。

　　程序中顏色特徵的學習體現在：每次只對RGB中的一個通道進行convolution，分別計算3次，然後把三個通道得到的convolution結果矩陣對應元素相加即可。這樣的話，後面的Pooling操作只需在一個圖像上進行即可。

　　Convolution後得到的形式如下：

　　convolvedFeatures(featureNum, imageNum, imageRow, imageCol)

　　pooling後得到的形式如下：

　　pooledFeatures(featureNum, imageNum, poolRow, poolCol)

　　圖片的保存形式如下：

　　convImages(imageRow, imageCol, imageChannel, imageNum)

　　由於只需訓練4個類別的softmax分類器，所以其速度非常快，1分鐘都不到。

　　一些matlab函數：

　　squeeze:

　　B = squeeze(A)，B與A有相同的元素,但所有隻有一行或只有一列的那個維度（a singleton dimension）被去除掉了。A singleton dimension的特徵是size(A,dim) = 1。二維陣列不受squeeze影響; 如果 A 是一個row or column矢量或a scalar (1-by-1) value, then B = A。比如，rand(4,1,3)產生一個均勻分佈的陣列，共3頁，每頁4行1列，經過squeeze後，1列的那個維度就沒有了，只剩下4行3列的一個二維陣列。而rand(4,2,3)因爲沒有1列或1行的維度，所有squeeze後沒有變化。

　　size：

　　size(A,n)，如果A是一個多維矩陣，那麼size(A,n)表示第n維的大小，返回值爲一個實數。

　　實驗結果：

　　訓練出來的特徵圖像爲：

　　最終的預測準確度爲：Accuracy: 80.406%

　　實驗主要部分代碼：

　　CnnExercise.m:

%% CS294A/CS294W Convolutional Neural Networks Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  convolutional neural networks exercise. In this exercise, you will only
%  need to modify cnnConvolve.m and cnnPool.m. You will not need to modify
%  this file.

%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

imageDim = 64;         % image dimension
imageChannels = 3;     % number of channels (rgb, so 3)

patchDim = 8;          % patch dimension
numPatches = 50000;    % number of patches

visibleSize = patchDim * patchDim * imageChannels;  % number of input units ,8*8*3=192
outputSize = visibleSize;   % number of output units
hiddenSize = 400;           % number of hidden units 

epsilon = 0.1;           % epsilon for ZCA whitening

poolDim = 19;          % dimension of pooling region

%%======================================================================
%% STEP 1: Train a sparse autoencoder (with a linear decoder) to learn 
%  features from color patches. If you have completed the linear decoder
%  execise, use the features that you have obtained from that exercise, 
%  loading them into optTheta. Recall that we have to keep around the 
%  parameters used in whitening (i.e., the ZCA whitening matrix and the
%  meanPatch)

% --------------------------- YOUR CODE HERE --------------------------
% Train the sparse autoencoder and fill the following variables with 
% the optimal parameters:

optTheta =  zeros(2*hiddenSize*visibleSize+hiddenSize+visibleSize, 1);%對patch網絡作用的所有參數個數
ZCAWhite =  zeros(visibleSize, visibleSize);
meanPatch = zeros(visibleSize, 1);
load STL10Features.mat;


% --------------------------------------------------------------------

% Display and check to see that the features look good
W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);
b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

displayColorNetwork( (W*ZCAWhite)');%以前的博客中有解釋

%%======================================================================
%% STEP 2: Implement and test convolution and pooling
%  In this step, you will implement convolution and pooling, and test them
%  on a small part of the data set to ensure that you have implemented
%  these two functions correctly. In the next step, you will actually
%  convolve and pool the features with the STL10 images.

%% STEP 2a: Implement convolution
%  Implement convolution in the function cnnConvolve in cnnConvolve.m

% Note that we have to preprocess the images in the exact same way 
% we preprocessed the patches before we can obtain the feature activations.

load stlTrainSubset.mat % loads numTrainImages, trainImages, trainLabels

%% Use only the first 8 images for testing
convImages = trainImages(:, :, :, 1:8); 

% NOTE: Implement cnnConvolve in cnnConvolve.m first!w和b已經是矩陣或向量的形式了
convolvedFeatures = cnnConvolve(patchDim, hiddenSize, convImages, W, b, ZCAWhite, meanPatch);

%% STEP 2b: Checking your convolution
%  To ensure that you have convolved the features correctly, we have
%  provided some code to compare the results of your convolution with
%  activations from the sparse autoencoder

% For 1000 random points
for i = 1:1000    
    featureNum = randi([1, hiddenSize]);%隨機選取一個特徵
    imageNum = randi([1, 8]);%隨機選取一個樣本
    imageRow = randi([1, imageDim - patchDim + 1]);%隨機選取一個點
    imageCol = randi([1, imageDim - patchDim + 1]);    
   
    %在那8張圖片中隨機選取1張圖片，然後又根據隨機選取的左上角點選取1個patch
    patch = convImages(imageRow:imageRow + patchDim - 1, imageCol:imageCol + patchDim - 1, :, imageNum);
    patch = patch(:); %這樣是按照列的順序來排列的           
    patch = patch - meanPatch;
    patch = ZCAWhite * patch;%用同樣的參數對該patch進行白化處理
    
    features = feedForwardAutoencoder(optTheta, hiddenSize, visibleSize, patch); %計算出該patch的輸出值

    if abs(features(featureNum, 1) - convolvedFeatures(featureNum, imageNum, imageRow, imageCol)) > 1e-9
        fprintf('Convolved feature does not match activation from autoencoder\n');
        fprintf('Feature Number    : %d\n', featureNum);
        fprintf('Image Number      : %d\n', imageNum);
        fprintf('Image Row         : %d\n', imageRow);
        fprintf('Image Column      : %d\n', imageCol);
        fprintf('Convolved feature : %0.5f\n', convolvedFeatures(featureNum, imageNum, imageRow, imageCol));
        fprintf('Sparse AE feature : %0.5f\n', features(featureNum, 1));       
        error('Convolved feature does not match activation from autoencoder');
    end 
end

disp('Congratulations! Your convolution code passed the test.');

%% STEP 2c: Implement pooling
%  Implement pooling in the function cnnPool in cnnPool.m

% NOTE: Implement cnnPool in cnnPool.m first!
pooledFeatures = cnnPool(poolDim, convolvedFeatures);

%% STEP 2d: Checking your pooling
%  To ensure that you have implemented pooling, we will use your pooling
%  function to pool over a test matrix and check the results.

testMatrix = reshape(1:64, 8, 8);%將1~64這64個數字弄成一個矩陣，按列的方向依次遞增
%直接計算均值pooling值
expectedMatrix = [mean(mean(testMatrix(1:4, 1:4))) mean(mean(testMatrix(1:4, 5:8))); ...
                  mean(mean(testMatrix(5:8, 1:4))) mean(mean(testMatrix(5:8, 5:8))); ];
            
testMatrix = reshape(testMatrix, 1, 1, 8, 8);

%squeeze去掉維度爲1的那一維
pooledFeatures = squeeze(cnnPool(4, testMatrix));%參數值爲4表明是對4*4的區域進行pooling

if ~isequal(pooledFeatures, expectedMatrix)
    disp('Pooling incorrect');
    disp('Expected');
    disp(expectedMatrix);
    disp('Got');
    disp(pooledFeatures);
else
    disp('Congratulations! Your pooling code passed the test.');
end

%%======================================================================
%% STEP 3: Convolve and pool with the dataset
%  In this step, you will convolve each of the features you learned with
%  the full large images to obtain the convolved features. You will then
%  pool the convolved features to obtain the pooled features for
%  classification.
%
%  Because the convolved features matrix is very large, we will do the
%  convolution and pooling 50 features at a time to avoid running out of
%  memory. Reduce this number if necessary

stepSize = 50;
assert(mod(hiddenSize, stepSize) == 0, 'stepSize should divide hiddenSize');%hiddenSize/stepSize爲整數，這裏分8次進行

load stlTrainSubset.mat % loads numTrainImages, trainImages, trainLabels
load stlTestSubset.mat  % loads numTestImages,  testImages,  testLabels

pooledFeaturesTrain = zeros(hiddenSize, numTrainImages, ...%image是大圖片的尺寸，這裏爲64
    floor((imageDim - patchDim + 1) / poolDim), ... %.poolDim爲多大的區域pool一次，這裏爲19，即19*19大小pool一次.
    floor((imageDim - patchDim + 1) / poolDim) );%最後算出的pooledFeaturesTrain大小爲400*2000*3*3
pooledFeaturesTest = zeros(hiddenSize, numTestImages, ...
    floor((imageDim - patchDim + 1) / poolDim), ...
    floor((imageDim - patchDim + 1) / poolDim) );%pooledFeaturesTest大小爲400*3200*3*3

tic();

for convPart = 1:(hiddenSize / stepSize)%stepSize表示分批次進行原始圖片數據的特徵提取，一次進行stepSize個隱含層節點
    
    featureStart = (convPart - 1) * stepSize + 1;%選取起始的特徵
    featureEnd = convPart * stepSize;%選取結束的特徵
    
    fprintf('Step %d: features %d to %d\n', convPart, featureStart, featureEnd);  
    Wt = W(featureStart:featureEnd, :);
    bt = b(featureStart:featureEnd);    
    
    fprintf('Convolving and pooling train images\n');
    convolvedFeaturesThis = cnnConvolve(patchDim, stepSize, ...%參數2表示的是當前"隱含層"節點的個數
        trainImages, Wt, bt, ZCAWhite, meanPatch);
    pooledFeaturesThis = cnnPool(poolDim, convolvedFeaturesThis);
    pooledFeaturesTrain(featureStart:featureEnd, :, :, :) = pooledFeaturesThis;   
    toc();
    clear convolvedFeaturesThis pooledFeaturesThis;%這些大的變量在不用的情況下全部刪除掉，因爲後面用的是test部分
    
    fprintf('Convolving and pooling test images\n');
    convolvedFeaturesThis = cnnConvolve(patchDim, stepSize, ...
        testImages, Wt, bt, ZCAWhite, meanPatch);
    pooledFeaturesThis = cnnPool(poolDim, convolvedFeaturesThis);
    pooledFeaturesTest(featureStart:featureEnd, :, :, :) = pooledFeaturesThis;   
    toc();

    clear convolvedFeaturesThis pooledFeaturesThis;

end


% You might want to save the pooled features since convolution and pooling takes a long time
save('cnnPooledFeatures.mat', 'pooledFeaturesTrain', 'pooledFeaturesTest');
toc();

%%======================================================================
%% STEP 4: Use pooled features for classification
%  Now, you will use your pooled features to train a softmax classifier,
%  using softmaxTrain from the softmax exercise.
%  Training the softmax classifer for 1000 iterations should take less than
%  10 minutes.

% Add the path to your softmax solution, if necessary
% addpath /path/to/solution/

% Setup parameters for softmax
softmaxLambda = 1e-4;%權值懲罰係數
numClasses = 4;
% Reshape the pooledFeatures to form an input vector for softmax
softmaxX = permute(pooledFeaturesTrain, [1 3 4 2]);%permute是調整順序，把圖片放在最後
softmaxX = reshape(softmaxX, numel(pooledFeaturesTrain) / numTrainImages,...%numel(pooledFeaturesTrain) / numTrainImages
                        numTrainImages);                                    %爲每一張圖片得到的特徵向量長度                                                             
    
softmaxY = trainLabels;

options = struct;
options.maxIter = 200;
softmaxModel = softmaxTrain(numel(pooledFeaturesTrain) / numTrainImages,...%第一個參數爲inputSize
    numClasses, softmaxLambda, softmaxX, softmaxY, options);

%%======================================================================
%% STEP 5: Test classifer
%  Now you will test your trained classifer against the test images

softmaxX = permute(pooledFeaturesTest, [1 3 4 2]);
softmaxX = reshape(softmaxX, numel(pooledFeaturesTest) / numTestImages, numTestImages);
softmaxY = testLabels;

[pred] = softmaxPredict(softmaxModel, softmaxX);
acc = (pred(:) == softmaxY(:));
acc = sum(acc) / size(acc, 1);
fprintf('Accuracy: %2.3f%%\n', acc * 100);%計算預測準確度

% You should expect to get an accuracy of around 80% on the test images.

　　cnnConvolve.m:

function convolvedFeatures = cnnConvolve(patchDim, numFeatures, images, W, b, ZCAWhite, meanPatch)
%cnnConvolve Returns the convolution of the features given by W and b with
%the given images
%
% Parameters:
%  patchDim - patch (feature) dimension
%  numFeatures - number of features
%  images - large images to convolve with, matrix in the form
%           images(r, c, channel, image number)
%  W, b - W, b for features from the sparse autoencoder
%  ZCAWhite, meanPatch - ZCAWhitening and meanPatch matrices used for
%                        preprocessing
%
% Returns:
%  convolvedFeatures - matrix of convolved features in the form
%                      convolvedFeatures(featureNum, imageNum, imageRow, imageCol)

patchSize = patchDim*patchDim;
assert(numFeatures == size(W,1), 'W should have numFeatures rows');
numImages = size(images, 4);%第4維的大小，即圖片的樣本數
imageDim = size(images, 1);%第1維的大小,即圖片的行數
imageChannels = size(images, 3);%第3維的大小，即圖片的通道數
assert(patchSize*imageChannels == size(W,2), 'W should have patchSize*imageChannels cols');

% Instructions:
%   Convolve every feature with every large image here to produce the 
%   numFeatures x numImages x (imageDim - patchDim + 1) x (imageDim - patchDim + 1) 
%   matrix convolvedFeatures, such that 
%   convolvedFeatures(featureNum, imageNum, imageRow, imageCol) is the
%   value of the convolved featureNum feature for the imageNum image over
%   the region (imageRow, imageCol) to (imageRow + patchDim - 1, imageCol + patchDim - 1)
%
% Expected running times: 
%   Convolving with 100 images should take less than 3 minutes 
%   Convolving with 5000 images should take around an hour
%   (So to save time when testing, you should convolve with less images, as
%   described earlier)

% -------------------- YOUR CODE HERE --------------------
% Precompute the matrices that will be used during the convolution. Recall
% that you need to take into account the whitening and mean subtraction
% steps

WT = W*ZCAWhite;%等效的網絡參數
b_mean = b - WT*meanPatch;%針對未均值化的輸入數據需要加入該項

% --------------------------------------------------------

convolvedFeatures = zeros(numFeatures, numImages, imageDim - patchDim + 1, imageDim - patchDim + 1);
for imageNum = 1:numImages
  for featureNum = 1:numFeatures

    % convolution of image with feature matrix for each channel
    convolvedImage = zeros(imageDim - patchDim + 1, imageDim - patchDim + 1);
    for channel = 1:imageChannels

      % Obtain the feature (patchDim x patchDim) needed during the convolution
      % ---- YOUR CODE HERE ----
      offset = (channel-1)*patchSize;
      feature = reshape(WT(featureNum,offset+1:offset+patchSize), patchDim, patchDim);%取一個權值圖像塊出來
      im  = images(:,:,channel,imageNum);

      % Flip the feature matrix because of the definition of convolution, as explained later
      feature = flipud(fliplr(squeeze(feature)));
      
      % Obtain the image
      im = squeeze(images(:, :, channel, imageNum));%取一張圖片出來

      % Convolve "feature" with "im", adding the result to convolvedImage
      % be sure to do a 'valid' convolution
      % ---- YOUR CODE HERE ----
      convolvedoneChannel = conv2(im, feature, 'valid');
      convolvedImage = convolvedImage + convolvedoneChannel;%直接把3通道的值加起來，理由？
      
      % ------------------------

    end
    
    % Subtract the bias unit (correcting for the mean subtraction as well)
    % Then, apply the sigmoid function to get the hidden activation
    % ---- YOUR CODE HERE ----

    convolvedImage = sigmoid(convolvedImage+b_mean(featureNum));
    
    
    % ------------------------
    
    % The convolved feature is the sum of the convolved values for all channels
    convolvedFeatures(featureNum, imageNum, :, :) = convolvedImage;
  end
end


end

function sigm = sigmoid(x)
    sigm = 1./(1+exp(-x));
end

　　cnnPool.m:

function pooledFeatures = cnnPool(poolDim, convolvedFeatures)
%cnnPool Pools the given convolved features
%
% Parameters:
%  poolDim - dimension of pooling region
%  convolvedFeatures - convolved features to pool (as given by cnnConvolve)
%                      convolvedFeatures(featureNum, imageNum, imageRow, imageCol)
%
% Returns:
%  pooledFeatures - matrix of pooled features in the form
%                   pooledFeatures(featureNum, imageNum, poolRow, poolCol)
%     

numImages = size(convolvedFeatures, 2);%圖片數
numFeatures = size(convolvedFeatures, 1);%特徵數
convolvedDim = size(convolvedFeatures, 3);%圖片的行數
resultDim  = floor(convolvedDim / poolDim);
pooledFeatures = zeros(numFeatures, numImages, resultDim, resultDim);

% -------------------- YOUR CODE HERE --------------------
% Instructions:
%   Now pool the convolved features in regions of poolDim x poolDim,
%   to obtain the 
%   numFeatures x numImages x (convolvedDim/poolDim) x (convolvedDim/poolDim) 
%   matrix pooledFeatures, such that
%   pooledFeatures(featureNum, imageNum, poolRow, poolCol) is the 
%   value of the featureNum feature for the imageNum image pooled over the
%   corresponding (poolRow, poolCol) pooling region 
%   (see http://ufldl/wiki/index.php/Pooling )
%   
%   Use mean pooling here.
% -------------------- YOUR CODE HERE --------------------
for imageNum = 1:numImages
    for featureNum = 1:numFeatures
        for poolRow = 1:resultDim
            offsetRow = 1+(poolRow-1)*poolDim;
            for poolCol = 1:resultDim
                offsetCol = 1+(poolCol-1)*poolDim;
                patch = convolvedFeatures(featureNum,imageNum,offsetRow:offsetRow+poolDim-1,...
                    offsetCol:offsetCol+poolDim-1);%取出一個patch
                pooledFeatures(featureNum,imageNum,poolRow,poolCol) = mean(patch(:));%使用均值pool
            end
        end
    end
end

end

　　參考資料：

Deep learning：十七(Linear Decoders，Convolution和Pooling)

Exercise:Convolution and Pooling

　　Deep learning：二十二(linear decoder練習)

http://blog.sina.com.cn/s/blog_50363a790100wyeq.html

Deep learning：二十四(stacked autoencoder練習)

　　前言：

　　本次是練習2個隱含層的網絡的訓練方法，每個網絡層都是用的sparse autoencoder思想，利用兩個隱含層的網絡來提取出輸入數據的特徵。本次實驗驗要完成的任務是對MINST進行手寫數字識別，實驗內容及步驟參考網頁教程Exercise: Implement deep networks for digit classification。當提取出手寫數字圖片的特徵後，就用softmax進行對其進行分類。關於MINST的介紹可以參考網頁：MNIST Dataset。本文的理論介紹也可以參考前面的博文：Deep learning：十六(deep networks)。

　　實驗基礎：

　　進行deep network的訓練方法大致如下：

　　1. 用原始輸入數據作爲輸入，訓練出（利用sparse autoencoder方法）第一個隱含層結構的網絡參數，並將用訓練好的參數算出第1個隱含層的輸出。

　　2. 把步驟1的輸出作爲第2個網絡的輸入，用同樣的方法訓練第2個隱含層網絡的參數。

　　3. 用步驟2 的輸出作爲多分類器softmax的輸入，然後利用原始數據的標籤來訓練出softmax分類器的網絡參數。

　　4. 計算2個隱含層加softmax分類器整個網絡一起的損失函數，以及整個網絡對每個參數的偏導函數值。

　　5. 用步驟1，2和3的網絡參數作爲整個深度網絡（2個隱含層,1個softmax輸出層）參數初始化的值，然後用lbfs算法迭代求出上面損失函數最小值附近處的參數值，並作爲整個網絡最後的最優參數值。

　　上面的訓練過程是針對使用softmax分類器進行的，而softmax分類器的損失函數等是有公式進行計算的。所以在進行參數校正時，可以對把所有網絡看做是一個整體，然後計算整個網絡的損失函數和其偏導，這樣的話當我們有了標註好了的數據後，就可以用前面訓練好了的參數作爲初始參數，然後用優化算法求得整個網絡的參數了。但如果我們後面的分類器不是用的softmax分類器，而是用的其它的，比如svm，隨機森林等，這個時候前面特徵提取的網絡參數已經預訓練好了，用該參數是可以初始化前面的網絡，但是此時該怎麼微調呢？因爲此時標註的數值只能在後面的分類器中才用得到，所以沒法計算系統的損失函數等。難道又要將前面n層網絡的最終輸出等價於第一層網絡的輸入（也就是多網絡的sparse autoencoder）?本人暫時還沒弄清楚，日後應該會想明白的。

　　關於深度網絡的學習幾個需要注意的小點（假設隱含層爲2層）：

利用sparse autoencoder進行預訓練時，需要依次計算出每個隱含層的輸出，如果後面是採用softmax分類器的話，則同樣也需要用最後一個隱含層的輸出作爲softmax的輸入來訓練softmax的網絡參數。
由步驟1可知，在進行參數校正之前是需要對分類器的參數進行預訓練的。且在進行參數校正(Finetuning )時是將所有的隱含層看做是一個單一的網絡層，因此每一次迭代就可以更新所有網絡層的參數。

　　另外在實際的訓練過程中可以看到，訓練第一個隱含層所用的時間較長，應該需要訓練的參數矩陣爲200*784(沒包括b參數),訓練第二個隱含層的時間較第一個隱含層要短些，主要原因是此時只需學習到200*200的參數矩陣，其參數個數大大減小。而訓練softmax的時間更短，那是因爲它的參數個數更少，且損失函數和偏導的計算公式也沒有前面兩層的複雜。最後對整個網絡的微調所用的時間和第二個隱含層的訓練時間長短差不多。

　　程序中部分函數：

　　[params, netconfig] = stack2params(stack)

　　是將stack層次的網絡參數（可能是多個參數）轉換成一個向量params，這樣有利用使用各種優化算法來進行優化操作。Netconfig中保存的是該網絡的相關信息，其中netconfig.inputsize表示的是網絡的輸入層節點的個數。netconfig.layersizes中的元素分別表示每一個隱含層對應節點的個數。

　　[ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, numClasses, netconfig,lambda, data, labels)

　　該函數內部實現整個網絡損失函數和損失函數對每個參數偏導的計算。其中損失函數是個實數值，當然就只有1個了，其計算方法是根據sofmax分類器來計算的，只需知道標籤值和softmax輸出層的值即可。而損失函數對所有參數的偏導卻有很多個，因此每個參數處應該就有一個偏導值，這些參數不僅包括了多個隱含層的，而且還包括了softmax那個網絡層的。其中softmax那部分的偏導是根據其公式直接獲得，而深度網絡層那部分這通過BP算法方向推理得到（即先計算每一層的誤差值，然後利用該誤差值計算參數w和b）。

　　stack = params2stack(params, netconfig)

　　和上面的函數功能相反，是吧一個向量參數按照深度網絡的結構依次展開。

　　[pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)

　　這個函數其實就是對輸入的data數據進行預測，看該data對應的輸出類別是多少。其中theta爲整個網絡的參數（包括了分類器部分的網絡），numClasses爲所需分類的類別，netconfig爲網絡的結構參數。

　　[h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)

　　該函數是用來顯示矩陣A的，此時要求A中的每一列爲一個權值，並且A是完全平方數。函數運行後會將A中每一列顯示爲一個小的patch圖像，具體的有多少個patch和patch之間該怎麼擺設是程序內部自動決定的。

　 matlab內嵌函數：

　　struct：

　 s = sturct;表示創建一個結構數組s。

　　nargout:

　　表示函數輸出參數的個數。

　　save：

　　比如函數save('saves/step2.mat', 'sae1OptTheta');則要求當前目錄下有saves這個目錄，否則該語句會調用失敗的。

　　實驗結果：

　　第一個隱含層的特徵值如下所示：

　　第二個隱含層的特徵值顯示不知道該怎麼弄，因爲第二個隱含層每個節點都是對應的200維，用display_network這個函數去顯示的話是不行的，它只能顯示維數能夠開平方的那些特徵，所以不知道是該將200弄成20*10，還是弄成16*25好，很好奇關於deep learning那麼多文章中第二層網絡是怎麼顯示的，將200分解後的顯示哪個具有代表性呢？待定。所以這裏暫且不顯示，因爲截取200前面的196位用display_network來顯示的話，什麼都看不出來：

　　沒有經過網絡參數微調時的識別準去率爲：

　　Before Finetuning Test Accuracy: 92.190%

　　經過了網絡參數微調後的識別準確率爲：

　　After Finetuning Test Accuracy: 97.670%

　　實驗主要部分代碼及註釋：

　　stackedAEExercise.m:

%% CS294A/CS294W Stacked Autoencoder Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  sstacked autoencoder exercise. You will need to complete code in
%  stackedAECost.m
%  You will also need to have implemented sparseAutoencoderCost.m and 
%  softmaxCost.m from previous exercises. You will need the initializeParameters.m
%  loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises.
%  
%  For the purpose of completing the assignment, you do not need to
%  change the code in this file. 
%
%%======================================================================
%% STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.

DISPLAY = true;
inputSize = 28 * 28;
numClasses = 10;
hiddenSizeL1 = 200;    % Layer 1 Hidden Size
hiddenSizeL2 = 200;    % Layer 2 Hidden Size
sparsityParam = 0.1;   % desired average activation of the hidden units.
                       % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
                       %  in the lecture notes). 
lambda = 3e-3;         % weight decay parameter       
beta = 3;              % weight of sparsity penalty term       

%%======================================================================
%% STEP 1: Load data from the MNIST database
%
%  This loads our training data from the MNIST database files.

% Load MNIST database files
trainData = loadMNISTImages('train-images.idx3-ubyte');
trainLabels = loadMNISTLabels('train-labels.idx1-ubyte');

trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1

%%======================================================================
%% STEP 2: Train the first sparse autoencoder
%  This trains the first sparse autoencoder on the unlabelled STL training
%  images.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.
%  Randomly initialize the parameters
sae1Theta = initializeParameters(hiddenSizeL1, inputSize);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the first layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL1"
%                You should store the optimal parameters in sae1OptTheta
addpath minFunc/;
options = struct;
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';
[sae1OptTheta, cost] =  minFunc(@(p)sparseAutoencoderCost(p,...
    inputSize,hiddenSizeL1,lambda,sparsityParam,beta,trainData),sae1Theta,options);%訓練出第一層網絡的參數
save('saves/step2.mat', 'sae1OptTheta');

if DISPLAY
  W1 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize);
  display_network(W1');
end
% -------------------------------------------------------------------------

%%======================================================================
%% STEP 2: Train the second sparse autoencoder
%  This trains the second sparse autoencoder on the first autoencoder
%  featurse.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.

[sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ...
                                        inputSize, trainData);

%  Randomly initialize the parameters
sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the second layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL2" and an inputsize of
%                "hiddenSizeL1"
%
%                You should store the optimal parameters in sae2OptTheta

[sae2OptTheta, cost] =  minFunc(@(p)sparseAutoencoderCost(p,...
    hiddenSizeL1,hiddenSizeL2,lambda,sparsityParam,beta,sae1Features),sae2Theta,options);%訓練出第一層網絡的參數
save('saves/step3.mat', 'sae2OptTheta');

figure;
if DISPLAY
  W11 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize);
  W12 = reshape(sae2OptTheta(1:hiddenSizeL2 * hiddenSizeL1), hiddenSizeL2, hiddenSizeL1);
  % TODO(zellyn): figure out how to display a 2-level network
%  display_network(log(W11' ./ (1-W11')) * W12');
%   W12_temp = W12(1:196,1:196);
%   display_network(W12_temp');
%   figure;
%   display_network(W12_temp');
end
% -------------------------------------------------------------------------

%%======================================================================
%% STEP 3: Train the softmax classifier
%  This trains the sparse autoencoder on the second autoencoder features.
%  If you've correctly implemented softmaxCost.m, you don't need
%  to change anything here.

[sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ...
                                        hiddenSizeL1, sae1Features);

%  Randomly initialize the parameters
saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1);


%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the softmax classifier, the classifier takes in
%                input of dimension "hiddenSizeL2" corresponding to the
%                hidden layer size of the 2nd layer.
%
%                You should store the optimal parameters in saeSoftmaxOptTheta 
%
%  NOTE: If you used softmaxTrain to complete this part of the exercise,
%        set saeSoftmaxOptTheta = softmaxModel.optTheta(:);


softmaxLambda = 1e-4;
numClasses = 10;
softoptions = struct;
softoptions.maxIter = 400;
softmaxModel = softmaxTrain(hiddenSizeL2,numClasses,softmaxLambda,...
                            sae2Features,trainLabels,softoptions);
saeSoftmaxOptTheta = softmaxModel.optTheta(:);

save('saves/step4.mat', 'saeSoftmaxOptTheta');
% -------------------------------------------------------------------------

%%======================================================================
%% STEP 5: Finetune softmax model

% Implement the stackedAECost to give the combined cost of the whole model
% then run this cell.

% Initialize the stack using the parameters learned
stack = cell(2,1);
%其中的saelOptTheta和sae1ptTheta都是包含了sparse autoencoder的重建層網絡權值的
stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ...
                     hiddenSizeL1, inputSize);
stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize+1:2*hiddenSizeL1*inputSize+hiddenSizeL1);
stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ...
                     hiddenSizeL2, hiddenSizeL1);
stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1+1:2*hiddenSizeL2*hiddenSizeL1+hiddenSizeL2);

% Initialize the parameters for the deep model
[stackparams, netconfig] = stack2params(stack);
stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ];%stackedAETheta是個向量，爲整個網絡的參數，包括分類器那部分，且分類器那部分的參數放前面

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the deep network, hidden size here refers to the '
%                dimension of the input to the classifier, which corresponds 
%                to "hiddenSizeL2".
%
%

[stackedAEOptTheta, cost] =  minFunc(@(p)stackedAECost(p,inputSize,hiddenSizeL2,...
                         numClasses, netconfig,lambda, trainData, trainLabels),...
                        stackedAETheta,options);%訓練出第一層網絡的參數
save('saves/step5.mat', 'stackedAEOptTheta');

figure;
if DISPLAY
  optStack = params2stack(stackedAEOptTheta(hiddenSizeL2*numClasses+1:end), netconfig);
  W11 = optStack{1}.w;
  W12 = optStack{2}.w;
  % TODO(zellyn): figure out how to display a 2-level network
  % display_network(log(1 ./ (1-W11')) * W12');
end
% -------------------------------------------------------------------------

%%======================================================================
%% STEP 6: Test 
%  Instructions: You will need to complete the code in stackedAEPredict.m
%                before running this part of the code
%

% Get labelled test images
% Note that we apply the same kind of preprocessing as the training set
testData = loadMNISTImages('t10k-images.idx3-ubyte');
testLabels = loadMNISTLabels('t10k-labels.idx1-ubyte');

testLabels(testLabels == 0) = 10; % Remap 0 to 10

[pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));
fprintf('Before Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));
fprintf('After Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

% Accuracy is the proportion of correctly classified images
% The results for our implementation were:
%
% Before Finetuning Test Accuracy: 87.7%
% After Finetuning Test Accuracy:  97.6%
%
% If your values are too low (accuracy less than 95%), you should check 
% your code for errors, and make sure you are training on the 
% entire data set of 60000 28x28 training images 
% (unless you modified the loading code, this should be the case)

　　stackedAECost.m:

function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
                                              numClasses, netconfig, ...
                                              lambda, data, labels)
                                         
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.
                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% netconfig:   the network configuration of the stack
% lambda:      the weight regularization penalty
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% labels: A vector containing labels, where labels(i) is the label for the
% i-th training example


%% Unroll softmaxTheta parameter

% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
    stackgrad{d}.w = zeros(size(stack{d}.w));
    stackgrad{d}.b = zeros(size(stack{d}.b));
end

cost = 0; % You need to compute this

% You might find these variables useful
M = size(data, 2);
groundTruth = full(sparse(labels, 1:M, 1));


%% --------------------------- YOUR CODE HERE -----------------------------
%  Instructions: Compute the cost function and gradient vector for 
%                the stacked autoencoder.
%
%                You are given a stack variable which is a cell-array of
%                the weights and biases for every layer. In particular, you
%                can refer to the weights of Layer d, using stack{d}.w and
%                the biases using stack{d}.b . To get the total number of
%                layers, you can use numel(stack).
%
%                The last layer of the network is connected to the softmax
%                classification layer, softmaxTheta.
%
%                You should compute the gradients for the softmaxTheta,
%                storing that in softmaxThetaGrad. Similarly, you should
%                compute the gradients for each layer in the stack, storing
%                the gradients in stackgrad{d}.w and stackgrad{d}.b
%                Note that the size of the matrices in stackgrad should
%                match exactly that of the size of the matrices in stack.
%

depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;

for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end

M = softmaxTheta * a{depth+1};
M = bsxfun(@minus, M, max(M));
p = bsxfun(@rdivide, exp(M), sum(exp(M)));

cost = -1/numClasses * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2);
softmaxThetaGrad = -1/numClasses * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta;

d = cell(depth+1);

d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1});

for layer = (depth:-1:2)
  d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer});
end

for layer = (depth:-1:1)
  stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}';
  stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2);
end

% -------------------------------------------------------------------------

%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

　　stackedAEPredict.m:

function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
                                         
% stackedAEPredict: Takes a trained theta and a test data set,
% and returns the predicted labels for each example.
                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 

% Your code should produce the prediction matrix 
% pred, where pred(i) is argmax_c P(y(c) | x(i)).
 
%% Unroll theta parameter

% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute pred using theta assuming that the labels start 
%                from 1.

depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;

for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end

[~, pred] = max(softmaxTheta * a{depth+1});%閫夋鐜囨渶澶х殑閭ｄ釜杈撳嚭鍊�
% -----------------------------------------------------------

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

　　參考資料：

MNIST Dataset

Exercise: Implement deep networks for digit classification

Deep learning：十六(deep networks)

Deep learning：二十五(Kmeans單層網絡識別性能)

　　前言：

　　本文是用kmeans方法來分析單層網絡的性能，主要是用在CIFAR-10圖像識別數據庫上。關於單層網絡的性能可以參考前面的博文：Deep learning：二十(無監督特徵學習中關於單層網絡的分析)。當然了，本文依舊是參考論文An Analysis of Single-Layer Networks in Unsupervised Feature Learning, Adam Coates, Honglak Lee, and Andrew Y. Ng. In AISTATS 14, 2011.只是重點在分析4個算法中的kemans算法（因爲作者只提供關於kmeans的demo，呵呵，當然了另一個原因是sparse autoencoder在前面的博文中也介紹很多了）本文的代碼可以在Ng主頁中下載：http://ai.stanford.edu/~ang/papers.php。

　　實驗基礎：

　　Kmeans相關：

　　Kmeans可以分爲2個步驟，第一步是cluster assignment step，就是完成各個樣本的聚類。第二步是move centroid，即重新選定類別中心點。Kmeans聚類不僅可以針對有比較明顯類別的數據，還可以針對不具有明顯類別的數據（即人眼看起來根本就沒有區別），即使是沒明顯區分的數據用kmeans聚類時得到的結果也是可以進行解釋的，因爲有時候在某種原因下類別數是人定的。

　　既然kmeans是一種機器學習算法，那麼它肯定也有一個目標函數需要優化，其目標函數如下所示：

　　在kmeans初始化k個類別時，由於初始化具有隨機性，如果選取的初始值點不同可能導致最後聚類的效果跟想象中的效果相差很遠，這也就是kmeans的局部收斂問題。解決這個問題一般採用的方法是進行多次kmeans，然後計算每次kmeans的損失函數值，取損失函數最小對應的那個結果作爲最終結果。

　　在kmeans中比較棘手的另一個問題是類別k的選擇。因爲有的數據集用不同的k來聚類都感覺比較合適，那麼到底該用哪個k值呢？通常情況下的方法都是採用”elbow”的方法，即做一個圖表，該圖的橫座標爲選取的類別個數k，縱座標爲kmeans的損失函數，通過觀察該圖找到曲線的轉折點，一般這個圖長得像人的手，而那個像人手肘對應的轉折點就是我們最終要的類別數k，但這種方法也不一定合適，因爲k的選擇可以由人物確定，比如說我就是想把數據集分爲10份（這種情況很常見，比如說對患者年齡進行分類），那麼就讓k等於10。

　　在本次試驗中的kmeans算法是分爲先求出每個樣本的聚類類別，然後重新計算中心點這2個步驟。但是在求出每個樣本的聚類類別是不是簡單的計算那2個向量的歐式距離。而是通過內積實現的。我們要A矩陣中a樣本和B矩陣中所有樣本（此處用b表示）距離最小的一個求，即求min(a-b)^2，等價於求min(a^2+b^2-2*a*b)，等價於求max(a*b-0.5*a^2-0.5*b^2),假設a爲輸入數據中固定的一個， b爲初始化中心點樣本中的某一個，則固定的a和不同的b作比較時，此時a中的該數據可以忽略不計，只跟b有關。即原式等價於求max(a*b-0.5*a^2)。也就是runkmeans函數的核心思想。（這個程序一開始沒看懂，後面慢慢推算總算弄明白了，應該是它這樣通過矩陣操作進行kmeans距離的速度比較快吧！）

　　當通過聚類的方法得到了樣本的k箇中心以後就要開始提取樣本的特徵了，當然了這些樣本特徵的提取是根據每個樣本到這k個類中心點的距離構成的，最簡單的方法就是取最近鄰，即取於這k個類別中心距離最近的那個類爲類標籤1，其它都爲0，其計算公式如下：

　　因爲那樣計算就有很高的稀疏性（只有1個爲1，其它都爲0），而如果需要放鬆條件則可以這樣考慮：先計算出對應樣本與k個類中心點的平均距離d，然後如果那些樣本與類別中心點的距離大於d的話都設置爲0，小於d的則用d與該距離之間的差來表示。這樣基本能夠保證一半以上的特徵都變成0了，也是具有稀疏性的，且考慮了更多那些距類別中心距離比較近的值。此時的計算公式如下：

　　首先是關於CIFAR-10的數據庫，到網站上http://www.cs.toronto.edu/~kriz/下載的CIFAR-10數據庫解壓後如下：

　　其中的每個data_batch都是10000x3072大小的，即有1w個樣本圖片，每個圖片都是32*32且rgb三通道的，這裏的每一行表示一個樣本，與前面博文程序中的剛好相反。因爲總共有5個data_batch，所以共有5w張訓練圖片。而測試數據test_batch則有1w張，是分別從10類中每類隨機選取1000張。

　　關於均值化的一點總結：

　　給定多張圖片構成的一個矩陣（其中每張圖片看成是一個向量，多張圖片就可以看做是一個矩陣了）。要對這個矩陣進行whitening操作，而在這之前是需要均值化的。在以前的實驗中，有時候是對每一張圖片內部做均值，也就是說均值是針對每張圖片的所有維度，而有的時候是針對矩陣中圖片的每一維做均值操作，那麼是不是有矛盾呢？其實並不矛盾，主要是這兩種均值化的目的不同。如果是算該均值的協方差矩陣，或者將一些訓練樣本輸入到分類器訓練前，則應該對每一維採取均值化操作（因爲協方差均值是描述每個維度之間的關係）。如果是爲了增強每張圖片亮度的對比度，比如說在進行whitening操作前，則需要對圖片的內部進行均值化（此時一般還會執行除以該圖像內部的標準差操作）。

　　另外，一般輸入svm分類器中的樣本都是需要標準化過。

　　Matlab相關：

　　Matlab中function函數內部並不需要針對function有個end語句。

　　svd(),eig()：

　　其實按照道理這2者之間應該是完全不同的。相同之處是這2個函數的輸入參數必須都是方陣。

　　cov:

　　cov(x)是求矩陣x的協方差矩陣。但對x是有要求，即x中每一行爲一個樣本，也就是說每一列爲數據的一個維度值，不要求x均值化過。

　　var:

　　該函數是用來求方差的，求方差時如果是無偏估計則分母應該除以N-1,否則除以N即可。默認情況下分母是除以N-1的，即默認採用的是無偏估計。

　　b1 = var(a); % 按默認來求
　　b2 = var(a, 0); % 默認的公式（除以N-1）
　　c1 = var(a, 1); % 另外的公式（除以N）
　　d1 = var(a, 0, 1); % 對每列操作（除以N-1）
　　d2 = var(a, 0, 2); % 對每行操作（除以N-1）。

　　Im2col:

　　該函數是將一個大矩陣按照小矩陣取出來，並把取出的小矩陣展成列向量。比如說B = im2col(A,[m n],block_type):就是把A按照m*n的小矩陣塊取出，取出後按照列的方式重新排列成向量,然後多個列向量組成一個矩陣。而參數block_type表示的是取出小矩形框的方式，有兩種值可以取，分別爲’distinct’和’sliding’。Distinct方式是指在取出的各小矩形在原矩陣中是沒有重疊的，元素不足的補0。而sliding是每次移動一個元素，即各小矩形之間有元素重疊，但此時沒有補0元素的說法。如果該參數不給出，則默認的爲’sliding’模式。

　　random:

　　該函數和常見的rand,randi,randn不同，random可以產生各種不同的分佈，其不同分佈由參賽name決定，比如二項分佈，泊松分佈，指數分佈等，其一般的調用形式爲： Y = random(name,A,B,C,[m,n,...])

　　rdivide：

　　在bsxfun(@rdivide,A,B)中，其中A是一個矩陣，B是一個行向量，則該函數的意思是將A中每個元素分別除以在B中對應列的值。

　　sum:

　　這裏主要是想說進行多維矩陣的求sum操作，比如矩陣X爲m*n*p維的，則sum(X,1)計算出的結果是1*n*p維的，而sum(x,2)後得到的尺寸是m*1*p維，sum(x,3) 後得到的尺寸是m*n*1維，也就是說，對哪一維求sum，則計算得到的結果後的那一維置1即可，其它可保持不變。

　　實驗結果：

　　kemans學習到的類中心點圖片顯示如下：

　　用kmeans方法對CIFAR-10訓練圖片的識別效果如下

　　Train accuracy 86.112000%

　　對測試圖片識別的效果如下：

　　Test accuracy 77.350000%

　　實驗主要部分代碼：

kmeans_demo.m:

CIFAR_DIR='cifar-10-batches-mat/';

assert(strcmp(CIFAR_DIR, 'cifar-10-batches-mat/'), ...%strcmp相等時爲1
       ['You need to modify kmeans_demo.m so that CIFAR_DIR points to ' ...
        'your cifar-10-batches-mat directory.  You can download this ' ...
        'data from:  http://www.cs.toronto.edu/~kriz/cifar-10-matlab.tar.gz']);

%% Configuration
addpath minFunc;
rfSize = 6;
numCentroids=1600;%類別總數
whitening=true;
numPatches = 400000;%40w張圖片，不少啊！
CIFAR_DIM=[32 32 3];

%% Load CIFAR training data
fprintf('Loading training data...\n');
f1=load([CIFAR_DIR '/data_batch_1.mat']);
f2=load([CIFAR_DIR '/data_batch_2.mat']);
f3=load([CIFAR_DIR '/data_batch_3.mat']);
f4=load([CIFAR_DIR '/data_batch_4.mat']);
f5=load([CIFAR_DIR '/data_batch_5.mat']);

trainX = double([f1.data; f2.data; f3.data; f4.data; f5.data]);%50000*3072
trainY = double([f1.labels; f2.labels; f3.labels; f4.labels; f5.labels]) + 1; % add 1 to labels!,變成1到10
clear f1 f2 f3 f4 f5;%及時清除變量

% extract random patches
patches = zeros(numPatches, rfSize*rfSize*3);%400000*108
for i=1:numPatches
    i=1;
  if (mod(i,10000) == 0) fprintf('Extracting patch: %d / %d\n', i, numPatches); end
  
  r = random('unid', CIFAR_DIM(1) - rfSize + 1);%符合均一分佈
  c = random('unid', CIFAR_DIM(2) - rfSize + 1);
  %使用mod(i-1,size(trainX,1))是因爲對每個圖片樣本，提取出numPatches/size(trainX,1)個patch
  patch = reshape(trainX(mod(i-1,size(trainX,1))+1, :), CIFAR_DIM);%32*32*3
  patch = patch(r:r+rfSize-1,c:c+rfSize-1,:);%6*6*3
  patches(i,:) = patch(:)';%patches的每一行代表一個小樣本
end

% normalize for contrast，亮度對比度均一化，減去每一行的均值然後除以該行的標準差（其實是標準差加10）
%bsxfun(@rdivide,A,B)表示A中每個元素除以B中對應行（或列）的值。
patches = bsxfun(@rdivide, bsxfun(@minus, patches, mean(patches,2)), sqrt(var(patches,[],2)+10));

% whiten
if (whitening)
  C = cov(patches);%計算patches的協方差矩陣
  M = mean(patches);
  [V,D] = eig(C);
  P = V * diag(sqrt(1./(diag(D) + 0.1))) * V';%P是ZCA Whitening矩陣
  %對數據矩陣白化前，應保證每一維的均值爲0
  patches = bsxfun(@minus, patches, M) * P;%注意patches的行列表示的意義不同時，白化矩陣的位置也是不同的。
end

% run K-means
centroids = run_kmeans(patches, numCentroids, 50);%對樣本數據patches進行聚類，聚類結果保存在centroids中
show_centroids(centroids, rfSize); drawnow;

% extract training features
if (whitening)
  trainXC = extract_features(trainX, centroids, rfSize, CIFAR_DIM, M,P);%M爲均值向量，P爲白化矩陣，CIFAR_DIM爲patch的維數，rfSize爲小patch的大小
else
  trainXC = extract_features(trainX, centroids, rfSize, CIFAR_DIM);
end

% standardize data，保證輸入svm分類器中的數據都是標準化過了的
trainXC_mean = mean(trainXC);
trainXC_sd = sqrt(var(trainXC)+0.01);
trainXCs = bsxfun(@rdivide, bsxfun(@minus, trainXC, trainXC_mean), trainXC_sd);
trainXCs = [trainXCs, ones(size(trainXCs,1),1)];%每一個特徵後面都添加了一個常量1

% train classifier using SVM
C = 100;
theta = train_svm(trainXCs, trainY, C);

[val,labels] = max(trainXCs*theta, [], 2);
fprintf('Train accuracy %f%%\n', 100 * (1 - sum(labels ~= trainY) / length(trainY)));

%%%%% TESTING %%%%%

%% Load CIFAR test data
fprintf('Loading test data...\n');
f1=load([CIFAR_DIR '/test_batch.mat']);
testX = double(f1.data);
testY = double(f1.labels) + 1;
clear f1;

% compute testing features and standardize
if (whitening)
  testXC = extract_features(testX, centroids, rfSize, CIFAR_DIM, M,P);
else
  testXC = extract_features(testX, centroids, rfSize, CIFAR_DIM);
end
testXCs = bsxfun(@rdivide, bsxfun(@minus, testXC, trainXC_mean), trainXC_sd);
testXCs = [testXCs, ones(size(testXCs,1),1)];

% test and print result
[val,labels] = max(testXCs*theta, [], 2);
fprintf('Test accuracy %f%%\n', 100 * (1 - sum(labels ~= testY) / length(testY)));

run_kmeans.m:

function centroids = runkmeans(X, k, iterations)

  x2 = sum(X.^2,2);%每一個樣本元素的平方和，x2這裏指每個樣本點與原點之間的歐式距離。
  centroids = randn(k,size(X,2))*0.1;%X(randsample(size(X,1), k), :); 程序中傳進來的k爲1600，即有1600個聚類類別
  BATCH_SIZE=1000;
  
  
  for itr = 1:iterations%iterations爲kemans聚類迭代的次數
    fprintf('K-means iteration %d / %d\n', itr, iterations);
    
    c2 = 0.5*sum(centroids.^2,2);%c2表示類別中心點到原點之間的歐式距離

    summation = zeros(k, size(X,2));
    counts = zeros(k, 1);
    
    loss =0;
    
    for i=1:BATCH_SIZE:size(X,1) %X輸入的參數爲50000，所以該循環能夠進行50次
      lastIndex=min(i+BATCH_SIZE-1, size(X,1));%lastIndex=1000,2000,3000,...
      m = lastIndex - i + 1;%m=1000,2000,3000,...
      %這種算法也是求每個樣本的標籤，因爲求min(a-b)^2等價於求min(a^2+b^2-2*a*b)等價於求max(a*b-0.5*a^2-0.5*b^2),假設a爲輸入數據矩陣，而b爲初始化中心點樣本
      %則每次從a中取出一個數據與b中所有中心點作比較時，此時a中的該數據可以忽略不計，只跟b有關。即原式等價於求max(a*b-0.5*a^2)
      [val,labels] = max(bsxfun(@minus,centroids*X(i:lastIndex,:)',c2));%val爲BATCH_SIZE大小的行向量（1000*1），labels爲每個樣本經過一次迭代後所屬的類別標號
      loss = loss + sum(0.5*x2(i:lastIndex) - val');%求出loss沒什麼用
      
      S = sparse(1:m,labels,1,m,k,m); % labels as indicator matrix，最後一個參數爲最大非0個數
      summation = summation + S'*X(i:lastIndex,:);%1600*108
      counts = counts + sum(S,1)';%1600*1的列向量，每個元素代表屬於該類樣本的個數
    end


    centroids = bsxfun(@rdivide, summation, counts);%步驟2，move centroids
    
    % just zap empty centroids so they don't introduce NaNs everywhere.
    badIndex = find(counts == 0);
    centroids(badIndex, :) = 0;%防止出現無窮大的情況
  end

extract_features.m:

function XC = extract_features(X, centroids, rfSize, CIFAR_DIM, M,P)
  assert(nargin == 4 || nargin == 6);
  whitening = (nargin == 6);
  numCentroids = size(centroids,1);%numCentroids中心點的個數
  
  % compute features for all training images
  XC = zeros(size(X,1), numCentroids*4);%爲什麼是4呢？因爲後面是分爲4個區域來pooling的
  for i=1:size(X,1)
    if (mod(i,1000) == 0) fprintf('Extracting features: %d / %d\n', i, size(X,1)); end
    
    % extract overlapping sub-patches into rows of 'patches'
    patches = [ im2col(reshape(X(i,1:1024),CIFAR_DIM(1:2)), [rfSize rfSize]) ;%類似於convolution一樣取出小的patches,patches中每一行都對應原圖中一個小圖像塊的rgb
                im2col(reshape(X(i,1025:2048),CIFAR_DIM(1:2)), [rfSize rfSize]) ;%因此patches中每一行也代表一個rgb樣本，每一行108維，每一張大圖片在patches中佔27*27行
                im2col(reshape(X(i,2049:end),CIFAR_DIM(1:2)), [rfSize rfSize]) ]';

    % do preprocessing for each patch
    
    % normalize for contrast，whitening前對每一個樣本內部做均值
    patches = bsxfun(@rdivide, bsxfun(@minus, patches, mean(patches,2)), sqrt(var(patches,[],2)+10));
    % whiten
    if (whitening)
      patches = bsxfun(@minus, patches, M) * P;
    end
    
    % compute 'triangle' activation function
    xx = sum(patches.^2, 2);
    cc = sum(centroids.^2, 2)';
    xc = patches * centroids';
    
    z = sqrt( bsxfun(@plus, cc, bsxfun(@minus, xx, 2*xc)) ); % distances = xx^2+cc^2-2*xx*cc;
    [v,inds] = min(z,[],2);%中間的那個中括號不能少，否則會認爲是將z中元素同2比較，現在的2表示z中的第2維
    mu = mean(z, 2); % average distance to centroids for each patch
    patches = max(bsxfun(@minus, mu, z), 0);%patches中每一行保存的是：小樣本與這1600個類別中心距離的平均值減掉與每個類別中心的距離，限定最小距離爲0
    % patches is now the data matrix of activations for each patch
    
    % reshape to numCentroids-channel image
    prows = CIFAR_DIM(1)-rfSize+1;
    pcols = CIFAR_DIM(2)-rfSize+1;
    patches = reshape(patches, prows, pcols, numCentroids);
    
    % pool over quadrants
    halfr = round(prows/2);
    halfc = round(pcols/2);
    q1 = sum(sum(patches(1:halfr, 1:halfc, :), 1),2);%求區域內像素之和，是個列向量，1600*1
    q2 = sum(sum(patches(halfr+1:end, 1:halfc, :), 1),2);
    q3 = sum(sum(patches(1:halfr, halfc+1:end, :), 1),2);
    q4 = sum(sum(patches(halfr+1:end, halfc+1:end, :), 1),2);
    
    % concatenate into feature vector
    XC(i,:) = [q1(:);q2(:);q3(:);q4(:)]';%類似於pooling操作
  end

train_svm.m:

function theta = train_svm(trainXC, trainY, C)
  
  numClasses = max(trainY);
  %w0 = zeros(size(trainXC,2)*(numClasses-1), 1);
  w0 = zeros(size(trainXC,2)*numClasses, 1);
  w = minFunc(@my_l2svmloss, w0, struct('MaxIter', 1000, 'MaxFunEvals', 1000), ...
              trainXC, trainY, numClasses, C);

  theta = reshape(w, size(trainXC,2), numClasses);
  
% 1-vs-all L2-svm loss function;  similar to LibLinear.
function [loss, g] = my_l2svmloss(w, X, y, K, C)
  [M,N] = size(X);
  theta = reshape(w, N,K);
  Y = bsxfun(@(y,ypos) 2*(y==ypos)-1, y, 1:K);

  margin = max(0, 1 - Y .* (X*theta));
  loss = (0.5 * sum(theta.^2)) + C*mean(margin.^2);
  loss = sum(loss);  
  g = theta - 2*C/M * (X' * (margin .* Y));
  g = g(:);

  %[v,i] = max(X*theta,[],2);
  %sum(i ~= y) / length(y)

　　參考資料：

Deep learning：二十(無監督特徵學習中關於單層網絡的分析)

　　An Analysis of Single-Layer Networks in Unsupervised Feature Learning, Adam Coates, Honglak Lee, and Andrew Y. Ng. In AISTATS 14, 2011.

http://www.cs.toronto.edu/~kriz/

　　http://ai.stanford.edu/~ang/papers.php

Deep learning：二十六(Sparse coding簡單理解)

　　Sparse coding：

　　本節將簡單介紹下sparse coding(稀疏編碼)，因爲sparse coding也是deep learning中一個重要的分支，同樣能夠提取出數據集很好的特徵。本文的內容是參考斯坦福deep learning教程：Sparse Coding，Sparse Coding: Autoencoder Interpretation，對應的中文教程見稀疏編碼，稀疏編碼自編碼表達。

　　在次之前，我們需要對凸優化有些瞭解，百度百科解釋爲：”凸優化“ 是指一種比較特殊的優化，是指目標函數爲凸函數且由約束條件得到的定義域爲凸集的優化問題，也就是說目標函數和約束條件都是”凸”的。

　　好了，現在開始簡單介紹下sparse coding, sparse coding是將輸入的樣本集X分解爲多個基元的線性組合，然後這些基前面的係數表示的是輸入樣本的特徵。其分解公式表達如下：

　　而一般情況下要求基的個數k非常大，至少要比x中元素的個數n要大，因爲這樣的基組合才能更容易的學到輸入數據內在的結構和特徵。其實在常見的PCA算法中，是可以找到一組基來分解X的，只不過那個基的數目比較小，所以可以得到分解後的係數a是可以唯一確定，而在sparse coding中，k太大，比n大很多，其分解係數a不能唯一確定。一般的做法是對係數a作一個稀疏性約束，這也就是sparse coding算法的來源。此時系統對應的代價函數（前面的博文都用損失函數表示，以後統一改用代價函數，感覺這樣翻譯更貼切）表達式爲：

　　其中的第一項是重構輸入數據X的代價值，第二項的S(.)爲分解係數的係數懲罰，lamda是兩種代價的權重，是個常量。但是這樣還是有一個問題，比如說我們可以將係數a減到很小，且將每個基的值增加到很大，這樣第一項的代價值基本保持不變，而第二項的稀疏懲罰依舊很小，達不到我們想要的目的——分解係數中只有少數係數遠遠大於0，而不是大部分系數都比0大（雖然不會大太多）。解決這個問題的通用方法是是對基集合中的值也做了一個約束，約束後的系統代價函數爲：

　　Sparse coding的概率解釋：

　　主要是從概率的角度來解釋sparse coding方法，不過這一部分的內容還真沒太看明白，只能講下自己的大概理解。如果把誤差考慮進去後，輸入樣本X經過sparse coding分解後的表達式則如下：

　　而我們的目標是找到一組基Ф，使得輸入樣本數據出現的概率與輸入樣本數據的經驗分佈概率最相近，如果用KL距離來衡量其相似度的話，就是滿足他們的KL距離最小，即下面表達式值最小：

　　由於輸入數據的經驗分佈函數概率是固定值，所以求上式值最小相當等價於求最大。

　　經過對參數a的先驗估計和函數積分值估計等推導步驟，最後等價於求下面的能量函數值最小：

　　而這就很好的和sparse coding的代價函數公式給聯繫起來了。

　　到目前爲止我們應該知道sparse coding的實際使用過程中速度是很慢的，因爲即使我們在訓練階段已經把輸入數據集的基Ф學習到了，在測試階段時還是要通過凸優化的方法去求得其特徵值（即基組合前面的係數值），所以這比一般的前向神經網絡速度要慢（一般的前向算法只需用矩陣做一下乘法，然後做下加法，求個函數值等少數幾步即可完成）。

　　Sparse coding的autoencoder解釋：

　　首先來看看向量X的Lk規範數，其值爲：由此可知，L1範數爲各元素之和，L2範數爲該向量到遠點的歐式距離。

　　用矩陣的形式來表達sparse coding的代價函數如下：

　　和前面所講的一樣，這裏也對基值s做了稀疏性懲罰，用的是L1範數來約束，同時也防止係數矩陣A過大，對其用的是L2範數的平方來約束。但是基值處的L1範數在0點是無法求導的，所以不能用梯度下降等類似的方法來對上面的代價函數求最優參數，於是爲了在0處可導，可將公式變成如下：

　　拓撲sparse coding：

　　拓撲sparse coding主要是模仿人體大腦皮層中相鄰的神經元對能提取出某一相近的特徵，因此在deep learning中我們希望學習到的特徵也具有這樣“拓撲秩序”的性質。如果我們隨意的將特徵排列成一個矩陣，則我們希望矩陣中相鄰的特徵是相似的。也就是把原先那些特徵係數的稀疏性懲罰項L1範數更改爲不同小組L1範數懲罰之和，而這些相鄰小組之間是有重疊值的，因此只要重疊的那一部分值改變就意味着各自組的懲罰值也會改變，這也就體現出了類似人腦皮層的特性，因此此時系統的代價函數爲：

　　改成矩陣的形式後如下：

　　總結：

　　在實際編程時，爲了寫出準確無誤的優化函數代碼並能快速又恰到好處地收斂到最優值，可以採用下面的技巧：

將輸入樣本集分成多個小的mini-batches，這樣做的好處是每次迭代時輸入系統的樣本數變少了，運行的時間也會變短很多，並且也提高了整體收斂速度。（暫時還沒弄明白原因）。
S的初始化值不能隨機給。一般都是按照下面的方法進行：

　　最後，在實際優化該代價函數時步驟大致如下：

隨機初始化A
重複以下步驟直至收斂
1. 隨機選取一個有小的mini-batches。
2. 按照前面講的方法來s。
3. 根據上一步給定的A，求解能夠最小化J(A,s)的s
4. 根據上一步得到的s，求解能夠最小化J(A,s)的A

　　參考資料：

Sparse Coding

Sparse Coding: Autoencoder Interpretation

稀疏編碼

稀疏編碼自編碼

Deep learning：二十七(Sparse coding中關於矩陣的範數求導)

　　前言：

　　由於在sparse coding模型中求系統代價函數偏導數時需要用到矩陣的範數求導，這在其它模型中應該也很常見，比如說對一個矩陣內的元素值進行懲罰，使其值不能過大，則可以使用F範數（下面將介紹）約束，查閱了下矩陣範數求導的相關資料，本節就簡單介紹下。

　　首先，網絡上有大把的人把2範數和F=2時的範數混爲一談，或者說把矩陣p範數和誘導p範數混淆了（也有可能是因爲各個版本書所定義的不同吧）。下面我還是以矩陣中權威教材the matrix cookbook和matlab內嵌函數所用的定義來解釋。話說the matrix cookbook是一本非常不錯的參考書，查找矩陣相關的公式就像查字典一樣，很方便。

　　矩陣的誘導2範數我們常說的2範數，其定義如下：

　　而矩陣的F=2時的範數，卻在實際優化領域經常用到的範數，也稱爲Frobenius範數，其定義爲：

　　由此可見，在前面博文Deep learning：二十六(Sparse coding簡單理解)中，Ng教授給出關於Sparse coding的代價公式如下：

　　並且Ng教授稱公式中比如第一項是l2範數，按照我現在這種定義其實這種講法是錯的，嚴格的說應該是Frobenius範數（不過也有可能是他自己的定義不同吧，反正最終能解決問題就行）。畢竟，在matlab中如果按照Ng關於l2範數定義來求的話，其結果就錯了。

　　爲了證明上面的觀點，下面在matlab下做一個簡單的實驗，實驗code如下：

%% 使用原始定義求，即a中各元素平方和，然後開根號
a = magic(3);
b = a.^2;
c = sum(b(:));
d = sqrt(c)

%% 直接使用matlab中2規範函數求
e = norm(a,2)

%% 使用矩陣a'*a最大特徵值開根號的方法求
f = a'*a;
g = eig(f);
h = max(g);
i = sqrt(h)

%% 使用Frobenius範數公式來求（其中F=2）
j = sqrt(trace(a*a'))

%% 使用matlab自帶的Frobenius公式來求
k = norm(a,'fro')

　　運行後其輸出結果爲：

　　d =

　　16.8819

　　e =

　　15.0000

　　i =

　　 15.0000

　　j =

　　16.8819

　　k =

　　 16.8819

　　從上面結果可以看出，矩陣的2範數定義所求出的結果和matlab中2範數所求出的結果都是一樣的，都爲15。而按照Frobenius範數公式的定義， matlab中求Frobenius的函數，以及Frobenius最初始的定義這3種方法來求，其結果也是一樣，爲16.8819。這個實驗和上面的介紹是一致的。

　　下面就來看看Sparse coding代價函數第一項中如果要對矩陣A和s求導，該怎麼求呢？很明顯這是一個矩陣Frobenius求導問題，且求A導數時假設s和X都是常量，求s的時類似，參考了網上論壇http://www.mathchina.net/dvbbs/dispbbs.asp?boardid=4&Id=3673上的教材後就可以得到相應的答案。其中對矩陣s求導可以參考下面一個例題：

　　而對矩陣A求導可以參考：

　　總結：

　　現在比較能夠區分2範數和F=2時的範數了，另外需要熟悉矩陣求導的方法。不過到目前爲止，還沒有找到矩陣2範數求導的公式，也不知道該怎麼推導。

　　參考資料：

矩陣範數- 維基百科，自由的百科全書 - 維基百科- Wikipedia

　　the matrix cookbook

Deep learning：二十六(Sparse coding簡單理解)

http://www.mathworks.com/matlabcentral/newsreader/view_thread/287712

http://www.mathchina.net/dvbbs/dispbbs.asp?boardid=4&Id=3673

Deep learning：二十八(使用BP算法思想求解Sparse coding中矩陣範數導數)

　　前言：

　　關於Sparse coding目標函數的優化會涉及到矩陣求數問題，因爲裏面有好多矩陣範數的導數，加上自己對矩陣運算不熟悉，推導前面博文Deep learning：二十六(Sparse coding簡單理解)中關於拓撲（非拓撲的要簡單很多）Sparse coding代價函數對特徵變量s導數的公式時，在草稿紙上推導了大半天也沒有正確結果。該公式表達式爲：

　　後面繼續看UFLDL教程，發現這篇文章Deriving gradients using the backpropagation idea中已經給出了我想要的答案，作者是應用BP神經網絡中求網絡代價函數導數的思想，將上述代價函數演變成一個多層的神經網絡，然後利用每層網絡中節點的誤差值來反向推導出每一層網絡節點的導數。Andrew Ng真值得人佩服，給出的教程切中了我們的要害。

　　在看怎樣使用BP思想計算矩陣範數的導數時，先看下針對這種問題求解的BP算法形式（和以前經典的BP算法稍有不同，比如說最後一層網絡的誤差值計算方法，暫時還沒弄明白這樣更改的理由）：

對網絡（由代價函數轉換成的網絡）中輸出層中節點的誤差值，採用下面公式計算：

　　2. 從網絡的倒數第2層一直到第2層，依次計算網絡每層的誤差值：

　　3. 計算網絡中l層的網絡參數的偏導（如果是第0層網絡，則表示是求代價函數對輸入數據作爲參數的偏導）：

　　比如在上篇博文中Deep learning：二十七(Sparse coding中關於矩陣的範數求導)，就使用過將矩陣範數轉換成矩陣的跡形式，然後利用跡的求導公式得出結果，那時候是求sparse coding中非拓撲網絡代價函數對權值矩陣A的偏導數，現在用BP思想來求對特徵矩陣s的導數，代價函數爲：

　　將表達式中s當做網絡的輸入，依次將公式中各變量和轉換關係變成下面的網絡結構：

　　列出每一層網絡的權值，activation函數及其偏導數，誤差值，每一層網絡的輸入，如下所示：

　　求最後一層網絡的誤差值時按照前面BP算法的方法此處是：最後一層網絡的輸出值之和J對最後一層某個節點輸入值的偏導，這裏的J爲：

　　因爲此時J對Zi求導是隻對其中關於Zi的那一項有效，所以它的偏導數爲2*Zi。

　　最終代價函數對輸入X（這裏是s）的偏導按照公式可以直接寫出如下：

　　下面繼續來看那個我花了解決一天時間也沒推倒出來的偏導數，即在拓撲sparse coding代價函數中關於特徵矩陣s的偏導公式。也就是本文一開始給出的公式。

　　用同樣的方法將其轉換成對應的網絡結構如下所示：

　　也同樣的，列出它對應網絡的參數：

　　其中的輸出函數J如下：

　　最終那個神奇的答案爲：

　　看來這種方法得掌握，如果日後自己論文用到什麼公式需要推導的話。

　　參考資料：

Deep learning：二十六(Sparse coding簡單理解)

Deriving gradients using the backpropagation idea

Deep learning：二十七(Sparse coding中關於矩陣的範數求導)

Deep learning：二十九(Sparse coding練習)

　　前言

　　本節主要是練習下斯坦福DL網絡教程UFLDL關於Sparse coding那一部分，具體的網頁教程參考：Exercise:Sparse Coding。該實驗的主要內容是從2w個自然圖像的patches中分別採用sparse coding和拓撲的sparse coding方法進行學習，並觀察學習到的這些圖像基圖像的特徵。訓練數據時自然圖片IMAGE，在給出的教程網站上有。

　　實驗基礎

　　Sparse coding的主要是思想學習輸入數據集”基數據”，一旦獲得這些”基數據”，輸入數據集中的每個數據都可以用這些”基數據”的線性組合表示，而稀疏性則體現在這些線性組合係數是係數的，即大部分的值都爲0。很顯然，這些”基數據”的尺寸和原始輸入數據的尺寸是相同的，另外”基數據”的個數通常要比每個樣本的維數大。最簡單的理解可以看前面博文提到過的公式：

　　其中的輸入數據x可以分解成基Ф的線性組合，ai爲組合係數。不過那只是針對一個數據而已，而在ML領域中都是大數據，因此下面來考慮樣本是矩陣的形式。在前面博文Deep learning：二十六(Sparse coding簡單理解)中我們已經介紹過sparse coding系統非拓撲時的代價函數爲：

　　拓撲結構時的代價函數如下：

　　在訓練階段我們的目的是要通過優化算法求出最佳的參數A，因爲A就是我們的”基數據”集。但是以上2個代價函數表達式中都有兩個未知的參數矩陣，即A和s，所以不能採用簡單的優化方法。此時一般的優化思想爲交叉優化，即先固定一個A來優化s，然後固定該s來優化A，以此類推，等迭代步驟到達預設值時就停止。而在優化過程中首先要解決的就是代價函數對參數矩陣A和s的求導問題。

　　此時的求導涉及到了矩陣範數的求導，一般有2種方法，第一種是將求導問題轉換到矩陣的跡的求導，可以參考前面博文Deep learning：二十七(Sparse coding中關於矩陣的範數求導)。第二種就是利用BP的思想來求，可以參考：Deep learning：二十八(使用BP算法思想求解Sparse coding中矩陣範數導數)一文。

　　代價函數關於權值矩陣A的導數如下（拓撲和非拓撲時結果是一樣的，因爲此時這2種情況下代價函數關於A是沒區別的）：

　　非拓撲結構下代價函數關於s的導數如下：

　　拓撲Sparse coding下代價函數關於s的導數爲：

　　關於本程序的一點註釋：

如果按照上面公式的和我們的理解，A是由學習到的基向量構成，S爲每個樣本在該基分解下的係數。在這裏表示前提下，可以這樣定義：A爲n*k維，其中的每一列表示的是訓練出來的基向量，S是k*m,其中的每一列是對應輸入樣本的sparse coding分解係數，當然了，此時的X是n*m了。即每一列表示的是一個樣本數據。如果我們反過來表示（雖然這樣理解不對，這裏只是用不同的表示方法矩陣而已），即A表示輸入數據X的分解係數（即編碼值），而S是原始數據集訓練出來的基的構成的，那麼此時關於A,S,X三者的維度可以這樣定義和解釋：現假設有m個樣本X，每個樣本是個n維的向量，即X爲m*n維的矩陣，需要用sparse coding學習k個特徵，使得代價函數值最小，則其中的A是m*k維的，A中的第i行表示第i個樣本分解後的係數值，S是k*n維的，S的第i行表示所學習到的第i個基。當然了，在本次實驗和以後類似情況下我們還是用正確的版本，即第一種表示。
在matlab中，右除矩陣A和右乘inv(A)雖然在定義上式一樣的，但是兩者運行的結果有可能不同，右除的精度要高些。
注意拓撲結構下代價函數對s導數公式中的最後一項是點乘符號，也就是矩陣中對應元素的相乘，如果弄成了普通的矩陣乘法則就很難通過gradient checking了。
本程序訓練樣本IMAGE原圖片尺寸512*512，共10張，從這10張大圖片中提取2w張8*8的小patch圖片，這些圖片部分顯示如下：

　　一些Matlab函數：

　　circshift:

　　該函數是將矩陣循環平移的函數，比如說B = circshift(A,shiftsize)是將矩陣A按照shiftsize的方式左右平移，一般hiftsize爲一個多維的向量，第一個元素表示上下方向移動（更準確的說是在第一個維度上移動，這裏只是考慮是2維矩陣的情況，後面的類似），如果爲正表示向下移，第二個元素表示左右方向移動，如果向右表示向右移動。

　　rndperm：

　　該函數是隨機產生一個行向量，比如randperm(n)產生一個n維的行向量，向量元素值爲1~n，隨機選取且不重複。而randperm(n,k)表示產生一個長爲k的行向量，其元素也是在1到n之間，不能有重複。

　　questdlg：

　　button = questdlg('qstring','title','str1','str2','str3',default)，這是一個對話框，對話框中的內容用qstring表示，標題爲title，然後後面3個分別爲對應yes,no,cancel按鈕，最後的參數default爲默認的對應按鈕。

　　實驗結果：

　　交叉優化參數中，給定s優化A時，由於A有直接的解析解，所以不需要通過lbfgs等優化算法求得，通過令代價函數對A的導函數爲0，可以得到解析解爲：

　　注意單位矩陣前一定要有個係數（即樣本個數），不然在程序中直接用該方法求得的A是通過不了驗證。

　　此時學習到的非拓撲結果爲：

　　上面的結果有點難看，採用的是16*16大小的patch，而非8*8的。　　

　　採用cg優化，256個16*16大小的patch，其結果如下：

　　如果將patch改爲8*8,121個特徵點，結果如下（這個比較像樣）：

　　如果用lbfgs，256個16*16的，其結果如下（效果很差，說明優化方法對結果有影響）：

　　實驗部分代碼及註釋：

　　sparseCodeingExercise.m:

%% CS294A/CS294W Sparse Coding Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  sparse coding exercise. In this exercise, you will need to modify
%  sparseCodingFeatureCost.m and sparseCodingWeightCost.m. You will also
%  need to modify this file, sparseCodingExercise.m slightly.

% Add the paths to your earlier exercises if necessary
% addpath /path/to/solution

%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

addpath minFunc;
numPatches = 20000;   % number of patches
numFeatures = 256;    % number of features to learn
patchDim = 16;         % patch dimension
visibleSize = patchDim * patchDim; %單通道灰度圖，64維，學習121個特徵

% dimension of the grouping region (poolDim x poolDim) for topographic sparse coding
poolDim = 3;

% number of patches per batch
batchNumPatches = 2000; %分成10個batch

lambda = 5e-5;  % L1-regularisation parameter (on features)
epsilon = 1e-5; % L1-regularisation epsilon |x| ~ sqrt(x^2 + epsilon)
gamma = 1e-2;   % L2-regularisation parameter (on basis)

%%======================================================================
%% STEP 1: Sample patches

images = load('IMAGES.mat');
images = images.IMAGES;
patches = sampleIMAGES(images, patchDim, numPatches);
display_network(patches(:, 1:64));

%%======================================================================
%% STEP 3: Iterative optimization
%  Once you have implemented the cost functions, you can now optimize for
%  the objective iteratively. The code to do the iterative optimization 
%  using mini-batching and good initialization of the features has already
%  been included for you. 
% 
%  However, you will still need to derive and fill in the analytic solution 
%  for optimizing the weight matrix given the features. 
%  Derive the solution and implement it in the code below, verify the
%  gradient as described in the instructions below, and then run the
%  iterative optimization.

% Initialize options for minFunc
options.Method = 'cg';
options.display = 'off';
options.verbose = 0;

% Initialize matrices
weightMatrix = rand(visibleSize, numFeatures);%64*121
featureMatrix = rand(numFeatures, batchNumPatches);%121*2000

% Initialize grouping matrix
assert(floor(sqrt(numFeatures)) ^2 == numFeatures, 'numFeatures should be a perfect square');
donutDim = floor(sqrt(numFeatures));
assert(donutDim * donutDim == numFeatures,'donutDim^2 must be equal to numFeatures');

groupMatrix = zeros(numFeatures, donutDim, donutDim);%121*11*11
groupNum = 1;
for row = 1:donutDim
    for col = 1:donutDim 
        groupMatrix(groupNum, 1:poolDim, 1:poolDim) = 1;%poolDim=3
        groupNum = groupNum + 1;
        groupMatrix = circshift(groupMatrix, [0 0 -1]);
    end
    groupMatrix = circshift(groupMatrix, [0 -1, 0]);
end
groupMatrix = reshape(groupMatrix, numFeatures, numFeatures);%121*121

if isequal(questdlg('Initialize grouping matrix for topographic or non-topographic sparse coding?', 'Topographic/non-topographic?', 'Non-topographic', 'Topographic', 'Non-topographic'), 'Non-topographic')
    groupMatrix = eye(numFeatures);%非拓撲結構時的groupMatrix矩陣
end

% Initial batch
indices = randperm(numPatches);%1*20000
indices = indices(1:batchNumPatches);%1*2000
batchPatches = patches(:, indices);                           

fprintf('%6s%12s%12s%12s%12s\n','Iter', 'fObj','fResidue','fSparsity','fWeight');
warning off;
for iteration = 1:200   
  %  iteration = 1;
    error = weightMatrix * featureMatrix - batchPatches;
    error = sum(error(:) .^ 2) / batchNumPatches;  %說明重構誤差需要考慮樣本數
    fResidue = error;
    num_batches = size(batchPatches,2);
    R = groupMatrix * (featureMatrix .^ 2);
    R = sqrt(R + epsilon);    
    fSparsity = lambda * sum(R(:));    %稀疏項和權值懲罰項不需要考慮樣本數
    
    fWeight = gamma * sum(weightMatrix(:) .^ 2);
    
    %上面的那些權值都是隨機初始化的
    fprintf('  %4d  %10.4f  %10.4f  %10.4f  %10.4f\n', iteration, fResidue+fSparsity+fWeight, fResidue, fSparsity, fWeight)
               
    % Select a new batch
    indices = randperm(numPatches);
    indices = indices(1:batchNumPatches);
    batchPatches = patches(:, indices);                    
    
    % Reinitialize featureMatrix with respect to the new
    % 對featureMatrix重新初始化，按照網頁教程上介紹的方法進行
    featureMatrix = weightMatrix' * batchPatches;
    normWM = sum(weightMatrix .^ 2)';
    featureMatrix = bsxfun(@rdivide, featureMatrix, normWM); 
    
    % Optimize for feature matrix    
    options.maxIter = 20;
    %給定權值初始值，優化特徵值矩陣
    [featureMatrix, cost] = minFunc( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix), ...
                                           featureMatrix(:), options);
    featureMatrix = reshape(featureMatrix, numFeatures, batchNumPatches);                                      
    weightMatrix = (batchPatches*featureMatrix')/(gamma*num_batches*eye(size(featureMatrix,1))+featureMatrix*featureMatrix');
    [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix);
          
end
    figure;
    display_network(weightMatrix);

　　sparseCodingWeightCost.m:

function [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures,  patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingWeightCost - given the features in featureMatrix, 
%                         computes the cost and gradient with respect to
%                         the weights, given in weightMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.

    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
    else
        groupMatrix = eye(numFeatures);%非拓撲的sparse coding中，相當於groupMatrix爲單位對角矩陣
    end

    numExamples = size(patches, 2);%測試代碼時爲5

    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);%其實傳入進來的就是這些東西
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);
    
    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   weights given in weightMatrix.     
    % -------------------- YOUR CODE HERE --------------------    
    %% 求目標的代價函數
    delta = weightMatrix*featureMatrix-patches;
    fResidue = sum(sum(delta.^2))./numExamples;%重構誤差
    fWeight = gamma*sum(sum(weightMatrix.^2));%防止基內元素值過大
%     sparsityMatrix = sqrt(groupMatrix*(featureMatrix.^2)+epsilon);
%     fSparsity = lambda*sum(sparsityMatrix(:)); %對特徵係數性的懲罰值
%     cost = fResidue+fWeight+fSparsity; %目標的代價函數
    cost = fResidue+fWeight;
    
    %% 求目標代價函數的偏導函數
    grad = (2*weightMatrix*featureMatrix*featureMatrix'-2*patches*featureMatrix')./numExamples+2*gamma*weightMatrix;
    grad = grad(:);
   
end

　　sparseCodingFeatureCost .m:

function [cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingFeatureCost - given the weights in weightMatrix,
%                          computes the cost and gradient with respect to
%                          the features, given in featureMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.

    isTopo = 1;
%     L = size(groupMatrix,1);
%     [K M] = size(featureMatrix);
    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
        if(isequal(groupMatrix,eye(numFeatures)));
            isTopo = 0;
        end
    else
        groupMatrix = eye(numFeatures);
         isTopo = 0;
    end
    
    numExamples = size(patches, 2);
    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);

    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   features given in featureMatrix.     
    %   You may wish to write the non-topographic version, ignoring
    %   the grouping matrix groupMatrix first, and extend the 
    %   non-topographic version to the topographic version later.
    % -------------------- YOUR CODE HERE --------------------
    
    
    %% 求目標的代價函數
    delta = weightMatrix*featureMatrix-patches;
    fResidue = sum(sum(delta.^2))./numExamples;%重構誤差
%     fWeight = gamma*sum(sum(weightMatrix.^2));%防止基內元素值過大
    sparsityMatrix = sqrt(groupMatrix*(featureMatrix.^2)+epsilon);
    fSparsity = lambda*sum(sparsityMatrix(:)); %對特徵係數性的懲罰值
%     cost = fResidue++fSparsity+fWeight;%此時A爲常量，可以不用
    cost = fResidue++fSparsity;

    %% 求目標代價函數的偏導函數
    gradResidue = (-2*weightMatrix'*patches+2*weightMatrix'*weightMatrix*featureMatrix)./numExamples;
  
    % 非拓撲結構時：
    if ~isTopo
        gradSparsity = lambda*(featureMatrix./sparsityMatrix);
    end
    
    % 拓撲結構時
    if isTopo
%         gradSparsity = lambda*groupMatrix'*(groupMatrix*(featureMatrix.^2)+epsilon).^(-0.5).*featureMatrix;%一定要小心最後一項是內積乘法
        gradSparsity = lambda*groupMatrix'*(groupMatrix*(featureMatrix.^2)+epsilon).^(-0.5).*featureMatrix;%一定要小心最後一項是內積乘法
    end
    grad = gradResidue+gradSparsity;
    grad = grad(:);
    
end

　　sampleIMAGES.m:

function patches = sampleIMAGES(images, patchsize,numpatches)
% sampleIMAGES
% Returns 10000 patches for training

% load IMAGES;    % load images from disk 

%patchsize = 8;  % we'll use 8x8 patches 
%numpatches = 10000;

% Initialize patches with zeros.  Your code will fill in this matrix--one
% column per patch, 10000 columns. 
patches = zeros(patchsize*patchsize, numpatches);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Fill in the variable called "patches" using data 
%  from IMAGES.  
%  
%  IMAGES is a 3D array containing 10 images
%  For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,
%  and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize
%  it. (The contrast on these images look a bit off because they have
%  been preprocessed using using "whitening."  See the lecture notes for
%  more details.) As a second example, IMAGES(21:30,21:30,1) is an image
%  patch corresponding to the pixels in the block (21,21) to (30,30) of
%  Image 1
for imageNum = 1:10%在每張圖片中隨機選取1000個patch，共10000個patch
    [rowNum colNum] = size(images(:,:,imageNum));
    for patchNum = 1:2000%實現每張圖片選取1000個patch
        xPos = randi([1,rowNum-patchsize+1]);
        yPos = randi([1, colNum-patchsize+1]);
        patches(:,(imageNum-1)*2000+patchNum) = reshape(images(xPos:xPos+patchsize-1,yPos:yPos+patchsize-1,...
                                                        imageNum),patchsize*patchsize,1);
    end
end


%% ---------------------------------------------------------------
% For the autoencoder to work well we need to normalize the data
% Specifically, since the output of the network is bounded between [0,1]
% (due to the sigmoid activation function), we have to make sure 
% the range of pixel values is also bounded between [0,1]
% patches = normalizeData(patches);

end


%% ---------------------------------------------------------------
function patches = normalizeData(patches)

% Squash data to [0.1, 0.9] since we use sigmoid as the activation
% function in the output layer

% Remove DC (mean of images). 
patches = bsxfun(@minus, patches, mean(patches));

% Truncate to +/-3 standard deviations and scale to -1 to 1
pstd = 3 * std(patches(:));
patches = max(min(patches, pstd), -pstd) / pstd;%因爲根據3sigma法則，95%以上的數據都在該區域內
                                                % 這裏轉換後將數據變到了-1到1之間

% Rescale from [-1,1] to [0.1,0.9]
patches = (patches + 1) * 0.4 + 0.1;

end

　　實驗總結：

　　拓撲結構的Sparse coding未完成，跑出來沒有效果，還望有人指導下。

　　2013.5.6：

　　已解決非拓撲下的Sparse coding，那時候出現的問題是因爲在代價函數中，重構誤差那一項沒有除樣本數（下面博文回覆中網友給的提示），導致代價函數，導數，以及A的解析解都相應錯了。

　　但是拓撲Sparse Coding依舊沒有訓練出來，因爲訓練過程中代價函數的值不是遞減的，而是基本無規律。

　　2013.5.14：

　　基本解決了拓撲下的Sparse coding。以前訓練不出特徵來主要原因是在sampleIMAGES.m裏沒有將最後的patches歸一化註釋掉（個人猜測：採樣前的大圖片是經過白化了的，所以如果後面繼續用那個帶誤差的歸一化，可能引入更大的誤差，導致給定的樣本不適合Sparse coding），另外就是根據羣裏網友@地皮菜的提示，將優化算法由lbfgs改爲cg就可以得出像樣的結果。由此可見，不同的優化算法對最終的結果也是有影響的。

　　參考資料：

Exercise:Sparse Coding

Deep learning：二十六(Sparse coding簡單理解)

Deep learning：二十七(Sparse coding中關於矩陣的範數求導)

Deep learning：二十八(使用BP算法思想求解Sparse coding中矩陣範數導數)

Deep learning：三十(關於數據預處理的相關技巧)

　　前言：

　　本文主要是介紹下在一個實際的機器學習系統中，該怎樣對數據進行預處理。個人感覺數據預處理部分在整個系統設計中的工作量佔了至少1/3。首先數據的採集就非常的費時費力，因爲這些數據需要考慮各種因素，然後有時還需對數據進行繁瑣的標註。當這些都有了後，就相當於我們有了元素的raw數據，然後就可以進行下面的數據預處理部分了。本文是參考的UFLDL網頁教程：Data Preprocessing，在該網頁的底部可以找到其對應的中文版。

　　基礎知識：

　　一般來說，算法的好壞一定程度上和數據是否歸一化，是否白化有關。但是在具體問題中，這些數據預處理中的參數其實還是很難準確得到的，當然了，除非你對對應的算法有非常的深刻的理解。下面就從歸一化和白化兩個角度來介紹下數據預處理的相關技術。

　　數據歸一化：

　　數據的歸一化一般包括樣本尺度歸一化，逐樣本的均值相減，特徵的標準化這3個。其中數據尺度歸一化的原因是：數據中每個維度表示的意義不同，所以有可能導致該維度的變化範圍不同，因此有必要將他們都歸一化到一個固定的範圍，一般情況下是歸一化到[0 1]或者[-1 1]。這種數據歸一化還有一個好處是對後續的一些默認參數（比如白化操作）不需要重新過大的更改。

　　逐樣本的均值相減主要應用在那些具有穩定性的數據集中，也就是那些數據的每個維度間的統計性質是一樣的。比如說，在自然圖片中，這樣就可以減小圖片中亮度對數據的影響，因爲我們一般很少用到亮度這個信息。不過逐樣本的均值相減這隻適用於一般的灰度圖，在rgb等色彩圖中，由於不同通道不具備統計性質相同性所以基本不會常用。

　　特徵標準化是指對數據的每一維進行均值化和方差相等化。這在很多機器學習的算法中都非常重要，比如SVM等。

　　數據白化：

　　數據的白化是在數據歸一化之後進行的。實踐證明，很多deep learning算法性能提高都要依賴於數據的白化。在對數據進行白化前要求先對數據進行特徵零均值化，不過一般只要我們做了特徵標準化，那麼這個條件必須就滿足了。在數據白化過程中，最主要的還是參數epsilon的選擇，因爲這個參數的選擇對deep learning的結果起着至關重要的作用。

　　在基於重構的模型中（比如說常見的RBM，Sparse coding, autoencoder都屬於這一類，因爲他們基本上都是重構輸入數據），通常是選擇一個適當的epsilon值使得能夠對輸入數據進行低通濾波。但是何謂適當的epsilon呢？這還是很難掌握的，因爲epsilon太小，則起不到過濾效果，會引入很多噪聲，而且基於重構的模型又要去擬合這些噪聲；epsilon太大，則又對元素數據有過大的模糊。因此一般的方法是畫出變化後數據的特徵值分佈圖，如果那些小的特徵值基本都接近0，則此時的epsilon是比較合理的。如下圖所示，讓那個長長的尾巴接近於x軸。該圖的橫座標表示的是第幾個特徵值，因爲已經將數據集的特徵值從大到小排序過。

　　文章中給出了個小小的實用技巧：如果數據已被縮放到合理範圍(如[0,1])，可以從epsilon = 0.01或epsilon = 0.1開始調節epsilon。

　　基於正交化的ICA模型中，應該保持參數epsilon儘量小，因爲這類模型需要對學習到的特徵做正交化，以解除不同維度之間的相關性。（暫時沒看懂，因爲還沒有時間去研究過ICA模型，等以後研究過後再來理解）。

　　教程中的最後是一些常見數據的預處理標準流程，其實也只是針對具體數據集而已的，所以僅供參考。

　　參考資料：

Data Preprocessing

Deep learning：三十一(數據預處理練習)

　　前言:

　　本節主要是來練習下在machine learning(不僅僅是deep learning)設計前的一些數據預處理步驟，關於數據預處理的一些基本要點在前面的博文Deep learning：三十(關於數據預處理的相關技巧)中已有所介紹，無非就是數據的歸一化和數據的白化，而數據的歸一化又分爲尺度歸一化，均值方差歸一化等。數據的白化常見的也有PCA白化和ZCA白化。

　　實驗基礎：

　　本次實驗所用的數據爲ASL手勢識別的數據，數據可以在網站http://personal.ee.surrey.ac.uk/Personal/N.Pugeault/index.php?section=FingerSpellingDataset

上下載。關於該ASL數據庫的一些簡單特徵：

　　該數據爲24個字母（字母j和z的手勢是動態的，所以在這裏不予考慮）的手勢靜態圖片庫，每個操作者以及每個字母都有顏色圖和深度圖，訓練和測試數據一起約2.2G（其實因爲它是8bit的整型，後面在matlab處理中一般都會轉換成浮點數，所以總共的數據大約10G以上了）。

　　這些手勢圖片是用kinect針對不同的5個人分別採集的，每個人採集24個字母的圖像各約500張，所以顏色圖片總算大約爲24*5*500=60k。當然了，這只是個大概數字，應該並不是每個人每個字母嚴格的500張，另外深度圖像和顏色圖像一樣多，也大概是60k。而該數據庫的作者是用一半的圖片來訓練，另一半用來測試。顏色圖和深度圖都用了。所以至少每次也用了3w張圖片，每張圖片都是上千維的，數據量有點大。

　　另外發現所有數據庫中顏色圖片的第一張缺失，即是從第二張圖片開始的。所以將其和kinect對應時要非常小心，並且中間有些圖片是錯的，比如說有的文件夾中深度圖和顏色圖的個數就不相等。並且原圖的rgb圖是8bit的，而depth圖是16bit的。通常所說的文件大小指的是字節大小，即byte；而一般所說的傳輸速率指的是位大小，即bit。

　　ASL數據庫的部分圖片如下：

　　一些matlab知識：

　　在matlab中，雖然說幾個矩陣的大小相同，也都是浮點數類型，但是由於裏面的內容（即元素值）不同，所以很有可能其佔用的文件大小不同。

　　Imagesc和imshow在普通rgb圖像使用時其實沒什麼區別，只不過imagesc顯示的時候把標籤信息給顯示出來了。

　　dir：

　　列出文件夾內文件的內容，只要列出的文件夾中有一個子文件夾，則其實代表了有至少有3個子文件夾。其中的’.’和’..’表示的是當前目錄和上一級的目錄。

　　load:

　　不加括號的load時不能接中間變量，只能直接給出文件名

　　sparse:

　　這個函數中參數必須爲正數，因爲負數或0是不能當下標的。

　　實驗結果：

　　這次實驗主要是完成以下3個小的預處理功能。

　　第一：將圖片尺度歸一化到96*96大小，因爲給定的圖片大小都不統一，所以只能取個大概的中間尺寸值。且將每張圖片變成一個列向量，多個圖片樣本構成一個矩陣。因爲這些圖片要用於訓練和測試，按照作者的方法，將訓練和測試圖片分成2部分，且每部分包含了rgb顏色圖，灰度圖，kinect深度圖3種，由於數據比較大，所以每個採集者（總共5人）又單獨設爲一組。因此生產後的尺度統一圖片共有30個。其中的部分文件顯示如下：

　　第二：因爲要用訓練部分圖像來訓練deep learning某種模型，所以需要提取出局部patch（10*10大小）樣本。此時的訓練樣本有3w張，每張提取出10個patch，總共30w個patch。

　　第三：對這些patch樣本進行數據白化操作，用的普通的ZCA白化。

　　實驗主要部分代碼及註釋：

　　下面3個m文件分別對應上面的3個小步驟。

img_preprocessing.m:

%% data processing:
% translate the picture sets to the mat form
% 將手勢識別的圖片數據庫整理成統一的大小（這裏是96*96），然後變成1列，最後轉換成矩陣的形式，每個採集者的
% 數據單獨放好（共ABCDE5人），爲了後續實驗的需要，分別保存了rgb顏色圖，灰度圖和深度圖3種類型

%add the picture path
addpath c:/Data
addpath c:/Data/fingerspelling5
addpath c:/Data/fingerspellingmat5/
matdatapath = 'c:/Data/fingerspellingmat5/';

%設置圖片和mat文件存儲的位置
img_root_path = 'c:/Data/fingerspelling5/';
mat_root_path = 'c:/Data/fingerspellingmat5/';

%將圖片歸一化到的尺寸大小
img_scale_width = 96;
img_scale_height = 96;

%% 開始講圖片轉換爲mat數據
img_who_path = dir(img_root_path);%dir命令爲列出文件夾內文件的內容
if(img_who_path(1).isdir) %判斷是哪個人操作的，A,B,C,...
    length_img_who_path = length(img_who_path);
    for ii = 4:length_img_who_path %3~7
        % 在次定義存儲中間元素的變量，因爲我的電腦有8G內存，所以就一次性全部讀完了，如果電腦內存不夠的話，最好分開存入這些數據
        %讀取所有RGB圖像的訓練部分和測試部分圖片
        color_img_train = zeros(img_scale_width*img_scale_height*3,250*24);
        color_label_train = zeros(250*24,1);
        color_img_test = zeros(img_scale_width*img_scale_height*3,250*24);
        color_label_test = zeros(250*24,1);
        %讀取所有gray圖像的訓練部分和測試部分圖片
        gray_img_train = zeros(img_scale_width*img_scale_height,250*24);
        gray_label_train = zeros(250*24,1);
        gray_img_test = zeros(img_scale_width*img_scale_height,250*24);
        gray_label_test = zeros(250*24,1);
        %讀取所有depth圖像的訓練部分和測試部分圖片
        depth_img_train = zeros(img_scale_width*img_scale_height,250*24);
        depth_label_train = zeros(250*24,1);
        depth_img_test = zeros(img_scale_width*img_scale_height,250*24);
        depth_label_test = zeros(250*24,1);
        
        img_which_path = dir([img_root_path img_who_path(ii).name '/']);
        if(img_which_path(1).isdir) %判斷是哪個手勢,a,b,c,...
            length_img_which_path = length(img_which_path);
            for jj = 3:length_img_which_path%3~26
                
               %讀取RGB和gray圖片目錄
               color_img_set = dir([img_root_path img_who_path(ii).name '/' ...
                                img_which_path(jj).name '/color_*.png']);%找到A/a.../下的rgb圖片 
               %讀取depth圖片目錄
               depth_img_set = dir([img_root_path img_who_path(ii).name '/' ...
                                img_which_path(jj).name '/depth_*.png']);%找到A/a.../下的depth圖片 
                            
               assert(length(color_img_set) == length(depth_img_set),'the number of color image must agree with the depth image');
               img_num = length(color_img_set);%因爲rgb和depth圖片的個數相等
               assert(img_num >= 500, 'the number of rgb color images must greater than 500');                         
               img_father_path = [img_root_path img_who_path(ii).name '/'  img_which_path(jj).name '/'];
               for kk = 1:500
                   color_img_name = [img_father_path color_img_set(kk).name];          
                   depth_img_name = [img_father_path depth_img_set(kk).name];        
                   fprintf('Processing the image: %s and %s\n',color_img_name,depth_img_name);
                   %讀取rgb圖和gray圖，最好是先resize，然後轉換成double
                   color_img = imresize(imread(color_img_name),[96 96]);
                   gray_img = rgb2gray(color_img);
                   color_img = im2double(color_img);                  
                   gray_img = im2double(gray_img);
                   %讀取depth圖
                   depth_img = imresize(imread(depth_img_name),[96 96]);
                   depth_img = im2double(depth_img);                  
                   %將圖片數據寫入數組中
                   if kk <= 250
                       color_img_train(:,(jj-3)*250+kk) =  color_img(:);
                       color_label_train((jj-3)*250+kk) = jj-2;
                       gray_img_train(:,(jj-3)*250+kk) =  gray_img(:);
                       gray_label_train((jj-3)*250+kk) = jj-2;
                       depth_img_train(:,(jj-3)*250+kk) = depth_img(:);
                       depth_label_train((jj-3)*250+kk) = jj-2;
                   else
                       color_img_test(:,(jj-3)*250+kk-250) = color_img(:);
                       color_label_test((jj-3)*250+kk-250) = jj-2;
                       gray_img_test(:,(jj-3)*250+kk-250) = gray_img(:);
                       gray_label_test((jj-3)*250+kk-250) = jj-2;
                       depth_img_test(:,(jj-3)*250+kk-250) = depth_img(:);
                       depth_label_test((jj-3)*250+kk-250) = jj-2;
                   end
               end              
            end                      
        end
        %保存圖片
        fprintf('Saving %s\n',[mat_root_path 'color_img_train_' img_who_path(ii).name '.mat']);
        save([mat_root_path 'color_img_train_' img_who_path(ii).name '.mat'], 'color_img_train','color_label_train');
        fprintf('Saving %s\n',[mat_root_path 'color_img_test_' img_who_path(ii).name '.mat']);
        save([mat_root_path 'color_img_test_' img_who_path(ii).name '.mat'] ,'color_img_test', 'color_label_test');
        fprintf('Saving %s\n',[mat_root_path 'gray_img_train_' img_who_path(ii).name '.mat']);
        save([mat_root_path 'gray_img_train_' img_who_path(ii).name '.mat'], 'gray_img_train','gray_label_train');
        fprintf('Saving %s\n',[mat_root_path 'gray_img_test_' img_who_path(ii).name '.mat']);
        save([mat_root_path 'gray_img_test_' img_who_path(ii).name '.mat'] ,'gray_img_test', 'gray_label_test'); 
        fprintf('Saving %s\n',[mat_root_path 'depth_img_train_' img_who_path(ii).name '.mat']);
        save([mat_root_path 'depth_img_train_' img_who_path(ii).name '.mat'], 'depth_img_train','depth_label_train');
        fprintf('Saving %s\n',[mat_root_path 'depth_img_test_' img_who_path(ii).name '.mat']);
        save([mat_root_path 'depth_img_test_' img_who_path(ii).name '.mat'] ,'depth_img_test', 'depth_label_test');        
        
        %清除變量，節省內存
        clear color_img_train color_label_train color_img_test color_label_test...
        gray_img_train gray_label_train gray_img_test gray_label_test...
        depth_img_train depth_label_train depth_img_test depth_label_test;
    end
end

sample_patches.m:

function patches = sample_patches(imgset, img_width, img_height, num_perimage, patch_size, channels)
% sample_patches
% imgset: 傳進來的imgset是個矩陣，其中的每一列已經是每張圖片的數據了
% img_width: 傳進來每一列對應的那個圖片的寬度
% img_height: 傳進來每一列對應的那個圖片的高度
% num_perimage: 每張大圖片採集的小patch的個數
% patch_size: 每個patch的大小，這裏統一採用高和寬相等的patch，所以這裏給出的就是其邊長

[n m] = size(imgset); %n爲大圖片的維數，m爲圖片樣本的個數
num_patches = num_perimage*m; %需要得到的patch的個數

% Initialize patches with zeros.  Your code will fill in this matrix--one
% column per patch, 10000 columns. 
if(channels == 3)
    patches = zeros(patch_size*patch_size*3, num_patches);
else if(channels == 1)
    patches = zeros(patch_size*patch_size, num_patches);
    end
end

assert(n == img_width*img_height*channels, 'The image in the imgset must agree with it width,height anc channels');


%隨機從每張圖片中取出num_perimage張圖片
for imageNum = 1:m%在每張圖片中隨機選取1000個patch，共10000個patch
     img = reshape(imgset(:,imageNum),[img_height img_width channels]);
     for patchNum = 1:num_perimage%實現每張圖片選取num_perimage個patch
        xPos = randi([1,img_height-patch_size+1]);
        yPos = randi([1, img_width-patch_size+1]);
        patch = img(xPos:xPos+patch_size-1,yPos:yPos+patch_size-1,:);
        patches(:,(imageNum-1)*num_perimage+patchNum) = patch(:);
    end
end


 end

patches_preprocessing.m:

% 提取出用於訓練的patches圖片，針對rgb彩色圖
% 打算提取10*10(這個參數當然可以更改，這裏只是默然參數而已)尺寸的patches
% 每張大圖片提取10（這個參數也可以更改）個小的patches
% 返回的參數中有沒有經過白化的patch矩陣patches_without_whiteing.mat，每一列是一個patches
% 也返回經過了ZCAWhitening白化後了的patch矩陣patches_with_whiteing.mat，以及此時的均值向量
% mean_patches，白化矩陣ZCAWhitening

patch_size = 10;
num_per_img = 10;%每張圖片提取出的patches數
num_patches = 100000; %本來有30w個數據的，但是太大了，這裏只取出10w個
epsilon = 0.1; %Whitening時其分母需要用到的參數

% 增加根目錄
addpath c:/Data
addpath c:/Data/fingerspelling5
addpath c:/Data/fingerspellingmat5/
matdatapath = 'c:/Data/fingerspellingmat5/'

% 加載5個人關於color圖像的所有數據
fprintf('Downing the color_img_train_A.mat...\n');
load color_img_train_A.mat
fprintf('Sampling the patches from the color_img_train_A set...\n');
patches_A = sample_patches(color_img_train,96,96,10,10,3);%採集所有的patches
clear color_img_train;

fprintf('Downing the color_img_train_B.mat...\n');
load color_img_train_B.mat
fprintf('Sampling the patches from the color_img_train_B set...\n');
patches_B = sample_patches(color_img_train,96,96,10,10,3);%採集所有的patches
clear color_img_train;

fprintf('Downing the color_img_train_C.mat...\n');
load color_img_train_C.mat
fprintf('Sampling the patches from the color_img_train_C set...\n');
patches_C = sample_patches(color_img_train,96,96,10,10,3);%採集所有的patches
clear color_img_train;

fprintf('Downing the color_img_train_D.mat...\n');
load color_img_train_D.mat
fprintf('Sampling the patches from the color_img_train_D set...\n');
patches_D = sample_patches(color_img_train,96,96,10,10,3);%採集所有的patches
clear color_img_train;

fprintf('Downing the color_img_train_E.mat...\n');
load color_img_train_E.mat
fprintf('Sampling the patches from the color_img_train_E set...\n');
patches_E = sample_patches(color_img_train,96,96,10,10,3);%採集所有的patches
clear color_img_train;

%將這些數據組合到一起
patches = [patches_A, patches_B, patches_C, patches_D, patches_E];
size_patches = size(patches);%這裏的size_patches是個2維的向量，並不需要考慮通道方面的事情
rand_patches = randi(size_patches(2), [1 num_patches]); %隨機選取出100000個樣本
patches = patches(:, rand_patches);

%直接保存原始的patches數據
fprintf('Saving the patches_without_whitening.mat...\n');
save([matdatapath 'patches_without_whitening.mat'], 'patches');

%ZCA Whitening其數據
mean_patches = mean(patches,2); %計算每一維的均值
patches = patches - repmat(mean_patches,[1 num_patches]);%均值化每一維的數據
sigma = (1./num_patches).*patches*patches';

[u s v] = svd(sigma);
ZCAWhitening = u*diag(1./sqrt(diag(s)+epsilon))*u';%ZCAWhitening矩陣，每一維獨立，且方差相等
patches = ZCAWhitening*patches;

%保存ZCA Whitening後的數據，以及均值列向量，ZCAWhitening矩陣
fprintf('Saving the patches_with_whitening.mat...\n');
save([matdatapath 'patches_with_whitening.mat'], 'patches', 'mean_patches', 'ZCAWhitening');


% %% 後面只是測試下爲什麼patches_with_whiteing.mat和patches_without_whiteing.mat大小會相差那麼多
% % 其實雖然說矩陣的大小相同，也都是浮點數，但是由於裏面的內容不同，所以很有可能其佔用的文件大小不同
% % 單獨存ZCAWhitening
% fprintf('Saving the zca_whiteing.mat...\n');
% save([matdatapath 'zca_whiteing.mat'], 'ZCAWhitening');
% 
% % 單獨存mean_patches
% fprintf('Saving the mean_patches.mat...\n');
% save([matdatapath 'mean_patches.mat'], 'mean_patches');
% 
% aa = ones(300,300000);
% save([matdatapath 'aaones.mat'],'aa');

　　參考資料：

Deep learning：三十(關於數據預處理的相關技巧)

http://personal.ee.surrey.ac.uk/Personal/N.Pugeault/index.php?section=FingerSpellingDataset

Deep learning：三十二(基礎知識_3)

　　前言：

　　本次主要是重新複習下Sparse autoencoder基礎知識，並且加入點自己的理解。關於sparse autoencoder在前面的博文Deep learning：八(Sparse Autoencoder)中已有所介紹。

　　基礎知識：

　　首先來看看爲什麼sparse autoencoder能夠學習到輸入數據的特徵呢？當使用autoencoder時，隱含層節點的個數會比輸入層小（一般情況下），而autoencoder又要能夠重構輸入數據，說明隱含層節點壓縮了原始數據，既然這個壓縮是有效的，則它就代表了輸入數據（因爲輸入數據每個分量值並不是相互獨立的）的一部分特徵了。如果對隱含節點加入稀疏性限制（此時隱含層節點的個數一般比輸入層要多），即對輸入的數據而言，其大部分時間都處於抑制狀態，這時候學習到的特徵就更有代表性，因爲它只對它感興趣的輸入值響應，說明這些輸入值就是我們需要學習的特徵。

　　在前面講的稀疏性中，並不是說對於某一個輸入樣本，隱含層中大部分的節點都處於非抑制狀態（雖然事實上有可能確實是如此），而是說對於所有的輸入樣本，某一個節點對這些輸入的響應大部分都處於非抑制狀態。

　　此時的稀疏性懲罰值公式如下所示：

　　其中的變量一般取很小，比如0.05. 而的計算公式則如下：

　　把其中的KL散度展開後，其公式如下：

　　不過在Ng的一節視頻教程http://www.stanford.edu/class/cs294a/handouts.html中，關於稀疏性的一些表達和計算方式稍有不同，它的並不是一次計算所有樣本在本節點i的期望，而是通過每一個樣本來迭代得到，如下面的講解截圖所示：

　　比較難理解的是，它這裏的偏置值b竟然不是由偏導公式來求得的，而是通過稀疏性來求得，有點不解，求解過程如下所示：

　　參考資料：

Deep learning：八(Sparse Autoencoder)

http://www.stanford.edu/class/cs294a/handouts.html

Deep learning：三十三(ICA模型)

　 基礎知識：

　　在sparse coding（可參考Deep learning：二十六(Sparse coding簡單理解)，Deep learning：二十九(Sparse coding練習)）模型中，學習到的基是超完備集的，也就是說基集中基的個數比數據的維數還要大，那麼對一個數據而言，將其分解爲基的線性組合時，這些基之間本身就是線性相關的。如果我們想要得到線性無關的基集，那麼基集中元素的個數必須小於或等於樣本的維數，本節所講的ICA（Independent Component Analysis，獨立成分分析）模型就可以完成這一要求，它學習到的基之間不僅保證線性無關，還保證了相互正交。本節主要參考的資料見：Independent Component Analysis

　　ICA模型中的目標函數非常簡單，如下所示：

　　它只有一項，也就是數據x經過W線性變換後的係數的1範數（這裏的1範數是對向量而言的，此時當x是向量時，Wx也就是個向量了，注意矩陣的1範數和向量的1範數定義和思想不完全相同，具體可以參考前面一篇文章介紹的範數問題Deep learning：二十七(Sparse coding中關於矩陣的範數求導)），這一項也相當於sparse coding中對特徵的稀疏性懲罰項。於係數性不同的是，這裏的基W是直接將輸入數據映射爲特徵值，而在sparse coding中的W是將特徵係數映射重構出原始數據。

　　當對基矩陣W加入正交化約束後，其表達式變爲：

　　所以針對上面的目標函數和約束條件，如果要用梯度下降的方法去優化權值的話，則需要執行下面2個步驟：

　　首先給定的學習率alpha是可以變化的（可以使用線性搜索算法來加速梯度下降過程，具體的每研究過，不瞭解），而Wx的1範數關於W的導數可以利用BP算法思想將其轉換成一個神經網絡模型求得，具體可以參考文章Deriving gradients using the backpropagation idea。此時的目標函數爲：

　　最後的導數結果爲：

　　另外每次用梯度下降法迭代權值W後，需要對該W進行正交化約束，即上面的步驟2。而用具體的數學表達式來表示其更新方式描述爲：

　　由於權值矩陣爲正交矩陣，就意味着：

矩陣W中基的個數比輸入數據的維數要低。這個可以這麼理解：因爲權值矩陣W是正交的，當然也就是線性無關的了，而線性相關的基的個數不可能大於輸入數據的維數。
在使用ICA模型時，對輸入數據進行ZCA白化時，需要將分母參數eplison設置爲0，原因是上面W權值正交化更新公式已經代表了ZCA Whitening。這是網頁教程中所講的，真心沒看懂。

　　另外，PCA Whitening和ZCA Whitening都是白化操作，即去掉數據維度之間的相關性，且保證特徵間的協方差矩陣爲單位矩陣。

　　參考資料：

Deep learning：二十六(Sparse coding簡單理解)

Deep learning：二十九(Sparse coding練習)

Independent Component Analysis

Deep learning：二十七(Sparse coding中關於矩陣的範數求導)

Deriving gradients using the backpropagation idea

Deep learning：三十四(用NN實現數據的降維)

　　數據降維的重要性就不必說了，而用NN（神經網絡）來對數據進行大量的降維是從2006開始的，這起源於2006年science上的一篇文章：reducing the dimensionality of data with neural networks，作者就是鼎鼎有名的Hinton，這篇文章也標誌着deep learning進入火熱的時代。

　　今天花了點時間讀了下這篇文章，下面是一點筆記：

　　多層感知機其實在上世紀已經被提出來了，但是爲什麼它沒有得到廣泛應用呢？其原因在於對多層非線性網絡進行權值優化時很難得到全局的參數。因爲一般使用數值優化算法（比如BP算法）時需要隨機給網絡賦一個值，而當這個權值太大的話，就很容易收斂到”差”的局部收斂點，權值太小的話則在進行誤差反向傳遞時離輸入層越近的權值更新越慢，因此優化問題是多層NN沒有大規模應用的原因。而本文的作者設計出來的autoencoder深度網絡確能夠較快的找到比較好的全局最優點，它是用無監督的方法（這裏是RBM）先分開對每層網絡進行訓練，然後將它當做是初始值來微調。這種方法被認爲是對PCA的一個非線性泛化方法。

每一層網絡的預訓練都採用的是RBM方法，關於RBM的簡單介紹可以參考前面的博文：Deep learning：十九(RBM簡單理解)，其主要思想是是利用能量函數，如下：

　　給定一張輸入圖像（暫時是以二值圖像爲例），我們可以通過調整網絡的權值和偏置值使得網絡對該輸入圖像的能量最低。

　　文章說單層的二值網絡不足以模擬大量的數據集，因此一般採用多層網絡，即把第一層網絡的輸出作爲第二層網絡的輸入。並且每增加一個網絡層，就會提高網絡對輸入數據重構的log下界概率值，且上層的網絡能夠提取出其下層網絡更高階的特徵。

　　圖像的預訓練和微調，編碼和解碼的示意圖如下：

　　由上圖可以看到，當網絡的預訓練過程完成後，我們需要把解碼部分重新拿回來展開構成整個網絡，然後用真實的數據作爲樣本標籤來微調網絡的參數。

　　當網絡的輸入數據是連續值時，只需將可視層的二進制值改爲服從方差爲1的高斯分佈即可，而第一個隱含層的輸出仍然爲二進制變量。

　　文章中包含了多個實驗部分，有手寫數字體的識別，人臉圖像的壓縮，新聞主題的提取等。在這些實驗的分層訓練過程中，其第一個RBM網絡的輸入層都是其對應的真實數據，且將值歸一化到了（0,1）.而其它RBM的輸入層都是上一個RBM網絡輸出層的概率值；但是在實際的網絡結構中，除了最底層的輸入層和最頂層RBM的隱含層是連續值外，其它所有層都是一個二值隨機變量。此時最頂層RBM的隱含層是一個高斯分佈的隨機變量，其均值由該RBM的輸入值決定，方差爲1。

　　實驗結果1：

　　這3副圖中每幅圖的最上面一層是原圖，其後面跟着的是用NN重構的圖，以及PCA重構的圖（可以選取主成分數量不同的PCA和logicPCA或者標準PCA的組合，本人對這logicPCA沒有仔細去研究過）。其中左上角那副圖是用NN將一個784維的數據直接降到6維！

　　作者通過實驗還發現：如果網絡的深度淺到只有1個隱含層時，這時候可以不用對網絡進行預訓練也同樣可以達到很好的效果，但是對網絡用RBM進行預訓練可以節省後面用BP訓練的時間。另外，當網絡中參數的個數是相同時，深層網絡比淺層網絡在測試數據上的重構誤差更小，但僅限於兩者參數個數相同時。作者在MINIST手寫數字識別庫中，用的是4個隱含層的網絡結構，維數依次爲784-500-500-2000-10，其識別誤差率減小至1.2%。預訓時練得到的網絡權值佔最終識別率的主要部分，因爲預訓練中已經隱含了數據的內部結構，而微調時用的標籤數據只對參數起到稍許的作用。

　　參考資料：

　　reducing the dimensionality of data with neural networks

Deep learning：十九(RBM簡單理解)

Deep learning：三十五(用NN實現數據降維練習)

　　前言：

　　本文是針對上篇博文Deep learning：三十四(用NN實現數據的降維)的練習部分，也就是Hition大牛science文章reducing the dimensionality of data with neural networks的code部分，其code下載見：http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html。花了點時間閱讀並運行了下它的code，其實code主要是2個單獨的工程。一個只是用MNIST數據庫來進行深度的autoencoder壓縮，用的是無監督學習，評價標準是重構誤差值MSE。另一個工程是MNIST的手寫字體識別，網絡的預訓練部分用的是無監督的，網絡的微調部分用的是有監督的。評價標準準是識別率或者錯誤率。

　　MINST降維實驗：

　　本次是訓練4個隱含層的autoencoder深度網絡結構，輸入層維度爲784維，4個隱含層維度分別爲1000,500,250,30。整個網絡權值的獲得流程梳理如下：

首先訓練第一個rbm網絡，即輸入層784維和第一個隱含層1000維構成的網絡。採用的方法是rbm優化，這個過程用的是訓練樣本，優化完畢後，計算訓練樣本在隱含層的輸出值。
利用1中的結果作爲第2個rbm網絡訓練的輸入值，同樣用rbm網絡來優化第2個rbm網絡，並計算出網絡的輸出值。並且用同樣的方法訓練第3個rbm網絡和第4個rbm網絡。
將上面4個rbm網絡展開連接成新的網絡，且分成encoder和decoder部分。並用步驟1和2得到的網絡值給這個新網絡賦初值。
由於新網絡中最後的輸出和最初的輸入節點數是相同的，所以可以將最初的輸入值作爲網絡理論的輸出標籤值，然後採用BP算法計算網絡的代價函數和代價函數的偏導數。
利用步驟3的初始值和步驟4的代價值和偏導值，採用共軛梯度下降法優化整個新網絡，得到最終的網絡權值。以上整個過程都是無監督的。

　　一些matlab函數：

　　rem和mod:

　　參考資料取模（mod）與取餘（rem）的區別——Matlab學習筆記

　　通常取模運算也叫取餘運算，它們返回結果都是餘數.rem和mod唯一的區別在於:
　　當x和y的正負號一樣的時候，兩個函數結果是等同的；當x和y的符號不同時，rem函數結果的符號和x的一樣，而mod和y一樣。這是由於這兩個函數的生成機制不同，rem函數採用fix函數，而mod函數採用了floor函數（這兩個函數是用來取整的，fix函數向0方向舍入，floor函數向無窮小方向舍入）。rem（x，y）命令返回的是x-n.*y，如果y不等於0，其中的n = fix(x./y)，而mod(x,y)返回的是x-n.*y，當y不等於0時，n=floor(x./y)

　　工程中的m文件：

　　converter.m:

　　實現的功能是將樣本集從.ubyte格式轉換成.ascii格式，然後繼續轉換成.mat格式。

　　makebatches.m:

　　實現的是將原本的2維數據集變成3維的，因爲分了多個批次，另外1維表示的是批次。

　　下面來看下在程序中大致實現RBM權值的優化步驟（假設是一個2層的RBM網絡，即只有輸入層和輸出層，且這兩層上的變量是二值變量）：

隨機給網絡初始化一個權值矩陣w和偏置向量b。
對可視層輸入矩陣v正向傳播，計算出隱含層的輸出矩陣h，並計算出輸入v和h對應節點乘積的均值矩陣
此時2中的輸出h爲概率值，將它隨機01化爲二值變量。
利用3中01化了的h方向傳播計算出可視層的矩陣v’.
對v’進行正向傳播計算出隱含層的矩陣h’，並計算出v’和h’對應節點乘積的均值矩陣。
用2中得到的均值矩陣減掉5中得到的均值矩陣，其結果作爲對應權值增量的矩陣。
結合其對應的學習率，利用權值迭代公式對權值進行迭代。
重複計算2到7，直至收斂。

　　偏置值的優化步驟：

隨機給網絡初始化一個權值矩陣w和偏置向量b。
對可視層輸入矩陣v正向傳播，計算出隱含層的輸出矩陣h，並計算v層樣本的均值向量以及h層的均值向量。
此時2中的輸出h爲概率值，將它隨機01化爲二值變量。
利用3中01化了的h方向傳播計算出可視層的矩陣v’.
對v’進行正向傳播計算出隱含層的矩陣h’，並計算v‘層樣本的均值向量以及h’層的均值向量。
用2中得到的v方均值向量減掉5中得到的v’方的均值向量，其結果作爲輸入層v對應偏置的增值向量。用2中得到的h方均值向量減掉5中得到的h’方的均值向量，其結果作爲輸入層h對應偏置的增值向量。
結合其對應的學習率，利用權值迭代公式對偏置值進行迭代。
重複計算2到7，直至收斂。

　　當然了，權值更新和偏置值更新每次迭代都是同時進行的，所以應該是同時收斂的。並且在權值更新公式也可以稍微作下變形，比如加入momentum變量，即本次權值更新的增量會保留一部分上次更新權值的增量值。

　　函數CG_MNIST形式如下：

　　function [f, df] = CG_MNIST(VV,Dim,XX);

　　該函數實現的功能是計算網絡代價函數值f，以及f對網絡中各個參數值的偏導數df，權值和偏置值是同時處理。其中參數VV爲網絡中所有參數構成的列向量，參數Dim爲每層網絡的節點數構成的向量，XX爲訓練樣本集合。f和df分別表示網絡的代價函數和偏導函數值。

　　共軛梯度下降的優化函數形式爲：

　　[X, fX, i] = minimize(X, f, length, P1, P2, P3, ... )

　　該函數時使用共軛梯度的方法來對參數X進行優化，所以X是網絡的參數值，爲一個列向量。f是一個函數的名稱，它主要是用來計算網絡中的代價函數以及代價函數對各個參數X的偏導函數，f的參數值分別爲X，以及minimize函數後面的P1,P2,P3,…使用共軛梯度法進行優化的最大線性搜索長度爲length。返回值X爲找到的最優參數，fX爲在此最優參數X下的代價函數，i爲線性搜索的長度（即迭代的次數）。

　　實驗結果：

　　由於在實驗過程中，作者將迭代次數設置爲200，本人在實驗時發現迭代到35次時已經花了6個多小時，所以懶得等那麼久了（需長達30多個小時），此時的原始數字和重構數字顯示如下：

　　均方誤差結果爲：

　　Train squared error: 4.318

　　Test squared error: 4.520

　　實驗主要部分代碼及註釋：

mnistdeepauto.m:

clear all
close all

maxepoch=10; %In the Science paper we use maxepoch=50, but it works just fine. 
numhid=1000; numpen=500; numpen2=250; numopen=30;

fprintf(1,'Converting Raw files into Matlab format \n');
converter; % 轉換數據爲matlab的格式

fprintf(1,'Pretraining a deep autoencoder. \n');
fprintf(1,'The Science paper used 50 epochs. This uses %3i \n', maxepoch);

makebatches;
[numcases numdims numbatches]=size(batchdata);

fprintf(1,'Pretraining Layer 1 with RBM: %d-%d \n',numdims,numhid);
restart=1;
rbm;
hidrecbiases=hidbiases; %hidbiases爲隱含層的偏置值
save mnistvh vishid hidrecbiases visbiases;%保持每層的變量，分別爲權值，隱含層偏置值，可視層偏置值

fprintf(1,'\nPretraining Layer 2 with RBM: %d-%d \n',numhid,numpen);
batchdata=batchposhidprobs;%batchposhidprobs爲第一個rbm的輸出概率值
numhid=numpen;
restart=1;
rbm;% 第2個rbm的訓練
hidpen=vishid; penrecbiases=hidbiases; hidgenbiases=visbiases;
save mnisthp hidpen penrecbiases hidgenbiases;%mnisthp爲所保存的文件名

fprintf(1,'\nPretraining Layer 3 with RBM: %d-%d \n',numpen,numpen2);
batchdata=batchposhidprobs;
numhid=numpen2;
restart=1;
rbm;
hidpen2=vishid; penrecbiases2=hidbiases; hidgenbiases2=visbiases;%第3個rbm
save mnisthp2 hidpen2 penrecbiases2 hidgenbiases2;

fprintf(1,'\nPretraining Layer 4 with RBM: %d-%d \n',numpen2,numopen);
batchdata=batchposhidprobs;
numhid=numopen; 
restart=1;
rbmhidlinear;
hidtop=vishid; toprecbiases=hidbiases; topgenbiases=visbiases;%第4個rbm
save mnistpo hidtop toprecbiases topgenbiases;

backprop;

rbm.m:

epsilonw      = 0.1;   % Learning rate for weights 
epsilonvb     = 0.1;   % Learning rate for biases of visible units 
epsilonhb     = 0.1;   % Learning rate for biases of hidden units %由此可見這裏隱含層和可視層的偏置值不是共用的，當然了，其權值是共用的
weightcost  = 0.0002;   
initialmomentum  = 0.5;
finalmomentum    = 0.9;

[numcases numdims numbatches]=size(batchdata);%[100,784,600]

if restart ==1,
  restart=0;
  epoch=1;

% Initializing symmetric weights and biases. 
  vishid     = 0.1*randn(numdims, numhid); %權值初始值隨便給,784*1000
  hidbiases  = zeros(1,numhid); %偏置值初始化爲0
  visbiases  = zeros(1,numdims);

  poshidprobs = zeros(numcases,numhid);%100*1000，單個batch正向傳播時隱含層的輸出概率
  neghidprobs = zeros(numcases,numhid);
  posprods    = zeros(numdims,numhid);%784*1000
  negprods    = zeros(numdims,numhid);
  vishidinc  = zeros(numdims,numhid);
  hidbiasinc = zeros(1,numhid);
  visbiasinc = zeros(1,numdims);
  batchposhidprobs=zeros(numcases,numhid,numbatches);% 整個數據正向傳播時隱含層的輸出概率
end

for epoch = epoch:maxepoch, %總共迭代10次
 fprintf(1,'epoch %d\r',epoch); 
 errsum=0;
 for batch = 1:numbatches, %每次迭代都有遍歷所有的batch
 fprintf(1,'epoch %d batch %d\r',epoch,batch); 

%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  data = batchdata(:,:,batch);% 每次迭代都需要取出一個batch的數據，每一行代表一個樣本值
  poshidprobs = 1./(1 + exp(-data*vishid - repmat(hidbiases,numcases,1)));% 樣本正向傳播時隱含層節點的輸出概率    
  batchposhidprobs(:,:,batch)=poshidprobs;
  posprods    = data' * poshidprobs;%784*1000，這個是求系統的能量值用的，矩陣中每個元素表示對應的可視層節點和隱含層節點的乘積（包含此次樣本的數據對應值的累加）
  poshidact   = sum(poshidprobs);%針對樣本值進行求和
  posvisact = sum(data);

%%%%%%%%% END OF POSITIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  poshidstates = poshidprobs > rand(numcases,numhid); %將隱含層數據01化（此步驟在posprods之後進行），按照概率值大小來判定

%%%%%%%%% START NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  negdata = 1./(1 + exp(-poshidstates*vishid' - repmat(visbiases,numcases,1)));% 反向進行時的可視層數據
  neghidprobs = 1./(1 + exp(-negdata*vishid - repmat(hidbiases,numcases,1)));% 反向進行後又馬上正向傳播的隱含層概率值    
  negprods  = negdata'*neghidprobs;% 同理也是計算能量值用的，784*1000
  neghidact = sum(neghidprobs);
  negvisact = sum(negdata); 

%%%%%%%%% END OF NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  err= sum(sum( (data-negdata).^2 ));% 重構後的差值
  errsum = err + errsum; % 變量errsum只是用來輸出每次迭代時的誤差而已

   if epoch>5,
     momentum=finalmomentum;%0.5，momentum爲保持上一次權值更新增量的比例，如果迭代次數越少，則這個比例值可以稍微大一點
   else
     momentum=initialmomentum;%0.9
   end;

%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
    vishidinc = momentum*vishidinc + ... %vishidinc 784*1000，權值更新時的增量；
                epsilonw*( (posprods-negprods)/numcases - weightcost*vishid); %posprods/numcases求的是正向傳播時vihj的期望，同理negprods/numcases是逆向重構時它們的期望
    visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*(posvisact-negvisact); %這3個都是按照權值更新公式來的
    hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*(poshidact-neghidact);

    vishid = vishid + vishidinc;
    visbiases = visbiases + visbiasinc;
    hidbiases = hidbiases + hidbiasinc;

%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

  end
  fprintf(1, 'epoch %4i error %6.1f  \n', epoch, errsum); 
end;

CG_MNIST.m:

function [f, df] = CG_MNIST(VV,Dim,XX);

l1 = Dim(1);
l2 = Dim(2);
l3 = Dim(3);
l4= Dim(4);
l5= Dim(5);
l6= Dim(6);
l7= Dim(7);
l8= Dim(8);
l9= Dim(9);
N = size(XX,1);% 樣本的個數

% Do decomversion.
 w1 = reshape(VV(1:(l1+1)*l2),l1+1,l2);% VV是一個長的列向量，這裏取出的向量已經包括了偏置值
 xxx = (l1+1)*l2; %xxx 表示已經使用了的長度
 w2 = reshape(VV(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
 xxx = xxx+(l2+1)*l3;
 w3 = reshape(VV(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
 xxx = xxx+(l3+1)*l4;
 w4 = reshape(VV(xxx+1:xxx+(l4+1)*l5),l4+1,l5);
 xxx = xxx+(l4+1)*l5;
 w5 = reshape(VV(xxx+1:xxx+(l5+1)*l6),l5+1,l6);
 xxx = xxx+(l5+1)*l6;
 w6 = reshape(VV(xxx+1:xxx+(l6+1)*l7),l6+1,l7);
 xxx = xxx+(l6+1)*l7;
 w7 = reshape(VV(xxx+1:xxx+(l7+1)*l8),l7+1,l8);
 xxx = xxx+(l7+1)*l8;
 w8 = reshape(VV(xxx+1:xxx+(l8+1)*l9),l8+1,l9);% 上面一系列步驟完成權值的矩陣化


  XX = [XX ones(N,1)];
  w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  XXout = 1./(1 + exp(-w7probs*w8));

f = -1/N*sum(sum( XX(:,1:end-1).*log(XXout) + (1-XX(:,1:end-1)).*log(1-XXout)));%原始數據和重構數據的交叉熵
IO = 1/N*(XXout-XX(:,1:end-1));
Ix8=IO; 
dw8 =  w7probs'*Ix8;%輸出層的誤差項，但是這個公式怎麼和以前介紹的不同，因爲它的誤差評價標準是交叉熵，不是MSE

Ix7 = (Ix8*w8').*w7probs.*(1-w7probs); 
Ix7 = Ix7(:,1:end-1);
dw7 =  w6probs'*Ix7;

Ix6 = (Ix7*w7').*w6probs.*(1-w6probs); 
Ix6 = Ix6(:,1:end-1);
dw6 =  w5probs'*Ix6;

Ix5 = (Ix6*w6').*w5probs.*(1-w5probs); 
Ix5 = Ix5(:,1:end-1);
dw5 =  w4probs'*Ix5;

Ix4 = (Ix5*w5');
Ix4 = Ix4(:,1:end-1);
dw4 =  w3probs'*Ix4;

Ix3 = (Ix4*w4').*w3probs.*(1-w3probs); 
Ix3 = Ix3(:,1:end-1);
dw3 =  w2probs'*Ix3;

Ix2 = (Ix3*w3').*w2probs.*(1-w2probs); 
Ix2 = Ix2(:,1:end-1);
dw2 =  w1probs'*Ix2;

Ix1 = (Ix2*w2').*w1probs.*(1-w1probs); 
Ix1 = Ix1(:,1:end-1);
dw1 =  XX'*Ix1;

df = [dw1(:)' dw2(:)' dw3(:)' dw4(:)' dw5(:)' dw6(:)'  dw7(:)'  dw8(:)'  ]'; %網絡代價函數的偏導數

backprop.m:

maxepoch=200;%迭代35次就用了6個多小時，200次要30多個小時，太長時間了，就沒讓它繼續運行了
fprintf(1,'\nFine-tuning deep autoencoder by minimizing cross entropy error. \n');%其微調通過最小化交叉熵來實現
fprintf(1,'60 batches of 1000 cases each. \n');

load mnistvh% 分別download4個rbm的參數
load mnisthp
load mnisthp2
load mnistpo 

makebatches;
[numcases numdims numbatches]=size(batchdata);
N=numcases; 

%%%% PREINITIALIZE WEIGHTS OF THE AUTOENCODER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
w1=[vishid; hidrecbiases];%分別裝載每層的權值和偏置值，將它們作爲一個整體
w2=[hidpen; penrecbiases];
w3=[hidpen2; penrecbiases2];
w4=[hidtop; toprecbiases];
w5=[hidtop'; topgenbiases]; 
w6=[hidpen2'; hidgenbiases2]; 
w7=[hidpen'; hidgenbiases]; 
w8=[vishid'; visbiases];

%%%%%%%%%% END OF PREINITIALIZATIO OF WEIGHTS  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

l1=size(w1,1)-1;%每個網絡層中節點的個數
l2=size(w2,1)-1;
l3=size(w3,1)-1;
l4=size(w4,1)-1;
l5=size(w5,1)-1;
l6=size(w6,1)-1;
l7=size(w7,1)-1;
l8=size(w8,1)-1;
l9=l1; %輸出層節點和輸入層的一樣
test_err=[];
train_err=[];


for epoch = 1:maxepoch

%%%%%%%%%%%%%%%%%%%% COMPUTE TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
err=0; 
[numcases numdims numbatches]=size(batchdata);
N=numcases;
 for batch = 1:numbatches
  data = [batchdata(:,:,batch)];
  data = [data ones(N,1)];% b補上一維，因爲有偏置項
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];%正向傳播，計算每一層的輸出，且同時在輸出上增加一維（值爲常量1）
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  dataout = 1./(1 + exp(-w7probs*w8));
  err= err +  1/N*sum(sum( (data(:,1:end-1)-dataout).^2 )); %重構的誤差值
  end
 train_err(epoch)=err/numbatches;%總的誤差值（訓練樣本上）

%%%%%%%%%%%%%% END OF COMPUTING TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%% DISPLAY FIGURE TOP ROW REAL DATA BOTTOM ROW RECONSTRUCTIONS %%%%%%%%%%%%%%%%%%%%%%%%%
fprintf(1,'Displaying in figure 1: Top row - real data, Bottom row -- reconstructions \n');
output=[];
 for ii=1:15
  output = [output data(ii,1:end-1)' dataout(ii,:)'];%output爲15（因爲是顯示15個數字）組，每組2列，分別爲理論值和重構值
 end
   if epoch==1 
   close all 
   figure('Position',[100,600,1000,200]);
   else 
   figure(1)
   end 
   mnistdisp(output);
   drawnow;

%%%%%%%%%%%%%%%%%%%% COMPUTE TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[testnumcases testnumdims testnumbatches]=size(testbatchdata);
N=testnumcases;
err=0;
for batch = 1:testnumbatches
  data = [testbatchdata(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  dataout = 1./(1 + exp(-w7probs*w8));
  err = err +  1/N*sum(sum( (data(:,1:end-1)-dataout).^2 ));
  end
 test_err(epoch)=err/testnumbatches;
 fprintf(1,'Before epoch %d Train squared error: %6.3f Test squared error: %6.3f \t \t \n',epoch,train_err(epoch),test_err(epoch));

%%%%%%%%%%%%%% END OF COMPUTING TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 tt=0;
 for batch = 1:numbatches/10 %測試樣本numbatches是100
 fprintf(1,'epoch %d batch %d\r',epoch,batch);

%%%%%%%%%%% COMBINE 10 MINIBATCHES INTO 1 LARGER MINIBATCH %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 tt=tt+1; 
 data=[];
 for kk=1:10
  data=[data 
        batchdata(:,:,(tt-1)*10+kk)]; 
 end 

%%%%%%%%%%%%%%% PERFORM CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%共軛梯度線性搜索
  max_iter=3;
  VV = [w1(:)' w2(:)' w3(:)' w4(:)' w5(:)' w6(:)' w7(:)' w8(:)']';% 把所有權值（已經包括了偏置值）變成一個大的列向量
  Dim = [l1; l2; l3; l4; l5; l6; l7; l8; l9];%每層網絡對應節點的個數（不包括偏置值）

  [X, fX] = minimize(VV,'CG_MNIST',max_iter,Dim,data);

  w1 = reshape(X(1:(l1+1)*l2),l1+1,l2);
  xxx = (l1+1)*l2;
  w2 = reshape(X(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
  xxx = xxx+(l2+1)*l3;
  w3 = reshape(X(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
  xxx = xxx+(l3+1)*l4;
  w4 = reshape(X(xxx+1:xxx+(l4+1)*l5),l4+1,l5);
  xxx = xxx+(l4+1)*l5;
  w5 = reshape(X(xxx+1:xxx+(l5+1)*l6),l5+1,l6);
  xxx = xxx+(l5+1)*l6;
  w6 = reshape(X(xxx+1:xxx+(l6+1)*l7),l6+1,l7);
  xxx = xxx+(l6+1)*l7;
  w7 = reshape(X(xxx+1:xxx+(l7+1)*l8),l7+1,l8);
  xxx = xxx+(l7+1)*l8;
  w8 = reshape(X(xxx+1:xxx+(l8+1)*l9),l8+1,l9); %依次重新賦值爲優化後的參數

%%%%%%%%%%%%%%% END OF CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 end

 save mnist_weights w1 w2 w3 w4 w5 w6 w7 w8 
 save mnist_error test_err train_err;

end

　　MINST識別實驗：

　　MINST手寫數字庫的識別部分和前面的降維部分其實很相似。首先它也是預訓練整個網絡，只不過在MINST識別時，預訓練的網絡部分需要包括輸出softmax部分，且這部分預訓練時是用的有監督方法的。在微調部分的不同體現在：MINST降維部分是用的無監督方法，即數據的標籤爲原始的輸入數據。而MINST識別部分數據的標籤爲訓練樣本的實際標籤

　　在進行MINST手寫數字體識別的時候，需要計算加入了softmax部分的網絡的代價函數，作者的程序中給出了2個函數。其中第一個函數用於預訓練softmax分類器：

　　function [f, df] = CG_CLASSIFY_INIT(VV,Dim,w3probs,target);

　　該函數是專門針對softmax分類器那部分預訓練用的，因爲一開始的rbm預訓練部分沒有包括輸出層softmax網絡。輸入參數VV表示整個網絡的權值向量（也包括了softmax那一部分），Dim爲sofmmax對應部分的2層網絡節點個數的向量，w3probs爲訓練softmax所用的樣本集，target爲對應樣本集的標籤。f和df分別爲softmax網絡的代價函數和代價函數的偏導數。

　　另一個纔是真正的計算網絡微調的代價函數：

　　function [f, df] = CG_CLASSIFY(VV,Dim,XX,target);

　　函數輸入值VV代表網絡的參數向量，Dim爲每層網絡的節點數向量，XX爲訓練樣本集，target爲訓練樣本集的標籤，f和df分別爲整個網絡的代價函數以及代價函數的偏導數。

　　實驗結果：

　　作者採用的1個輸入層，3個隱含層和一個softmax分類層的節點數爲：784-500-500-2000-10。

　　其最終識別的錯誤率爲：1.2%.

　　實驗主要部分代碼及註釋：

mnistclassify.m:

clear all
close all

maxepoch=50; 
numhid=500; numpen=500; numpen2=2000; 

fprintf(1,'Converting Raw files into Matlab format \n');
converter; 

fprintf(1,'Pretraining a deep autoencoder. \n');
fprintf(1,'The Science paper used 50 epochs. This uses %3i \n', maxepoch);

makebatches;
[numcases numdims numbatches]=size(batchdata);

fprintf(1,'Pretraining Layer 1 with RBM: %d-%d \n',numdims,numhid);
restart=1;
rbm;
hidrecbiases=hidbiases; 
save mnistvhclassify vishid hidrecbiases visbiases;%mnistvhclassify爲第一層網絡的權值保存的文件名

fprintf(1,'\nPretraining Layer 2 with RBM: %d-%d \n',numhid,numpen);
batchdata=batchposhidprobs;
numhid=numpen;
restart=1;
rbm;
hidpen=vishid; penrecbiases=hidbiases; hidgenbiases=visbiases;
save mnisthpclassify hidpen penrecbiases hidgenbiases;%mnisthpclassify和前面類似，第2層網絡的

fprintf(1,'\nPretraining Layer 3 with RBM: %d-%d \n',numpen,numpen2);
batchdata=batchposhidprobs;
numhid=numpen2;
restart=1;
rbm;
hidpen2=vishid; penrecbiases2=hidbiases; hidgenbiases2=visbiases;
save mnisthp2classify hidpen2 penrecbiases2 hidgenbiases2;

backpropclassify;

backpropclassify.m:

maxepoch=200;
fprintf(1,'\nTraining discriminative model on MNIST by minimizing cross entropy error. \n');
fprintf(1,'60 batches of 1000 cases each. \n');

load mnistvhclassify %載入3個rbm網絡的預訓練好了的權值
load mnisthpclassify
load mnisthp2classify

makebatches;
[numcases numdims numbatches]=size(batchdata);
N=numcases; 

%%%% PREINITIALIZE WEIGHTS OF THE DISCRIMINATIVE MODEL%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

w1=[vishid; hidrecbiases];
w2=[hidpen; penrecbiases];
w3=[hidpen2; penrecbiases2];
w_class = 0.1*randn(size(w3,2)+1,10); %因爲要分類，所以最後一層直接輸出10個節點，類似softmax分類器
 

%%%%%%%%%% END OF PREINITIALIZATIO OF WEIGHTS  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

l1=size(w1,1)-1;
l2=size(w2,1)-1;
l3=size(w3,1)-1;
l4=size(w_class,1)-1;
l5=10; 
test_err=[];
train_err=[];


for epoch = 1:maxepoch %200

%%%%%%%%%%%%%%%%%%%% COMPUTE TRAINING MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
err=0; 
err_cr=0;
counter=0;
[numcases numdims numbatches]=size(batchdata);
N=numcases;
 for batch = 1:numbatches
  data = [batchdata(:,:,batch)];
  target = [batchtargets(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10); %softmax分類器

  [I J]=max(targetout,[],2);%J是索引值
  [I1 J1]=max(target,[],2);
  counter=counter+length(find(J==J1));% length(find(J==J1))表示爲預測值和網絡輸出值相等的個數
  err_cr = err_cr- sum(sum( target(:,1:end).*log(targetout))) ;
 end
 train_err(epoch)=(numcases*numbatches-counter);%每次迭代的訓練誤差
 train_crerr(epoch)=err_cr/numbatches;

%%%%%%%%%%%%%% END OF COMPUTING TRAINING MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%% COMPUTE TEST MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
err=0;
err_cr=0;
counter=0;
[testnumcases testnumdims testnumbatches]=size(testbatchdata);
N=testnumcases;
for batch = 1:testnumbatches
  data = [testbatchdata(:,:,batch)];
  target = [testbatchtargets(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10);

  [I J]=max(targetout,[],2);
  [I1 J1]=max(target,[],2);
  counter=counter+length(find(J==J1));
  err_cr = err_cr- sum(sum( target(:,1:end).*log(targetout))) ;
end
 test_err(epoch)=(testnumcases*testnumbatches-counter); %測試樣本的誤差，這都是在預訓練基礎上得到的結果
 test_crerr(epoch)=err_cr/testnumbatches;
 fprintf(1,'Before epoch %d Train # misclassified: %d (from %d). Test # misclassified: %d (from %d) \t \t \n',...
            epoch,train_err(epoch),numcases*numbatches,test_err(epoch),testnumcases*testnumbatches);

%%%%%%%%%%%%%% END OF COMPUTING TEST MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 tt=0;
 for batch = 1:numbatches/10
 fprintf(1,'epoch %d batch %d\r',epoch,batch);

%%%%%%%%%%% COMBINE 10 MINIBATCHES INTO 1 LARGER MINIBATCH %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 tt=tt+1; 
 data=[];
 targets=[]; 
 for kk=1:10
  data=[data 
        batchdata(:,:,(tt-1)*10+kk)]; 
  targets=[targets
        batchtargets(:,:,(tt-1)*10+kk)];
 end 

%%%%%%%%%%%%%%% PERFORM CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  max_iter=3;

  if epoch<6  % First update top-level weights holding other weights fixed. 前6次迭代都是針對softmax部分的預訓練
    N = size(data,1);
    XX = [data ones(N,1)];
    w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs  ones(N,1)];
    w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
    w3probs = 1./(1 + exp(-w2probs*w3)); %w3probs = [w3probs  ones(N,1)];

    VV = [w_class(:)']';
    Dim = [l4; l5];
    [X, fX] = minimize(VV,'CG_CLASSIFY_INIT',max_iter,Dim,w3probs,targets);
    w_class = reshape(X,l4+1,l5);

  else
    VV = [w1(:)' w2(:)' w3(:)' w_class(:)']';
    Dim = [l1; l2; l3; l4; l5];
    [X, fX] = minimize(VV,'CG_CLASSIFY',max_iter,Dim,data,targets);

    w1 = reshape(X(1:(l1+1)*l2),l1+1,l2);
    xxx = (l1+1)*l2;
    w2 = reshape(X(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
    xxx = xxx+(l2+1)*l3;
    w3 = reshape(X(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
    xxx = xxx+(l3+1)*l4;
    w_class = reshape(X(xxx+1:xxx+(l4+1)*l5),l4+1,l5);

  end
%%%%%%%%%%%%%%% END OF CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 end

 save mnistclassify_weights w1 w2 w3 w_class
 save mnistclassify_error test_err test_crerr train_err train_crerr;

end

CG_CLASSIFY_INIT.m:

function [f, df] = CG_CLASSIFY_INIT(VV,Dim,w3probs,target);%只有2層網絡
l1 = Dim(1);
l2 = Dim(2);
N = size(w3probs,1);%N爲訓練樣本的個數
% Do decomversion.
  w_class = reshape(VV,l1+1,l2);
  w3probs = [w3probs  ones(N,1)];  

  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10);
  f = -sum(sum( target(:,1:end).*log(targetout))) ;%f位softmax分類器的誤差函數
IO = (targetout-target(:,1:end));
Ix_class=IO; 
dw_class =  w3probs'*Ix_class; %偏導值

df = [dw_class(:)']';

CG_CLASSIFY.m:

function [f, df] = CG_CLASSIFY(VV,Dim,XX,target);

l1 = Dim(1);
l2 = Dim(2);
l3= Dim(3);
l4= Dim(4);
l5= Dim(5);
N = size(XX,1);

% Do decomversion.
 w1 = reshape(VV(1:(l1+1)*l2),l1+1,l2);
 xxx = (l1+1)*l2;
 w2 = reshape(VV(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
 xxx = xxx+(l2+1)*l3;
 w3 = reshape(VV(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
 xxx = xxx+(l3+1)*l4;
 w_class = reshape(VV(xxx+1:xxx+(l4+1)*l5),l4+1,l5);


  XX = [XX ones(N,1)];
  w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];

  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10);
  f = -sum(sum( target(:,1:end).*log(targetout))) ;

IO = (targetout-target(:,1:end));
Ix_class=IO; 
dw_class =  w3probs'*Ix_class; 

Ix3 = (Ix_class*w_class').*w3probs.*(1-w3probs);
Ix3 = Ix3(:,1:end-1);
dw3 =  w2probs'*Ix3;

Ix2 = (Ix3*w3').*w2probs.*(1-w2probs); 
Ix2 = Ix2(:,1:end-1);
dw2 =  w1probs'*Ix2;

Ix1 = (Ix2*w2').*w1probs.*(1-w1probs); 
Ix1 = Ix1(:,1:end-1);
dw1 =  XX'*Ix1;

df = [dw1(:)' dw2(:)' dw3(:)' dw_class(:)']';

　　實驗總結：

　　 1. 終於閱讀了一個RBM的源碼了，以前看那些各種公式的理論，現在有了對應的code，讀對應的code起來就是爽！

　　 2. 這裏由於用的是整個圖片進行訓練（不是用的它們的patch部分），所以沒有對應的convolution和pooling，因此預訓練網絡結構時下一個rbm網絡的輸入就是上一個rbm網絡的輸出，且當沒有加入softmax時的微調階段用的依舊是無監督的學習（此時的標籤依舊爲原始的輸入數據）；而當加入了softmax後的微調部分用的就是訓練樣本的真實標籤了，因爲此時需要進行分類。

　　 3. 深度越深，則網絡的微調時間越長，需要很多時間收斂，即使是進行了預訓練。

　　 4. 暫時還沒弄懂要是針對大圖片採用covolution訓練時，第二層網絡的數據來源是什麼，有可能和上面的一樣，是上層網絡的輸出（但是此時微調怎麼辦呢，不用標籤數據？）也有可能是大圖片經過第一層網絡covolution，pooling後的輸出值（如果是這樣的話，網絡的代價函數就不好弄了，因爲裏面有convolution和pooling操作）。

　　參考資料：

Deep learning：三十四(用NN實現數據的降維)

　　reducing the dimensionality of data with neural networks

http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

取模（mod）與取餘（rem）的區別——Matlab學習筆記

Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)

　　前言：

　　最近一直在思考，如果我使用SCSAE（即stacked convolution sparse autoendoer）算法來訓練一個的deep model的話，其網絡的第二層開始後續所有網絡層的訓練數據從哪裏來呢？其實如果在這個問題中，當我們的樣本大小（指提供的最原始數據，比如大的圖片集）和我們所訓練第一個網絡的輸入維度是一樣的話，那麼第二層網絡的輸入即第一層網絡的輸出（後續的網絡依次類推），但是這種情況下根本就不會涉及到convolution（一般有convolution的地方也會有pooling），所以不屬於我想要討論的SCSAE框架。後面根據自己對deep learning的理解（剛接觸DL不到2個月，菜鳥一個）認爲第二層網絡的輸入需要將原始的訓練樣本集通過covolution的方法經過第一層網絡（已訓練好了的網絡）的輸出採樣（如果輸出的特徵圖尺寸比第二層網絡的輸入尺寸大的話，就需要通過隨機採樣方法了）得到。

　　最近同時還在思考的另一個問題是，如果我們的SCASE網絡預訓練成功後，後期的fine-tuning該怎樣進行呢？當然了，fine-tuning的過程肯定會用到BP算法的，但是此時的SCASE網絡並沒有清晰直觀的網絡結構（不像非convolution網絡那樣，雖然有多層，但是前一層網絡的輸出直接連接到後一層網絡的輸入，結構清晰，一目瞭然，其fine-tuning過程容易理解），所以在使用BP算法時會不會有什麼不同呢？特別是其中convolution後的pooling部分，比如max-pooling，該部分就不需要學習任何參數，但它也是SCASE結構中的一層，所以感覺其對BP算法計算會有影響。

　　內容：

　　帶着這2個問題而是就在網絡上開始尋找答案了。首先，找到了一篇文章Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction，看這個標題感覺就是我想要的，只是它沒有預訓練網絡時不是用的SAE，而是標準的AE，這點影響不大。大致瀏覽了下文章，下面是一些筆記：

　　DAE（denoised autoencoder）是在訓練網絡參數時，先計算下輸入樣本的統計特性，並根據這個統計特性給這些樣本加入一定的噪聲，再把這些帶噪聲的圖像輸入到網絡中訓練。這樣做的好處是，如果網絡能夠重構帶噪聲的樣本，那麼它的泛化能力就更強。

　　SCAE（本文作者的方法）中每個隱含層的節點都是用來做covolution的，因此針對某個隱含層節點，給定一張大圖片通過convolution就可以得到一張特徵圖，然後對這張特徵圖用剛纔那個節點的權值轉置矩陣就可以還原出對應的那幅大圖。

　　關於梯度下降的一些小總結：

　　標準梯度下降也就是batch梯度下降，其更新過程的增量是一次用所有樣本的誤差來計算得到的。

　　隨機梯度下降和標準梯度下降法類似，區別在於它每次更新權值時只是用一個隨機的樣本來計算其增量。

　　共軛梯度下降是前面梯度下降的基礎上採用某種策略來更改學習率，其策略是通過一系列線搜索來找到誤差函數最小值的方向，然後在該方向上找到一個適合的學習率，當然，其第一次的線搜索爲梯度的負方向。

　　Max-pooling可以提高提取特徵的不變性，最初的目標是用在有監督的學習中的。作者所使用的max-pooling一般是無重疊的，它有公開max-pooling相關的matlab源碼：http://www.idsia.ch/~masci/software.php文章指出使用了max-pooling層後就沒有必要隱含層或者權值作L1或者L2的規則化了，why？

　　作者用一個隱含層（或者後面再加一個max-pooling層）對數據庫MNIST和CIFAR10提取出了20個7*7的特徵，其實驗結果顯示如下：

　　其中的a爲不加噪聲，不用pooling層學習到的特徵；b是加了噪聲但沒用pooling層時的特徵；c是用了2*2大小的pooling層後但沒加噪聲時對應的特徵；d是用了2*2大小的pooling層且加了噪聲學習到的特徵。

　　從上面可以看到，c圖學習到的特徵比較不錯。C圖是沒有加入噪聲且用了pooling層。由於a圖和b圖學到的都是不重要的特徵，而d圖學到特徵的又不太像人腦視覺皮層那樣，所以作者認爲加噪聲用處不大，且max-pooling功能特別強大，大到像作者說的那樣有了max-pooling後什麼約束就可以不用了，好像神器一樣。我不太贊同作者的觀點，一是它只是用了普通的AE（沒有其他任何的約束，純屬一個壓縮），且特徵的個數不多，訓練樣本的個數也少，所以學習不到特徵也是很正常的。

　　後面作者構建了一個含6個隱含層的深度網絡來對MNIST和CIFAR10這2個數據庫來做識別，使用的是沒有經過任何處理的raw數據。由於一開始作者已經得到結論只用max-pooling和標準AE，不用加噪聲，所以這個網絡也是按照他的結論設定的。第1個隱含層實現100個5*5的filter，第2個隱含層爲2*2的max-pooling,第3個隱含層爲150個5*5的filter，第4個隱含層也爲2*2的max-pooling，第5個隱含層爲200個3*3的filter，第6個隱含層有300個節點，實現全連接，最後的輸出層四softmax分類器。這個網絡適用於這2個數據庫，只是在CIFAR10數據庫是，由於它的數據庫是rgb的，所以講rgb分成3個通道的圖像分別輸入到剛剛那個6隱含層網絡中識別（這點比較有意思）。

　　下面是它的實驗結果：

　　總結：

　　從這篇文章（文章沒太多內容，沒必要細讀）沒有找到我要的2個問題的答案，不過個人推測它預訓練的過程有點像第一個問題我猜測的那樣。第二個問題，作者根本就沒有展開說（只是說用了5%的樣本進行有監督微調），可能是這個方法默認爲大家都知道了吧。

　　所以後面打算讀CNN方面的文章，因爲如果把CNN過程弄懂了，那麼我的這2個問題就不是問題了，哈哈。

　　參考資料：

http://www.idsia.ch/~masci/software.php

Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction

Deep learning：三十七(Deep learning中的優化方法)

　　內容：

　　本文主要是參考論文：On optimization methods for deep learning，文章內容主要是筆記SGD（隨機梯度下降），LBFGS（受限的BFGS），CG（共軛梯度法）三種常見優化算法的在deep learning體系中的性能。下面是一些讀完的筆記。

　　SGD優點：實現簡單，當訓練樣本足夠多時優化速度非常快。

　　SGD缺點：需要人爲調整很多參數，比如學習率，收斂準則等。另外，它是序列的方法，不利於GPU並行或分佈式處理。

　　各種deep learning中常見方法（比如說Autoencoder，RBM，DBN，ICA，Sparse coding）的區別是：目標函數形式不同。這其實才是最本質的區別，由於目標函數的不同導致了對其優化的方法也可能會不同，比如說RBM中目標函數跟網絡能量有關，採用CD優化的，而Autoencoder目標函數爲理論輸出和實際輸出的MSE，由於此時的目標函數的偏導可以直接被計算，所以可以用LBFGS，CG等方法優化，其它的類似。所以不能單從網絡的結構來判斷其屬於Deep learning中的哪種方法，比如說我單獨給定64-100的2層網絡，你就無法知道它屬於deep learning中的哪一種方法，因爲這個網絡既可以用RBM也可以用Autoencoder來訓練。

　　作者通過實驗得出的結論是：不同的優化算法有不同的優缺點，適合不同的場合，比如LBFGS算法在參數的維度比較低（一般指小於10000維）時的效果要比SGD（隨機梯度下降）和CG（共軛梯度下降）效果好，特別是帶有convolution的模型。而針對高維的參數問題，CG的效果要比另2種好。也就是說一般情況下，SGD的效果要差一些，這種情況在使用GPU加速時情況一樣，即在GPU上使用LBFGS和CG時，優化速度明顯加快，而SGD算法優化速度提高很小。在單核處理器上，LBFGS的優勢主要是利用參數之間的2階近視特性來加速優化，而CG則得得益於參數之間的共軛信息，需要計算器Hessian矩陣。

　　不過當使用一個大的minibatch且採用線搜索的話，SGD的優化性能也會提高。

　　在單核上比較SGD，LBFGS，CG三種算法的優化性能，當針對Autoencoder模型。結果如下：

　　可以看出，SGD效果最差。

　　同樣的情況下，訓練的是Sparse autoencoder模型的比較情況如下：

　　這時SGD的效果更差。這主要原因是LBFGS和CG能夠使用大的minibatch數據來估算每個節點的期望激發值，這個值是可以用來約束該節點的稀疏特性的，而SGD需要去估計噪聲信息。

　　當然了作者還有在GUP，convolution上也做了不少實驗。

　　最後，作者訓練了一個2隱含層（這2層不算pooling層）的Sparse autocoder網絡，並應用於MNIST上，其識別率結果如下：

　　作者網站上給出了一些code，見deep autoencoder with L-BFGS。看着標題本以爲code會實現deep convolution autoencoder pre-training和fine-tuning的，因爲作者paper裏面用的是convolution，閱讀完code後發現其實現就是一個普通二層的autoencoder。看來還是得到前面博文第二個問題的答案：Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)。

　　下面是作者code主要部分的一些註釋：

optimizeAutoencoderLBFGS.m(實現deep autoencoder網絡的參數優化過程):

function [] = optimizeAutoencoderLBFGS(layersizes, datasetpath, ...
                                       finalObjective)
% train a deep autoencoder with variable hidden sizes
% layersizes : the sizes of the hidden layers. For istance, specifying layersizes =
%     [200 100] will create a network looks like input -> 200 -> 100 -> 200
%     -> output (same size as input). Notice the mirroring structure of the
%     autoencoders. Default layersizes = [2*3072 100]
% datasetpath: the path to the CIFAR dataset (where we find the *.mat
%     files). see loadData.m
% finalObjective: the final objective that you use to compare to
%                 terminate your optimization. To qualify, the objective
%                 function on the entire training set must be below this
%                 value.
%
% Author: Quoc V. Le ([email protected])
% 
%% Handle default parameters
if nargin < 3 || isempty(finalObjective)
    finalObjective = 70; % i am just making this up, the evaluation objective 
                         % will be much lower
end
if nargin < 2 || isempty(datasetpath)
  datasetpath = '.';
end
if nargin < 1 || isempty(layersizes)
  layersizes = [2*3072 100];
  layersizes = [200 100];
end

%% Load data
loadData %traindata 3072*10000的，每一列表示一個向量

%% Random initialization
initializeWeights;%看作者對應該部分的code，也沒有感覺出convolution和pooling的影響啊，怎麼它就連接起來了呢

%% Optimization: minibatch L-BFGS
% Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y. Ng. 
% On optimization methods for deep learning. ICML, 2011

addpath minFunc/
options.Method = 'lbfgs'; 
options.maxIter = 20;      
options.display = 'on';
options.TolX = 1e-3;

perm = randperm(size(traindata,2));
traindata = traindata(:,perm);% 將訓練樣本隨機排列
batchSize = 1000;%因爲總共樣本數爲10000個，所以分成了10個批次
maxIter = 20;
for i=1:maxIter    
    startIndex = mod((i-1) * batchSize, size(traindata,2)) + 1;
    fprintf('startIndex = %d, endIndex = %d\n', startIndex, startIndex + batchSize-1);
    data = traindata(:, startIndex:startIndex + batchSize-1); 
    [theta, obj] = minFunc( @deepAutoencoder, theta, options, layersizes, ...
                            data);
    if obj <= finalObjective % use the minibatch obj as a heuristic for stopping
                             % because checking the entire dataset is very
                             % expensive
        % yes, we should check the objective for the entire training set        
        trainError = deepAutoencoder(theta, layersizes, traindata);
        if trainError <= finalObjective
            % now your submission is qualified
            break
        end
    end
end

%% write to text files so that we can test your program
writeToTextFiles;

deepAutoencoder.m:（深度網絡代價函數及其導數的求解函數）:

function [cost,grad] = deepAutoencoder(theta, layersizes, data)
% cost and gradient of a deep autoencoder 
% layersizes is a vector of sizes of hidden layers, e.g., 
% layersizes[2] is the size of layer 2
% this does not count the visible layer
% data is the input data, each column is an example
% the activation function of the last layer is linear, the activation
% function of intermediate layers is the hyperbolic tangent function

% WARNING: the code is optimized for ease of implemtation and
% understanding, not speed nor space

%% FORCING THETA TO BE IN MATRIX FORMAT FOR EASE OF UNDERSTANDING
% Note that this is not optimized for space, one can just retrieve W and b
% on the fly during forward prop and backprop. But i do it here so that the
% readers can understand what's going on
layersizes = [size(data,1) layersizes];
l = length(layersizes);
lnew = 0;
for i=1:l-1
    lold = lnew + 1;
    lnew = lnew + layersizes(i) * layersizes(i+1);
    W{i} = reshape(theta(lold:lnew), layersizes(i+1), layersizes(i));
    lold = lnew + 1;
    lnew = lnew + layersizes(i+1);
    b{i} = theta(lold:lnew);
end
% handle tied-weight stuff
j = 1;
for i=l:2*(l-1)
    lold = lnew + 1;
    lnew = lnew + layersizes(l-j);
    W{i} = W{l - j}'; %直接用encoder中對應的轉置即可
    b{i} = theta(lold:lnew);
    j = j + 1;
end
assert(lnew == length(theta), 'Error: dimensions of theta and layersizes do not match\n')


%% FORWARD PROP
for i=1:2*(l-1)-1
    if i==1
        [h{i} dh{i}] = tanhAct(bsxfun(@plus, W{i}*data, b{i}));
    else
        [h{i} dh{i}] = tanhAct(bsxfun(@plus, W{i}*h{i-1}, b{i}));
    end
end
h{i+1} = linearAct(bsxfun(@plus, W{i+1}*h{i}, b{i+1}));

%% COMPUTE COST
diff = h{i+1} - data; 
M = size(data,2); 
cost = 1/M * 0.5 * sum(diff(:).^2);% 純粹標準的autoencoder，不加其它比如sparse限制

%% BACKPROP
if nargout > 1
    outderv = 1/M * diff;    
    for i=2*(l-1):-1:2
        Wgrad{i} = outderv * h{i-1}';
        bgrad{i} = sum(outderv,2);        
        outderv = (W{i}' * outderv) .* dh{i-1};        
    end
    Wgrad{1} = outderv * data';
    bgrad{1} = sum(outderv,2);
        
    % handle tied-weight stuff        
    j = 1;
    for i=l:2*(l-1)
        Wgrad{l-j} = Wgrad{l-j} + Wgrad{i}';
        j = j + 1;
    end
    % dump the results to the grad vector
    grad = zeros(size(theta));
    lnew = 0;
    for i=1:l-1
        lold = lnew + 1;
        lnew = lnew + layersizes(i) * layersizes(i+1);
        grad(lold:lnew) = Wgrad{i}(:);
        lold = lnew + 1;
        lnew = lnew + layersizes(i+1);
        grad(lold:lnew) = bgrad{i}(:);
    end
    j = 1;
    for i=l:2*(l-1)
        lold = lnew + 1;
        lnew = lnew + layersizes(l-j);
        grad(lold:lnew) = bgrad{i}(:);
        j = j + 1;
    end
end 
end

%% USEFUL ACTIVATION FUNCTIONS
function [a da] = sigmoidAct(x)

a = 1 ./ (1 + exp(-x));
if nargout > 1
    da = a .* (1-a);
end
end

function [a da] = tanhAct(x)
a = tanh(x);
if nargout > 1
    da = (1-a) .* (1+a);
end
end

function [a da] = linearAct(x)
a = x;
if nargout > 1
    da = ones(size(a));
end
end

initializeWeights.m（參數初始化賦值，雖然是隨機，但是有一定要求）:

%% Random initialization
% X. Glorot, Y. Bengio. 
% Understanding the dif鏗乧ulty of training deep feedforward neural networks.
% AISTATS 2010.
% QVL: this initialization method appears to perform better than 
% theta = randn(d,1);
s0 = size(traindata,1);% s0涓烘牱鏈殑緇存暟
layersizes = [s0 layersizes];%輸入層-hidden1-hidden2，這裏是3072-6144-100
l = length(layersizes);%緗戠粶涓殑灞傛暟錛屼笉鍖呭惈瑙ｇ爜閮ㄥ垎錛屽鏋滄槸2涓殣鍚眰鐨勮瘽錛岃繖閲宭=3
lnew = 0;
for i=1:l-1%1到3之間
    lold = lnew + 1;
    lnew = lnew + layersizes(i) * layersizes(i+1);
    r  = sqrt(6) / sqrt(layersizes(i+1)+layersizes(i));   
    A = rand(layersizes(i+1), layersizes(i))*2*r - r; %reshape(theta(lold:lnew), layersizes(i+1), layersizes(i));
    theta(lold:lnew) = A(:); %相當於權值W的賦值
    lold = lnew + 1;
    lnew = lnew + layersizes(i+1);
    A = zeros(layersizes(i+1),1);
    theta(lold:lnew) = A(:);%相當於偏置值b的賦值
end %以上是encoder部分
j = 1;
for i=l:2*(l-1) %1到4之間，下面開始decoder部分
    lold = lnew + 1;
    lnew = lnew + layersizes(l-j);
    theta(lold:lnew)= zeros(layersizes(l-j),1);
    j = j + 1;
end
theta = theta';
layersizes = layersizes(2:end); %去除輸入層

　　參考資料：

　　Le, Q. V., et al. (2011). On optimization methods for deep learning. Proc. of ICML.

deep autoencoder with L-BFGS

Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)

Deep learning：三十八(Stacked CNN簡單介紹)

　　前言：

　　本節主要是來簡單介紹下stacked CNN（深度卷積網絡），起源於本人在構建SAE網絡時的一點困惑：見Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)。因爲有時候針對大圖片進行recognition時，需要用到無監督學習的方法去pre-training（預訓練）stacked CNN的每層網絡，然後用BP算法對整個網絡進行fine-tuning（微調），並且上一層的輸出作爲下一層的輸入。這幾句話說起來很簡單，可是真的這麼容易嗎？對於初學者來說，在實際實現這個流程時並不是那麼順利，因爲這其中要涉及到很多細節問題。這裏不打算細講deep statcked網絡以及covolution，pooling，這幾部分的內容可以參考前面的博文：Deep learning：十六(deep networks)，Deep learning：十七(Linear Decoders，Convolution和Pooling)。而只只重點介紹以下一個方面的內容（具體見後面的解釋）。

　　基礎知識：

　　首先需要知道的是，convolution和pooling的優勢爲使網絡結構中所需學習到的參數個數變得更少，並且學習到的特徵具有一些不變性，比如說平移，旋轉不變性。以2維圖像提取爲例，學習的參數個數變少是因爲不需要用整張圖片的像素來輸入到網絡，而只需學習其中一部分patch。而不變的特性則是由於採用了mean-pooling或者max-pooling等方法。

　　以經典的LeNet5結構圖爲例：

　　可以看出對於這個網絡，每輸入一張32*32大小的圖片，就輸出一個84維的向量，這個向量即我們提取出的特徵向量。

　　網絡的C1層是由6張28*28大小的特徵圖構成，其來源是我們用6個5*5大小的patch對32*32大小的輸入圖進行convolution得到，28=32-5+1，其中每次移動步伐爲1個像素。而到了s2層則變成了6張14*14大小的特徵圖，原因是每次對4個像素（即2*2的）進行pooling得到1個值。這些都很容易理解，在ufldl教程Feature extraction using convolution，Pooling中給出了詳細的解釋。

　　最難問題的就是：C3那16張10*10大小的特徵圖是怎麼來？這纔是本文中最想講清楚的。

　　有人可能會講，這不是很簡單麼，將S2層的內容輸入到一個輸入層爲5*5，隱含層爲16的網絡即可。其實這種解釋是錯的，還是沒有說到問題本質。我的答案是：將S2的特徵圖用1個輸入層爲150（=5*5*6，不是5*5）個節點，輸出層爲16個節點的網絡進行convolution。

　　並且此時， C3層的每個特徵圖並不一定是都與S2層的特徵圖相連接，有可能只與其中的某幾個連接，比如說在LeNet5中，其連接情況如下所示：

　　其中打X了的表示兩者之間有連接的。取我們學習到的網絡（結構爲150-16）中16個隱含節點種的一個拿來分析，比如拿C3中的第3號特徵圖來說，它與上層網絡S2第3,4,5號特徵圖連接。那麼該第3號特徵圖的值（假設爲H3）是怎麼得到的呢？其過程如下：

　　首先我們把網絡150-16（以後這樣表示，表面輸入層節點爲150，隱含層節點爲16）中輸入的150個節點分成6個部分，每個部分爲連續的25個節點。取出倒數第3個部分的節點（爲25個），且同時是與隱含層16個節點中的第4（因爲對應的是3號，從0開始計數的）個相連的那25個值，reshape爲5*5大小，用這個5*5大小的特徵patch去convolution S2網絡中的倒數第3個特徵圖，假設得到的結果特徵圖爲h1。

　　同理，取出網絡150-16中輸入的倒數第2個部分的節點（爲25個），且同時是與隱含層16個節點中的第5個相連的那25個值，reshape爲5*5大小，用這個5*5大小的特徵patch去convolution S2網絡中的倒數第2個特徵圖，假設得到的結果特徵圖爲h2。

　　繼續，取出網絡150-16中輸入的最後1個部分的節點（爲25個），且同時是與隱含層16個節點中的第5個相連的那25個值，reshape爲5*5大小，用這個5*5大小的特徵patch去convolution S2網絡中的最後1個特徵圖，假設得到的結果特徵圖爲h3。

　　最後將h1，h2，h3這3個矩陣相加得到新矩陣h，並且對h中每個元素加上一個偏移量b，且通過sigmoid的激發函數，即可得到我們要的特徵圖H3了。

　　終於把想要講的講完了，LeNet5後面的結構可以類似的去推理。其實發現用文字去描述這個過程好難，如果是面對面交談的話，幾句話就可以搞定。

　　因爲在經典的CNN網絡結構中（比如這裏的LeNet5），是不需要對每層進行pre-traing的。但是在目前的stacked CNN中，爲了加快最終網絡參數尋優的速度，一般都需要用無監督的方法進行預訓練。現在來解決在Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)中的第1個問題，對應到LeNet5框架中該問題爲：pre-training從S2到C3的那個150-16網絡權值W時，訓練樣本從哪裏來？

　　首先，假設我們總共有m張大圖片作爲訓練樣本，則S2中共得到6*m張特徵圖，其大小都是14*14，而我們對其進行convolution時使用的5*5大小的，且我們輸入到該網絡是150維的，所以肯定需要對這些數據進行sub-sample。因此我們只需對這6*m張圖片進行採樣，每6張特徵圖（S2層的那6張）同時隨機採樣若干個5*5大小（即它們每個的採樣位置是一樣的）的patch，並將其按照順序res爲hape150維，此作爲150-16網絡的一個訓練樣本，用同樣的方法獲取多個樣本，共同構成該網絡的訓練樣本。

　　這裏給出這幾天在網上搜的一些資料：

　　首先是LeNet5對應的手寫字體識別的demo，可以參考其網頁：LeNet-5, convolutional neural networks，以及該demo對應的paper：LeCun, Y., et al. (1998). "Gradient-based learning applied to document recognition."，這篇paper內容比較多，只需看其中的單個文字識別那部分。paper中關於LeNet5各層網絡的詳細內容可以參考網頁：Deep Learning（深度學習）學習筆記整理系列之（七）.

　　下面這個是用python寫的一個簡單版本的LeNet5，用Theano機器學習庫實現的：Convolutional Neural Networks (LeNet)，懂Python的同學可以看下，比較通俗易懂（不懂Python其實也能看懂個大概）。關於stacked CNN的matlab實現可以參考：https://sites.google.com/site/chumerin/projects/mycnn。裏面有源碼和界面。

　　最後Hition在2012年ImageNet識別時用的算法paper：Imagenet classification with deep convolutional neural networks. 他還給出了對應的code，基於GPU，c++的：https://code.google.com/p/cuda-convnet/。

　　總結：

　　關於Statcked CNN網絡pre-training過程中，後續層的訓練樣本來源已經弄清楚了，但是關於最後對整個網絡的fine-tuning過程還不是很明白，裏面估計有不少數學公式。

　 參考資料：

Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)

Deep learning：十六(deep networks)

Deep learning：十七(Linear Decoders，Convolution和Pooling)

Deep Learning（深度學習）學習筆記整理系列之（七）

Convolutional Neural Networks (LeNet)

https://sites.google.com/site/chumerin/projects/mycnn.

Gradient-based learning applied to document recognition.

　 Imagenet classification with deep convolutional neural networks.

Feature extraction using convolution

Pooling

Deep learning：三十九(ICA模型練習)

　　前言：

　　本次主要是練習下ICA模型，關於ICA模型的理論知識可以參考前面的博文：Deep learning：三十三(ICA模型)。本次實驗的內容和步驟可以是參考UFLDL上的教程：Exercise:Independent Component Analysis。本次實驗完成的內容和前面的很多練習類似，即學習STL-10數據庫的ICA特徵。當然了，這些數據已經是以patches的形式給出，共2w個patch，8*8大小的。

　　實驗基礎：

　　步驟分爲下面幾步：

設置網絡的參數，其中的輸入樣本的維數爲8*8*3=192。
對輸入的樣本集進行白化，比如說ZCA白化，但是一定要將其中的參數eplison設置爲0。
完成ICA的代價函數和其導數公式。雖然在教程Exercise:Independent Component Analysis中給出的代價函數爲：

　（當然了，它還必須滿足權值W是正交矩陣）。

　　但是在UFLDL前面的一個教程Deriving gradients using the backpropagation idea中給出的代價函數卻爲：

　　不過我感覺第2個代價函數要有道理些，並且在其教程中還給出了代價函數的偏導公式（這樣實現時，可以偷懶不用推導了），只不過它給出的公式有一個小小的錯誤，我把正確的公式整理如下：

　　錯誤就是公式右邊第一項最左邊的那個應該是W，而不是它的轉置W’，否則程序運行時是有矩陣維數不匹配的情況。

　　4. 最後就是對參數W進行迭代優化了，由於要使W滿足正交性這一要求，所以不能直接像以前那樣採用lbfgs算法，而是每次直接使用梯度下降法進行迭代，迭代完成後採用正交化步驟讓W變成正交矩陣。只是此時文章中所說的學習率alpha是個動態變化的，是按照線性搜索來找到的。W正交性公式爲：

　　5. 如果採用上面的代價函數和偏導公式時，用Ng給的code是跑不起來的，程序在線搜索的過程中會陷入死循環。（線搜索沒有研究過，所以完全不懂）。最後在Deep Learning高質量交流羣內網友”蜘蛛小俠”的提議下，將代價函數的W加一個特徵稀疏性的約束，（注意此時的特徵爲Wx），然後把Ng的code中的迭代次數改大，比如5000，

其它程序不用更改，即可跑出結果來。

　　此時的代價函數爲：

　　偏導爲：

　　其中一定要考慮樣本的個數m，否則即使通過了代價函數和其導數的驗證，也不一定能通過W正交投影的驗證。

　　實驗結果：

　　用於訓練的樣本顯示如下：

　　迭代20000次後的結果如下（因爲電腦CUP不給力，跑了一天，當然了跑50000次結果會更完美，我就沒時間驗證了）：

　　實驗主要部分代碼及註釋：

ICAExercise.m:

%% CS294A/CS294W Independent Component Analysis (ICA) Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  ICA exercise. In this exercise, you will need to modify
%  orthonormalICACost.m and a small part of this file, ICAExercise.m.

%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

numPatches = 20000;
numFeatures = 121;
imageChannels = 3;
patchDim = 8;
visibleSize = patchDim * patchDim * imageChannels;

outputDir = '.';
% 一般情況下都將L1規則項轉換成平方加一個小系數然後開根號的形式，因爲L1範數在0處不可微
epsilon = 1e-6; % L1-regularisation epsilon |Wx| ~ sqrt((Wx).^2 + epsilon)

%%======================================================================
%% STEP 1: Sample patches

patches = load('stlSampledPatches.mat');
patches = patches.patches(:, 1:numPatches);
displayColorNetwork(patches(:, 1:100));


%%======================================================================
%% STEP 2: ZCA whiten patches
%  In this step, we ZCA whiten the sampled patches. This is necessary for
%  orthonormal ICA to work.

patches = patches / 255;
meanPatch = mean(patches, 2);
patches = bsxfun(@minus, patches, meanPatch);

sigma = patches * patches';
[u, s, v] = svd(sigma);
ZCAWhite = u * diag(1 ./ sqrt(diag(s))) * u';
patches = ZCAWhite * patches;

%%======================================================================
%% STEP 3: ICA cost functions
%  Implement the cost function for orthornomal ICA (you don't have to 
%  enforce the orthonormality constraint in the cost function) 
%  in the function orthonormalICACost in orthonormalICACost.m.
%  Once you have implemented the function, check the gradient.

% Use less features and smaller patches for speed
% numFeatures = 5;
% patches = patches(1:3, 1:5);
% visibleSize = 3;
% numPatches = 5;
% 
% weightMatrix = rand(numFeatures, visibleSize);
% 
% [cost, grad] = orthonormalICACost(weightMatrix, visibleSize, numFeatures, patches, epsilon);
% 
% numGrad = computeNumericalGradient( @(x) orthonormalICACost(x, visibleSize, numFeatures, patches, epsilon), weightMatrix(:) );
% % Uncomment to display the numeric and analytic gradients side-by-side
% % disp([numGrad grad]); 
% diff = norm(numGrad-grad)/norm(numGrad+grad);
% fprintf('Orthonormal ICA difference: %g\n', diff);
% assert(diff < 1e-7, 'Difference too large. Check your analytic gradients.');
% 
% fprintf('Congratulations! Your gradients seem okay.\n');


%%======================================================================
%% STEP 4: Optimization for orthonormal ICA
%  Optimize for the orthonormal ICA objective, enforcing the orthonormality
%  constraint. Code has been provided to do the gradient descent with a
%  backtracking line search using the orthonormalICACost function 
%  (for more information about backtracking line search, you can read the 
%  appendix of the exercise).
%
%  However, you will need to write code to enforce the orthonormality 
%  constraint by projecting weightMatrix back into the space of matrices 
%  satisfying WW^T  = I.
%
%  Once you are done, you can run the code. 10000 iterations of gradient
%  descent will take around 2 hours, and only a few bases will be
%  completely learned within 10000 iterations. This highlights one of the
%  weaknesses of orthonormal ICA - it is difficult to optimize for the
%  objective function while enforcing the orthonormality constraint - 
%  convergence using gradient descent and projection is very slow.

weightMatrix = rand(numFeatures, visibleSize);%121*192
[cost, grad] = orthonormalICACost(weightMatrix(:), visibleSize, numFeatures, patches, epsilon);
fprintf('%11s%16s%10s\n','Iteration','Cost','t');
startTime = tic();

% Initialize some parameters for the backtracking line search
alpha = 0.5;
t = 0.02;
lastCost = 1e40;

% Do 10000 iterations of gradient descent
for iteration = 1:20000
                       
    grad = reshape(grad, size(weightMatrix));
    newCost = Inf;        
    linearDelta = sum(sum(grad .* grad));
    
    % Perform the backtracking line search
    while 1
        considerWeightMatrix = weightMatrix - alpha * grad;
        % -------------------- YOUR CODE HERE --------------------
        % Instructions:
        %   Write code to project considerWeightMatrix back into the space
        %   of matrices satisfying WW^T = I.
        %   
        %   Once that is done, verify that your projection is correct by 
        %   using the checking code below. After you have verified your
        %   code, comment out the checking code before running the
        %   optimization.
        
        % Project considerWeightMatrix such that it satisfies WW^T = I
%         error('Fill in the code for the projection here');        
        considerWeightMatrix = (considerWeightMatrix*considerWeightMatrix')^(-0.5)*considerWeightMatrix;
        % Verify that the projection is correct
        temp = considerWeightMatrix * considerWeightMatrix';
        temp = temp - eye(numFeatures);
        assert(sum(temp(:).^2) < 1e-23, 'considerWeightMatrix does not satisfy WW^T = I. Check your projection again');
%         error('Projection seems okay. Comment out verification code before running optimization.');
        
        % -------------------- YOUR CODE HERE --------------------                                        

        [newCost, newGrad] = orthonormalICACost(considerWeightMatrix(:), visibleSize, numFeatures, patches, epsilon);
        if newCost >= lastCost - alpha * t * linearDelta
            t = 0.9 * t;
        else
            break;
        end
    end
   
    lastCost = newCost;
    weightMatrix = considerWeightMatrix;
    
    fprintf('  %9d  %14.6f  %8.7g\n', iteration, newCost, t);
    
    t = 1.1 * t;
    
    cost = newCost;
    grad = newGrad;
           
    % Visualize the learned bases as we go along    
    if mod(iteration, 10000) == 0
        duration = toc(startTime);
        % Visualize the learned bases over time in different figures so 
        % we can get a feel for the slow rate of convergence
        figure(floor(iteration /  10000));
        displayColorNetwork(weightMatrix'); 
    end
                   
end

% Visualize the learned bases
displayColorNetwork(weightMatrix');

orthonormalICACost.m:

function [cost, grad] = orthonormalICACost(theta, visibleSize, numFeatures, patches, epsilon)
%orthonormalICACost - compute the cost and gradients for orthonormal ICA
%                     (i.e. compute the cost ||Wx||_1 and its gradient)

    weightMatrix = reshape(theta, numFeatures, visibleSize);
    
    cost = 0;
    grad = zeros(numFeatures, visibleSize);
    
    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   weights given in weightMatrix.     
    % -------------------- YOUR CODE HERE --------------------     
    %% 法一：
    num_samples = size(patches,2);
%     cost = sum(sum((weightMatrix'*weightMatrix*patches-patches).^2))./num_samples+...
%             sum(sum(sqrt((weightMatrix*patches).^2+epsilon)))./num_samples;
%     grad = (2*weightMatrix*(weightMatrix'*weightMatrix*patches-patches)*patches'+...
%         2*weightMatrix*patches*(weightMatrix'*weightMatrix*patches-patches)')./num_samples+...
%         ((weightMatrix*patches./sqrt((weightMatrix*patches).^2+epsilon))*patches')./num_samples;
    cost = sum(sum((weightMatrix'*weightMatrix*patches-patches).^2))./num_samples+...
            sum(sum(sqrt((weightMatrix*patches).^2+epsilon)));
    grad = (2*weightMatrix*(weightMatrix'*weightMatrix*patches-patches)*patches'+...
        2*weightMatrix*patches*(weightMatrix'*weightMatrix*patches-patches)')./num_samples+...
        (weightMatrix*patches./sqrt((weightMatrix*patches).^2+epsilon))*patches';
    grad = grad(:);
end