Deep learning------------ Autoencoders

Autoencoders分爲3類:   喵嗚,是哪三類呢?登登登登

Firstly, when it comes into being is

sparse autoencoder

Second:  

denoising autoencoder

Finally:

Contractive Auto-Encoders



下面我們來分條闡述:


sparse autoencoder:


這節課來學習下Deep learning領域比較出名的一類算法——sparse autoencoder,即稀疏模式的自動編碼。我們知道,deep learning也叫做unsupervised learning,所以這裏的sparse autoencoder也應是無監督的。按照前面的博文:Deep learning:一(基礎知識_1)Deep learning:七(基礎知識_2)所講,如果是有監督的學習的話,在神經網絡中,我們只需要確定神經網絡的結構就可以求出損失函數的表達式了(當然,該表達式需對網絡的參數進行”懲罰”,以便使每個參數不要太大),同時也能夠求出損失函數偏導函數的表達式,然後利用優化算法求出網絡最優的參數。應該清楚的是,損失函數的表達式中,需要用到有標註值的樣本。那麼這裏的sparse autoencoder爲什麼能夠無監督學習呢?難道它的損失函數的表達式中不需要標註的樣本值(即通常所說的y值)麼?其實在稀疏編碼中”標註值”也是需要的,只不過它的輸出理論值是本身輸入的特徵值x,其實這裏的標註值y=x。這樣做的好處是,網絡的隱含層能夠很好的代替輸入的特徵,因爲它能夠比較準確的還原出那些輸入特徵值。Sparse autoencoder的一個網絡結構圖如下所示:

   

 

  損失函數的求法:

  無稀疏約束時網絡的損失函數表達式如下:

   

  稀疏編碼是對網絡的隱含層的輸出有了約束,即隱含層節點輸出的平均值應儘量爲0,這樣的話,大部分的隱含層節點都處於非activite狀態。因此,此時的sparse autoencoder損失函數表達式爲:

   

  後面那項爲KL距離,其表達式如下:

   

  隱含層節點輸出平均值求法如下:

    

  其中的參數一般取很小,比如說0.05,也就是小概率發生事件的概率。這說明要求隱含層的每一個節點的輸出均值接近0.05(其實就是接近0,因爲網絡中activite函數爲sigmoid函數),這樣就達到稀疏的目的了。KL距離在這裏表示的是兩個向量之間的差異值。從約束函數表達式中可以看出,差異越大則”懲罰越大”,因此最終的隱含層節點的輸出會接近0.05。

 

  損失函數的偏導數的求法:

  如果不加入稀疏規則,則正常情況下由損失函數求損失函數偏導數的過程如下:

   

  而加入了稀疏性後,神經元節點的誤差表達式由公式:

   

  變成公式:

   

  

  梯度下降法求解:

  有了損失函數及其偏導數後就可以採用梯度下降法來求網絡最優化的參數了,整個流程如下所示:

  

  從上面的公式可以看出,損失函數的偏導其實是個累加過程,每來一個樣本數據就累加一次。這是因爲損失函數本身就是由每個訓練樣本的損失疊加而成的,而按照加法的求導法則,損失函數的偏導也應該是由各個訓練樣本所損失的偏導疊加而成。從這裏可以看出,訓練樣本輸入網絡的順序並不重要,因爲每個訓練樣本所進行的操作是等價的,後面樣本的輸入所產生的結果並不依靠前一次輸入結果(只是簡單的累加而已,而這裏的累加是順序無關的)。

 

該總結的原文是來自:這裏面有更詳細的講解:

http://deeplearning.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity

So far, we have described the application of neural networks to supervised learning, in which we have labeled training examples. Now suppose we have only a set of unlabeled training examples \textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}, where \textstyle x^{(i)} \in \Re^{n}. An autoencoderneural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses \textstyle y^{(i)} = x^{(i)}.

Here is an autoencoder:

Autoencoder636.png

The autoencoder tries to learn a function \textstyle h_{W,b}(x) \approx x. In other words, it is trying to learn an approximation to the identity function, so as to output \textstyle \hat{x} that is similar to \textstyle x. The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the network, such as by limiting the number of hidden units, we can discover interesting structure about the data. As a concrete example, suppose the inputs \textstyle x are the pixel intensity values from a \textstyle 10 \times 10 image (100 pixels) so \textstyle n=100, and there are \textstyle s_2=50 hidden units in layer \textstyle L_2. Note that we also have \textstyle y \in \Re^{100}. Since there are only 50 hidden units, the network is forced to learn a compressed representation of the input. I.e., given only the vector of hidden unit activations \textstyle a^{(2)} \in \Re^{50}, it must try to reconstruct the 100-pixel input \textstyle x. If the input were completely random---say, each \textstyle x_i comes from an IID Gaussian independent of the other features---then this compression task would be very difficult. But if there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs.

Our argument above relied on the number of hidden units \textstyle s_2 being small. But even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. In particular, if we impose a sparsity constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large.

Informally, we will think of a neuron as being "active" (or as "firing") if its output value is close to 1, or as being "inactive" if its output value is close to 0. We would like to constrain the neurons to be inactive most of the time. This discussion assumes a sigmoid activation function. If you are using a tanh activation function, then we think of a neuron as being inactive when it outputs values close to -1.

Recall that \textstyle a^{(2)}_j denotes the activation of hidden unit \textstyle j in the autoencoder. However, this notation doesn't make explicit what was the input \textstyle x that led to that activation. Thus, we will write \textstyle a^{(2)}_j(x) to denote the activation of this hidden unit when the network is given a specific input \textstyle x. Further, let

\begin{align}\hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]\end{align}

be the average activation of hidden unit \textstyle j (averaged over the training set). We would like to (approximately) enforce the constraint

\begin{align}\hat\rho_j = \rho,\end{align}

where \textstyle \rho is a sparsity parameter, typically a small value close to zero (say \textstyle \rho = 0.05). In other words, we would like the average activation of each hidden neuron \textstyle j to be close to 0.05 (say). To satisfy this constraint, the hidden unit's activations must mostly be near 0.


To achieve this, we will add an extra penalty term to our optimization objective that penalizes \textstyle \hat\rho_j deviating significantly from \textstyle \rho. Many choices of the penalty term will give reasonable results. We will choose the following:

\begin{align}\sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.\end{align}

Here, \textstyle s_2 is the number of neurons in the hidden layer, and the index \textstyle j is summing over the hidden units in our network. If you are familiar with the concept of KL divergence, this penalty term is based on it, and can also be written

\begin{align}\sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}

where \textstyle {\rm KL}(\rho || \hat\rho_j) = \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j} is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean \textstyle \rho and a Bernoulli random variable with mean \textstyle \hat\rho_j. KL-divergence is a standard function for measuring how different two different distributions are. (If you've not seen KL-divergence before, don't worry about it; everything you need to know about it is contained in these notes.)

This penalty function has the property that \textstyle {\rm KL}(\rho || \hat\rho_j) = 0 if \textstyle \hat\rho_j = \rho, and otherwise it increases monotonically as \textstyle \hat\rho_j diverges from \textstyle \rho. For example, in the figure below, we have set \textstyle \rho = 0.2, and plotted \textstyle {\rm KL}(\rho || \hat\rho_j) for a range of values of \textstyle \hat\rho_j:

KLPenaltyExample.png

We see that the KL-divergence reaches its minimum of 0 at \textstyle \hat\rho_j = \rho, and blows up (it actually approaches \textstyle \infty) as \textstyle \hat\rho_j approaches 0 or 1. Thus, minimizing this penalty term has the effect of causing \textstyle \hat\rho_j to be close to \textstyle \rho.

Our overall cost function is now

\begin{align}J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}

where \textstyle J(W,b) is as defined previously, and \textstyle \beta controls the weight of the sparsity penalty term. The term \textstyle \hat\rho_j (implicitly) depends on \textstyle W,b also, because it is the average activation of hidden unit \textstyle j, and the activation of a hidden unit depends on the parameters \textstyle W,b.

To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement trick involving only a small change to your code. Specifically, where previously for the second layer (\textstyle l=2), during backpropagation you would have computed

\begin{align}\delta^{(2)}_i = \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right) f'(z^{(2)}_i),\end{align}

now instead compute

\begin{align}\delta^{(2)}_i =  \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .\end{align}

One subtlety is that you'll need to know \textstyle \hat\rho_i to compute this term. Thus, you'll need to compute a forward pass on all the training examples first to compute the average activations on the training set, before computing backpropagation on any example. If your training set is small enough to fit comfortably in computer memory (this will be the case for the programming assignment), you can compute forward passes on all your examples and keep the resulting activations in memory and compute the \textstyle \hat\rho_is. Then you can use your precomputed activations to perform backpropagation on all your examples. If your data is too large to fit in memory, you may have to scan through your examples computing a forward pass on each to accumulate (sum up) the activations and compute \textstyle \hat\rho_i (discarding the result of each forward pass after you have taken its activations \textstyle a^{(2)}_i into account for computing \textstyle \hat\rho_i). Then after having computed \textstyle \hat\rho_i, you'd have to redo the forward pass for each example so that you can do backpropagation on that example. In this latter case, you would end up computing a forward pass twice on each example in your training set, making it computationally less efficient.


The full derivation showing that the algorithm above results in gradient descent is beyond the scope of these notes. But if you implement the autoencoder using backpropagation modified this way, you will be performing gradient descent exactly on the objective \textstyle J_{\rm sparse}(W,b). Using the derivative checking method, you will be able to verify this for yourself as well.


denoising autoencoder:












發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章