Deep learning------------ Autoencoders

Autoencoders分为3类：喵呜，是哪三类呢？登登登登

Firstly, when it comes into being is

sparse autoencoder

Second:

denoising autoencoder

Finally:

Contractive Auto-Encoders

下面我们来分条阐述：

sparse autoencoder：

这节课来学习下Deep learning领域比较出名的一类算法——sparse autoencoder，即稀疏模式的自动编码。我们知道，deep learning也叫做unsupervised learning，所以这里的sparse autoencoder也应是无监督的。按照前面的博文：Deep learning：一(基础知识_1)，Deep learning：七(基础知识_2)所讲，如果是有监督的学习的话，在神经网络中，我们只需要确定神经网络的结构就可以求出损失函数的表达式了（当然，该表达式需对网络的参数进行”惩罚”，以便使每个参数不要太大）,同时也能够求出损失函数偏导函数的表达式，然后利用优化算法求出网络最优的参数。应该清楚的是，损失函数的表达式中，需要用到有标注值的样本。那么这里的sparse autoencoder为什么能够无监督学习呢？难道它的损失函数的表达式中不需要标注的样本值（即通常所说的y值）么？其实在稀疏编码中”标注值”也是需要的，只不过它的输出理论值是本身输入的特征值x，其实这里的标注值y=x。这样做的好处是，网络的隐含层能够很好的代替输入的特征，因为它能够比较准确的还原出那些输入特征值。Sparse autoencoder的一个网络结构图如下所示：

　　损失函数的求法：

　　无稀疏约束时网络的损失函数表达式如下：

　　稀疏编码是对网络的隐含层的输出有了约束，即隐含层节点输出的平均值应尽量为0，这样的话，大部分的隐含层节点都处于非activite状态。因此，此时的sparse autoencoder损失函数表达式为：

　　后面那项为KL距离，其表达式如下：

　　隐含层节点输出平均值求法如下：

　　其中的参数一般取很小，比如说0.05，也就是小概率发生事件的概率。这说明要求隐含层的每一个节点的输出均值接近0.05（其实就是接近0，因为网络中activite函数为sigmoid函数），这样就达到稀疏的目的了。KL距离在这里表示的是两个向量之间的差异值。从约束函数表达式中可以看出，差异越大则”惩罚越大”，因此最终的隐含层节点的输出会接近0.05。

　　损失函数的偏导数的求法：

　　如果不加入稀疏规则，则正常情况下由损失函数求损失函数偏导数的过程如下：

　　而加入了稀疏性后，神经元节点的误差表达式由公式：

　　变成公式：

　　梯度下降法求解：

　　有了损失函数及其偏导数后就可以采用梯度下降法来求网络最优化的参数了，整个流程如下所示：

　　从上面的公式可以看出，损失函数的偏导其实是个累加过程，每来一个样本数据就累加一次。这是因为损失函数本身就是由每个训练样本的损失叠加而成的，而按照加法的求导法则，损失函数的偏导也应该是由各个训练样本所损失的偏导叠加而成。从这里可以看出，训练样本输入网络的顺序并不重要，因为每个训练样本所进行的操作是等价的，后面样本的输入所产生的结果并不依靠前一次输入结果（只是简单的累加而已，而这里的累加是顺序无关的）。

该总结的原文是来自：这里面有更详细的讲解：

http://deeplearning.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity

So far, we have described the application of neural networks to supervised learning, in which we have labeled training examples. Now suppose we have only a set of unlabeled training examples $\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}$ , where $\textstyle x^{(i)} \in \Re^{n}$ . An autoencoderneural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses $\textstyle y^{(i)} = x^{(i)}$ .

Here is an autoencoder:

The autoencoder tries to learn a function $\textstyle h_{W,b}(x) \approx x$ . In other words, it is trying to learn an approximation to the identity function, so as to output $\textstyle \hat{x}$ that is similar to $\textstyle x$ . The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the network, such as by limiting the number of hidden units, we can discover interesting structure about the data. As a concrete example, suppose the inputs $\textstyle x$ are the pixel intensity values from a $\textstyle 10 \times 10$ image (100 pixels) so $\textstyle n=100$ , and there are $\textstyle s_2=50$ hidden units in layer $\textstyle L_2$ . Note that we also have $\textstyle y \in \Re^{100}$ . Since there are only 50 hidden units, the network is forced to learn a compressed representation of the input. I.e., given only the vector of hidden unit activations $\textstyle a^{(2)} \in \Re^{50}$ , it must try to reconstruct the 100-pixel input $\textstyle x$ . If the input were completely random---say, each $\textstyle x_i$ comes from an IID Gaussian independent of the other features---then this compression task would be very difficult. But if there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs.

Our argument above relied on the number of hidden units $\textstyle s_2$ being small. But even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. In particular, if we impose a sparsity constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large.

Informally, we will think of a neuron as being "active" (or as "firing") if its output value is close to 1, or as being "inactive" if its output value is close to 0. We would like to constrain the neurons to be inactive most of the time. This discussion assumes a sigmoid activation function. If you are using a tanh activation function, then we think of a neuron as being inactive when it outputs values close to -1.

Recall that $\textstyle a^{(2)}_j$ denotes the activation of hidden unit $\textstyle j$ in the autoencoder. However, this notation doesn't make explicit what was the input $\textstyle x$ that led to that activation. Thus, we will write $\textstyle a^{(2)}_j(x)$ to denote the activation of this hidden unit when the network is given a specific input $\textstyle x$ . Further, let

$\begin{align}\hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]\end{align}$

be the average activation of hidden unit $\textstyle j$ (averaged over the training set). We would like to (approximately) enforce the constraint

$\begin{align}\hat\rho_j = \rho,\end{align}$

where $\textstyle \rho$ is a sparsity parameter, typically a small value close to zero (say $\textstyle \rho = 0.05$ ). In other words, we would like the average activation of each hidden neuron $\textstyle j$ to be close to 0.05 (say). To satisfy this constraint, the hidden unit's activations must mostly be near 0.

To achieve this, we will add an extra penalty term to our optimization objective that penalizes $\textstyle \hat\rho_j$ deviating significantly from $\textstyle \rho$ . Many choices of the penalty term will give reasonable results. We will choose the following:

$\begin{align}\sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.\end{align}$

Here, $\textstyle s_2$ is the number of neurons in the hidden layer, and the index $\textstyle j$ is summing over the hidden units in our network. If you are familiar with the concept of KL divergence, this penalty term is based on it, and can also be written

$\begin{align}\sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}$

where $\textstyle {\rm KL}(\rho || \hat\rho_j) = \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}$ is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean $\textstyle \rho$ and a Bernoulli random variable with mean $\textstyle \hat\rho_j$ . KL-divergence is a standard function for measuring how different two different distributions are. (If you've not seen KL-divergence before, don't worry about it; everything you need to know about it is contained in these notes.)

This penalty function has the property that $\textstyle {\rm KL}(\rho || \hat\rho_j) = 0$ if $\textstyle \hat\rho_j = \rho$ , and otherwise it increases monotonically as $\textstyle \hat\rho_j$ diverges from $\textstyle \rho$ . For example, in the figure below, we have set $\textstyle \rho = 0.2$ , and plotted $\textstyle {\rm KL}(\rho || \hat\rho_j)$ for a range of values of $\textstyle \hat\rho_j$ :

We see that the KL-divergence reaches its minimum of 0 at $\textstyle \hat\rho_j = \rho$ , and blows up (it actually approaches $\textstyle \infty$ ) as $\textstyle \hat\rho_j$ approaches 0 or 1. Thus, minimizing this penalty term has the effect of causing $\textstyle \hat\rho_j$ to be close to $\textstyle \rho$ .

Our overall cost function is now

$\begin{align}J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}$

where $\textstyle J(W,b)$ is as defined previously, and $\textstyle \beta$ controls the weight of the sparsity penalty term. The term $\textstyle \hat\rho_j$ (implicitly) depends on $\textstyle W,b$ also, because it is the average activation of hidden unit $\textstyle j$ , and the activation of a hidden unit depends on the parameters $\textstyle W,b$ .

To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement trick involving only a small change to your code. Specifically, where previously for the second layer ( $\textstyle l=2$ ), during backpropagation you would have computed

$\begin{align}\delta^{(2)}_i = \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right) f'(z^{(2)}_i),\end{align}$

now instead compute

$\begin{align}\delta^{(2)}_i = \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .\end{align}$

One subtlety is that you'll need to know $\textstyle \hat\rho_i$ to compute this term. Thus, you'll need to compute a forward pass on all the training examples first to compute the average activations on the training set, before computing backpropagation on any example. If your training set is small enough to fit comfortably in computer memory (this will be the case for the programming assignment), you can compute forward passes on all your examples and keep the resulting activations in memory and compute the $\textstyle \hat\rho_i$ s. Then you can use your precomputed activations to perform backpropagation on all your examples. If your data is too large to fit in memory, you may have to scan through your examples computing a forward pass on each to accumulate (sum up) the activations and compute $\textstyle \hat\rho_i$ (discarding the result of each forward pass after you have taken its activations $\textstyle a^{(2)}_i$ into account for computing $\textstyle \hat\rho_i$ ). Then after having computed $\textstyle \hat\rho_i$ , you'd have to redo the forward pass for each example so that you can do backpropagation on that example. In this latter case, you would end up computing a forward pass twice on each example in your training set, making it computationally less efficient.

The full derivation showing that the algorithm above results in gradient descent is beyond the scope of these notes. But if you implement the autoencoder using backpropagation modified this way, you will be performing gradient descent exactly on the objective $\textstyle J_{\rm sparse}(W,b)$ . Using the derivative checking method, you will be able to verify this for yourself as well.

Deep learning------------ Autoencoders

Autoencoders分为3类：喵呜，是哪三类呢？登登登登

sparse autoencoder

denoising autoencoder

Contractive Auto-Encoders

sparse autoencoder：

denoising autoencoder:

杭州的 IT 崩盘了么？

开源高性能结构化日志模块NanoLog

Python 潮流周刊#55：分享 9 个高质量的技术类信息源！

Azure Virtual Network (22) 多订阅使用Azure DNS解析问题 Windows Azure Platform 系列文章目录

【简写Mybatis-02】注册机的实现以及SqlSession处理

手绘二维码

.NET借助虚拟网卡实现一个简单异地组网工具

The effective tools for processing matrix in C++ programming

Deep learning-------------Neural networks

Python（4）-----The function operation of python

Machine learning-------SIFT feature extraction

The mixed programming in terms of matlab and C++

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Deep learning------------ Autoencoders

Autoencoders分为3类： 喵呜，是哪三类呢？登登登登

sparse autoencoder

denoising autoencoder

Contractive Auto-Encoders

sparse autoencoder：

denoising autoencoder:

Autoencoders分为3类：喵呜，是哪三类呢？登登登登