第10章无监督学习（3）

Manifold Interpretation of PCA and Linear Auto-Encoders

目标是寻找x 在子空间的一个映射，并保存尽量多的信息

令编码器为

h = f (x) = W T (x - μ)

h 是 x 的一个低维模拟

解码器为

x^= g (h) = b + V h

因为解码器和编码器都是线性的，那么最小化重构误差就是

E [| | x - x^| | 2]

也就是

V = W, μ = b = E [x]

W 的行形成协方差矩阵主要特征向量空间一组正交基

C = E [(x - μ) (x - μ) T]

对于PCA, W 的行就是这些特征向量，以对应特征值的重要性排序

最优的重构误差

min E [| | x - x^| | 2] = \sum i = d + 1 D λ i

其中

x∈RD

h∈Rd

λi 是协方差矩阵的特征值.

如果协方差的秩是 d , 那么重构误差就是 0.

ICA

Independent Component Analysis
独立成分分析
Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,1999; Hyv¨arinen et al., 2001

和概率PCA，特征分析相似，它也满足线性特征模型的条件

sample real-valued factors $h \sim P (h)$
sample the real-valued observable variables $x = W h + b + m n o i s e$

其中不同的一点是它不假设先验分布是高斯分布，它只假设是参数化的，例如

$P (h) = \prod i P (h i)$
假如假定隐藏变量是非高斯分布的，那么就可以重现他们。这也是ICA的目的。

Sparse Coding as a Generative Model

一个比较有趣的非高斯分布模型-分布是稀疏的

P(h) 在0附近值很大也就是h 是0附近的概率很高例如参数化的拉普拉斯密度先验分布

P (h) = \prod i P (h i) = \prod i λ 2 e - λ | h i |

Student_t prior is

P (h) = \prod i P (h i) \propto \prod i 1 1 + h 2 i v v + 1 2

Greedy Layerwise Unsupervised Pre-Training

Greedy - 不同层没有一起统合起来训练，可能会得到局部最优解

Layerwise - 每次只训练一层，训练第K层的时候保持前面的层保持不变

Unsupervised - 每一层都是无监督学习

Pre-Training - 它只是算法的第一步

Transfer Learning and Domain Adaptation

目标是抽取和利用数据集A的信息来应用到数据集B
譬如，不同的领域的具体评价不同（电影，音乐，书籍的评价），但有些地方是相同的。所以叫 Domain Adaptation

两个例子

Mesnil et al., 2011

Goodfellow et al., 2011

Extreme forms of transfer learning

one-shot learning

zero-shot learning

zero-data learning

Manifold Interpretation of PCA and Linear Auto-Encoders

标签（空格分隔）：深度学习个人兴趣

Look for projections of x into a subspace that preserves as much as information as possible about x

Let the encoder be

h = f (x) = W T (x - μ)

h is a low-dimensional representation of x

Decoder

x^= g (h) = b + V h

With liner encoder and decoder, minimizing reconstruction error

E [| | x - x^| | 2]

means that

V = W, μ = b = E [x]

and the rows of

W form an orthonormal basis which spans the same subspace as the principal eigenvectors of the covariance matrix

C = E [(x - μ) (x - μ) T]

In the case of PCA, the rows of W are these eigenvectors, ordered by the magnitude of the corresponding eigenvalues.

the optimal reconstruction error

min E [| | x - x^| | 2] = \sum i = d + 1 D λ i

Where

D is the dimension of x

d is the dimension of h

λi are the eigenvalues of the convariance.

If the covariance has rank d , the reconstrcution error is 0.

ICA

Independent Component Analysis
Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,1999; Hyv¨arinen et al., 2001

Like probabilistic PCA and factor analysis, it also fits the linear factor model of Eqs.

sample real-valued factors $h \sim P (h)$
sample the real-valued observable variables $x = W h + b + m n o i s e$

What is particular about ICA is that unlike PCA and factor analysis it does not assume that the prior is Gaussian. It only assumes that it is factorized, i.e.

$P (h) = \prod i P (h i)$
In this case, if we assume that the latent variables are non-Gaussian, then we can recover them, and this is what ICA is trying to achieve.

Sparse Coding as a Generative Model

A particularly interesting form of non-Gaussianity arises with distributions that are sparse.

P(h) puts high probability at or around 0. For instance, the factorized Laplace density prior is

P (h) = \prod i P (h i) = \prod i λ 2 e - λ | h i |

Student_t prior is

P (h) = \prod i P (h i) \propto \prod i 1 1 + h 2 i v v + 1 2

Greedy Layerwise Unsupervised Pre-Training

Greedy - the different layers are not jointly trained with respect to a global training objective, which could make the procedure sub-optimal

Layerwise - it proceeds one layer at a time, training the k-layer while keeping the previous ones fixed.

Unsupervised - each layer is trained with an unsupervised representation learning algorithm.

Pre-Training - it should be only a first step before a joint training algorithm is applied to fine-tune all the layers together with respect to a criterion of interest

Transfer Learning and Domain Adaptation

The objective is to take advantage of data from a first setting to extract information that may be useful when learning or even directly making predictions in the second setting.

Two examples

Mesnil et al., 2011

Goodfellow et al., 2011

Extreme forms of transfer learning

one-shot learning

zero-shot learning

zero-data learning

第10章无监督学习（3）

Manifold Interpretation of PCA and Linear Auto-Encoders

ICA

Sparse Coding as a Generative Model

Greedy Layerwise Unsupervised Pre-Training

Transfer Learning and Domain Adaptation

Manifold Interpretation of PCA and Linear Auto-Encoders

ICA

Sparse Coding as a Generative Model

Greedy Layerwise Unsupervised Pre-Training

Transfer Learning and Domain Adaptation

公司刚入职了一名 Java 中级开发，短短 4 行代码居然凑齐了 3 个 bug！我哭了~~

公众号5月C#/.NET热文一览

git 下载大陆镜像地址

第 11 章 CNNs(2)

第10章無監督學習（2）

英文markdown 簡歷

在你的 Mac 上安裝Theano

第10章無監督學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

第10章 无监督学习（3）

Manifold Interpretation of PCA and Linear Auto-Encoders

ICA

Sparse Coding as a Generative Model

Greedy Layerwise Unsupervised Pre-Training

Transfer Learning and Domain Adaptation

Manifold Interpretation of PCA and Linear Auto-Encoders

ICA

Sparse Coding as a Generative Model

Greedy Layerwise Unsupervised Pre-Training

Transfer Learning and Domain Adaptation

第10章无监督学习（3）