Restricted Boltzmann Machines

—-目錄

Restricted Boltzmann Machines

Energy-Based Models (EBM)

Energy-based models associate a scalar energy to each configuration of the variables of interest. Learning corresponds to modifying that energy function so that its shape has desirable properties. For example, we would like plausible or desirable configurations to have low energy. Energy-based probabilistic models define a probability distribution through an energy function, as follows:

P (x) = e - E ( x ) Z (1)

The normalizing factor Z is called the partition function by analogy with physical systems.

Z = \sum x e - E (x) (2)

EBMs with Hidden Units

In many cases of interest, we do not observe the example x fully, or we want to introduce some non-observed variables to increase the expressive power of the model. So we consider an observed part (still denoted x here) and a hidden part h. We can then write:

P (x) = \sum h P (x, h) = \sum h e - E ( x ) Z (3)

In such cases, to map this formulation to one similar to Eq. (1), we introduce the notation (inspired from physics) of free energy, defined as follows:

F (x) = - l o g \sum h e - E (x) (4)

which allow us to write:

P (x) = e - F ( x ) Z w i t h Z = \sum x e - F (x) (5)

The data negative log-likelihood gradient then has a particularly intresting form.

- \partial l o g P ( x ) \partial θ = \partial F ( x ) \partial θ - \sum x ¯ P (x ¯) \partial F ( x ¯ ) \partial θ (6)

推導過程如下：(注意x和

x¯ 不同)

   −∂logP(x)∂θ=∂log∑x¯e−F(x¯)∂θ−∂loge−F(x)∂θ
                    =∂F(x)∂θ−1∑x¯e−F(x¯)∑x¯e−F(x¯)∂F(x)∂θ
                      =∂F(x)∂θ−∑x¯P(x¯)∂F(x¯)∂θ

  Notice that the above gradient contains two terms, which are referred to as the positive and negative phase. The terms positive and negative do not refer to the sign of each term in the equation, but rather reflect their effect on the probability density defined by the model. The first term increases the probability of training data (by reducing the corresponding free energy), while the second term decreases the probability of samples generated by the model.
  It is usually difficult to determine this gradient analytically, as it involves the computation of EP[∂F(x)∂θ] . This is nothing less than an expectation over all possible configurations of the input x (under the distribution P formed by the model) !
  The first step in making this computation tractable is to estimate the expectation using a fixed number of model samples. Samples used to estimate the negative phase gradient are referred to as negative particles, which are denoted as N. The gradient can then be written as:

- \partial log p ( x ) \partial θ \approx \partial F ( x ) \partial θ - 1 | N | \sum x ¯ \in N \partial F ( x ¯ ) \partial θ . (7)

where we would ideally like elements

x¯ of

N to be sampled according to P (i.e. we are doing Monte-Carlo).
有了以上的公式，我們基本上就可以訓練一個EBM了，唯一留下的問題是how to extract these negative particles

N ,不過下面就要介紹一種方法Markov Chain Monte Carlo methods.

Restricted Boltzmann Machines (RBM)

Boltzmann Machines (BMs) are a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters. To make them powerful enough to represent complicated distributions (i.e., go from the limited parametric setting to a non-parametric one), we consider that some of the variables are never observed (they are called hidden). By having more hidden variables (also called hidden units), we can increase the modeling capacity of the Boltzmann Machine (BM). Restricted Boltzmann Machines further restrict BMs to those without visible-visible and hidden-hidden connections. A graphical depiction of an RBM is shown below.

The energy function E(v,h) of an RBM is defined as:

E (v, h) = - b' v - c' h - h' W v (8)

where Wnh×nv represents the weights connecting hidden and visible units and b, c are the offsets of the visible and hidden layers respectively.

This translates directly to the following free energy formula(其中Wi 爲行向量，∑i 表示對所有的hidden units求和):

F (v) = - b' v - \sum i log \sum h i e h i (c i + W i v) . (9)

Because of the specific structure of RBMs, visible and hidden units are conditionally independent given one-another. Using this property, we can write:

p (h | v) = \prod i p (h i | v) . (10)

p (v | h) = \prod j p (v j | h) . (11)

RBMs with binary units

In the commonly studied case of using binary units (where vj and hi∈{0,1}) , we obtain from Eq. (6) and (2), a probabilistic version of the usual neuron activation function:

P (h i = 1 | v) = s i g m o i d (c i + W i v) (12)

P (v j = 1 | h) = s i g m o i d (b j + W . j h) (13)

P(vj=1|h) 的推導過程如下：
首先我們將式（8）寫成標量的形式：

E (v, h) = - \sum j = 1 n v b j v j - \sum i = 1 n h c i v i - \sum i = 1 n h \sum j = 1 n v h i W i j v j (8.1)

所以有：

P(vj=1|h)=P(vj=1|v−j,h)

=P(vj=1,v−i,h)P(v−j,h)

=P(vj=1,v−i,h)P(vj=1,v−j,h)+P(vj=0,v−j,h)

=1Ze−E(vj=1,v−j,h)1Ze−E(vj=1,v−j,h)+1Ze−E(vj=0,v−j,h)

=11+e−E(vj=0,v−j,h)+E(vj=1,v−j,h)

=11+e−(bj+∑nhi=1hiWij)

=sigmoid(bj+W.jh)
注意：其中

W.j 指列向量，

Wi 指行向量。
同理也可以推導出式(12)。
The free energy of an RBM with binary units further simplifies to:

F (v) = - b' v - \sum i log (1 + e (c i + W i v)) . (14)

因爲

hi 只有兩種取值0和1.
我們可以通過式(6)來對RBM的各個參數（b, c, W）求導，不過計算量太大，不可取，我們可以通過採樣的方法來近似估計。

- \partial l o g P ( x ) \partial θ = \partial F ( x ) \partial θ - \sum x ¯ P (x ¯) \partial F ( x ¯ ) \partial θ (6)

Sampling in an RBM

Samples of p(x) can be obtained by running a Markov chain to convergence, using Gibbs sampling as the transition operator.

Gibbs sampling of the joint of N random variables S=(S1,...,SN) is done through a sequence of N sampling sub-steps of the form Si∼p(Si|S−i) whereS−i contains the N-1 other random variables in S excludingSi .

For RBMs, S consists of the set of visible and hidden units. However, since they are conditionally independent, one can perform block Gibbs sampling. In this setting, visible units are sampled simultaneously given fixed values of the hidden units. Similarly, hidden units are sampled simultaneously given the visibles. A step in the Markov chain is thus taken as follows:

h (n + 1) \sim P (h | v) = s i g m (W v (n) + c),

v (n + 1) \sim P (v | h) = s i g m (W' h (n + 1) + b) .

注意h和v均是向量，每個分量同時採樣。

where h(n) refers to the set of all hidden units at the n-th step of the Markov chain. What it means is that, for example,h(n+1)i is randomly chosen to be 1 (versus 0) with probability sigmoid(Wiv(n)+ci) , and similarly, v(n+1)j is randomly chosen to be 1 (versus 0) with probability sigmoid(W.jh(n+1)+bj) .

This can be illustrated graphically:

As t→∞ , samples (v(t),h(t)) are guaranteed to be accurate samples of p(v,h).[詳細理論推導見下面參考資料3]

Contrastive Divergence (CD-k)

Contrastive Divergence uses two tricks to speed up the sampling process:

since we eventually want p(v)≈ptrain(v) (the true, underlying distribution of the data), we initialize the Markov chain with a training example (i.e., from a distribution that is expected to be close to p, so that the chain will be already close to having converged to its final distribution p).例如我們用訓練樣本來初始化Markov chain,使得模型最後達到和訓練數據相同的分佈。
CD does not wait for the chain to converge. Samples are obtained after only k-steps of Gibbs sampling. In pratice, k=1 has been shown to work surprisingly well.
restarting a chain for each observed example. 每次Markov Chain 收斂，求得參數的梯度後，再重新初始化Markov Chain，再求參數的梯度，以此類推。

我們可以通過CD-k算法來估計梯度的計算：

- \partial l o g P ( x ) \partial θ = \partial F ( x ) \partial θ - \sum x ¯ P (x ¯) \partial F ( x ¯ ) \partial θ (6)

x可以是輸入的訓練樣本，而

x¯ 可以是Markov Chain收斂採樣得到的樣本，用來估計式(6)的第二項，從而來估計對各個參數的梯度。
詳細的請見下面代碼：

# determine gradients on RBM parameters
        # note that we only need the sample at the end of the chain
        chain_end = nv_samples[-1]

        cost = T.mean(self.free_energy(self.input)) - T.mean(
            self.free_energy(chain_end))
        # We must not compute the gradient through the gibbs sampling
        gparams = T.grad(cost, self.params, consider_constant=[chain_end])

Persistent CD

Persistent CD Tieleman08(詳細請參考論文) uses another approximation for sampling from p(v,h). It relies on a single Markov chain, which has a persistent state (i.e., not restarting a chain for each observed example). For each parameter update, we extract new samples by simply running the chain for k-steps. The state of the chain is then preserved for subsequent updates.

The general intuition is that if parameter updates are small enough compared to the mixing rate of the chain, the Markov chain should be able to “catch up” to changes in the model.

one RBM example

代碼請參考我的githubRestrictedBoltzmannMachines.py.

參考資料

【1】DeepLearning Tutorial Restricted Boltzmann Machines (RBM)，這個是使用Theano實現深度學習算法的一個教程。
【2】 section 5 of Learning Deep Architectures for AI這是Hinton大神提出怎麼訓練RBM的文章。
【3】受限玻爾茲曼機（RBM）學習筆記，這是一個很詳細的RBM原理講解的教程，非常好懂。

DeepLearning——Restricted Boltzmann Machines

Restricted Boltzmann Machines

Energy-Based Models (EBM)

EBMs with Hidden Units

Restricted Boltzmann Machines (RBM)

RBMs with binary units

Sampling in an RBM

Contrastive Divergence (CD-k)

Persistent CD

one RBM example

參考資料

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

基礎知識面試準備

Linux學習筆記(一)-Linux操作系統啓動流程以及系統關機命令

機器學習筆記(2)-決策樹

Linux學習筆記(八)-基於AIX/Linux平臺的項目開發

Linux學習筆記(五)-安全管理以及開發基礎技術

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結