文章目錄
- Probabilistic Graphical Models
- Statistical and Algorithmic Foundations of Deep Learning
- 01 An overview of DL components
- Historical remarks: early days of neural networks
- Reverse-mode automatic differentiation (aka backpropagation)
- Modern building blocks: units, layers, activations functions, loss functions, etc.
- 02 Similarities and differences between GMs and NNs
- Graphical models vs. computational graphs
- Sigmoid Belief Networks as graphical models
- Deep Belief Networks and Boltzmann Machines
- 03 Combining DL methods and GMs
- Using outputs of NNs as inputs to GMs
- GMs with potential functions represented by NNs q NNs with structured outputs
- 04 Bayesian Learning of NNs
Probabilistic Graphical Models
Statistical and Algorithmic Foundations of Deep Learning
Author: Eric Xing
01 An overview of DL components
Historical remarks: early days of neural networks
我們知道生物神經元是這樣的:
上游細胞通過軸突(Axon)將神經遞質傳送給下游細胞的樹突。 人工智能受到該原理的啓發,是按照下圖來構造人工神經元(或者是感知器)的。
類似的,生物神經網絡 —— > 人工神經網絡
![在這裏插入圖片描述](https://img-blog.csdnimg.cn/2020051209264072.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L05HVWV2ZXIxNQ==,size_16,color_FFFFFF,t_70Reverse-mode automatic differentiation (aka backpropagation)
Reverse-mode automatic differentiation (aka backpropagation)
下面我們來看看具體的感知器學習算法。
假設這是一個迴歸問題x->y,$, 則目標函數爲
爲了求出該函數的解,我們需要對其求導,具體的:
其中
由此的更新公式爲:
下面我們來說說神經網絡模型:
其中,隱藏單元沒有目標。
人工神經網絡不過是可以由計算圖表示的複雜功能組成。
通過應用鏈式規則並使用反向累積,我們得到:
該算法通常稱爲反向傳播。 如果某些功能是隨機的怎麼辦?使用隨機反向傳播!現代軟件包可以自動執行此操作(稍後再介紹)
Modern building blocks: units, layers, activations functions, loss functions, etc.
常用激活函數:
- Linear and ReLU
- Sigmoid and tanh
- Etc.
網絡層:
- Fully connected
- Convolutional & pooling
- Recurrent
- ResNets
- Etc.
-
也就是說基本構成要素的可以任意組合,如果有多種損失功能的話,可以實現多目標預測和轉移學習等。 只要有足夠的數據,更深的架構就會不斷改進。
Feature learning
成功學習中間表示[Lee et al ICML 2009,Lee et al NIPS 2009]
表示學習:網絡學習越來越多的抽象數據表示形式,這些數據被“解開”,即可以進行線性分離。
02 Similarities and differences between GMs and NNs
Graphical models vs. computational graphs
Graphical models:
- 用於以圖形形式編碼有意義的知識和相關的不確定性的表示形式
- 學習和推理基於經過充分研究(依賴於結構)的技術(例如EM,消息傳遞,VI,MCMC等)的豐富工具箱
- 圖形代表模型
Utility of the graph - 一種用於從局部結構綜合全局損失函數的工具(潛在功能,特徵功能等)
- 一種設計合理有效的推理算法的工具(總和,均值場等)
- 激發近似和懲罰的工具(結構化MF,樹近似等)
- 用於監視理論和經驗行爲以及推理準確性的工具
Utility of the loss function
- 學習算法和模型質量的主要衡量指標
Deep neural networks :
- 學習有助於最終指標上的計算和性能的表示形式(中間表示形式不保證一定有意義)
- 學習主要基於梯度下降法(aka反向傳播);推論通常是微不足道的,並通過“向前傳遞”完成
- 圖形代表計算
Utility of the network
- 概念上綜合複雜決策假設的工具(分階段的投影和聚合)
- 用於組織計算操作的工具(潛在狀態的分階段更新)
- 用於設計加工步驟和計算模塊的工具(逐層並行化)
- 在評估DL推理算法方面沒有明顯的用途
到目前爲止,圖形模型是概率分佈的表示,而神經網絡是函數近似器(無概率含義)。有些神經網絡實際上是圖形模型(即單位/神經元代表隨機變量):
- 玻爾茲曼機器Boltzmann machines (Hinton&Sejnowsky,1983)
- 受限制的玻爾茲曼機器Restricted Boltzmann machines(Smolensky,1986)
- Sigmoid信念網絡的學習和推理Learning and Inference in sigmoid belief networks(Neal,1992)
- 深度信念網絡中的快速學習Fast learning in deep belief networks(Hinton,Osindero,Teh,2006年)
- 深度玻爾茲曼機器Deep Boltzmann machines(Salakhutdinov和Hinton,2009年)
接下來我們會逐一介紹他們。
I: Restricted Boltzmann Machines
受限玻爾茲曼機器,縮寫爲RBM。 RBM是用二部圖(bi-partite graph)表示的馬爾可夫隨機場,圖的一層/部分中的所有節點都連接到另一層中的所有節點; 沒有層間連接。
聯合分佈爲:
單個數據點的對數似然度(不可觀察的邊際被邊緣化):
對數似然比的梯度 模型參數:
對數似然比的梯度 參數(替代形式):
兩種期望都可以通過抽樣來近似, 從後部採樣是準確的(RBM在給定的h上分解)。 通過MCMC從關節進行採樣(例如,吉布斯採樣)
在神經網絡文獻中:
- 計算第一項稱爲鉗位/喚醒/正相(網絡是“清醒的”,因爲它取決於可見變量)
- 計算第二項稱爲非固定/睡眠/自由/負相(該網絡“處於睡眠狀態”,因爲它對關節的可見變量進行了採樣;比喻,它夢見了可見的輸入)
通過隨機梯度下降(SGD)優化給定數據的模型對數似然來完成學習, 第二項(負相)的估計嚴重依賴於馬爾可夫鏈的混合特性,這經常導致收斂緩慢並且需要額外的計算。
II: Sigmoid Belief Networks
Sigimoid信念網是簡單的貝葉斯網絡,其二進制變量的條件概率由Sigmoid函數表示:
貝葉斯網絡表現出一種稱爲“解釋效應”的現象:如果A與C相關,則B與C相關的機會減少。 ⇒在給定C的情況下A和B相互關聯。
值得注意的是, 由於“解釋效應”,當我們以信念網絡中的可見層爲條件時,所有隱藏變量都將成爲因變量。
Sigmoid Belief Networks as graphical models
尼爾提出了用於學習和推理的蒙特卡洛方法(尼爾,1992年):
RBMs are infinite belief networks
要對模型參數進行梯度更新,我們需要通過採樣計算期望值。
- 我們可以在第一階段從後驗中精確採樣
- 我們運行吉布斯塊抽樣,以從聯合分佈中近似抽取樣本
條件分佈和用sigmoid表示, 因此,我們可以將以RBM表示的聯合分佈中的Gibbs採樣視爲無限深的Sigmoid信念網絡中的自頂向下傳播!
RBM等效於無限深的信念網絡。當我們訓練RBM時,實際上就是在訓練一個無限深的簡短網, 只是所有圖層的權重都捆綁在一起。如果權重在某種程度上“統一”,我們將獲得一個深度信仰網絡。
Deep Belief Networks and Boltzmann Machines
III: Deep Belief Nets
DBN是混合圖形模型(鏈圖)。其聯合概率分佈可表示爲:
其中蘊含的挑戰:
由於explaining away effect,因此在DBN中進行精確推斷是有問題的
訓練分兩個階段進行:
- 貪婪的預訓練+臨時微調; 沒有適當的聯合訓練
- 近似推斷爲前饋(自下而上)
Layer-wise pre-training
- 預訓練並凍結第一個RBM
- 在頂部堆疊另一個RBM並對其進行訓練
- 重物2層以上的重物保持綁緊狀態
- 我們重複此過程:預訓練和解開
Fine-tuning
- Pre-training is quite ad-hoc(特別指定) and is unlikely to lead to a good probabilistic model per se
- However, the layers of representations could perhaps be useful for some other downstream tasks!
- We can further “fine-tune” a pre-trained DBN for some other task
Setting A: Unsupervised learning (DBN → autoencoder)
- Pre-train a stack of RBMs in a greedy layer-wise fashion
- “Unroll” the RBMs to create an autoencoder
- Fine-tune the parameters by optimizing the reconstruction error(重構誤差)
Setting B: Supervised learning (DBN → classifier)
- Pre-train a stack of RBMs in a greedy layer-wise fashion
- “Unroll” the RBMs to create a feedforward classifier
- Fine-tune the parameters by optimizing the reconstruction error
Deep Belief Nets and Boltzmann Machines
DBMs are fully un-directed models (Markov random fields). Can be trained similarly as RBMs via MCMC (Hinton & Sejnowski, 1983). Use a variational approximation(變分近似) of the data distribution for faster training (Salakhutdinov & Hinton, 2009). Similarly, can be used to initialize other networks for downstream tasks
A few critical points to note about all these models:
- The primary goal of deep generative models is to represent the distribution of the observable variables. Adding layers of hidden variables allows to represent increasingly more complex distributions.
- Hidden variables are secondary (auxiliary) elements used to facilitate learning of complex dependencies between the observables.
- Training of the model is ad-hoc, but what matters is the quality of learned hidden representations.
- Representations are judged by their usefulness on a downstream task (the probabilistic meaning of the model is often discarded at the end).
- In contrast, classical graphical models are often concerned with the correctness of learning and inference of all variables
Conclusion
- DL & GM: the fields are similar in the beginning (structure, energy, etc.), and then diverge to their own signature pipelines
- DL: most effort is directed to comparing different architectures and their components (models are driven by evaluating empirical performance on a downstream tasks)
- DL models are good at learning robust hierarchical representations from the data and suitable for simple reasoning (call it “low-level cognition”)
- GM: the effort is directed towards improving inference accuracy and convergence speed
- GMs are best for provably correct inference and suitable for high-level complex reasoning tasks (call it “high-level cognition”) 推理任務
- Convergence of both fields is very promising!
03 Combining DL methods and GMs
Using outputs of NNs as inputs to GMs
Combining sequential NNs and GMs
HMM:隱馬爾可夫
Hybrid NNs + conditional GMs
In a standard CRF條件隨機場, each of the factor cells is a parameter.
In a hybrid model, these values are computed by a neural network.
GMs with potential functions represented by NNs q NNs with structured outputs
Using GMs as Prediction Explanations
!!! How do we build a powerful predictive model whose predictions we can interpret in terms of semantically meaningful features?
Contextual Explanation Networks (CENs)
- The final prediction is made by a linear GM.
- Each coefficient assigns a weight to a meaningful attribute.
- Allows us to judge predictions in terms of GMs produced by the context encoder.
CEN: Implementation Details
Workflow:
- Maintain a (sparse稀疏) dictionary of GM parameters.
- Process complex inputs (images, text, time series, etc.) using deep nets; use soft attention to either select or combine models from the dictionary.
• Use constructed GMs (e.g., CRFs) to make predictions.
• Inspect GM parameters to understand the reasoning behind predictions.
Results: imagery as context
Based on the imagery, CEN learns to select different models for urban and rural
Results: classical image & text datasets
CEN architectures for survival analysis
04 Bayesian Learning of NNs
Bayesian learning of NN parameters q Deep kernel learning
A neural network as a probabilistic model: Likelihood:
- Categorical distribution for classification ⇒ cross-entropy loss 交叉熵損失
- Gaussian distribution for regression ⇒ squared loss平方損失
- Gaussianprior⇒L2regularization
- Laplaceprior⇒L1regularization
Bayesian learning [MacKay 1992, Neal 1996, de Freitas 2003]