EM算法的推導、證明和例子

EM

推導 1

X={x1,x2,,xN}\mathcal{X}=\{x_1, x_2, \cdots, x_N\}: 觀察數據

z: 潛在變量

似然估計函數
logP(X;θ)=logiNP(xi;θ)=iNlogP(xi;θ)(1) \log P(\mathcal{X};\theta)=\log \prod_i^N P(x_i;\theta) = \sum_i^N \log P(x_i;\theta) \tag{1}
我們的目標是找到最大化似然估計函數的參數θ\theta
argmaxθlogP(X;θ)(2) \underset{\theta}{\arg \max} \log P(\mathcal{X};\theta) \tag{2}

現有
P(x,z;θ)=P(x;θ)P(zx;θ)(3) P(x,z;\theta) = P(x;\theta)P(z|x;\theta) \tag{3}
因此
logP(x;θ)=logP(x,z;θ)P(zx;θ)=logP(x,z;θ)logP(zx;θ)(4) \log P(x;\theta) = \log \frac{P(x,z;\theta)}{P(z|x;\theta)} = \log P(x,z;\theta) - \log P(z|x;\theta) \tag{4}
假設z屬於Q(z;ϕ)Q(z;\phi)分佈
logP(x;θ)=logP(x,z;θ)logP(zx;θ)logQ(z;ϕ)logQ(z;ϕ)=(logP(x,z;θ)logQ(z;ϕ))(logP(zx;θ)logQ(z;ϕ))=logP(x,z;θ)Q(z;ϕ)loglogP(zx;θ)Q(z;ϕ)(5) \begin{aligned} \log P(x;\theta) & = \log P(x,z;\theta) - \log P(z|x;\theta) - \log Q(z;\phi) - \log Q(z;\phi) \\ & = (\log P(x,z;\theta) - \log Q(z;\phi)) - (\log P(z|x;\theta) - \log Q(z;\phi) ) \\ & = \log \frac{P(x,z;\theta)}{Q(z;\phi)} - \log \frac{\log P(z|x;\theta)}{Q(z;\phi)} \end{aligned} \tag{5}
公式5左右兩邊求關於z的期望
left=zQ(z;ϕ)logP(x;θ)dz=logP(x;θ)zQ(z;ϕ)dz=logP(x;θ)(6) \text{left} = \int_z Q(z;\phi) \log P(x;\theta) dz = \log P(x;\theta) \int_z Q(z;\phi) dz = \log P(x;\theta) \tag{6}
因爲zQ(z;ϕ)dz=1\int_z Q(z;\phi) dz = 1

right=zQ(z;ϕ)logP(x,z;θ)Q(z;ϕ)dzzQ(z;ϕ)loglogP(zx;θ)Q(z;ϕ)dz=ELBO+KL(Q(z;ϕ)P(zx;θ))(7) \begin{aligned} \text{right} & = \int_z Q(z;\phi) \log \frac{P(x,z;\theta)}{Q(z;\phi)} dz - \int_z Q(z;\phi) \log \frac{\log P(z|x;\theta)}{Q(z;\phi)} dz \\ & = \text{ELBO} + \text{KL}(Q(z;\phi)||P(z|x;\theta)) \end{aligned} \tag{7}
公式7第一項稱爲ELBO(evidence lower bound),第二項是KL散度。
因此,我們得到
logP(x;θ)=ELBO+KL(Q(z;ϕ)P(zx;θ))(8) \log P(x;\theta) = \text{ELBO} + \text{KL}(Q(z;\phi)||P(z|x;\theta)) \tag{8}
因爲KL()0\text{KL}(\cdot) \ge 0,因此logP(x;θ)ELBO\log P(x;\theta) \ge \text{ELBO},當且僅當Q(z;ϕ)=P(zx;θ)Q(z;\phi) = P(z|x;\theta)時,等號成立。ELBO相當於一個下界,不斷地提高ELBO,就能不斷提高logP(x;θ)\log P(x;\theta),達到我們的目的——最大化似然估計函數。

假設我們有θ(t)\theta^{(t)},我們想要最大化ELBO,即最小化KL(Q(z;ϕ)P(zx;θ))\text{KL}(Q(z;\phi)||P(z|x;\theta)):
ϕ(t)=argminϕKL(Q(z;ϕ)P(zx;θ(t)))=argmaxϕELBO(ϕ,θ(t))(9) \phi^{(t)} = \underset{\phi}{\arg \min} \text{KL}(Q(z;\phi)||P(z|x;\theta^{(t)})) = \underset{\phi}{\arg \max} \text{ELBO}(\phi, \theta^{(t)}) \tag{9}
當得到最優的ϕ(t)\phi^{(t)},有Q(z;ϕ(t))=P(zx;θt)Q(z;\phi^{(t)}) = P(z|x;\theta^{t})。實際情況下很難得到最優的ϕ(t)\phi^{(t)},我們的目的時儘可能最大化ELBO。當我們計算出ELBO後,我們可以反過來求θ(t+1)=argmaxθELBO(ϕ(t),θ)\theta^{(t+1)} = \underset{\theta}{\arg \max}\text{ELBO}(\phi^{(t)}, \theta)

通過不斷重複這個兩個最大化的過程,我們就可以近似地求出最大化似然估計函數的參數θ\theta

繼續看看ELBO(ϕ(t),θ)\text{ELBO}(\phi^{(t)}, \theta)

ELBO(ϕ(t),θ)=zQ(z;ϕ(t))logP(x,z;θ)Q(z;ϕ(t))dz=zQ(z;ϕ(t))logP(x,z;θ)dzzQ(z;ϕ(t))logQ(z;ϕ(t)dz=EzQ(z;ϕ(t))[logP(x,z;θ)]EzQ(z;ϕ(t))[logQ(z;ϕ(t)](10) \begin{aligned} \text{ELBO}(\phi^{(t)}, \theta) &= \int_z Q(z;\phi^{(t)}) \log \frac{P(x,z;\theta)}{Q(z;\phi^{(t)})} dz \\ &= \int_z Q(z;\phi^{(t)}) \log P(x,z;\theta) dz - \int_z Q(z;\phi^{(t)}) \log Q(z;\phi^{(t)} dz \\ &= E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)] - E_{z\backsim Q(z;\phi^{(t)})}[\log Q(z;\phi^{(t)}] \end{aligned} \tag{10}
公式10的最後一行的第一項是關於z的期望,第二項是一個常數(ϕ(t)\phi^{(t)}已知),所以
θ(t+1)=argmaxθELBO(ϕ(t),θ)=argmaxθEzQ(z;ϕ(t))[logP(x,z;θ)](11) \begin{aligned} \theta^{(t+1)} &= \underset{\theta}{\arg \max}\text{ELBO}(\phi^{(t)}, \theta) \\ &= \underset{\theta}{\arg \max} E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)] \end{aligned} \tag{11}
因此EM算法叫做期望最大化算法。

總結,EM算法的迭代過程如下

  • E-step: 固定θ(t)\theta^{(t)}ϕ(t)=argmaxϕEzQ(z;ϕ(t))[logP(x,z;θ)]\phi^{(t)}=\underset{\phi}{\arg \max} E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)]
  • M-step: 固定ϕ(t)\phi^{(t)}θ(t+1)=argmaxθEzQ(z;ϕ(t))[logP(x,z;θ)]\theta^{(t+1)} = \underset{\theta}{\arg \max}E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)]

E-step和M-step的順序可以互換。

EM算法收斂性證明1

簡單的證明:只要θ(t)θ(t+1),logP(x;θ(t))logP(x;θ(t+1))\theta^{(t)} \to \theta^{(t+1)}, \log P(x;\theta^{(t)}) \le \log P(x;\theta^{(t+1)}),就能保證算法收斂。

從公式4出發,兩邊求關於z的期望
left=zQ(z;ϕ(t))logP(x;θ)dz=logP(x;θ)zQ(z;ϕ(t))dz=logP(x;θ)(12) \begin{aligned} \text{left} &= \int_z Q(z;\phi^{(t)}) \log P(x;\theta) dz \\ &= \log P(x;\theta) \int_z Q(z; \phi^{(t)}) dz \\ &= \log P(x;\theta) \end{aligned} \tag{12}

right=zQ(z;ϕ(t))logP(x,z;θ)dzzQ(z;ϕ(t))logP(zx;θ)dz(13) \text{right} = \int_z Q(z;\phi^{(t)}) \log P(x,z;\theta)dz - \int_z Q(z;\phi^{(t)}) \log P(z|x;\theta)dz \tag{13}

因爲ϕ(t)\phi^{(t)}根據公式9求解得到的,假設我們得到的是最優解,則Q(z;ϕ(t))=P(zx;θt)Q(z;\phi^{(t)}) = P(z|x;\theta^{t}),代入公式13得
right=zP(zx;θ(t))logP(x,z;θ)dzzP(zx;θ(t))logP(zx;θ)dz=H1(θ,θ(t))H2(θ,θ(t))(14) \begin{aligned} \text{right} &= \int_z P(z|x;\theta^{(t)}) \log P(x,z;\theta)dz - \int_z P(z|x;\theta^{(t)}) \log P(z|x;\theta)dz \\ &= H_1 (\theta, \theta^{(t)}) - H_2 (\theta, \theta^{(t)}) \end{aligned} \tag{14}
我們分別用H1H_1H2H_2指代公式14的兩項。

由公式12和14得
logP(x;θ)=H1(θ,θ(t))H2(θ,θ(t))(15) \log P(x;\theta) = H_1 (\theta, \theta^{(t)}) - H_2 (\theta, \theta^{(t)}) \tag{15}

因爲θ(t+1)=argmaxθEzQ(z;ϕ(t))[logP(x,z;θ)]\theta^{(t+1)} = \underset{\theta}{\arg \max}E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)],所以有H1(θ(t+1),θ(t))H1(θ(t),θ(t))H_1 (\theta^{(t+1)}, \theta^{(t)}) \ge H_1 (\theta^{(t)}, \theta^{(t)})。接下來,只要證明H2(θ(t+1),θ(t))H2(θ(t),θ(t))-H_2(\theta^{(t+1)}, \theta^{(t)}) \ge -H_2(\theta^{(t)}, \theta^{(t)}),就能證明logP(x;θ(t+1))logP(x;θ(t+1))\log P(x;\theta^{(t+1)}) \ge \log P(x;\theta^{(t+1)})

現在
H2(θ(t+1),θ(t))H2(θ(t),θ(t))=zP(zx;θ(t))logP(zx;θ(t+1))dzzP(zx;θ(t))logP(zx;θ(t))dz=zP(zx;θ(t))logP(zx;θ(t+1))P(zx;θ(t))dz(16) \begin{aligned} & H_2(\theta^{(t+1)}, \theta^{(t)}) - H_2(\theta^{(t)}, \theta^{(t)}) \\ =& \int_z P(z|x;\theta^{(t)}) \log P(z|x; \theta^{(t+1)}) dz - \int_z P(z|x;\theta^{(t)}) \log P(z|x; \theta^{(t)}) dz \\ =& \int_z P(z|x;\theta^{(t)}) \log \frac{P(z|x; \theta^{(t+1)})}{P(z|x; \theta^{(t)})} dz \end{aligned} \tag{16}
證明公式16小於等於0:

方法1:公式16是負KL散度KL(P(zx;θ(t))P(zx;θ(t+1)))0-KL(P(z|x;\theta^{(t)})||P(z|x;\theta^{(t+1)})) \le 0

方法2:公式16等於EzP(zx,θ(t))[logP(zx;θ(t+1))P(zx;θ(t))]E_{z \backsim P(z|x,\theta^{(t)})}[\log \frac{P(z|x; \theta^{(t+1)})}{P(z|x; \theta^{(t)})}]。根據Jensen不等式E[log(x)]logE[x]E[\log (x)] \le \log E[x],因此
EzP(zx,θ(t))[logP(zx;θ(t+1))P(zx;θ(t))]logEzP(zx,θ(t))[P(zx;θ(t+1))P(zx;θ(t))]=logzP(zx;θ(t))P(zx;θ(t+1))P(zx;θ(t))dz=logzP(zx;θ(t+1))dz=log1=0(17) \begin{aligned} & E_{z \backsim P(z|x,\theta^{(t)})}[\log \frac{P(z|x; \theta^{(t+1)})}{P(z|x; \theta^{(t)})}] \\ \le & \log E_{z \backsim P(z|x,\theta^{(t)})}[\frac{P(z|x; \theta^{(t+1)})}{P(z|x; \theta^{(t)})}] \\ = & \log \int_z P(z|x;\theta^{(t)}) \frac{P(z|x; \theta^{(t+1)})}{P(z|x; \theta^{(t)})} dz \\ =& \log \int_z P(z|x; \theta^{(t+1)}) dz \\ =& \log 1 = 0 \end{aligned} \tag{17}

例子 2

拋硬幣,有兩個硬幣,但是兩個硬幣的材質不同導致其出現正反面的概率不一樣,目前我們只有一組觀測數據,要求出每一種硬幣投擲時正面向上的概率。總共投了五輪,每輪投擲五次。假設我們不知道每一次投擲用的是哪一種硬幣,等於是現在的問題加上了一個隱變量,就是每一次選取的硬幣的種類。
拋硬幣例子
(圖片來自https://blog.csdn.net/u010834867/article/details/90762296)

設兩個硬幣分別是AB,P(正|A)=x1x_1,P(反|A)=1x11-x_1,P(正|B)=x2x_2,P(正|B)=x2x_2

假設第i次實驗選擇硬幣A的概率是P(zi=A)=yiP(z_{i}=A)=y_i,選擇硬幣B的概率是P(zi=B)=1yiP(z_i=B)=1-y_i

看實驗i的數據j,用xijx_{ij}表示,似然估計函數爲
logP(x)=logijP(xij)=ijlogP(xij) \log P(x) = \log \prod_i \prod_j P(x_{ij}) = \sum_i \sum_j \log P(x_{ij})
P(xij)=P(x,z)P(zx)=P(z)P(xz)P(zx)P(x_{ij})=\frac{P(x,z)}{P(z|x)}=\frac{P(z)P(x|z)}{P(z|x)}

其中P(zx)P(z|x)不好求出來。我們使用EM算法來解x1x_1x2x_2

首先,我們求出期望,
EzQ(z;ϕ(t))[logP(x,z;θ)]=y1logy1(x1x1(1x1)x1(1x1))+(1y1)log(1y1)(x2x2(1x2)x2(1x2))+y2logy2((1x1)(1x1)x1x1(1x1))+(1y2)log(1y2)((1x2)(1x2)x2x2(1x2))+y3logy3(x1(1x1)(1x1)(1x1)(1x1))+(1y3)log(1y3)(x2(1x2)(1x2)(1x2)(1x2))+y4logy4(x1(1x1)(1x1)x1x1)+(1y4)log(1y4)(x2(1x2)(1x2)x2x2)+y5logy5((1x1)x1x1(1x1)(1x1))+(1y5)log(1y5)((1x2)x2x2(1x2)(1x2)) \begin{aligned} & E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)] \\ =& y_1 \log y_1(x_1 x_1 (1-x_1) x_1 (1-x_1)) &+& (1-y_1)\log (1-y_1)(x_2 x_2 (1-x_2) x_2 (1-x_2)) \\ +& y_2 \log y_2((1-x_1)(1-x_1)x_1 x_1(1-x_1)) &+& (1-y_2)\log (1-y_2)((1-x_2)(1-x_2)x_2 x_2(1-x_2)) \\ +& y_3 \log y_3(x_1 (1-x_1)(1-x_1)(1-x_1)(1-x_1)) &+& (1-y_3)\log (1-y_3)(x_2 (1-x_2)(1-x_2)(1-x_2)(1-x_2)) \\ +& y_4 \log y_4(x_1 (1-x_1)(1-x_1) x_1 x_1) &+& (1-y_4) \log (1-y_4)(x_2 (1-x_2)(1-x_2) x_2 x_2) \\ +& y_5 \log y_5((1-x_1)x_1 x_1 (1-x_1) (1-x_1)) &+& (1-y_5) \log (1-y_5)((1-x_2)x_2 x_2 (1-x_2) (1-x_2)) \end{aligned}

假設x1=0.2,x2=0.7x_1=0.2,x_2=0.7,代入上式得
EzQ(z;ϕ(t))[logP(x,z;θ)]=y1log0.00512y1+(1y1)log0.03087(1y1)+y2log0.02048y2+(1y2)log0.01323(1y2)+y3log0.08192y3+(1y3)log0.00567(1y3)+y4log0.00512y4+(1y4)log0.03087(1y4)+y5log0.02048y5+(1y5)log0.01323(1y5) \begin{aligned} & E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)] \\ =& y_1 \log 0.00512 y_1 &+& (1-y_1)\log 0.03087(1-y_1) \\ +& y_2 \log 0.02048 y_2 &+& (1-y_2)\log 0.01323(1-y_2) \\ +& y_3 \log 0.08192 y_3 &+& (1-y_3)\log 0.00567 (1-y_3) \\ +& y_4 \log 0.00512 y_4 &+& (1-y_4) \log 0.03087 (1-y_4) \\ +& y_5 \log 0.02048 y_5 &+& (1-y_5) \log 0.01323(1-y_5) \end{aligned}
當正面概率已知
(圖片來自https://blog.csdn.net/u010834867/article/details/90762296)

現在求
maxEzQ(z;ϕ(t))[logP(x,z;θ)] \max E_{z\backsim Q(z;\phi^{(t)})}[\log P(x,z;\theta)]

爲了簡單運行,我們取Q(z;ϕ(t))Q(z;\phi^{(t)})y={0,1,1,0,1}y=\{0,1,1,0,1\},即z={B,A,A,B,A}z=\{B,A,A,B,A\}。雖然求出來的期望不是最大的,但不影響算法的收斂。因爲z的結果已經固定了,可以直接計算θ(t+1)\theta^{(t+1)}
x1=(2+1+2)/15=0.33,x2=(3+3)/10=0.6 x_1 = (2+1+2) / 15 = 0.33, x_2 = (3+3) / 10 = 0.6

接着不斷迭代,直到z或者x1x_1x2x_2收斂。

若有不恰當之處,請指正。


  1. https://www.bilibili.com/video/av31906558?p=1 ↩︎ ↩︎

  2. https://blog.csdn.net/u010834867/article/details/90762296 ↩︎

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章