第九章-EM算法

從第九章開始,學習總結的東西有所不同了,第2-8章是分類問題,都屬於監督學習,第9章EM算法是非監督學習。本文主要是總結EM算法的應用以及處理問題的過程和原理推導。

EM算法

EM算法(期望極大算法 expectation maximization algorithm)是一種迭代算法。當我們面對概率模型的時候,既有觀測變量,又含有隱變量或者潛在變量。如果概率模型的變量都是觀測變量,那麼給定數據,可以直接使用極大似然估計法或者貝葉斯估計模型估計參數,但是,當模型含有隱變量的時候,就不能簡單地這樣估計,此時,在1977年,Dempster等人總結提出EM算法:E步:求期望(expectation);M步:求極大值(maximization)
YZP(Y,Zθ)P(ZY,θ)θ(1)θ(0)(2)Eθ(i)iθi+1EQ(θ,θ(i))=EZ[lnP(Y,Zθ)Y,θ(i)] =ZlnP(Y,Zθ)P(ZY,θ(i))P(ZY,θ(i))Yθ(i)Z(3)M使Q(θ,θ(i))θi+1θ(i+1)θ(i+1)=argmaxθQ(θ,θ(i))(4)(2)(3)θ(i)θ(i+1)Q(θ(i+1),θ(i))Q(θ(i),θ(i1))Q(θ,θ(i))EMQ 輸入:觀測變量數據Y,隱變量數據Z,聯合分佈P(Y,Z|\theta),條件分佈P(Z|Y,\theta)。\\ 輸出:模型參數\theta。\\ (1)選擇參數的初值\theta^{(0)},開始迭代。\\ (2)**E步:**記\theta^{(i)}爲第i次迭代參數\theta的估計值,在第i+1次迭代的E步,\\計算\begin{aligned} Q(\theta,\theta^{(i)}) =& E_Z\big[\ln P(Y,Z|\theta) | Y, \theta^{(i)}\big] \ =& \sum_Z \ln P(Y,Z|\theta)P(Z|Y,\theta^{(i)}) \end{aligned}\\這裏,P(Z|Y,\theta^{(i)})是在給定觀測數據Y和當前的參數估計\theta^{(i)}下隱變量數據Z的條件概率分佈。\\ (3)**M步:**求使得Q(\theta,\theta^{(i)})極大化的\theta,確定第i+1次迭代的參數估計值\theta^{(i+1)} \theta^{(i+1)}=\mathop{\arg \max} \limits_{\theta} Q(\theta,\theta^{(i)}) \\(4)重複第(2)步和第(3)步,直到收斂(收斂條件:\theta^{(i)}和\theta^{(i+1)}很接近,\\或者是Q(\theta^{(i+1)},\theta^{(i)})和Q(\theta^{(i)},\theta^{(i-1)})很接近)。 函數Q(\theta,\theta^{(i)})是EM算法的核心,稱爲Q函數。

推導過程

上述闡述了EM算法,可是爲什麼EM算法能近似實現對觀測數據的極大似然估計呢?下面通過近似求解觀測數據的對數似然函數的極大化問題來導出EM算法,從而瞭解EM算法的作用。

在推導過程中用到的公式:

Jensonf(iαixi)iαif(xi)fiαi=1αi0αi1 Jenson不等式:f(\sum_i \alpha_i x_i) \geqslant \sum_i \alpha_i f(x_i)其中函數f是凸函數,\\那麼對數函數也是凸函數,\displaystyle \sum_i \alpha_i = 1,\alpha_i表示權值,0 \leqslant \alpha_i \leqslant 1

θY=(y1,y2,,yN)Z=(z1,z2,,zN)θL(θ)=lnP(Yθ) =lnZP(Y,Zθ) =ln(ZP(Zθ)P(YZ,θ))iθθ(i)θ使L(θ)L(θ)>L(θ(i))L(θ)L(θ(i))=ln(ZP(Zθ)P(YZ,θ))lnP(Yθ(i))lnP1P2PNlnP1P2JensonZJensonαiZ1ZL(θ)L(θ(i))=ln(ZP(Zθ)P(YZ,θ))lnP(Yθ(i))=ln(ZP(ZY,θ(i))P(Zθ)P(YZ,θ)P(ZY,θ(i)))lnP(Yθ(i))Jessonln(ZP(ZY,θ(i))P(Zθ)P(YZ,θ)P(ZY,θ(i)))ZP(ZY,θ(i))lnP(Zθ)P(YZ,θ)P(ZY,θ(i))lnP(Yθ(i))=ZP(ZY,θ(i))lnP(Yθ(i))L(θ)L(θ(i))ZP(ZY,θ(i))lnP(Zθ)P(YZ,θ)P(ZY,θ(i))ZP(ZY,θ(i))lnP(Yθ(i))=ZP(ZY,θ(i))lnP(Zθ)P(YZ,θ)P(ZY,θ(i))P(Yθ(i))B(θ,θ(i))=L(θ(i))+ZP(ZY,θ(i))lnP(Zθ)P(YZ,θ)P(ZY,θ(i))P(Yθ(i))L(θ)B(θ,θ(i))B(θ,θ(i))L(θ)L(θ)B(θ,θ(i))θ(i+1)=argmaxθB(θ,θ(i))=argmaxθ(ZP(ZY,θ(i))lnP(Zθ)P(YZ,θ))=argmaxθ(ZP(ZY,θ(i))lnP(Y,Zθ))Q(θ,θ(i))=ZlnP(Y,Zθ)P(ZY,θ(i))θ(i+1)=argmaxθ(Q(θ,θ(i)))EMMEZP(ZY,θ(i))lnP(Y,Zθ)EM 首先有一個需要觀測的向量\theta,觀測數據Y=(y_1,y_2,\cdots,y_N),隱變量Z=(z_1,z_2,\cdots,z_N),\\當求解\theta時,似然函數爲\begin{aligned} L(\theta) = \ln P(Y|\theta) \ = \ln \sum_Z P(Y,Z|\theta) \ = \ln ( \sum_Z P(Z|\theta) P(Y|Z,\theta) ) \end{aligned}  \\假設在第i次迭代後\theta的估計值爲\theta^{(i)},希望新估計值\theta能使L(\theta)增加,即L(\theta) > L(\theta^{(i)}),則可計算兩者的差:\\ L(\theta)-L(\theta^{(i)}) = \ln ( \sum_Z P(Z|\theta) P(Y|Z,\theta) ) - \ln P(Y|\theta^{(i)})\\   一般來說,對\ln P_1 P_2 \cdots P_N比較好處理,但是如果是\ln \sum P_1 P_2就不好處理,\\爲了將\sum求和符號去掉,用Jenson不等式進行縮放處理。\\   對於上述形式,對Z求和,要如何湊出來一個具有Jenson不等式中的\alpha_i呢?\\很容易想到,關於Z的密度函數,該密度函數取值求和爲1,需要構造一個Z的概率分佈。\\ \begin{aligned}L(\theta)-L(\theta^{(i)}) = \ln ( \sum_Z P(Z|\theta) P(Y|Z,\theta)) - \ln P(Y|\theta^{(i)}) \\ = \ln ( \sum_Z P(Z|Y,\theta^{(i)}) \frac{P(Z|\theta)P(Y|Z,\theta)}{P(Z|Y,\theta^{(i)})} ) - \ln P(Y|\theta^{(i)}) \end{aligned}\\ 由Jesson不等式,\displaystyle \ln ( \sum_Z P(Z|Y,\theta^{(i)}) \frac{P(Z|\theta)P(Y|Z,\theta)}{P(Z|Y,\theta^{(i)})} ) \geqslant \sum_Z P(Z|Y,\theta^{(i)}) \ln \frac{P(Z|\theta)P(Y|Z,\theta)}{P(Z|Y,\theta^{(i)})}\\ \displaystyle \because \ln P(Y|\theta^{(i)}) = \sum_Z P(Z|Y,\theta^{(i)}) \ln P(Y|\theta^{(i)})\\ \begin{aligned} \therefore L(\theta)-L(\theta^{(i)}) \geqslant \sum_Z P(Z|Y,\theta^{(i)}) \ln \frac{P(Z|\theta)P(Y|Z,\theta)}{P(Z|Y,\theta^{(i)})} - \sum_Z P(Z|Y,\theta^{(i)}) \ln P(Y|\theta^{(i)}) \\= \sum_Z P(Z|Y,\theta^{(i)}) \ln \frac{P(Z|\theta)P(Y|Z,\theta)}{P(Z|Y,\theta^{(i)}) P(Y|\theta^{(i)})} \end{aligned}\\ 令\displaystyle B(\theta,\theta^{(i)}) = L(\theta^{(i)}) + \sum_Z P(Z|Y,\theta^{(i)}) \ln \frac{P(Z|\theta)P(Y|Z,\theta)}{P(Z|Y,\theta^{(i)}) P(Y|\theta^{(i)})} \\ \therefore L(\theta) \geqslant B(\theta,\theta^{(i)}),\\也就是說B(\theta,\theta^{(i)})是L(\theta)的一個下界,要最大化L(\theta),即最大化B(\theta,\theta^{(i)})。\\ \begin{aligned} \therefore \theta^{(i+1)} = \mathop{\arg \max} \limits_{\theta} B(\theta,\theta^{(i)}) = \mathop{\arg \max} \limits_{\theta} ( \sum_Z P(Z|Y,\theta^{(i)}) \ln P(Z|\theta)P(Y|Z,\theta)) \\ = \mathop{\arg \max} \limits_{\theta} ( \sum_Z P(Z|Y,\theta^{(i)}) \ln P(Y,Z|\theta)) \end{aligned}\\ \displaystyle \because Q(\theta, \theta^{(i)}) = \sum_Z \ln P(Y,Z|\theta) P(Z|Y,\theta^{(i)}) \\ \displaystyle \therefore \theta^{(i+1)} = \mathop{\arg \max} \limits_{\theta} (Q(\theta, \theta^{(i)}))\\   等價於EM算法的M步,E步等價於求\displaystyle \sum_Z P(Z|Y,\theta^{(i)}) \ln P(Y,Z|\theta),\\以上就得到了EM算法,通過不斷求解下界的極大化逼近求解對數似然函數極大化。

EM算法在高斯混合模型學習中的應用

高斯混合模型

P(yθ)=k=1Kαkϕ(yθk)αkαk0,k=1Kαk=1ϕ(yθk)θk=(μk,σk2)ϕ(yθ)=12πσkexp((yμk)22σk2)kyyN(μ,σ2)yμσ2yN(μ1,σ12)N(μ2,σ22)yγzz=1γ=(1,0,0,,0)z=2γ=(0,1,0,,0)Onehotziγi10 高斯混合模型是指具有如下形式的概率分佈模型:\\P(y|\theta)=\sum_{k=1}^K \alpha_k \phi(y|\theta_k)\\其中,\alpha_k是係數,\displaystyle \alpha_k \geqslant 0, \sum_{k=1}^K \alpha_k = 1,\phi(y|\theta_k)是高斯分佈密度,\\\theta_k=(\mu_k, \sigma_k^2),\phi(y|\theta)=\frac{1}{\sqrt{2 \pi} \sigma_k} \exp \left( -\frac{(y-\mu_k)^2}{2\sigma_k^2} \right)稱爲第k個分模型。\\   首先介紹高斯混合模型,只考慮簡單的一維隨機變量y,高斯分佈就是正態分佈,\\y \sim N(\mu, \sigma^2),給定y的觀測值,就可以很容易求出\mu和\sigma^2,但是目前y不是來自高斯分佈,\\而是有一定概率的來自於兩個不同的高斯分佈N(\mu_1, \sigma_1^2)和N(\mu_2, \sigma_2^2),這個就是兩個高斯分佈的混合,\\並不知道y來自於哪一個高斯分佈,這裏涉及到了隱變量。對於包含隱變量的參數估計,對此可以做以下處理。\\用一個向量\gamma表示z,如果z=1,則\gamma=(1,0,0,\cdots,0),\\如果z=2,則\gamma=(0,1,0,\cdots,0),這個相當於One-hot,\\也就是說z是第i個高斯分佈,在\gamma的第i個分量爲1,其他分量都爲0。

推導過程

明確隱變量,寫出完全數據的對數似然函數

EMγγyγ1=(γ11,γ12,,γ1K)γjkγjk={1,jk0,j=1,2,,N;k=1,2,,K1α12α2KαKγ1y1p(γ1,y1θ)=p(γ1θ)p(y1γ1,θ)=αγ111αγ122αγ1KKϕ(y1θ1)γ11ϕ(y2θ2)γ12ϕ(y1θK)γ1K=k=1K[αkϕ(y1θk)]γ1k1P(y,γθ)=j=1Nk=1K[αkϕ(yjθk)]γjkj=1Nk=1Kαkγjk=k=1Kαkj=1Nγjkj=1NγjkNknk=j=1Nγjkn1+n2++nK=Nk=1Kαkj=1Nγjk=k=1Kαknkj=1Nk=1Kαkγjk=k=1KαknkP(y,γθ)=k=1Kαknkj=1N[ϕ(yiθk)]γjk=k=1Kαknkj=1N[12πσkexp((yjμk)22σk2)]γjklnP(y,γθ)=k=1K{nklnαk+j=1Nγjk[ln(12π)lnσk12σk2(yjμk)2]} 根據EM算法,存在一個隱變量\gamma,\gamma表示當前的y來自的高斯分佈,對於第一個觀測值,\\有\gamma_1=(\gamma_{11},\gamma_{12},\cdots,\gamma_{1K}),其中根據書中的\gamma_{jk}的定義:\gamma_{jk} = \left\{ \begin{aligned} 1, & 第j個觀測來自第k個分模型 \\ 0, & 否則 \end{aligned}\right. \\ j = 1,2,\cdots, N; k=1, 2,\cdots,K以上是隨機變量的分佈,取第1個值的概率爲\alpha_1,\\取第2個值的概率爲\alpha_2,……,取第K個值的概率爲\alpha_K,一旦知道\gamma_1的值,就知道從第幾個高斯分佈中抽取y_1。\\ \begin{aligned} p(\gamma_1,y_1|\theta)= p(\gamma_1|\theta) \cdot p(y_1 | \gamma_1,\theta) \\ = \alpha^{\gamma_{11}}1 \cdot \alpha^{\gamma{12}}2 \cdots \alpha^{\gamma{1K}}K \phi(y_1|\theta_1)^{\gamma{11}} \phi(y_2|\theta_2)^{\gamma_{12}} \cdots \phi(y_1|\theta_K)^{\gamma_{1K}} = \prod_{k=1}^K [ \alpha_k \phi(y_1 | \theta_k) ]^{\gamma_{1k}} \end{aligned}\\這個是第1個樣本點完全數據的密度函數。在極大化似然估計中是極大化似然函數,這需要所有樣本點的聯合分佈,\\對於所有的樣本點,概率密度函數爲 \\P(y,\gamma|\theta)=\prod_{j=1}^N \prod_{k=1}^K [\alpha_k \phi(y_j | \theta_k)]^{\gamma_{jk}}\\\displaystyle \because \prod_{j=1}^N \prod_{k=1}^K \alpha_k^{\gamma_{jk}} = \prod_{k=1}^K \alpha_k^{\sum_{j=1}^N \gamma_{jk}},\displaystyle \sum_{j=1}^N \gamma_{jk}表示在N個樣本點中,一共有多少個是來自第k個高斯分佈的,將該數量記爲\\\displaystyle n_k=\sum_{j=1}^N \gamma_{jk},n_1+n_2+\cdots+n_K=N\\ \displaystyle \therefore \prod_{k=1}^K \alpha_k^{\sum_{j=1}^N \gamma_{jk}} = \prod_{k=1}^K \alpha_k^{n_k}\\ \displaystyle \therefore \prod_{j=1}^N \prod_{k=1}^K \alpha_k^{\gamma_{jk}} = \prod_{k=1}^K \alpha_k^{n_k}\\ \displaystyle \therefore P(y, \gamma|\theta) = \prod_{k=1}^K \alpha_k^{n_k} \prod_{j=1}^N \big[\phi(y_i|\theta_k)\big]^{\gamma_{jk}} = \prod_{k=1}^K \alpha_k^{n_k} \cdot \prod_{j=1}^N [ \frac{1}{\sqrt{2\pi} \sigma_k} \exp\left(-\frac{(y_j-\mu_k)^2}{2 \sigma_k^2} ) \right]^{\gamma_{jk}}\\ \displaystyle \therefore \ln P(y, \gamma|\theta) = \sum_{k=1}^K \{ n_k \ln \alpha_k + \sum_{j=1}^N \gamma_{jk} [\ln (\frac{1}{\sqrt{2\pi}}) - \ln \sigma_k - \frac{1}{2 \sigma_k^2} (y_j - \mu_k)^2] \} \\

EM算法的E步,確定Q函數

γjknkE(nk)=E(jγjk)=jE(γjk)E(γjkθ(i),y)=P(γjk=1θ(i),y)θ(i)yjγjkP(γjk=1θ(i),y)P(γjk=1θ(i),y)=P(γjk=1,yjθ(i))P(yjθ(i))=P(γjk=1,yjθ(i))k=1KP(γjk=1,yjθ(i))=P(γjk=1θ(i))P(yiγjk=1,θ(i))k=1KP(yjγjk=1,θ(i))P(γjk=1θ(i))αk=P(γjk=1θ),ϕ(yiθ)=P(yiγjk=1,θ)E(γjky,θ(i))=P(γjk=1θ(i),y)=αkϕ(yiθ(i))k=1Kαkϕ(yiθ(i))θ(i)=(αk(i),θk(i))γjkyθ(i)Zk=E(γjky,θ(i))ZkjQ(θ,θ(i))=EZ[lnP(y,γθ(i))]=k=1K{(NZk)lnαk+Zkj=1N[ln(12π)lnσk12σk2(yjμk)2]} 將隱變量都換成期望,隱變量有\gamma_{jk}和n_k\\ \displaystyle \because E(n_k) = E \left(\sum_j \gamma_{jk} \right) = \sum_j E(\gamma_{jk}),E(\gamma_{jk} | \theta^{(i)},y) = P(\gamma_{jk}=1| \theta^{(i)},y),\\求解期望時,是根據上一步的\theta^{(i)}以及觀測數據所有的y_j,需要知道\gamma_{jk}的分佈P(\gamma_{jk}=1|\theta^{(i)},y)。\\ \begin{aligned} \because P(\gamma_{jk}=1|\theta^{(i)},y) =& \frac{P(\gamma_{jk}=1,y_j | \theta^{(i)})}{P(y_j|\theta^{(i)})} \\ =& \frac{P(\gamma_{jk}=1,y_j | \theta^{(i)})}{\displaystyle \sum_{k=1}^K P(\gamma_{jk}=1,y_j | \theta^{(i)})} \\ =& \frac{P(\gamma_{jk}=1|\theta^{(i)})P(y_i|\gamma_{jk}=1, \theta^{(i)})}{\displaystyle \sum_{k=1}^K P(y_j | \gamma_{jk}=1,\theta^{(i)}) P(\gamma_{jk}=1|\theta^{(i)})} \end{aligned}\\ \because \alpha_k=P(\gamma_{jk}=1|\theta), \phi(y_i|\theta)=P(y_i | \gamma_{jk}=1,\theta)\\ \displaystyle \therefore E(\gamma_{jk} | y, \theta^{(i)}) = P(\gamma_{jk}=1|\theta^{(i)},y) = \frac{\alpha_k \phi(y_i|\theta^{(i)})}{\displaystyle \sum_{k=1}^K \alpha_k \phi(y_i|\theta^{(i)})},其中\theta^{(i)}=(\alpha_k^{(i)}, \theta_k^{(i)})\\   將\gamma_{jk}關於給定y和\theta^{(i)}的條件下的期望記爲Z_k=E(\gamma_{jk} | y, \theta^{(i)}),\\因爲各個樣本之間是獨立同分布的,所以Z_k是和j無關的。\\ \displaystyle \therefore Q(\theta, \theta^{(i)}) = E_Z \big[ln P(y,\gamma | \theta^{(i)})\big] = \sum_{k=1}^K \left\{ (N Z_k) \ln \alpha_k + Z_k \sum_{j=1}^N [ \ln (\frac{1}{\sqrt{2\pi}}) - \ln \sigma_k - \frac{1}{2 \sigma_k^2} (y_j - \mu_k)^2] \right\}

確定EM算法的M步

αk,σk,μk0Q(θ,θ(i))μk=0Q(θ,θ(i))σk2=0{Q(θ,θ(i))αk=0αk=1μk(i+1)=j=1Nγjk^yjj=1Nγjk^(σk2)(i+1)=j=1Nγjk^(yiμk)2j=1Nγjk^αk(i+1)=nkN=j=1Nγjk^N,γjk^=Eγjk,nk=j=1NEγjk,k=1,2,,K 需要估計的變量有\alpha_k,\sigma_k,\mu_k,然後求偏導等於0:\\\begin{array}{l} \displaystyle \frac{\partial Q(\theta, \theta^{(i)})}{\partial \mu_k} = 0 \\ \displaystyle \frac{\partial Q(\theta, \theta^{(i)})}{\partial \sigma_k^2} = 0 \\ \left \{ \begin{array}{l} \displaystyle \frac{\partial Q(\theta, \theta^{(i)})}{\partial \alpha_k} = 0 \\ \sum \alpha_k = 1 \end{array} \right . \end{array}\\根據上述公式可以推導出:\begin{array}{l} \mu_k^{(i+1)} = \frac{\displaystyle \sum_{j=1}^N \hat{\gamma_{jk}} y_j}{\displaystyle \sum_{j=1}^N \hat{\gamma_{jk}} } \\ (\sigma_k^2)^{(i+1)} = \frac{\displaystyle \sum_{j=1}^N \hat{\gamma_{jk}} (y_i - \mu_k)^2}{\displaystyle \sum_{j=1}^N \hat{\gamma_{jk}} } \\ \displaystyle \alpha_k^{(i+1)} = \frac{n_k}{N}= \frac{\displaystyle \sum_{j=1}^N \hat{\gamma_{jk}}}{N} \\ \end{array},其中\displaystyle \hat{\gamma_{jk}}=E \gamma_{jk}, n_k = \sum_{j=1}^N E \gamma_{jk}, k=1,2,\cdots,K

EM算法的推廣

GEM算法

Q(1)θ(0)=(θ1(0),θ2(0),,θd(0))(2)i+11θ(i)=(θ1(i),θ2(i),,θd(i))θ=(θ1,θ2,,θd)Q(θ,θ(i))=EZ[logP(Y,Zθ)Y,θ(i)] =ZP(ZY,θ(i))logP(Y,Zθ)(3)2dθ2(i),θ3(i),,θk(i)使Q(θ,θ(i))θ1(i+1)θ1=θ1(i+1),θj=θj(j),j=3,4,,k使Q(θ,θ(i))θ(i+1)dθ(i+1)=(θ1(i+1),θ2(i+1),,θd(i+1))使Q(θ(i+1),θ(i))>Q(θ(i),θ(i))(4)(2)(3) 輸入:觀測數據,Q函數\\ 輸出:模型參數\\ (1)初始化參數\theta^{(0)}=(\theta^{(0)}_1,\theta^{(0)}_2,\cdots,\theta^{(0)}_d),開始迭代;\\ (2)第i+1次迭代,第1步:記\theta^{(i)}=(\theta^{(i)}_1,\theta^{(i)}_2,\cdots,\theta^{(i)}_d)爲參數\theta=(\theta_1,\theta_2,\cdots,\theta_d)的估計值,計算\\\begin{aligned} Q(\theta,\theta^{(i)}) =& E_Z\big[ \log P(Y,Z|\theta)|Y,\theta^{(i)} \big] \ =& \sum_Z P(Z|Y,\theta^{(i)}) \log P(Y,Z|\theta) \end{aligned}\\ (3)第2步:進行d次條件極大化: 首先,在\theta^{(i)}_2,\theta^{(i)}_3,\cdots,\theta^{(i)}_k保持不變的條件下求使得Q(\theta,\theta^{(i)})達到極大的\theta^{(i+1)}_1;\\ 然後,在\theta_1=\theta^{(i+1)}_1,\theta_j=\theta^{(j)}_j,j=3,4,\cdots,k的條件下求使Q(\theta,\theta^{(i)})達到極大的\theta^{(i+1)};\\ 如此繼續,經過d次條件極大化,得到\theta^{(i+1)}=(\theta^{(i+1)}_1,\theta^{(i+1)}_2,\cdots, \theta^{(i+1)}_d)使得Q(\theta^{(i+1)},\theta^{(i)}) > Q(\theta^{(i)},\theta^{(i)})\\ (4)重複(2)和(3),直到收斂。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章