模式識別 | PRML Chapter 10 Approximate Inference

PRML Chapter 10 Approximate Inference

10.1 Variational Inference

For observed variable X={x1,...,xN}X=\{x_1,...,x_N\} and latent Z={z1,...,zN}Z=\{z_1,...,z_N\}. Our probabilistic model specifies the joint distribution p(ZX)p(Z\|X) and our goal is to find an approximation for the posterior distribution p(ZX)p(Z\|X) as well as for the model evidence p(X)p(X). As in our discussion of EM, we can decompose the log marginal probability using:

lnp(X)=L(q)+KL(qp)\ln p(X) = \mathcal{L}(q) + KL(q || p)

where

L(q)=q(Z)ln{p(X,Z)q(Z)}dZ\mathcal{L}(q) = \int q(Z) \ln\left\{ \frac{p(X, Z)}{q(Z)} \right\} dZ

KL(qp)=q(Z)ln{p(ZX)q(Z)}dZKL(q || p) = -\int q(Z)\ln \left\{p(Z | X)q(Z) \right\} dZ

10.1.1 Factorized distributions

Suppose we partition the elements of ZZ into disjoint groups and the q factorizes with respect to there groups:

q(Z)=i=1Mqi(Zi)q(Z) = \prod_{i=1}^{M}q_{i}(Z_{i})

substitute it to the fomula above and denoting qj(Zj)q_j(Z_j) by simply qjq_j we obtain:

L(q)=qjlnp~(X,Zj)qjlnqjdZj+const\mathcal{L}(q) = \int q_{j}\ln \tilde{p}(X, Z_{j}) - \int q_{j}\ln q_{j}dZ_{j} + const

where we have defined a new distribution p~(X,Zj)\tilde{p}(X, Z_{j}) by the relation:

lnp~(X,Zj)=Eij[lnp(X,Z)]+const\ln \tilde{p}(X, Z_{j}) = E_{i\neq j}[\ln p(X, Z)] + const

Eij[lnp(X,Z)]=lnp(X,Z)ijqidZiE_{i\neq j}[\ln p(X, Z)] = \int \ln p(X,Z)\prod_{i\neq j}q_{i}dZ_{i}

maximizing L(q)\mathcal{L}(q) is equivalent to minimizing the Kullback-Leibler divergence, and the minimum occurs when qj(Zj)=p~(X,Zj)q_j(Z_j)=\tilde{p}(X,Z_j). Thus we obtain the generl expression for the optimal solution qj(Zj)q_j^*(Z_j) given by:

lnqj(Zj)=Eij[lnp(X,Z)]+const\ln q^{*}_{j}(Z_{j}) = E_{i\neq j}[\ln p(X, Z)] + const

10.1.2 Properties of factorized approximations

Two forms of Kullback-Leibler divergence are members of the alpha family of divergences defined by:

Dα(pq)=41α2(1p(x)(1+α)/2q(x)(1α)/2dx)D_\alpha (p||q)=\frac{4}{1-\alpha^2}(1-\int p(x)^{(1+\alpha)/2}q(x)^{(1-\alpha)/2}dx)

10.1.3 Example: The univariate Gaussian

10.1.4 Model comparison

We can readily verify the following decomposition based on this variational distribution:

lnp(X)=LmmZq(Zm)q(m)ln(p(Z,mX)q(Zm)q(m))lnp(X)=\mathcal{L}_m-\sum_m\sum_Z q(Z|m)q(m)ln(\frac{p(Z,m|X)}{q(Z|m)q(m)})

where Lm\mathcal{L}_m is a lower bound on lnp(X)lnp(X) and is given by

Lm=mZq(Zm)q(m)ln(p(Z,X,m)q(Zm)q(m))\mathcal{L}_m=\sum_m\sum_Z q(Z|m)q(m)ln(\frac{p(Z,X,m)}{q(Z|m)q(m)})

We can maximizing L\mathcal{L} with respect to the distribution q(m)q(m) using a Lagrange multiplier, with the result:

q(m)p(m)exp(Lm)q(m)\propto p(m)exp(\mathcal{L}_m)

10.2 Illustration: Variational Mixture of Gaussians

The conditional distribution of ZZ, given the mixing coefficients π\pi, in the form:

p(Zπ)=n=1Nk=1Kπkznkp(Z | \pi) = \prod_{n=1}^{N} \prod_{k=1}^{K} \pi_{k}^{z_{nk}}

Similarly, the conditional distribution of the observed data vectors, given the latent variables and the component parameters:

p(XZ,μ,Λ)=n=1Nk=1KN(xnμk,Λk1)znkp(X | Z, \mu, \Lambda) = \prod_{n=1}^{N} \prod_{k=1}^{K}\mathcal{N}(x_{n} | \mu_{k}, \Lambda_{k}^{-1})^{z_{nk}}

Choose a Dirichlet distribution over the mixing coefficients π\pi:

p(π)Dir(πα0)C(α0)k=1Kπkα01p(\pi) Dir(\pi | \alpha_{0}) C(\alpha_{0})\prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1}

Introduce an independent Gaussian-Wishart prior governing the mean and precision of each Gaussian component, give by:

p(μ,Λ)=p(μΛ)p(Λ)=k=1KN(μkm0,(β0Λ)1)W(ΛkW0,ν0)p(\mu, \Lambda) = p(\mu | \Lambda)p(\Lambda) = \prod_{k=1}^{K}\mathcal{N}(\mu_{k} | m_{0}, (\beta_{0}\Lambda)^{-1})\mathcal{W}(\Lambda_{k} | W_{0}, \nu_{0})

10.2.1 Variational distribution

In order to formulate a variational treatment of this model, we write down the joint distribution of all of the random variables:

p(X,Z,π,μ,Λ)=p(XZ,μ,Λ)p(Zπ)p(π)p(μΛ)p(Λ)p(X, Z, \pi, \mu, \Lambda) = p(X | Z, \mu, \Lambda)p(Z | \pi)p(\pi)p(\mu | \Lambda)p(\Lambda)

Consider a variational distribution which factorizes between the latent variables and the parameters as that:

q(Z,π,μ,Λ)=q(Z)q(π,μ,Λ)q(Z, \pi, \mu, \Lambda) = q(Z)q(\pi, \mu, \Lambda)

Let us consider the derivation of the update equation for the factor q(Z)q(Z). The log of the optimized factor is given by:

lnq(Z)=Eπ,μ,Λ[lnp(X,Z,π,μ,Λ)]+const\ln q^{*}(Z) = E_{\pi, \mu,\Lambda}[\ln p(X, Z, \pi, \mu, \Lambda)] + const

Absorb terms do not depend on ZZ into the constant:

lnq(Z)=Eπ[lnp(Zπ)]+Eμ,Λ[lnp(XZ,μ,Λ)]+const\ln q^{*}(Z) = E_\pi[\ln p(Z | \pi)] + E_{\mu, \Lambda}[\ln p(X | Z,\mu, \Lambda)] + const

Substituting for the two conditional distributions on the right-hand side and absorb terms independent of ZZ into the constant, we have:

lnq(Z)=n=1Nk=1Kznklnρnk+constln q^*(Z)=\sum_{n=1}^N\sum_{k=1}^K z_{nk}ln\rho_{nk}+const

where we have defined:

lnρnk=E[lnπk]+12E[lnΛk]D2ln(2π)12Eμk,Λk[(xnμk)TΛk(xnμk)]\ln \rho_{nk} = E[\ln \pi_{k}] + \frac{1}{2}E[\ln|\Lambda_{k}|] - \frac{D}{2}\ln(2\pi) - \frac{1}{2}E_{\mu_{k}, \Lambda_{k}}[(x_{n} - \mu_{k})^{T}\Lambda_{k}(x_{n} - \mu_{k})]

Taking the exponential of both sides we obtain:

q(Z)n=1Nk=1Kρnkznkq^{*}(Z) \propto \prod_{n=1}^{N}\prod_{k=1}^{K}\rho_{nk}^{z_{nk}}

Normalized the distribution we obtain:

q(Z)=n=1Nk=1Krnkznkq^{*}(Z) = \prod_{n=1}^{N}\prod_{k=1}^{K} r_{nk}^{z_{nk}}

where:

rnk=ρnkj=1Kρnjr_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^{K}\rho_{nj}}

For the distribution q(Z)q^*(Z) we have the standard result:

E[znk]=rnkE[z_{nk}] = r_{nk}

Let us consider the factor q(π,μ,Λ)q(\pi, \mu, \Lambda) in the variational posterior distribution. We have:

lnq(π,μ,Λ)=lnp(π)k=1Klnp(μk,Λk)+EZ[lnp(Zπ)]+k=1Kn=1NE[znk]lnN(xnμk,Λk1)+const\ln q^{*}(\pi, \mu, \Lambda) = \ln p(\pi) \sum_{k=1}^{K}\ln p(\mu_{k}, \Lambda_{k}) + E_{Z}[\ln p(Z | \pi)] + \sum_{k=1}^{K}\sum_{n=1}^{N}E[z_{nk}]\ln \mathcal{N}(x_{n} | \mu_{k}, \Lambda_{k}^{-1}) + const

The variational approximation will be:

q(π,μ,Λ)=q(π)k=1kq(μk,Λk)q(\pi,\mu,\Lambda)=q(\pi)\prod_{k=1}^kq(\mu_k,\Lambda_k)

The results are given by:

q(π)=Dir(πα)q^{*}(\pi) = Dir(\pi | \alpha)

where α\alpha has components αk\alpha_k given by:

αk=α0+n=1Nrnk\alpha_{k} = \alpha_{0} + \sum_{n=1}^{N}r_{nk}

Using q(μk,Λk)q(μkΛk)q(Λk)q^{*}(\mu_{k}, \Lambda_{k}) q^{*}(\mu_{k} \| \Lambda_{k})q^{*}(\Lambda_{k}), the posterior distribution is a Gaussian-Wishart distribution and is given by:

q(μk,Λk)=N(μkmk,(βkΛk)1)W(ΛkWk,νk)q^{*}(\mu_{k}, \Lambda_{k}) = \mathcal{N}(\mu_{k} | m_{k}, (\beta_{k}\Lambda_{k})^{-1})\mathcal{W}(\Lambda_{k} | W_{k}, \nu_{k})

where

βk=β0+N0\beta_{k} = \beta_{0} + N_{0}

mk=1βk(β0m0+Nkx^k)m_{k} = \frac{1}{\beta_{k}}(\beta_{0}m_{0} + N_{k}\hat{x}_{k})

Wk1=W01+NkSk+β0Nkβ0+Nk(x^km0)(x^km0)TW_{k}^{-1} = W_{0}^{-1} + N_{k}S_{k} + \frac{\beta_{0}N_{k}}{\beta_{0} + N_{k}}(\hat{x}_{k} - m_{0})(\hat{x}_{k} - m_{0})^{T}

νk=ν0+Nk\nu_{k} = \nu_{0} + N_{k}

10.2.2 Variational lower bound

For the variational mixture of Gaussians, the lower bound is given by:

L=Zq(Z,π,μ,Λ)ln{p(X,Z,π,μ,Λ)q(Z,π,μ,Λ)}dπdμdΛ\mathcal{L} = \sum_{Z}q(Z, \pi, \mu, \Lambda)\ln\left\{\frac{p(X, Z, \pi, \mu, \Lambda)}{q(Z, \pi, \mu, \Lambda)}\right\} d\pi d\mu d\Lambda

10.2.3 Predictive density

Wihe a new value x^\hat{x} the predictive density is given by:

p(x^X)=Z^p(X^z^,μ,Λ)p(z^π)p(π,μ,ΛX)dπdμdΛp(\hat{x} | X) = \sum_{\hat{Z}}\int\int\int p(\hat{X} | \hat{z},\mu, \Lambda)p(\hat{z} | \pi)p(\pi, \mu, \Lambda | X)d\pi d\mu d\Lambda

10.2.4 Determining the number of components

10.2.5 Induced factorizations

10.3 Variational Linear Regression

The joint distribution of all the variables is given by:

p(t,w,α)=p(tw)p(wα)p(α)p(t,w,\alpha)=p(t|w)p(w|\alpha)p(\alpha)

where

p(tw)=n=1NN(tnwTϕn,β1)p(t|w)=\prod_{n=1}^N N(t_n|w^T\phi_n,\beta^{-1})

p(wα)=N(w0,α1I)p(w|\alpha)=N(w|0,\alpha^{-1}I)

p(α)=Gam(αa0,b0)p(\alpha)=Gam(\alpha|a_0,b_0)

10.3.1 Variational distribution

Our first goal is to find an approximation to the posterior distribution p(w,αt)p(w,\alpha\|\mathbf{t}). The variational posterior distribution is given by the factorized expression:

q(w,α)=q(w)q(α)q(w,\alpha)=q(w)q(\alpha)

The result will be:

q(α)=Gam(αaN,bN)q^*(\alpha)=Gam(\alpha|a_N,b_N)

where

aN=a0+M2a_N=a_0+\frac{M}{2}

bN=b0+12E[wTw]b_N=b_0+\frac{1}{2}E[w^Tw]

and the distribution q(w)q^*(w) is Gaussian:

q(w)=N(wmN,SN)q^*(w)=N(w|m_N,S_N)

where

mN=βSNΦTtm_N=\beta S_N\Phi^T t

SN=(E[α]I+βΦTΦ)1S_N=(E[\alpha]I+\beta\Phi^T\Phi)^{-1}

10.3.2 Predictive distribution

The predictive distribution over tt, given a new input xx is evaluated using the Gaussian variational posterior:

p(tx,t)N(tmNTϕ(x),σ2(x))p(t|x,\mathbf{t})\simeq N(t|m^T_N\phi(x),\sigma^2(x))

where the input-dependent variance is given by

σ2(x)=1β+ϕ(x)TSNϕ(x)\sigma^2(x)=\frac{1}{\beta}+\phi(x)^TS_N\phi(x)

10.3.3 Lower bound

10.4 Exponential Family Distributions

Suppose the joint distribution of observed and latent variables is a member of the exponential family, parameterized by natural parameters η\eta so that:

p(X,Zη)=n=1Nh(xn,zn)g(η)exp{ηTu(xn,zn)}p(X,Z|\eta)=\prod_{n=1}^N h(x_n,z_n)g(\eta)exp\{\eta^Tu(x_n,z_n)\}

we shall also use a conjugate prior for etaeta, which can be written as:

p(ηv0)=f(v0,x0)g(η)v0exp{v0ηTx0}p(\eta|v_0)=f(v_0,x_0)g(\eta)^{v_0}exp\{v_0\eta^Tx_0\}

Now consider a variational distribution that factorizes between the latent variables and the parameters, so that q(Z,η)=q(Z)q(η)q(Z,\eta)=q(Z)q(\eta). The result will be

q(zn)=h(xn,zn)g(E[η])exp{E[ηT]u(xn,zn)}q^*(z_n)=h(x_n,z_n)g(E[\eta])exp\{E[\eta^T]u(x_n,z_n)\}

q(η)=f(vN,xN)g(η)vNexp{ηTxN}q^*(\eta)=f(v_N,x_N)g(\eta)^{v_N}exp\{\eta^Tx_N\}

where

vN=vo+Nv_N=v_o+N

xN=x0+n=1NEzn[u(xn,zn)]x_N=x_0+\sum_{n=1}^N E_{z_n}[u(x_n,z_n)]

10.4.1 Variational message passing

The joint distribution corresponding to a directed graph can be written using the decomposition:

p(x)=ip(xipai)p(x)=\prod_i p(x_i|pa_i)

Now consider a variational approximation in which the distribution q(x)q(x) is assumed to factorize with respect to the xix_i so that:

q(x)=iqi(xi)q(x)=\prod_i q_i(x_i)

Substitute the formula above into the general result we will get:

lnqj(xj)=Eij[ilnp(xipai)]+constlnq_j^*(x_j)=E_{i\not =j}[\sum_i lnp(x_i|pa_i)]+const

10.5 Local Variational Methods

For convex funcitons, we can obtain upper bounds by:

g(η)=minx{f(x)ηx}=maxx{ηxf(x)}g(\eta) = - \min_{x}\{ f(x) - \eta x \} = \max_{x}\{ \eta x - f(x) \}

f(x)=maxη{ηxg(η)}f(x) = \max_{\eta}\{ \eta x - g(\eta) \}

And for concave functions:

f(x)=minη{ηxg(η)}f(x) = \min_{\eta}\{ \eta x- g(\eta) \}

g(η)=minx{ηxf(x)}g(\eta) = \min_{x}\{ \eta x - f(x) \}

If the function is not convex or concave, then we need the invertible transformations. An example will be logistic sigmoid function:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

The result with that of the logistic sigmoid will be:

σ(x)exp(ηxg(η))\sigma(x) \leq exp(\eta x - g(\eta))

σ(x)σ(ξ)exp{xξ2λ(ξ)(x2ξ2)}\sigma(x) \geq \sigma(\xi)exp\left\{ \frac{x-\xi}{2} - \lambda(\xi)(x^{2} - \xi^{2}) \right\}

where:

λ(ξ)=12ξ[σ(ξ)12]\lambda(\xi) = -\frac{1}{2\xi}\left[ \sigma(\xi) - \frac{1}{2} \right]

We cna see how the bounds can be used, suppose we wish to evaluate an integral of the form:

I=σ(a)p(a)daI = \int \sigma(a)p(a) da

We can employ the variational bound and we will get:

If(a,ξ)p(a)da=F(ξ)I \geq \int f(a, \xi)p(a) da = F(\xi)

10.6 Variational Logistic Regression

10.6.1 Variational posterior distribution

In the variational framework, we seek to maximize a lower bound on the marginal likelihood. For the Bayesian logistic regression model, the marginal likelihood takes the form:

p(t)=p(tw)p(w)dw=[n=1Np(tnw)]p(w)dwp(t) = \int p(t | w)p(w)dw = \int \left[\prod_{n=1}^{N}p(t_{n} | w) \right]p(w) dw

The conditional distribution for tt can be written as

p(tw)=eatσ(a)p(t | w) = e^{at}\sigma(-a)

where a=wTϕa = w^{T}\phi. Using the variational lower bound on the logistic sigmoid function we can get:

p(tw)=eatσ(a)eatσ(ξ)exp{a+ξ2λ(ξ)(a2ξ2)}p(t | w) = e^{at}\sigma(-a)\geq e^{at}\sigma(\xi)exp\left\{ -\frac{a + \xi}{2} - \lambda(\xi)(a^{2} - \xi^{2}) \right\}

Using a=wTϕa=w^T\phi, and multiplying by the prior distribution, we obtain the bound on the joint distribution of tt and ww:

p(tw)=p(tw)p(w)h(w,ξ)p(w)p(t | w) = p(t | w)p(w) \geq h(w, \xi)p(w)

where

h(w,ξ)=n=1Nσ(ξn)exp{wTϕntn(wTϕn+ξn)/2λ(ξn)([wTϕn]ξn2)}h(w, \xi) = \prod_{n=1}^{N}\sigma(\xi_{n})exp\{w^{T}\phi_{n}t_{n} - (w^{T}\phi_{n} + \xi_{n})/2 - \lambda(\xi_{n})([w^{T}\phi_{n}] - \xi^{2}_{n}) \}

The Gaussian variational posterior will be the form:

q(w)=N(wmN,SN)q(w) = \mathcal{N}(w | m_{N}, S_{N})

where

mN=SN(S01m0+n=1N(tn12)ϕn)m_{N} = S_{N}\left( S_{0}^{-1}m_{0} + \sum_{n=1}^{N}(t_{n} - \frac{1}{2}) \phi_{n} \right)

SN1=S01+2n=1Nλ(ξn)ϕnϕnTS_{N}^{-1} = S_{0}^{-1} + 2\sum_{n=1}^{N}\lambda(\xi_{n})\phi_{n}\phi_{n}^{T}

10.6.2 Optimizing the variational parameters

Substitute the inequality above back into the marginal likelihood to give:

lnp(t)=lnp(tw)p(w)dwlnh(w,ξ)p(w)dw=L(ξ)\ln p(t) = \ln\int p(t | w)p(w)dw \geq \ln \int h(w, \xi)p(w) dw = \mathcal{L}(\xi)

There are two approaches to determining the ξn\xi_n: EM algorithm and integrate over ww.

10.6.3 Inference of hyperparameters

10.7 Expectation Propagation

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章