For observed variable X={x1,...,xN} and latent Z={z1,...,zN}. Our probabilistic model specifies the joint distribution p(Z∥X) and our goal is to find an approximation for the posterior distribution p(Z∥X) as well as for the model evidence p(X). As in our discussion of EM, we can decompose the log marginal probability using:
lnp(X)=L(q)+KL(q∣∣p)
where
L(q)=∫q(Z)ln{q(Z)p(X,Z)}dZ
KL(q∣∣p)=−∫q(Z)ln{p(Z∣X)q(Z)}dZ
10.1.1 Factorized distributions
Suppose we partition the elements of Z into disjoint groups and the q factorizes with respect to there groups:
q(Z)=i=1∏Mqi(Zi)
substitute it to the fomula above and denoting qj(Zj) by simply qj we obtain:
L(q)=∫qjlnp~(X,Zj)−∫qjlnqjdZj+const
where we have defined a new distribution p~(X,Zj) by the relation:
lnp~(X,Zj)=Ei=j[lnp(X,Z)]+const
Ei=j[lnp(X,Z)]=∫lnp(X,Z)i=j∏qidZi
maximizing L(q) is equivalent to minimizing the Kullback-Leibler divergence, and the minimum occurs when qj(Zj)=p~(X,Zj). Thus we obtain the generl expression for the optimal solution qj∗(Zj) given by:
lnqj∗(Zj)=Ei=j[lnp(X,Z)]+const
10.1.2 Properties of factorized approximations
Two forms of Kullback-Leibler divergence are members of the alpha family of divergences defined by:
Dα(p∣∣q)=1−α24(1−∫p(x)(1+α)/2q(x)(1−α)/2dx)
10.1.3 Example: The univariate Gaussian
10.1.4 Model comparison
We can readily verify the following decomposition based on this variational distribution:
The joint distribution of all the variables is given by:
p(t,w,α)=p(t∣w)p(w∣α)p(α)
where
p(t∣w)=n=1∏NN(tn∣wTϕn,β−1)
p(w∣α)=N(w∣0,α−1I)
p(α)=Gam(α∣a0,b0)
10.3.1 Variational distribution
Our first goal is to find an approximation to the posterior distribution p(w,α∥t). The variational posterior distribution is given by the factorized expression:
q(w,α)=q(w)q(α)
The result will be:
q∗(α)=Gam(α∣aN,bN)
where
aN=a0+2M
bN=b0+21E[wTw]
and the distribution q∗(w) is Gaussian:
q∗(w)=N(w∣mN,SN)
where
mN=βSNΦTt
SN=(E[α]I+βΦTΦ)−1
10.3.2 Predictive distribution
The predictive distribution over t, given a new input x is evaluated using the Gaussian variational posterior:
p(t∣x,t)≃N(t∣mNTϕ(x),σ2(x))
where the input-dependent variance is given by
σ2(x)=β1+ϕ(x)TSNϕ(x)
10.3.3 Lower bound
10.4 Exponential Family Distributions
Suppose the joint distribution of observed and latent variables is a member of the exponential family, parameterized by natural parameters η so that:
p(X,Z∣η)=n=1∏Nh(xn,zn)g(η)exp{ηTu(xn,zn)}
we shall also use a conjugate prior for eta, which can be written as:
p(η∣v0)=f(v0,x0)g(η)v0exp{v0ηTx0}
Now consider a variational distribution that factorizes between the latent variables and the parameters, so that q(Z,η)=q(Z)q(η). The result will be
q∗(zn)=h(xn,zn)g(E[η])exp{E[ηT]u(xn,zn)}
q∗(η)=f(vN,xN)g(η)vNexp{ηTxN}
where
vN=vo+N
xN=x0+n=1∑NEzn[u(xn,zn)]
10.4.1 Variational message passing
The joint distribution corresponding to a directed graph can be written using the decomposition:
p(x)=i∏p(xi∣pai)
Now consider a variational approximation in which the distribution q(x) is assumed to factorize with respect to the xi so that:
q(x)=i∏qi(xi)
Substitute the formula above into the general result we will get:
lnqj∗(xj)=Ei=j[i∑lnp(xi∣pai)]+const
10.5 Local Variational Methods
For convex funcitons, we can obtain upper bounds by:
g(η)=−xmin{f(x)−ηx}=xmax{ηx−f(x)}
f(x)=ηmax{ηx−g(η)}
And for concave functions:
f(x)=ηmin{ηx−g(η)}
g(η)=xmin{ηx−f(x)}
If the function is not convex or concave, then we need the invertible transformations. An example will be logistic sigmoid function:
σ(x)=1+e−x1
The result with that of the logistic sigmoid will be:
σ(x)≤exp(ηx−g(η))
σ(x)≥σ(ξ)exp{2x−ξ−λ(ξ)(x2−ξ2)}
where:
λ(ξ)=−2ξ1[σ(ξ)−21]
We cna see how the bounds can be used, suppose we wish to evaluate an integral of the form:
I=∫σ(a)p(a)da
We can employ the variational bound and we will get:
I≥∫f(a,ξ)p(a)da=F(ξ)
10.6 Variational Logistic Regression
10.6.1 Variational posterior distribution
In the variational framework, we seek to maximize a lower bound on the marginal likelihood. For the Bayesian logistic regression model, the marginal likelihood takes the form:
p(t)=∫p(t∣w)p(w)dw=∫[n=1∏Np(tn∣w)]p(w)dw
The conditional distribution for t can be written as
p(t∣w)=eatσ(−a)
where a=wTϕ. Using the variational lower bound on the logistic sigmoid function we can get:
p(t∣w)=eatσ(−a)≥eatσ(ξ)exp{−2a+ξ−λ(ξ)(a2−ξ2)}
Using a=wTϕ, and multiplying by the prior distribution, we obtain the bound on the joint distribution of t and w: