模式識別 | PRML Chapter 2 Probability Distributions

PRML Chapter 2 Probability Distributions

2.1 Binary Variables

  • bernoulli distribution: Bern(xμ)=μx(1μ)1xBern(x | \mu) = \mu^{x}(1-\mu)^{1-x}
  • binomial distribution: Bin(mN,μ)=N!(Nm)!m!μm(1μ)NmBin(m | N,\mu) = \frac{N!}{(N-m)!m!}\mu^{m}(1-\mu)^{N-m}

2.1.1 The beta distribution

To get the MLE solution in a bayesian perspective, we need a prior distribution. Beta is a common one:

Beta(μa,b)=Γ(a+b)Γ(a)Γ(b)μa1(1μ)b1Beta(\mu | a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1}

Where E[μ]=aa+bE[\mu] = \frac{a}{a + b} and var[μ]=ab(a+b)2(a+b+1)var[\mu] = \frac{ab}{(a+b)^{2}(a + b + 1)}. We can get the posterior distribution:

p(μm,l,a,b)=Γ(m+a+l+b)Γ(m+a)Γ(l+b)μm+a1(1μ)l+b1p(\mu | m,l,a,b) = \frac{\Gamma(m+a+l+b)}{\Gamma(m+a)\Gamma(l+b)}\mu^{m+a-1}(1-\mu)^{l+b-1}

For prediction, we need to estimate the distribution of xx given the condition that the training data has known.

p(x=1D)=01p(x=1μ)p(μD)dμ=01μp(μD)dμ=E[μD]p(x = 1 | D) = \int_{0}^{1}p(x=1 | \mu)p(\mu | D)d\mu = \int_{0}^{1}\mu p(\mu | D)d\mu = E[\mu | D]

So we have:

p(x=1D)=m+am+a+l+bp(x = 1 | D) = \frac{m+a}{m+a+l+b}

On average, the more data we observe, the uncertainty of the posterior possibility will continuous decreasing.

Eθ(θ)=ED[Eθ[θD]]E_{\theta}(\theta) = E_{D}[E_{\theta}[\theta | D]]

varθ[θ]=ED[varθ[θD]]+varD[Eθ[θD]]var_{\theta}[\theta] = E_{D}[var_{\theta}[\theta | D]] + var_{D}[E_{\theta}[\theta | D]]

2.2 Multinomial Variables

If we want to discribe a variable more than two states with binary variables, we can use the form like x=(0,0,,1,,0)Tx = (0, 0, \dots, 1, \dots, 0)^{T}. If we use μk\mu_k to represent the probability of xk=1x_k=1 and we have N independent dataset, the likelihood function will be:

p(Dμ)=k=1Kμkmk, mk=nxnkp(D | \mu) = \prod_{k=1}^{K}\mu_{k}^{m_{k}},\ m_k=\sum_n x_{nk}

Using a Lagrange multiplier and let the partial derivative to find the MLE solution for μ\mu:

k=1Kmklnμk+λ(k=1Kμk1)\sum_{k=1}^{K}m_{k}ln\mu_{k} + \lambda(\sum_{k=1}^{K}\mu_{k}-1)

μkML=mkN\mu_{k}^{ML} = \frac{m_{k}}{N}

Now consider the conditional joint distribution, it is called multinomial distribution (here k=1Kmk=N\sum_{k=1}^{K}m_{k} = N):

Mult(m1,m2,,mKμ,N)=N!m1!m2!mK!Mult(m_{1}, m_{2}, \dots, m_{K} |\mu, N) = \frac{N!}{m_{1}!m_{2}!\dots m_{K}!}

2.2.1 The Dirichlet Distribution

The prior distribution of parameter μk\mu_k is Dirichlet distribution:

DIr(μα)=Γ(α0)Γ(α1)Γ(αK)k=1Kμkαk1DIr(\mu | \alpha) = \frac{\Gamma(\alpha_{0})}{\Gamma(\alpha_{1})\dots\Gamma(\alpha_{K})} \prod_{k=1}^{K}\mu_{k}^{\alpha_{k-1}}

where α0=k=1Kαk\alpha_{0} = \sum_{k=1}^{K}\alpha_{k} and it is easy to get the posterior distribution:

p(μD,α)=Dir(μα+m)p(\mu | D,\alpha) = Dir(\mu | \alpha + m)

2.3 The Gaussian Distribution

The geometric form of Guassian distribution: The exponent is a quadratic form:

Δ2=(xμ)TΣ1(xμ)\Delta^{2} = (x-\mu)^{T}\Sigma^{-1}(x-\mu)

Σ\Sigma can be symmetric and from the eigenvector equation Σμi=λiμi\Sigma\mu_{i} = \lambda_{i}\mu_{i}, when we choose the eigenvertor to be orthogonal, we will have:

Σ=i=1DλiμiμiT\Sigma = \sum_{i=1}^{D}\lambda_{i}\mu_{i}\mu_{i}^{T}

So the quadratic form can be written as:

Δ2=i=1Dyi2λi, yi=μiT(xμ)\Delta^{2} = \sum_{i=1}^{D}\frac{y_{i}^{2}}{\lambda_{i}},\ y_{i} = \mu_{i}^{T}(x-\mu)

Consider the newly defined coordinate system by yiy_i, the form of Guassian distribution will be:

p(y)=p(x)J=j=1D1(2πλj)12exp{yj22λj},  Jij=xiyj=Uijp(y) = p(x)|J| = \prod_{j=1}^{D}\frac{1}{(2\pi\lambda_{j})^{\frac{1}{2}}}exp\{-\frac{y_{j}^{2}}{2\lambda_{j}}\},\ \ J_{ij} = \frac{\partial x_{i}}{\partial y_{j}} = U_{ij}

2.3.1 Conditional Gaussian distributions

Consider multivariate normal distribution, suppose we have:

x=(xaxb),μ=(μaμb),Σ=(Σaa  ΣabΣba  Σbb) x = \left( \begin{aligned} x_{a} \\ x_{b} \end{aligned} \right) ,\mu = \left( \begin{aligned} \mu_{a} \\ \mu_{b} \end{aligned} \right) ,% <![CDATA[ \Sigma = \left( \begin{aligned} \Sigma_{aa} ~&~ \Sigma_{ab}\\ \Sigma_{ba} ~&~ \Sigma_{bb} \end{aligned} \right) %]]>

and we introduce the precision matrix:

Σ1=Λ=(Λaa  ΛabΛba  Λbb)\Sigma^{-1}= \Lambda = \left( \begin{aligned} \Lambda_{aa} ~&~ \Lambda_{ab}\\ \Lambda_{ba} ~&~ \Lambda_{bb} \end{aligned} \right) %]]>

To find an expression for the conditional distribution p(xaxb)p(x_a\|x_b), we obtain:

12(xμ)TΣ1(xμ)=12(xaμa)TΛaa(xaμa)12(xaμa)TΛab(xbμb)12(xbμb)TΛba(xaμa)12(xbμb)TΛbb(xbμb)-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu) = -\frac{1}{2}(x_{a} - \mu_{a})^{T}\Lambda_{aa}(x_{a} - \mu_{a}) - \frac{1}{2}(x_{a} - \mu_{a})^{T}\Lambda_{ab}(x_{b} - \mu_{b}) \\ -\frac{1}{2}(x_{b} - \mu_{b})^{T}\Lambda_{ba}(x_{a} - \mu_{a}) - \frac{1}{2}(x_{b} - \mu_{b})^{T}\Lambda_{bb}(x_{b} - \mu_{b})

which is the exponential term of conditional Gaussian. Also, it is easy to know that:
12(xμ)TΣ1(xμ)=12xTΣ1x+xTΣ1μ+const-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu) = -\frac{1}{2}x^{T}\Sigma^{-1}x + x^{T}\Sigma^{-1}\mu + const

Apply this method to p(xaxb)p(x_a\|x_b) and xbx_b is regarded as a constant. Compare the second order term 12xaTΛaaxa-\frac{1}{2}x_a^T\Lambda_{aa}x_a and the linear terms xaT(ΛaaμaΛab(xbμb))x_a^T(\Lambda_{aa}\mu_{a} - \Lambda_{ab}(x_{b} - \mu_{b})) in xax_a, we can get the variance and mean:

Σab=Λaa1\Sigma_{a|b} = \Lambda_{aa}^{-1}

μab=Σab(ΛaaμaΛab(xbμb))=μaΛaa1Λab(xbμb)\mu_{a|b} = \Sigma_{a|b}(\Lambda_{aa}\mu_{a} - \Lambda_{ab}(x_{b} - \mu_{b}))\\ = \mu_{a} - \Lambda_{aa}^{-1}\Lambda_{ab}(x_{b} - \mu_{b})

2.3.2 Marginal Gaussian distributions

To prove the marginal distribution p(xa)=p(xa,xb)dxbp(x_{a}) = \int p(x_{a}, x_{b})dx_{b} is Gaussian, the method is similar to 2.3.1, the results are:

Σa=(ΛaaΛabΛbb1Λba)1\Sigma_{a} = (\Lambda_{aa} - \Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})^{-1}

μa=Σa(ΛaaΛabΛbb1Λba)μa\mu_{a} = \Sigma_{a}(\Lambda_{aa} - \Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})\mu_{a}

2.3.3 Bayes’ theorem for Gaussian variables

Suppose we are given a Gaussian marginal distribution p(x)p(x) and a Gaussian conditional distribution p(yx)p(y\|x) (linear):
p(x)=N(xμ,Λ1)p(x) = N(x | \mu, \Lambda^{-1})

p(yx)=N(yAx+b,L1)p(y | x) = N(y | Ax + b, L^{-1})

we wish to find the marginal distribution p(y)p(y) and the conditional distribution p(xy)p(x\|y). Consider the log of the joint distribution:

lnp(z)=lnp(x)+lnp(yx)=12(xμ)TΛ(xμ)12(yAxb)TL(yAxb)+C \begin{aligned} ln p(z) & = ln p(x) + ln p(y | x) \\ & = -\frac{1}{2}(x-\mu)^{T}\Lambda(x-\mu) - \frac{1}{2}(y-Ax-b)^{T}L(y-Ax-b) + C \end{aligned} %]]>

By comparing quadratic forms we can get the mean and covariance of the joint distribution:

cov[z]=R1=(Λ1  Λ1ATAΛ1  L1+AΛ1AT) cov[z] = R^{-1} = \left( \begin{aligned} \Lambda^{-1} ~&~ \Lambda^{-1}A^{T} \\ A\Lambda^{-1} ~&~ L^{-1} + A\Lambda^{-1}A^{T} \end{aligned} \right) %]]>

E[z]=(μAμ+b)E[z] = \left( \begin{aligned} \mu \\ A\mu + b \end{aligned} \right)

And for the marginal distribution p(y)p(y), it is easy to observe that

E[y]=Aμ+bE[y] = A\mu + b

cov[y]=L1+AΛ1ATcov[y] = L^{-1} + A\Lambda^{-1}A^{T}

Similarly, for conditional distribution, we have:
E[xy]=(Λ+ATLA)1{ATL(yb)+aμ}E[x | y] = (\Lambda + A^{T}LA)^{-1}\{A^{T}L(y-b) + a\mu \}

cov[xy]=(Λ+ATLA)1cov[x | y] = (\Lambda + A^{T}LA)^{-1}

2.3.4 Maximum likelihood for the Gaussian

This part we only show the result:
μML=1Nn=1Nxn\mu_{ML}=\frac{1}{N}\sum_{n=1}^N x_n

ΣML=1Nn=1N(xnμML)(xnμML)T\Sigma_{ML}=\frac{1}{N}\sum_{n=1}^N(x_n-\mu_{ML})(x_n-\mu_{ML})^T

2.3.5 Sequential estimation

Sequential methods allow data points to be processed one at a time and then discarded. Refer to 2.3.4, the contribution from the final data point xNx_N is:

μML(N)=1Nn=1Nxn=μML(N1)+1N(xNμML(N1))% <![CDATA[ \begin{aligned} \mu_{ML}^{(N)} & = \frac{1}{N}\sum_{n=1}^{N}x_{n} \\ & = \mu_{ML}^{(N-1)} + \frac{1}{N}(x_{N} - \mu_{ML}^{(N-1)}) \end{aligned} %]]>

A more general formulation pair of sequential learning is called Robbins-Monro algorithm. For random variables θ\theta and zz governed by joint distribution p(z,θ)p(z,\theta). The conditional expenctation of zz given θ\theta is

f(θ)=E[zθ]=zp(zθ)dzf(\theta) = E[z | \theta] = \int zp(z | \theta) dz

f(θ)f(\theta) is called regression functions. Our goal is to find θ\theta^* that f(θ)=0f(\theta^*)=0. Suppose we observe values of zz and we with to find a corresponding sequential estimation scheme for θ\theta^*. Assume that $
E[(z-f)^{2}][\theta] < \infty %]]>$, the sequence of successive estimation will be:

θ(N)=θ(N1)αN1z(θ(N1))\theta^{(N)} = \theta^{(N-1)} - \alpha_{N-1}z(\theta^{(N-1)})

2.3.6 Bayesian inference for the Gaussian

Estimate μ\mu (σ\sigma known):
The likelihood function of Gaussian distribution:

p(xμ)=n=1Np(xnμ)=12πσ2exp{12σ2n=1N(xnμ2)}p(x | \mu) = \prod_{n=1}^{N}p(x_{n} | \mu) = \frac{1}{2\pi\sigma^{2}}exp\left\{-\frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_{n} - \mu^{2})\right\}

Choose p(μ)p(\mu) as Gaussian distribution: p(μ)=N(μμ0,σ02)p(\mu) = N(\mu | \mu_{0},\sigma_{0}^{2}), and the posterior distribution is given by:

p(μx)p(xμ)p(μ)p(\mu | x) \propto p(x | \mu)p(\mu)

Consequently, the posterior distribution will be:

p(μx)=N(μμN,σN2)p(\mu | x) = N(\mu | \mu_{N}, \sigma_{N}^{2})

where

μN=σ2Nσ02+σ2μ0+Nσ02Nσ02+σ2μML,  μML=1Nn=1Nxn\mu_{N} = \frac{\sigma^{2}}{N\sigma_{0}^{2} + \sigma^{2}}\mu_{0} + \frac{N\sigma_{0}^{2}}{N\sigma_{0}^{2} + \sigma^{2}}\mu_{ML},\ \ \mu_{ML} = \frac{1}{N}\sum_{n=1}^{N}x_{n}

1σN2=1σ02+Nσ2\frac{1}{\sigma_{N}^{2}} = \frac{1}{\sigma_{0}^{2}} + \frac{N}{\sigma^{2}}

The Bayesian paradigm leads very naturally to a sequential view of the inference problem.

p(μx)[p(μ)n=1Np(xnμ)]p(xNμ)p(\mu | x) \propto \left[ p(\mu)\prod_{n=1}^{N}p(x_{n} | \mu) \right] p(x_{N} | \mu)

Estimate σ\sigma (μ\mu known):
The likelihood function for λ\lambda takes the form:

p(xλ)=n=1NN(xnμ,λ1)λ12exp{λ2n=1N(xnμ)2}p(x | \lambda) = \prod_{n=1}^{N}N(x_{n} | \mu, \lambda^{-1}) \propto \lambda^{\frac{1}{2}}exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^{N}(x_{n} - \mu)^{2} \right\}

The prior distribution that we choose is Gamma distribution:

Gam(λa,,b)=1Γ(a)baλ(a1)exp(bλ)Gam(\lambda | a, ,b) = \frac{1}{\Gamma(a)}b^{a}\lambda^{(a-1)}exp(-b\lambda)

The posterior distribution is given by:

p(λx)λa01λN2exp{b0λλ2n=1N(xnμ)2}p(\lambda | x) \propto \lambda^{a_{0} - 1}\lambda^{\frac{N}{2}}exp\left\{ -b_{0}\lambda - \frac{\lambda}{2}\sum_{n=1}^{N}(x_{n}-\mu)^{2}\right\}

Consequently, the posterior distribution will be Gam(λaN,bN)Gam(\lambda | a_{N}, b_{N}), where

aN=a0+N2a_{N} = a_{0} + \frac{N}{2}

bN=b0+12n=1N(xnμ)2=b0+N2σML2b_{N} = b_{0} + \frac{1}{2}\sum_{n=1}^{N}(x_{n} - \mu)^{2} = b_{0} + \frac{N}{2}\sigma_{ML}^{2}

μ\mu and σ\sigma are unknown:

The form of prior distribution we choose is:

p(μ,λ)=N(μμ0,(βλ)1)Gam(λa,b)p(\mu, \lambda) = N(\mu | \mu_{0}, (\beta\lambda)^{-1})Gam(\lambda | a, b)

Where μ0=cβ\mu_{0} = \frac{c}{\beta}, a=1+β2a = \frac{1+\beta}{2} and b=dc22βb = d - \frac{c^{2}}{2\beta}.

2.3.7 Student’s t-distribution

If we have a gaussian distribution N(xμ,τ1)N(x | \mu, \tau^{-1}) and a Gamma Prior Gam(τa,b)Gam(\tau | a, b). Calculate the integral of τ\tau, we can get the marginal distribution of xx:

p(xμ,a,b)=0N(xμ,τ1)Gam(τa,b)dτp(x | \mu, a, b) = \int_{0}^{\infty} N(x | \mu, \tau^{-1})Gam(\tau | a, b)d\tau

Replace the parameters by v=2av=2a and λ=ab\lambda=\frac{a}{b}, we can get the Student’s t-distribution:

St(xμ,λ,ν)=Γ(ν2+12)Γ(ν2)(λπν)12[1+λ(xμ)2ν]ν212St(x | \mu,\lambda, \nu) = \frac{\Gamma(\frac{\nu}{2} + \frac{1}{2})}{\Gamma(\frac{\nu}{2})}\left( \frac{\lambda}{\pi\nu} \right)^{\frac{1}{2}}\left[ 1 + \frac{\lambda(x-\mu)^{2}}{\nu} \right]^{-\frac{\nu}{2}-\frac{1}{2}}

2.3.8 Periodic variables

2.3.9 Mixtures of Gaussians

Taking linear combinations of basic distributions such as Gaussian, almost any continuous density can be approximated to arbitrary accuracy.

p(x)=k=1KπkN(xμk,Σk)p(x) = \sum_{k=1}^{K}\pi_{k}N(x | \mu_{k}, \Sigma_{k})

2.4 The Exponential Family

The distribution of variable xx with parameter η\eta can be defined as:

p(xη)=h(x)g(η)exp{ηTu(x)}p(x | \eta) = h(x)g(\eta)exp\{\eta^{T}u(x)\}

And the formula satisfies:

g(η)h(x)exp{ηTu(x)}dx=1g(\eta)\int h(x)exp\{\eta^{T}u(x)\} dx = 1

2.4.1 Maximum likelihood and sufficient statistics

2.4.2 Conjugate priors

For any member of the exponential family, there exists a conjugate a conjugate prior that can be written in the form:

p(ηχ,ν)=f(χ,ν)g(η)νexp(νηTχ)p(\eta | \chi, \nu) = f(\chi, \nu)g(\eta)^{\nu}exp(\nu\eta^{T}\chi)

2.4.3 Noninformative priors

Two simple examples of noninformative priors:

The location parameter: p(xμ)=f(xμ)p(x| \mu) = f(x-\mu)

The scale parameter: p(xσ)=1σf(xσ)p(x | \sigma) = \frac{1}{\sigma}f(\frac{x}{\sigma})

2.5 Nonparametric Methods

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章