PRML Chapter 2 Probability Distributions
2.1 Binary Variables
bernoulli distribution: B e r n ( x ∣ μ ) = μ x ( 1 − μ ) 1 − x Bern(x | \mu) = \mu^{x}(1-\mu)^{1-x} B e r n ( x ∣ μ ) = μ x ( 1 − μ ) 1 − x
binomial distribution: B i n ( m ∣ N , μ ) = N ! ( N − m ) ! m ! μ m ( 1 − μ ) N − m Bin(m | N,\mu) = \frac{N!}{(N-m)!m!}\mu^{m}(1-\mu)^{N-m} B i n ( m ∣ N , μ ) = ( N − m ) ! m ! N ! μ m ( 1 − μ ) N − m
2.1.1 The beta distribution
To get the MLE solution in a bayesian perspective, we need a prior distribution. Beta is a common one:
B e t a ( μ ∣ a , b ) = Γ ( a + b ) Γ ( a ) Γ ( b ) μ a − 1 ( 1 − μ ) b − 1 Beta(\mu | a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1} B e t a ( μ ∣ a , b ) = Γ ( a ) Γ ( b ) Γ ( a + b ) μ a − 1 ( 1 − μ ) b − 1
Where E [ μ ] = a a + b E[\mu] = \frac{a}{a + b} E [ μ ] = a + b a and v a r [ μ ] = a b ( a + b ) 2 ( a + b + 1 ) var[\mu] = \frac{ab}{(a+b)^{2}(a + b + 1)} v a r [ μ ] = ( a + b ) 2 ( a + b + 1 ) a b . We can get the posterior distribution:
p ( μ ∣ m , l , a , b ) = Γ ( m + a + l + b ) Γ ( m + a ) Γ ( l + b ) μ m + a − 1 ( 1 − μ ) l + b − 1 p(\mu | m,l,a,b) = \frac{\Gamma(m+a+l+b)}{\Gamma(m+a)\Gamma(l+b)}\mu^{m+a-1}(1-\mu)^{l+b-1} p ( μ ∣ m , l , a , b ) = Γ ( m + a ) Γ ( l + b ) Γ ( m + a + l + b ) μ m + a − 1 ( 1 − μ ) l + b − 1
For prediction, we need to estimate the distribution of x x x given the condition that the training data has known.
p ( x = 1 ∣ D ) = ∫ 0 1 p ( x = 1 ∣ μ ) p ( μ ∣ D ) d μ = ∫ 0 1 μ p ( μ ∣ D ) d μ = E [ μ ∣ D ] p(x = 1 | D) = \int_{0}^{1}p(x=1 | \mu)p(\mu | D)d\mu = \int_{0}^{1}\mu p(\mu | D)d\mu = E[\mu | D] p ( x = 1 ∣ D ) = ∫ 0 1 p ( x = 1 ∣ μ ) p ( μ ∣ D ) d μ = ∫ 0 1 μ p ( μ ∣ D ) d μ = E [ μ ∣ D ]
So we have:
p ( x = 1 ∣ D ) = m + a m + a + l + b p(x = 1 | D) = \frac{m+a}{m+a+l+b} p ( x = 1 ∣ D ) = m + a + l + b m + a
On average, the more data we observe, the uncertainty of the posterior possibility will continuous decreasing.
E θ ( θ ) = E D [ E θ [ θ ∣ D ] ] E_{\theta}(\theta) = E_{D}[E_{\theta}[\theta | D]] E θ ( θ ) = E D [ E θ [ θ ∣ D ] ]
v a r θ [ θ ] = E D [ v a r θ [ θ ∣ D ] ] + v a r D [ E θ [ θ ∣ D ] ] var_{\theta}[\theta] = E_{D}[var_{\theta}[\theta | D]] + var_{D}[E_{\theta}[\theta | D]] v a r θ [ θ ] = E D [ v a r θ [ θ ∣ D ] ] + v a r D [ E θ [ θ ∣ D ] ]
2.2 Multinomial Variables
If we want to discribe a variable more than two states with binary variables, we can use the form like x = ( 0 , 0 , … , 1 , … , 0 ) T x = (0, 0, \dots, 1, \dots, 0)^{T} x = ( 0 , 0 , … , 1 , … , 0 ) T . If we use μ k \mu_k μ k to represent the probability of x k = 1 x_k=1 x k = 1 and we have N independent dataset, the likelihood function will be:
p ( D ∣ μ ) = ∏ k = 1 K μ k m k , m k = ∑ n x n k p(D | \mu) = \prod_{k=1}^{K}\mu_{k}^{m_{k}},\ m_k=\sum_n x_{nk} p ( D ∣ μ ) = k = 1 ∏ K μ k m k , m k = n ∑ x n k
Using a Lagrange multiplier and let the partial derivative to find the MLE solution for μ \mu μ :
∑ k = 1 K m k l n μ k + λ ( ∑ k = 1 K μ k − 1 ) \sum_{k=1}^{K}m_{k}ln\mu_{k} + \lambda(\sum_{k=1}^{K}\mu_{k}-1) k = 1 ∑ K m k l n μ k + λ ( k = 1 ∑ K μ k − 1 )
μ k M L = m k N \mu_{k}^{ML} = \frac{m_{k}}{N} μ k M L = N m k
Now consider the conditional joint distribution, it is called multinomial distribution (here ∑ k = 1 K m k = N \sum_{k=1}^{K}m_{k} = N ∑ k = 1 K m k = N ):
M u l t ( m 1 , m 2 , … , m K ∣ μ , N ) = N ! m 1 ! m 2 ! … m K ! Mult(m_{1}, m_{2}, \dots, m_{K} |\mu, N) = \frac{N!}{m_{1}!m_{2}!\dots m_{K}!} M u l t ( m 1 , m 2 , … , m K ∣ μ , N ) = m 1 ! m 2 ! … m K ! N !
2.2.1 The Dirichlet Distribution
The prior distribution of parameter μ k \mu_k μ k is Dirichlet distribution:
D I r ( μ ∣ α ) = Γ ( α 0 ) Γ ( α 1 ) … Γ ( α K ) ∏ k = 1 K μ k α k − 1 DIr(\mu | \alpha) = \frac{\Gamma(\alpha_{0})}{\Gamma(\alpha_{1})\dots\Gamma(\alpha_{K})} \prod_{k=1}^{K}\mu_{k}^{\alpha_{k-1}} D I r ( μ ∣ α ) = Γ ( α 1 ) … Γ ( α K ) Γ ( α 0 ) k = 1 ∏ K μ k α k − 1
where α 0 = ∑ k = 1 K α k \alpha_{0} = \sum_{k=1}^{K}\alpha_{k} α 0 = ∑ k = 1 K α k and it is easy to get the posterior distribution:
p ( μ ∣ D , α ) = D i r ( μ ∣ α + m ) p(\mu | D,\alpha) = Dir(\mu | \alpha + m) p ( μ ∣ D , α ) = D i r ( μ ∣ α + m )
2.3 The Gaussian Distribution
The geometric form of Guassian distribution: The exponent is a quadratic form:
Δ 2 = ( x − μ ) T Σ − 1 ( x − μ ) \Delta^{2} = (x-\mu)^{T}\Sigma^{-1}(x-\mu) Δ 2 = ( x − μ ) T Σ − 1 ( x − μ )
Σ \Sigma Σ can be symmetric and from the eigenvector equation Σ μ i = λ i μ i \Sigma\mu_{i} = \lambda_{i}\mu_{i} Σ μ i = λ i μ i , when we choose the eigenvertor to be orthogonal, we will have:
Σ = ∑ i = 1 D λ i μ i μ i T \Sigma = \sum_{i=1}^{D}\lambda_{i}\mu_{i}\mu_{i}^{T} Σ = i = 1 ∑ D λ i μ i μ i T
So the quadratic form can be written as:
Δ 2 = ∑ i = 1 D y i 2 λ i , y i = μ i T ( x − μ ) \Delta^{2} = \sum_{i=1}^{D}\frac{y_{i}^{2}}{\lambda_{i}},\ y_{i} = \mu_{i}^{T}(x-\mu) Δ 2 = i = 1 ∑ D λ i y i 2 , y i = μ i T ( x − μ )
Consider the newly defined coordinate system by y i y_i y i , the form of Guassian distribution will be:
p ( y ) = p ( x ) ∣ J ∣ = ∏ j = 1 D 1 ( 2 π λ j ) 1 2 e x p { − y j 2 2 λ j } , J i j = ∂ x i ∂ y j = U i j p(y) = p(x)|J| = \prod_{j=1}^{D}\frac{1}{(2\pi\lambda_{j})^{\frac{1}{2}}}exp\{-\frac{y_{j}^{2}}{2\lambda_{j}}\},\ \ J_{ij} = \frac{\partial x_{i}}{\partial y_{j}} = U_{ij} p ( y ) = p ( x ) ∣ J ∣ = j = 1 ∏ D ( 2 π λ j ) 2 1 1 e x p { − 2 λ j y j 2 } , J i j = ∂ y j ∂ x i = U i j
2.3.1 Conditional Gaussian distributions
Consider multivariate normal distribution, suppose we have:
x = ( x a x b ) , μ = ( μ a μ b ) , Σ = ( Σ a a Σ a b Σ b a Σ b b )
x = \left(
\begin{aligned}
x_{a} \\
x_{b}
\end{aligned}
\right)
,\mu = \left(
\begin{aligned}
\mu_{a} \\
\mu_{b}
\end{aligned}
\right)
,% <![CDATA[
\Sigma = \left(
\begin{aligned}
\Sigma_{aa} ~&~ \Sigma_{ab}\\
\Sigma_{ba} ~&~ \Sigma_{bb}
\end{aligned}
\right) %]]>
x = ( x a x b ) , μ = ( μ a μ b ) , Σ = ( Σ a a Σ b a Σ a b Σ b b )
and we introduce the precision matrix:
Σ − 1 = Λ = ( Λ a a Λ a b Λ b a Λ b b ) \Sigma^{-1}=
\Lambda = \left(
\begin{aligned}
\Lambda_{aa} ~&~ \Lambda_{ab}\\
\Lambda_{ba} ~&~ \Lambda_{bb}
\end{aligned}
\right) %]]> Σ − 1 = Λ = ( Λ a a Λ b a Λ a b Λ b b )
To find an expression for the conditional distribution p ( x a ∥ x b ) p(x_a\|x_b) p ( x a ∥ x b ) , we obtain:
− 1 2 ( x − μ ) T Σ − 1 ( x − μ ) = − 1 2 ( x a − μ a ) T Λ a a ( x a − μ a ) − 1 2 ( x a − μ a ) T Λ a b ( x b − μ b ) − 1 2 ( x b − μ b ) T Λ b a ( x a − μ a ) − 1 2 ( x b − μ b ) T Λ b b ( x b − μ b ) -\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu) =
-\frac{1}{2}(x_{a} - \mu_{a})^{T}\Lambda_{aa}(x_{a} - \mu_{a}) - \frac{1}{2}(x_{a} - \mu_{a})^{T}\Lambda_{ab}(x_{b} - \mu_{b}) \\
-\frac{1}{2}(x_{b} - \mu_{b})^{T}\Lambda_{ba}(x_{a} - \mu_{a}) - \frac{1}{2}(x_{b} - \mu_{b})^{T}\Lambda_{bb}(x_{b} - \mu_{b}) − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) = − 2 1 ( x a − μ a ) T Λ a a ( x a − μ a ) − 2 1 ( x a − μ a ) T Λ a b ( x b − μ b ) − 2 1 ( x b − μ b ) T Λ b a ( x a − μ a ) − 2 1 ( x b − μ b ) T Λ b b ( x b − μ b )
which is the exponential term of conditional Gaussian. Also, it is easy to know that:
− 1 2 ( x − μ ) T Σ − 1 ( x − μ ) = − 1 2 x T Σ − 1 x + x T Σ − 1 μ + c o n s t -\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu) = -\frac{1}{2}x^{T}\Sigma^{-1}x + x^{T}\Sigma^{-1}\mu + const − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) = − 2 1 x T Σ − 1 x + x T Σ − 1 μ + c o n s t
Apply this method to p ( x a ∥ x b ) p(x_a\|x_b) p ( x a ∥ x b ) and x b x_b x b is regarded as a constant. Compare the second order term − 1 2 x a T Λ a a x a -\frac{1}{2}x_a^T\Lambda_{aa}x_a − 2 1 x a T Λ a a x a and the linear terms x a T ( Λ a a μ a − Λ a b ( x b − μ b ) ) x_a^T(\Lambda_{aa}\mu_{a} - \Lambda_{ab}(x_{b} - \mu_{b})) x a T ( Λ a a μ a − Λ a b ( x b − μ b ) ) in x a x_a x a , we can get the variance and mean:
Σ a ∣ b = Λ a a − 1 \Sigma_{a|b} = \Lambda_{aa}^{-1} Σ a ∣ b = Λ a a − 1
μ a ∣ b = Σ a ∣ b ( Λ a a μ a − Λ a b ( x b − μ b ) ) = μ a − Λ a a − 1 Λ a b ( x b − μ b ) \mu_{a|b} = \Sigma_{a|b}(\Lambda_{aa}\mu_{a} - \Lambda_{ab}(x_{b} - \mu_{b}))\\
= \mu_{a} - \Lambda_{aa}^{-1}\Lambda_{ab}(x_{b} - \mu_{b}) μ a ∣ b = Σ a ∣ b ( Λ a a μ a − Λ a b ( x b − μ b ) ) = μ a − Λ a a − 1 Λ a b ( x b − μ b )
2.3.2 Marginal Gaussian distributions
To prove the marginal distribution p ( x a ) = ∫ p ( x a , x b ) d x b p(x_{a}) = \int p(x_{a}, x_{b})dx_{b} p ( x a ) = ∫ p ( x a , x b ) d x b is Gaussian, the method is similar to 2.3.1, the results are:
Σ a = ( Λ a a − Λ a b Λ b b − 1 Λ b a ) − 1 \Sigma_{a} = (\Lambda_{aa} - \Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})^{-1} Σ a = ( Λ a a − Λ a b Λ b b − 1 Λ b a ) − 1
μ a = Σ a ( Λ a a − Λ a b Λ b b − 1 Λ b a ) μ a \mu_{a} = \Sigma_{a}(\Lambda_{aa} - \Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})\mu_{a} μ a = Σ a ( Λ a a − Λ a b Λ b b − 1 Λ b a ) μ a
2.3.3 Bayes’ theorem for Gaussian variables
Suppose we are given a Gaussian marginal distribution p ( x ) p(x) p ( x ) and a Gaussian conditional distribution p ( y ∥ x ) p(y\|x) p ( y ∥ x ) (linear):
p ( x ) = N ( x ∣ μ , Λ − 1 ) p(x) = N(x | \mu, \Lambda^{-1}) p ( x ) = N ( x ∣ μ , Λ − 1 )
p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) p(y | x) = N(y | Ax + b, L^{-1}) p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 )
we wish to find the marginal distribution p ( y ) p(y) p ( y ) and the conditional distribution p ( x ∥ y ) p(x\|y) p ( x ∥ y ) . Consider the log of the joint distribution:
l n p ( z ) = l n p ( x ) + l n p ( y ∣ x ) = − 1 2 ( x − μ ) T Λ ( x − μ ) − 1 2 ( y − A x − b ) T L ( y − A x − b ) + C
\begin{aligned}
ln p(z) & = ln p(x) + ln p(y | x) \\
& = -\frac{1}{2}(x-\mu)^{T}\Lambda(x-\mu) - \frac{1}{2}(y-Ax-b)^{T}L(y-Ax-b) + C
\end{aligned} %]]> l n p ( z ) = l n p ( x ) + l n p ( y ∣ x ) = − 2 1 ( x − μ ) T Λ ( x − μ ) − 2 1 ( y − A x − b ) T L ( y − A x − b ) + C
By comparing quadratic forms we can get the mean and covariance of the joint distribution:
c o v [ z ] = R − 1 = ( Λ − 1 Λ − 1 A T A Λ − 1 L − 1 + A Λ − 1 A T )
cov[z] = R^{-1} = \left(
\begin{aligned}
\Lambda^{-1} ~&~ \Lambda^{-1}A^{T} \\
A\Lambda^{-1} ~&~ L^{-1} + A\Lambda^{-1}A^{T}
\end{aligned}
\right) %]]> c o v [ z ] = R − 1 = ( Λ − 1 A Λ − 1 Λ − 1 A T L − 1 + A Λ − 1 A T )
E [ z ] = ( μ A μ + b ) E[z] = \left(
\begin{aligned}
\mu \\
A\mu + b
\end{aligned}
\right) E [ z ] = ( μ A μ + b )
And for the marginal distribution p ( y ) p(y) p ( y ) , it is easy to observe that
E [ y ] = A μ + b E[y] = A\mu + b E [ y ] = A μ + b
c o v [ y ] = L − 1 + A Λ − 1 A T cov[y] = L^{-1} + A\Lambda^{-1}A^{T} c o v [ y ] = L − 1 + A Λ − 1 A T
Similarly, for conditional distribution, we have:
E [ x ∣ y ] = ( Λ + A T L A ) − 1 { A T L ( y − b ) + a μ } E[x | y] = (\Lambda + A^{T}LA)^{-1}\{A^{T}L(y-b) + a\mu \} E [ x ∣ y ] = ( Λ + A T L A ) − 1 { A T L ( y − b ) + a μ }
c o v [ x ∣ y ] = ( Λ + A T L A ) − 1 cov[x | y] = (\Lambda + A^{T}LA)^{-1} c o v [ x ∣ y ] = ( Λ + A T L A ) − 1
2.3.4 Maximum likelihood for the Gaussian
This part we only show the result:
μ M L = 1 N ∑ n = 1 N x n \mu_{ML}=\frac{1}{N}\sum_{n=1}^N x_n μ M L = N 1 n = 1 ∑ N x n
Σ M L = 1 N ∑ n = 1 N ( x n − μ M L ) ( x n − μ M L ) T \Sigma_{ML}=\frac{1}{N}\sum_{n=1}^N(x_n-\mu_{ML})(x_n-\mu_{ML})^T Σ M L = N 1 n = 1 ∑ N ( x n − μ M L ) ( x n − μ M L ) T
2.3.5 Sequential estimation
Sequential methods allow data points to be processed one at a time and then discarded. Refer to 2.3.4, the contribution from the final data point x N x_N x N is:
μ M L ( N ) = 1 N ∑ n = 1 N x n = μ M L ( N − 1 ) + 1 N ( x N − μ M L ( N − 1 ) ) % <![CDATA[
\begin{aligned}
\mu_{ML}^{(N)} & = \frac{1}{N}\sum_{n=1}^{N}x_{n} \\
& = \mu_{ML}^{(N-1)} + \frac{1}{N}(x_{N} - \mu_{ML}^{(N-1)})
\end{aligned} %]]> μ M L ( N ) = N 1 n = 1 ∑ N x n = μ M L ( N − 1 ) + N 1 ( x N − μ M L ( N − 1 ) )
A more general formulation pair of sequential learning is called Robbins-Monro algorithm. For random variables θ \theta θ and z z z governed by joint distribution p ( z , θ ) p(z,\theta) p ( z , θ ) . The conditional expenctation of z z z given θ \theta θ is
f ( θ ) = E [ z ∣ θ ] = ∫ z p ( z ∣ θ ) d z f(\theta) = E[z | \theta] = \int zp(z | \theta) dz f ( θ ) = E [ z ∣ θ ] = ∫ z p ( z ∣ θ ) d z
f ( θ ) f(\theta) f ( θ ) is called regression functions. Our goal is to find θ ∗ \theta^* θ ∗ that f ( θ ∗ ) = 0 f(\theta^*)=0 f ( θ ∗ ) = 0 . Suppose we observe values of z z z and we with to find a corresponding sequential estimation scheme for θ ∗ \theta^* θ ∗ . Assume that $
E[(z-f)^{2}][\theta] < \infty %]]>$, the sequence of successive estimation will be:
θ ( N ) = θ ( N − 1 ) − α N − 1 z ( θ ( N − 1 ) ) \theta^{(N)} = \theta^{(N-1)} - \alpha_{N-1}z(\theta^{(N-1)}) θ ( N ) = θ ( N − 1 ) − α N − 1 z ( θ ( N − 1 ) )
2.3.6 Bayesian inference for the Gaussian
Estimate μ \mu μ (σ \sigma σ known):
The likelihood function of Gaussian distribution:
p ( x ∣ μ ) = ∏ n = 1 N p ( x n ∣ μ ) = 1 2 π σ 2 e x p { − 1 2 σ 2 ∑ n = 1 N ( x n − μ 2 ) } p(x | \mu) = \prod_{n=1}^{N}p(x_{n} | \mu) = \frac{1}{2\pi\sigma^{2}}exp\left\{-\frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_{n} - \mu^{2})\right\} p ( x ∣ μ ) = n = 1 ∏ N p ( x n ∣ μ ) = 2 π σ 2 1 e x p { − 2 σ 2 1 n = 1 ∑ N ( x n − μ 2 ) }
Choose p ( μ ) p(\mu) p ( μ ) as Gaussian distribution: p ( μ ) = N ( μ ∣ μ 0 , σ 0 2 ) p(\mu) = N(\mu | \mu_{0},\sigma_{0}^{2}) p ( μ ) = N ( μ ∣ μ 0 , σ 0 2 ) , and the posterior distribution is given by:
p ( μ ∣ x ) ∝ p ( x ∣ μ ) p ( μ ) p(\mu | x) \propto p(x | \mu)p(\mu) p ( μ ∣ x ) ∝ p ( x ∣ μ ) p ( μ )
Consequently, the posterior distribution will be:
p ( μ ∣ x ) = N ( μ ∣ μ N , σ N 2 ) p(\mu | x) = N(\mu | \mu_{N}, \sigma_{N}^{2}) p ( μ ∣ x ) = N ( μ ∣ μ N , σ N 2 )
where
μ N = σ 2 N σ 0 2 + σ 2 μ 0 + N σ 0 2 N σ 0 2 + σ 2 μ M L , μ M L = 1 N ∑ n = 1 N x n \mu_{N} = \frac{\sigma^{2}}{N\sigma_{0}^{2} + \sigma^{2}}\mu_{0} + \frac{N\sigma_{0}^{2}}{N\sigma_{0}^{2} + \sigma^{2}}\mu_{ML},\ \ \mu_{ML} = \frac{1}{N}\sum_{n=1}^{N}x_{n} μ N = N σ 0 2 + σ 2 σ 2 μ 0 + N σ 0 2 + σ 2 N σ 0 2 μ M L , μ M L = N 1 n = 1 ∑ N x n
1 σ N 2 = 1 σ 0 2 + N σ 2 \frac{1}{\sigma_{N}^{2}} = \frac{1}{\sigma_{0}^{2}} + \frac{N}{\sigma^{2}} σ N 2 1 = σ 0 2 1 + σ 2 N
The Bayesian paradigm leads very naturally to a sequential view of the inference problem.
p ( μ ∣ x ) ∝ [ p ( μ ) ∏ n = 1 N p ( x n ∣ μ ) ] p ( x N ∣ μ ) p(\mu | x) \propto \left[ p(\mu)\prod_{n=1}^{N}p(x_{n} | \mu) \right] p(x_{N} | \mu) p ( μ ∣ x ) ∝ [ p ( μ ) n = 1 ∏ N p ( x n ∣ μ ) ] p ( x N ∣ μ )
Estimate σ \sigma σ (μ \mu μ known) :
The likelihood function for λ \lambda λ takes the form:
p ( x ∣ λ ) = ∏ n = 1 N N ( x n ∣ μ , λ − 1 ) ∝ λ 1 2 e x p { − λ 2 ∑ n = 1 N ( x n − μ ) 2 } p(x | \lambda) = \prod_{n=1}^{N}N(x_{n} | \mu, \lambda^{-1}) \propto \lambda^{\frac{1}{2}}exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^{N}(x_{n} - \mu)^{2} \right\} p ( x ∣ λ ) = n = 1 ∏ N N ( x n ∣ μ , λ − 1 ) ∝ λ 2 1 e x p { − 2 λ n = 1 ∑ N ( x n − μ ) 2 }
The prior distribution that we choose is Gamma distribution:
G a m ( λ ∣ a , , b ) = 1 Γ ( a ) b a λ ( a − 1 ) e x p ( − b λ ) Gam(\lambda | a, ,b) = \frac{1}{\Gamma(a)}b^{a}\lambda^{(a-1)}exp(-b\lambda) G a m ( λ ∣ a , , b ) = Γ ( a ) 1 b a λ ( a − 1 ) e x p ( − b λ )
The posterior distribution is given by:
p ( λ ∣ x ) ∝ λ a 0 − 1 λ N 2 e x p { − b 0 λ − λ 2 ∑ n = 1 N ( x n − μ ) 2 } p(\lambda | x) \propto \lambda^{a_{0} - 1}\lambda^{\frac{N}{2}}exp\left\{ -b_{0}\lambda - \frac{\lambda}{2}\sum_{n=1}^{N}(x_{n}-\mu)^{2}\right\} p ( λ ∣ x ) ∝ λ a 0 − 1 λ 2 N e x p { − b 0 λ − 2 λ n = 1 ∑ N ( x n − μ ) 2 }
Consequently, the posterior distribution will be G a m ( λ ∣ a N , b N ) Gam(\lambda | a_{N}, b_{N}) G a m ( λ ∣ a N , b N ) , where
a N = a 0 + N 2 a_{N} = a_{0} + \frac{N}{2} a N = a 0 + 2 N
b N = b 0 + 1 2 ∑ n = 1 N ( x n − μ ) 2 = b 0 + N 2 σ M L 2 b_{N} = b_{0} + \frac{1}{2}\sum_{n=1}^{N}(x_{n} - \mu)^{2} = b_{0} + \frac{N}{2}\sigma_{ML}^{2} b N = b 0 + 2 1 n = 1 ∑ N ( x n − μ ) 2 = b 0 + 2 N σ M L 2
μ \mu μ and σ \sigma σ are unknown :
The form of prior distribution we choose is:
p ( μ , λ ) = N ( μ ∣ μ 0 , ( β λ ) − 1 ) G a m ( λ ∣ a , b ) p(\mu, \lambda) = N(\mu | \mu_{0}, (\beta\lambda)^{-1})Gam(\lambda | a, b) p ( μ , λ ) = N ( μ ∣ μ 0 , ( β λ ) − 1 ) G a m ( λ ∣ a , b )
Where μ 0 = c β \mu_{0} = \frac{c}{\beta} μ 0 = β c , a = 1 + β 2 a = \frac{1+\beta}{2} a = 2 1 + β and b = d − c 2 2 β b = d - \frac{c^{2}}{2\beta} b = d − 2 β c 2 .
2.3.7 Student’s t-distribution
If we have a gaussian distribution N ( x ∣ μ , τ − 1 ) N(x | \mu, \tau^{-1}) N ( x ∣ μ , τ − 1 ) and a Gamma Prior G a m ( τ ∣ a , b ) Gam(\tau | a, b) G a m ( τ ∣ a , b ) . Calculate the integral of τ \tau τ , we can get the marginal distribution of x x x :
p ( x ∣ μ , a , b ) = ∫ 0 ∞ N ( x ∣ μ , τ − 1 ) G a m ( τ ∣ a , b ) d τ p(x | \mu, a, b) = \int_{0}^{\infty} N(x | \mu, \tau^{-1})Gam(\tau | a, b)d\tau p ( x ∣ μ , a , b ) = ∫ 0 ∞ N ( x ∣ μ , τ − 1 ) G a m ( τ ∣ a , b ) d τ
Replace the parameters by v = 2 a v=2a v = 2 a and λ = a b \lambda=\frac{a}{b} λ = b a , we can get the Student’s t-distribution:
S t ( x ∣ μ , λ , ν ) = Γ ( ν 2 + 1 2 ) Γ ( ν 2 ) ( λ π ν ) 1 2 [ 1 + λ ( x − μ ) 2 ν ] − ν 2 − 1 2 St(x | \mu,\lambda, \nu) = \frac{\Gamma(\frac{\nu}{2} + \frac{1}{2})}{\Gamma(\frac{\nu}{2})}\left( \frac{\lambda}{\pi\nu} \right)^{\frac{1}{2}}\left[ 1 + \frac{\lambda(x-\mu)^{2}}{\nu} \right]^{-\frac{\nu}{2}-\frac{1}{2}} S t ( x ∣ μ , λ , ν ) = Γ ( 2 ν ) Γ ( 2 ν + 2 1 ) ( π ν λ ) 2 1 [ 1 + ν λ ( x − μ ) 2 ] − 2 ν − 2 1
2.3.8 Periodic variables
2.3.9 Mixtures of Gaussians
Taking linear combinations of basic distributions such as Gaussian, almost any continuous density can be approximated to arbitrary accuracy.
p ( x ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) p(x) = \sum_{k=1}^{K}\pi_{k}N(x | \mu_{k}, \Sigma_{k}) p ( x ) = k = 1 ∑ K π k N ( x ∣ μ k , Σ k )
2.4 The Exponential Family
The distribution of variable x x x with parameter η \eta η can be defined as:
p ( x ∣ η ) = h ( x ) g ( η ) e x p { η T u ( x ) } p(x | \eta) = h(x)g(\eta)exp\{\eta^{T}u(x)\} p ( x ∣ η ) = h ( x ) g ( η ) e x p { η T u ( x ) }
And the formula satisfies:
g ( η ) ∫ h ( x ) e x p { η T u ( x ) } d x = 1 g(\eta)\int h(x)exp\{\eta^{T}u(x)\} dx = 1 g ( η ) ∫ h ( x ) e x p { η T u ( x ) } d x = 1
2.4.1 Maximum likelihood and sufficient statistics
2.4.2 Conjugate priors
For any member of the exponential family, there exists a conjugate a conjugate prior that can be written in the form:
p ( η ∣ χ , ν ) = f ( χ , ν ) g ( η ) ν e x p ( ν η T χ ) p(\eta | \chi, \nu) = f(\chi, \nu)g(\eta)^{\nu}exp(\nu\eta^{T}\chi) p ( η ∣ χ , ν ) = f ( χ , ν ) g ( η ) ν e x p ( ν η T χ )
2.4.3 Noninformative priors
Two simple examples of noninformative priors:
The location parameter: p ( x ∣ μ ) = f ( x − μ ) p(x| \mu) = f(x-\mu) p ( x ∣ μ ) = f ( x − μ )
The scale parameter: p ( x ∣ σ ) = 1 σ f ( x σ ) p(x | \sigma) = \frac{1}{\sigma}f(\frac{x}{\sigma}) p ( x ∣ σ ) = σ 1 f ( σ x )
2.5 Nonparametric Methods