Expectation Maximization Algorithm

Complete Information

Consider an experiment with coin A that has a probability θA\theta_A of heads, and a coin B that has a probability θB\theta_B of tails. We draw m samples as follows: pick one of the coins at random, flip it nn times, and record the number of heads and tails (that sum to nn). If we recorded which coin we used for each sample, we have complete information and can estimate θA\theta_A and θB\theta_B in closed form. To be very explicit, suppose we drew 5 samples with the number of heads and tails represented as a vector xx, and the sequence of coins chosen was AA, AA, BB, AA, BB. Then the complete log likelihood is
logp(x1;θA)+logp(x2;θA)+logp(x3;θB)+logp(x4;θA)+logp(x5;θB)\text{log}\,p(x_1;\theta_A)+\text{log}\,p(x_2;\theta_A)+ \text{log}\,p(x_3;\theta_B)+\text{log}\,p(x_4;\theta_A)+\text{log}\,p(x_5;\theta_B)
where p(xi;θ)p(x_i;\theta) is the binomial distribtion PMF with n=mn=m and p=θp=θ. We will use ziz_i to indicate the label of the iith coin, that is - whether we used coin AA or BB to gnerate the ith sample.

Now, solving for complete likelihood using minimization:

import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.style.use('ggplot')
np.random.seed(1234)
from numpy.core.umath_tests import matrix_multiply as mm
from scipy.optimize import minimize
from scipy.stats import bernoulli, binom
def neg_loglik(thetas, n, xs, zs):
    return -np.sum([binom(n, thetas[z]).logpmf(x) for (x, z) in zip(xs, zs)])
m = 10
theta_A = 0.8
theta_B = 0.3
theta_0 = [theta_A, theta_B]

coin_A = bernoulli(theta_A)
coin_B = bernoulli(theta_B)

xs = map(sum, [coin_A.rvs(m), coin_A.rvs(m), coin_B.rvs(m), coin_A.rvs(m), coin_B.rvs(m)])
zs = [0, 0, 1, 0, 1]

Exact solute:

xs = np.array(xs)
print(xs)
array([7.000, 9.000, 2.000, 6.000, 0.000])
ml_A = np.sum(xs[[0,1,3]])/(3.0*m)
ml_B = np.sum(xs[[2,4]])/(2.0*m)
print(ml_A, ml_B)
(0.73333333333333328, 0.10000000000000001)

Numerical estimate:

bnds = [(0,1), (0,1)]
minimize(neg_loglik, [0.5, 0.5], args=(m, xs, zs),
         bounds=bnds, method='tnc', options={'maxiter': 100})
 status: 1
success: True
   nfev: 17
    fun: 7.6552677541393193
      x: array([0.733, 0.100])
message: 'Converged (|f_n-f_(n-1)| ~= 0)'
    jac: array([-0.000, -0.000])
    nit: 6

Incomplete information

However, if we did not record the coin we used, we have missing data and the problem of estimating θ\theta is harder to solve. One way to approach the problem is to ask - can we assign weights wi to each sample according to how likely it is to be generated from coin AA or coin BB?

With knowledge of wiw_i, we can maximize the likelihod to find θ\theta. Similarly, given wi, we can calculate what θ\theta should be. So the basic idea behind Expectation Maximization (EM) is simply to start with a guess for θ\theta, then calculate z, then update θ\theta using this new value for z, and repeat till convergence. The derivation below shows why the EM algorithm using this “alternating” updates actually works.

A verbal outline of the derivtion - first consider the log likelihood function as a curve (surface) where the base is θ\theta. Find another function Q of θ\theta that is a lower bound of the log-likelihood but touches the log likelihodd function at some θ\theta (E-step). Next find the value of θ\theta that maximizes this function (M-step). Now find yet antoher function of θ that is a lower bound of the log-likelihood but touches the log likelihodd function at this new θ\theta. Now repeat until convergence - at this point, the maxima of the lower bound and likelihood functions are the same and we have found the maximum log likelihood. See illustratioin below.

The only remaining step is how to find the functions that are lower bounds of the log likelihood. This will require a little math using Jensen’s inequality, and is shown in the next section. In the E-step, we identify a function which is a lower bound for the log-likelikelihood. We want the QQ function to touch the log-likelihood, and know that Jensen’s inequality is an equality only if the function is constant. So QiQi is just the posterior distribution of zi, and this completes the E-step. In the M-step, we find the value of θ\theta that maximizes the QQ function, and then we iterate over the E and M steps until convergence.
So we see that EM is an algorihtm for maximum likelikhood optimization when there is missing inforrmaiton - or when it is useful to add latent augmented variables to simplify maximum likelihood calculatoins.

For the E-step, with each sample we have
For the M-step, we need to find the value of θ that maximises the Q function
We can differentiate and solve for each component θs where the derivative vanishes.

xs = np.array([(5,5), (9,1), (8,2), (4,6), (7,3)])
thetas = np.array([[0.6, 0.4], [0.5, 0.5]])

tol = 0.01
max_iter = 100

ll_old = 0
for i in range(max_iter):
    ws_A = []
    ws_B = []

    vs_A = []
    vs_B = []

    ll_new = 0

    # E-step: calculate probability distributions over possible completions
    for x in xs:
        # multinomial (binomial) log likelihood
        ll_A = np.sum([x*np.log(thetas[0])])
        ll_B = np.sum([x*np.log(thetas[1])])

        # [EQN 1]
        denom = np.exp(ll_A) + np.exp(ll_B)
        w_A = np.exp(ll_A)/denom
        w_B = np.exp(ll_B)/denom

        ws_A.append(w_A)
        ws_B.append(w_B)

        # used for calculating theta
        vs_A.append(np.dot(w_A, x))
        vs_B.append(np.dot(w_B, x))

        # update complete log likelihood
        ll_new += w_A * ll_A + w_B * ll_B
    # M-step: update values for parameters given current distribution
    # [EQN 2]
    thetas[0] = np.sum(vs_A, 0)/np.sum(vs_A)
    thetas[1] = np.sum(vs_B, 0)/np.sum(vs_B)
    # print distribution of z for each x and current parameter estimate

    print "Iteration: %d" % (i+1)
    print "theta_A = %.2f, theta_B = %.2f, ll = %.2f" % (thetas[0,0], thetas[1,0], ll_new)

    if np.abs(ll_new - ll_old) < tol:  break
    ll_old = ll_new

Suppose the set Z = (X,Y) consist in the observed data XX and unobserved data YY,X 和Z = (X,Y)分別稱爲incomplete data and complete data. 假設Z的joint possibility density is parameterized as P(X,Yθ)P(X,Y|\theta),where θ\theta presents the parameter to be estimated. θ\theta的最大似然估計是求不完整數據的對數似然函數P(X,Yθ)P(X,Y|\theta)的最大值而得到的。

EM Algorithm

Problem of Latent Variables for Maximum Likelihood

A common modeling problem involves how to estimate a joint probability distribution for a dataset, where density estimation involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data.

There are many techniques for solving this problem of estimating a approaciate joint probability distribution, one of the most common which is maximum likelihood estimation. Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample. A limitation of maximum likelihood estimation is that it assumes that the dataset is complete, or fully observed.

This is not always the case. There may be datasets where only some of the relevant variables can be observed, and some cannot, and although they influence other random variables in the dataset, they remain hidden. More generally, these unobserved or hidden variables are referred to as latent variables. Conventional maximum likelihood estimation does not work well in the presence of latent variables. Instead, an alternate formulation of maximum likelihood is required for searching for the appropriate model parameters in the presence of latent variables. The Expectation-Maximization algorithm is one such approach.

Expectation-Maximization Algorithm

The Expectation-Maximization (EM) Algorithm is an approach for maximum likelihood estimation in the presence of latent variables.

The EM algorithm is an iterative approach that cycles between two modes. The first mode attempts to estimate the missing or latent variables, called the estimation-step or E-step. The second mode attempts to optimize the parameters of the model to best explain the data, called the maximization-step or M-step.

  • E-Step: Estimate the missing variables in the dataset.
  • M-Step: Maximize the parameters of the model in the presence of the data.

In machine learning, the EM algorithm is perhaps most well known for use in unsupervised learning problems, such as density estimation and clustering.

Perhaps the most discussed application of the EM algorithm is for clustering with a mixture model, and perhaps its most discussed application is for clustering with a mixture model.

Gaussian Mixture Model and the EM Algorithm

A mixture model is a model comprised of an unspecified combination of multiple probability distribution functions.

The Gaussian Mixture Model (GMM) is a mixture model that uses a combination of Gaussian (Normal) probability distributions and requires the estimation of the mean and standard deviation parameters for each. There are many techniques for estimating the parameters for a GMM, although a maximum likelihood estimate is perhaps the most common.

Consider the case where a dataset is comprised of many points that happen to be generated by two different processes. The points for each process have a Gaussian probability distribution, but the data is combined and the distributions are similar enough that it is not obvious to which distribution a given point may belong.

The processes used to generate the data point represents a latent variable, e.g. process 0 and process 1. It influences the data but is not observable. As such, the EM algorithm is an appropriate approach to use to estimate the parameters of the distributions.

Example of Gaussian Mixture Model

We can make the application of the EM algorithm to a Gaussian Mixture Model concrete with a worked example.

First, let’s contrive a problem where we have a dataset where points are generated from one of two Gaussian processes. The points are one-dimensional, the mean of the first distribution is 20, the mean of the second distribution is 40, and both distributions have a standard deviation of 5. We will draw 3,000 points from the first process and 7,000 points from the second process and mix them together.

# generate a sample
X1 = normal(loc=20, scale=5, size=3000)
X2 = normal(loc=40, scale=5, size=7000)
X = hstack((X1, X2))

We can then plot a histogram of the points to give an intuition for the dataset, which is expected to show a bimodal distribution with a peak for each of the means of the two distributions. The complete example is listed below.

# example of a bimodal constructed from two gaussian processes
from numpy import hstack
from numpy.random import normal
from matplotlib import pyplot
# generate a sample
X1 = normal(loc=20, scale=5, size=3000)
X2 = normal(loc=40, scale=5, size=7000)
X = hstack((X1, X2))
# plot the histogram
pyplot.hist(X, bins=50, density=True)
pyplot.show()

Running the example creates the dataset and then creates a histogram plot for the data points. The plot clearly shows the expected bimodal distribution with a peak for the first process around 20 and a peak for the second process around 40. We can see that for many of the points in the middle of the two peaks that it is ambiguous as to which distribution they were drawn from.

We can model the problem of estimating the density of this dataset using a Gaussian Mixture Model.

The GaussianMixture scikit-learn class can be used to model this problem and estimate the parameters of the distributions using the expectation-maximization algorithm. The class allows us to specify the suspected number of underlying processes used to generate the data via the n_components argument when defining the model. Here, we will set this to 2 for the two processes or distributions. If the number of processes was not known, a range of different numbers of components could be tested and the model with the best fit could be chosen, where models could be evaluated using scores such as Akaike or Bayesian Information Criterion (AIC or BIC). There are also many ways we can configure the model to incorporate other information we may know about the data, such as how to estimate initial values for the distributions. In this case, we will randomly guess the initial parameters, by setting the init_params argument to ‘random’.

Once the model is fit, we can access the learned parameters via arguments on the model, such as the means, covariances, mixing weights, and more. More usefully, we can use the fit model to estimate the latent parameters for existing and new data points. For example, we can estimate the latent variable for the points in the training dataset and we would expect the first 3,000 points to belong to one process (e.g. value=1) and the next 7,000 data points to belong to a different process (e.g. value=0).

# fit model
model = GaussianMixture(n_components=2, init_params='random')
model.fit(X)
# predict latent values
yhat = model.predict(X)
# check latent value for first few points
print(yhat[:100])
# check latent value for last few points
print(yhat[-100:])

Tying all of this together, the complete example is listed below.

from numpy import hstack
from numpy.random import normal
from sklearn.mixture import GaussianMixture
# generate a sample
X1 = normal(loc=20, scale=5, size=3000)
X2 = normal(loc=40, scale=5, size=7000)
X = hstack((X1, X2))
# reshape into a table with one column
X = X.reshape((len(X), 1))
# fit model
model = GaussianMixture(n_components=2, init_params='random')
model.fit(X)
# predict latent values
yhat = model.predict(X)
# check latent value for first few points
print(yhat[:100])
# check latent value for last few points
print(yhat[-100:])

This post borrow heavily from the following:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章