模式識別 | PRML Chapter 4 Linear Models for Classification

PRML Chapter 4 Linear Models for Classification

4.1 Discriminant Functions

4.1.1 Two classes

The simplest representation of a linear discriminant function can be expressed as:

y(x)=wTx+w0y(x) = w^{T}x + w_{0}

The normal distance from the origin to the decision surface is given by

wTxw=w0x\frac{w^{T}x}{||w||} = - \frac{w_{0}}{||x||}

4.1.2 Multiple classes

Considering a single K-class discriminant comprising K linear functions of the form

yk(x)=wkTx+wk0y_{k}(x) = w_{k}^{T}x + w_{k0}

and we assume a point xx to class CkC_k if yk(x)>yj(x)y_k(x)>y_j(x). The decision boundary between class CkC_k and class CjC_j is given by yk(x)=yj(x)y_k(x)=y_j(x), and the corresponding D-1 dimensional hyperplane can be

(wkwj)Tx+(wk0wj0)=0(w_{k} - w_{j})^{T}x + (w_{k0} - w_{j0}) = 0

4.1.3 Least squares for classification

Each class CkC_k is described by its own linear model so that

yk(x)=wkTx+wk0y_k(x)=w^T_kx+w_{k0}

and we can group these together to get

y(x)=W~Tx~y(x) = \tilde{W}^{T}\tilde{x}

The sum-of-squares error function can be written as

ED(W~)=12Tr{(X~W~T)T(X~W~T)}E_{D}(\tilde{W}) = \frac{1}{2}Tr\left\{ (\tilde{X}\tilde{W} - T)^{T}(\tilde{X}\tilde{W} - T) \right\}

Let the derivative of W~\tilde{W} equal to zero and we can obtain the solution for W~\tilde{W}:

W~=(X~TX~)1X~TT=X~T\tilde{W} = (\tilde{X}^{T}\tilde{X})^{-1}\tilde{X}^{T}T = \tilde{X}^{\dag}T

We then obtain the discriminant function in the form

y(x)=W~Tx~=TT(X~)Tx~y(x) = \tilde{W}^{T}\tilde{x} = T^{T}(\tilde{X}^{\dag})^{T}\tilde{x}

4.1.4 Fisher’s linear discriminant

The mean vectors of the two classes are given by

m1=1N1nC1xn, m2=1N2nC2xnm_1=\frac{1}{N_1}\sum_{n\in C_1} x_n,\ m_2=\frac{1}{N_2}\sum_{n\in C_2}x_n

To avoid the overlap in the projection space, we might choose ww to maximize m2m1=wT(m2m1)m_{2} - m_{1} = w^{T}(\mathbf{m_{2} - m_{1}}).

The with-in class variance of the transform data from class CkC_k is given by:

sk2=nCk(ynmk)2s_{k}^{2} = \sum_{n\in C_{k}}(y_{n} - m_{k})^{2}

where yn=wTxny_n=w^T x_n. The Fisher criterion is defined to be the ratio of the between-class variance to the within clss variance:

J(w)=(m2m1)2s12s22J(w) = \frac{(m_{2} - m_{1})^{2}}{s_{1}^{2} - s_{2}^{2}}

Rewrite the Fisher criterion in the following form, here we have the between-class covariance matirx SBS_B and within-class covariance matrix SWS_W.

SB=(m2m1)(m2m1)TS_{B} = (m_{2} - m_{1})(m_{2} - m_{1})^{T}

SW=nC1(xnm1)(xnm1)T+nC2(xnm2)(xnm2)TS_{W} = \sum_{n\in C_{1}}(x_{n} - m_{1})(x_{n} - m_{1})^{T} + \sum_{n\in C_{2}}(x_{n}-m_{2})(x_{n} - m_{2})^{T}

J(w)=wTSBwwTSWwJ(w) = \frac{w^{T}S_{B}w}{w^{T}S_{W}w}

Differentiating the formula above and we can see that J(w)J(w) is maximized when:

(wTSBw)SWw=(wTSWw)SBw(w^{T}S_{B}w)S_{W}w = (w^{T}S_{W}w)S_{B}w

we then obtain:

wSW1(m2m1)w \propto S_{W}^{-1}(m_{2} - m_{1})

4.1.5 Relation to least squares

The Fisher solution can be obtained as a special case of least squares for the two class problem. For class C1C_1 we shall take the targets to be N/N1N/N_1 and for C2C_2 to be N/N2-N/N_2.

The sum-of-square error function can be written:

E=12n=1N(wTxn+w0tn)2E=\frac{1}{2}\sum_{n=1}^N(w^Tx_n+w_0-t_n)^2

Setting the derivatives of E with respect to w0w_0 and ww to zero, we can obtain:

n=1N(wTxn+w0tn)=0\sum_{n=1}^N(w^Tx_n+w_0-t_n)=0

n=1N(wTxn+w0tn)xn=0\sum_{n=1}^N(w^Tx_n+w_0-t_n)x_n=0

Thus we can get:

w0=wTmw_0=-w^T m
(SW+N1N2NSB)w=N(m1m2) wSW1(m2m1)(S_W+\frac{N_1N_2}{N}S_B)w=N(m_1-m_2)\ \rightarrow w\propto S_W^{-1}(m_2-m_1)

4.1.6 Fisher’s discriminant for multiple classes

For multiple classes problem, similar to the two classes, the input space may contain:

  • Mean vector
    mk=1NknCkxn,  m=1Nn=1Nxn=1Nk=1KNkmkm_k=\frac{1}{N_k}\sum_{n\in C_k}x_n,\ \ m=\frac{1}{N}\sum_{n=1}^Nx_n=\frac{1}{N}\sum_{k=1}^KN_km_k

  • Within-class covariance matrix
    SW=k=1KSk,  Sk=nCk(xnmk)(xnmk)TS_W=\sum_{k=1}^K S_k,\ \ S_k=\sum_{n\in C_k}(x_n-m_k)(x_n-m_k)^T

  • Between-class covariance matrix
    SB=k=1KNk(mkm)(mkm)TS_B=\sum_{k=1}^K N_k(m_k-m)(m_k-m)^T

  • The total covariance matrix
    ST=n=1N(xnm)(xnm)T,  ST=SW+SBS_T=\sum_{n=1}^N(x_n-m)(x_n-m)^T,\ \ S_T=S_W+S_B

Next we introduce D>1D'>1 linear ‘features’ yk=wkTxy_k=w_k^Tx and the feature values can be grouped together: y=WTxy=W^Tx. We can define similar matrices in the projected D’-dimensional y-space.

sW=k=1KnCk(ynμk)(ynμk)Ts_W=\sum_{k=1}^K\sum_{n\in C_k}(y_n-\mu_k)(y_n-\mu_k)^T

sB=k=1KNk(μkμ)(μkμ)Ts_B=\sum_{k=1}^K N_k(\mu_k-\mu)(\mu_k-\mu)^T

μk=1NknCkyn,  μ=1Nk=1KNkμk\mu_k=\frac{1}{N_k}\sum_{n\in C_k}y_n,\ \ \mu=\frac{1}{N}\sum_{k=1}^KN_k \mu_k

One of the many choices of criterion is J(W)=Tr(sW1sB)J(W)=Tr(s_W^{-1}s_B) and it is straightforward to see that we should maximize J(W)=Tr[(WSWWT)1(WSBWT)]J(W)=Tr[(WS_WW^T)^{-1}(WS_BW^T)]

4.1.7 The perceptron algorithm

A generalized linear model will be the form:

y(x)=f(wTϕ(x))y(x) = f(w^{T}\phi(x))

and the nonlinear activation function is given by:

f(a)={+1,   a01,   a<0f(a) = \left\{ \begin{aligned} +1, ~~~a\geq 0 \\ -1, ~~~a\lt 0 \end{aligned} \right.

Here’s an alternative error function known as perceptron criterion:

EP(w)=nMwTϕntnE_{P}(w) = -\sum_{n\in M}w^T\phi_{n}t_{n}

And the training process towards this problem will be the stochastic gradient descent algorithm:

w(τ+1)=w(τ)ηEP(w)=w(τ)+ηϕntnw^{(\tau+1)}=w^{(\tau)}-\eta\nabla E_P(w)=w^{(\tau)}+\eta\phi_n t_n

4.2 Probabilistic Generative Models

For the problem of two classes, the posterior probability for class one can be:

p(C1x)=σ(a)p(C_{1} | x) = \sigma(a)

where a=lnp(xC1)p(C1)p(xC2)p(C2)a = \ln\frac{p(x|C_{1})p(C_{1})}{p(x|C_{2})p(C_{2})} and σ(a)=11+exp(1)\sigma(a) = \frac{1}{1 + \exp(-1)} (logistic sigmoid).

For the case of K>2, we have:

p(Ckx)=p(xCk)p(Ck)jp(xCj)p(Cj)=exp(ak)jexp(aj)p(C_{k} | x) = \frac{p(x|C_{k})p(C_{k})}{\sum_{j}p(x|C_{j})p(C_{j})} = \frac{exp(a_{k})}{\sum_{j}\exp(a_{j})}

where ak=lnp((xCk)p(Ck))a_{k} = \ln p((x | C_{k})p(C_{k})) (equals to softmax function).

4.2.1 Continuous inputs

Assume that the class-conditional densities are Gaussian and all classes share the same covariance matrix. So the density for class CkC_k is given by:

p(xCk)=12πD21Σ12exp{12(xμk)TΣ1(xμk)}p(x | C_{k}) = \frac{1}{2\pi^{\frac{D}{2}}}\frac{1}{|\Sigma|^{\frac{1}{2}}}\exp\left\{ -\frac{1}{2}(x-\mu_{k})^{T}\Sigma^{-1}(x-\mu_{k}) \right\}

Consider the first two classes, we have:

p(C1x)=σ(wTx+w0)p(C_{1} | x) = \sigma(w^{T}x + w_{0})

Where w=Σ1(μ1μ2)w = \Sigma^{-1}(\mu_{1} - \mu_{2}) and w0=12μ1TΣ1μ1+12μ2TΣ1μ2+lnp(C1)p(C2)w_{0} = -\frac{1}{2}\mu_{1}^{T}\Sigma^{-1}\mu_{1} + \frac{1}{2}\mu_{2}^{T}\Sigma^{-1}\mu_{2} + \ln\frac{p(C_{1})}{p(C_{2})}.

For the general case of K classes we have, we have:

ak(x)=wkTx+wk0a_k(x)=w_k^Tx+w_{k0}

where wk=Σ1μkw_k=\Sigma^{-1}\mu_k and wk0=12μkTΣ1μk+lnp(Ck)w_{k0}=-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+lnp(C_k).

4.2.2 Maximum likelihood solution

After we specified the class-conditional densities, we can determine the parameters’ value and prior probabilities using maximum likelihood.

For the case of two classes, we have:

p(t,Xπ,μ1,μ2,Σ)=n=1N[πN(x1μ1,Σ)]tn[(1π)N(xnμ2,Σ)]1tnp(\mathbf{t}, \mathbf{X} | \pi, \mu_{1}, \mu_{2}, \Sigma) = \prod_{n=1}^{N}[\pi N(x_{1} | \mu_{1}, \Sigma)]^{t_{n}}[(1-\pi)N(x_{n} | \mu_{2}, \Sigma)]^{1-t_{n}}

The solution will be:

μ1=1N1n=1Ntnxn\mu_{1} = \frac{1}{N_{1}}\sum_{n=1}^{N}t_{n}x_{n}

μ2=1N2n=1N(1tn)xn\mu_{2} = \frac{1}{N_{2}}\sum_{n=1}^{N}(1-t_{n})x_{n}

Σ=N1NS1+N2NS2\Sigma = \frac{N_{1}}{N}S_{1} + \frac{N_{2}}{N}S_{2}

where S1=1N1nC1(xnμ1)(xnμ1)TS_{1} = \frac{1}{N_{1}}\sum_{n\in C_{1}}(x_{n} - \mu_{1})(x_{n} - \mu_{1})^{T} and S2=1N2nC2(xnμ2)(xnμ2)TS_{2} = \frac{1}{N_{2}}\sum_{n\in C_{2}}(x_{n} - \mu_{2})(x_{n} - \mu_{2})^{T}.

The idea of multi-classes problem will be the same.

4.2.3 Discrete features

In this case, we have class-conditional distributions of the form:

p(xCk)=i=1Dμkixi(1μki)1xip(x|C_k)=\prod_{i=1}^D\mu_{ki}^{x_i}(1-\mu_{ki})^{1-x_i}

The posterior probability density will be:

ak(x)=i=1D[xilnμki+(1xi)ln(1μki)]+lnp(Ck)a_k(x)=\sum_{i=1}^D[x_i ln\mu_{ki}+(1-x_i)ln(1-\mu_{ki})]+lnp(C_k)

4.2.4 Exponential family

For members of the exponential family, the distribution of xx can be written in the form:
p(xλk)=h(x)g(λk)exp{λkTu(x)}p(x|\lambda_k)=h(x)g(\lambda_k)exp\{\lambda_k^Tu(x)\}

if we let u(x)=xu(x)=x and introduce a scaling parameter ss:

p(xλk,s)=1sh(1sx)g(λk)exp{1sλkTx}p(x | \lambda_{k}, s) = \frac{1}{s}h(\frac{1}{s}x)g(\lambda_{k})\exp\left\{\frac{1}{s}\lambda_{k}^{T}x\right\}

Consequently, for two-class problem the posterior class probability is given by a logistic sigmoid acting on a linear function a(x)a(x):

a(x)=(λ1λ2)Tx+lng(λ1)lng(λ2)+lnp(C1)lnp(C2)a(x)=(\lambda_1-\lambda_2)^Tx+lng(\lambda_1)-lng(\lambda_2)+lnp(C_1)-lnp(C_2)

And for K-classes problem:

ak(x)=1sλkTx+lng(λk)+lnp(Ck)a_{k}(x) = \frac{1}{s}\mathbf{\lambda_{k}^{T}x} + \ln g(\lambda_{k}) + \ln p(C_{k})

4.3 Probabilistic Discriminative Models

4.3.1 Fixed basis functions

4.3.2 Logistic regression

For two-class classification problem, the posterior probability of class C1C_1 can be written as a logistic sigmoid acting on a linear function of the feature vector:

p(C1ϕ)=y(x)=σ(wTϕ)p(C_{1} | \phi) = y(x) = \sigma(w^{T}\phi)

We now use MLE to determine the parameters of logistic regression:

p(tw)=n=1Nyntn{1yn}1tnp(\mathbf{t} | w) = \prod_{n=1}^{N}y_{n}^{t_{n}}\{1-y_{n}\}^{1-t_{n}}

where t=(t1,...,tn)\mathbf{t}=(t_1,...,t_n) and yn=p(C1ϕn)y_n=p(C_1\|\phi_n). We can define an error function by taking the negative logarithm of the likelihood, which gives the cross-entropy error function in the form:

E(w)=lnp(tw)=n=1N{tnlnyn+(1tn)ln(1yn)}E(w) = -\ln p(\mathbf{t} | w) = -\sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \}

where yn=σ(an)y_n=\sigma(a_n) and an=wTϕna_n=w^T\phi_n. Taking the gradient of the error function with respect to ww, we obtain:

E(w)=n=1N(yntn)ϕn\nabla E(w) = \sum_{n=1}^{N}(y_{n} - t_{n})\phi_{n}

4.3.3 Iterative reweighted least squares

The Newton-Raphson update, for minimizing a function E(w)E(w), takes the form:

wnew=woldH1E(w)w^{new} = w^{old} - \mathbf{H}^{-1}\nabla E(w)

Apply the Newton-Raphson update to the cross-entropy error function for the logistic regression model, the gradient and Hessian of the error function are given by:

E(w)=n=1N(yntn)ϕn=ΦT(yt)\nabla E(w) = \sum_{n=1}^{N}(y_{n} - t_{n})\phi_{n} = \Phi^{T}(\mathbf{y} - t)

H=E(w)=n=1Nyn(1yn)ϕnϕnT=ΦTRΦH = \nabla\nabla E(w) = \sum_{n=1}^{N}y_{n}(1-y_{n})\phi_{n}\phi_{n}^{T} = \Phi^{T}R\Phi

Rnn=yn(1yn)R_{nn} = y_{n}(1-y_{n})

The Newton-Raphson update formula for the logistic regression model becomes:

wnew=wold(ΦTRΦ)1ΦT(yt)=(ΦTRΦ)1ΦTRzw^{new} = w^{old} - (\Phi^{T}R\Phi)^{-1}\Phi^{T}(\mathbf{y} - \mathbf{t}) = (\Phi^{T}R\Phi)^{-1}\Phi^{T}Rz

z=ΦwoldR1(yt)z = \Phi w^{old} - R^{-1}(\mathbf{y} - \mathbf{t})

4.3.4 Multiclass logistic regression

For this problem, the posterior probabilities are given by:

p(Ckϕ)=yk(ϕ)=exp(ak)jexp(aj)p(C_k|\phi)=y_k(\phi)=\frac{exp(a_k)}{\sum_j exp(a_j)}

where ak=wkTϕa_k=w^T_k\phi.

Similarly, we can write down the likelihood function:

p(Tw1,...,wK)=n=1Nk=1Kp(Ckϕn)tnk=n=1Nk=1Kynktnkp(\textbf{T}|w_1,...,w_K)=\prod_{n=1}^N\prod_{k=1}^K p(C_k|\phi_n)^{t_{nk}}=\prod_{n=1}^N\prod_{k=1}^K y_{nk}^{t_{nk}}

The cross-entropy error function for the multiclass classification problem:

E(w1,...,wK)=lnp(Tw1,...,wk)=n=1Nk=1KtnklnynkE(w_1,...,w_K)=-lnp(\textbf{T}|w_1,...,w_k)=-\sum_{n=1}^N\sum_{k=1}^K t_{nk}lny_{nk}

The derivatives will be

wjE(w1,...,wK)=n=1N(ynjtnj)ϕn\nabla_{w_j}E(w_1,...,w_K)=\sum_{n=1}^N(y_{nj}-t_{nj})\phi_n

wkwjE(w1,...,wK)=n=1Nynj(Ikjynj)ϕnϕnT\nabla_{w_k}\nabla_{w_j}E(w_1,...,w_K)=-\sum_{n=1}^N y_{nj}(I_{kj}-y_{nj})\phi_n\phi_n^T

4.3.5 Probit regression

If the value of θ\theta is drawn from a probability density p(θ)p(\theta), then the corresponding activation function will be given by the cumulative distribution function:

f(a)=ap(θ)dθf(a)=\int^a_{-\infin}p(\theta)d\theta

And we suppose the density is given by a zero mean, unit variance Gaussian:

Φ(a)=aN(θ0,1)dθ\Phi(a) = \int_{-\infty}^{a} N(\theta | 0, 1) d\theta

which is known as probit function. Many numerical packages provide for evaluation of a closely related function defined by:

erf(a)=2π0aexp(θ2)dθerf(a) = \frac{2}{\sqrt{\pi}}\int_{0}^{a}\exp(-\theta^{2})d\theta

which is known as erf function. It is related to the probit function by:

Φ(a)=12{1+12erf(a)}\Phi(a) = \frac{1}{2}\{ 1 + \frac{1}{\sqrt{2}}erf(a)\}

The generalized linear model based on a probit activation function is known as probit regression.

4.3.6 Canonical link functions

If we assume that the conditional distribution of the target variable comes from the exponential family distribution, the corresponding activation function is selected as the standard link function (the link function is the inverse of the activation function), then we have:

E(w)=1sn=1N{yntn}ϕn\nabla E(w) = \frac{1}{s}\sum_{n=1}^{N}\{y_{n} - t_{n}\}\phi_{n}

For the Gaussian s=β1s=\beta^{-1}, whereas for the logistic model s=1s=1.

4.4 The Laplace Approximation

Laplace approximation aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. Suppose the distribution is difined by:

p(z)=1zf(z)p(z) = \frac{1}{z}f(z)

Expanding around the stationary point:

lnf(z)lnf(z0)12(zz0)TA(zz0)\ln f(z) \simeq \ln f(z_{0}) - \frac{1}{2}(z-z_{0})^{T}A(z-z_{0})

where A=lnf(z)z=z0A = -\nabla\nabla\ln f(z)|_{z=z_{0}}. Taking the exponential of both sides we obtain:

f(z)f(z0)exp{frac12(zz0)TA(zz0)}f(z) \simeq f(z_{0})\exp \left\{ -frac{1}{2}(z - z_{0})^{T}A(z-z_{0}) \right\}

and we know that q(z)q(z) is proportional to f(z)f(z) so:

q(z)=A12(2π)M2exp{12(zz0)TA(zz0)}=N(zz0,A1)q(z) = \frac{|A|^{\frac{1}{2}}}{(2\pi)^{\frac{M}{2}}}\exp \left\{ -\frac{1}{2}(z-z_{0})^{T}A(z-z_{0}) \right\} = N(z | z_{0}, A^{-1})

4.4.1 Model comparison and BIC

4.5 Bayesian Logistic Regression

4.5.1 Laplace approximation

Seek a Gaussian representation for the posterior distribution and get the log form:

lnp(wt)=12(ww0)TS01(ww0)+n=1N{tnlnyn+(1tn)ln(1yn)}+const\ln p(w | t) = -\frac{1}{2}(w - w_{0})^{T}S_{0}^{-1}(w - w_{0}) + \sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n})\} + const

The covariance is then given by the inverse of the matrix of second derivatives of the negative log likelihood:

SN1=lnp(wt)=S01+n=1Nyn(1yn)ϕnϕnTS_{N}^{-1} = -\nabla\nabla \ln p(w | t) = S_{0}^{-1} + \sum_{n=1}^{N}y_{n}(1-y_{n})\phi_{n}\phi_{n}^{T}

The Gaussian approximation to the posterior distribution therefore takes the form:

q(w)=N(wwMAP,SN)q(w) = N(w | w_{MAP}, S_{N})

4.5.2 Predictive distribution

The variational approximation to the predictive distribution:

p(C1t)=σ(a)p(a)da=σ(a)N(aμa,σa2)dap(C_{1} | t) = \int \sigma(a)p(a)d a = \int\sigma(a)N(a | \mu_{a}, \sigma_{a}^{2})da

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章