The simplest representation of a linear discriminant function can be expressed as:
y(x)=wTx+w0
The normal distance from the origin to the decision surface is given by
∣∣w∣∣wTx=−∣∣x∣∣w0
4.1.2 Multiple classes
Considering a single K-class discriminant comprising K linear functions of the form
yk(x)=wkTx+wk0
and we assume a point x to class Ck if yk(x)>yj(x). The decision boundary between class Ck and class Cj is given by yk(x)=yj(x), and the corresponding D-1 dimensional hyperplane can be
(wk−wj)Tx+(wk0−wj0)=0
4.1.3 Least squares for classification
Each class Ck is described by its own linear model so that
yk(x)=wkTx+wk0
and we can group these together to get
y(x)=W~Tx~
The sum-of-squares error function can be written as
ED(W~)=21Tr{(X~W~−T)T(X~W~−T)}
Let the derivative of W~ equal to zero and we can obtain the solution for W~:
W~=(X~TX~)−1X~TT=X~†T
We then obtain the discriminant function in the form
y(x)=W~Tx~=TT(X~†)Tx~
4.1.4 Fisher’s linear discriminant
The mean vectors of the two classes are given by
m1=N11n∈C1∑xn,m2=N21n∈C2∑xn
To avoid the overlap in the projection space, we might choose w to maximize m2−m1=wT(m2−m1).
The with-in class variance of the transform data from class Ck is given by:
sk2=n∈Ck∑(yn−mk)2
where yn=wTxn. The Fisher criterion is defined to be the ratio of the between-class variance to the within clss variance:
J(w)=s12−s22(m2−m1)2
Rewrite the Fisher criterion in the following form, here we have the between-class covariance matirx SB and within-class covariance matrix SW.
Differentiating the formula above and we can see that J(w) is maximized when:
(wTSBw)SWw=(wTSWw)SBw
we then obtain:
w∝SW−1(m2−m1)
4.1.5 Relation to least squares
The Fisher solution can be obtained as a special case of least squares for the two class problem. For class C1 we shall take the targets to be N/N1 and for C2 to be −N/N2.
The sum-of-square error function can be written:
E=21n=1∑N(wTxn+w0−tn)2
Setting the derivatives of E with respect to w0 and w to zero, we can obtain:
The total covariance matrix ST=n=1∑N(xn−m)(xn−m)T,ST=SW+SB
Next we introduce D′>1 linear ‘features’ yk=wkTx and the feature values can be grouped together: y=WTx. We can define similar matrices in the projected D’-dimensional y-space.
sW=k=1∑Kn∈Ck∑(yn−μk)(yn−μk)T
sB=k=1∑KNk(μk−μ)(μk−μ)T
μk=Nk1n∈Ck∑yn,μ=N1k=1∑KNkμk
One of the many choices of criterion is J(W)=Tr(sW−1sB) and it is straightforward to see that we should maximize J(W)=Tr[(WSWWT)−1(WSBWT)]
4.1.7 The perceptron algorithm
A generalized linear model will be the form:
y(x)=f(wTϕ(x))
and the nonlinear activation function is given by:
f(a)={+1,a≥0−1,a<0
Here’s an alternative error function known as perceptron criterion:
EP(w)=−n∈M∑wTϕntn
And the training process towards this problem will be the stochastic gradient descent algorithm:
w(τ+1)=w(τ)−η∇EP(w)=w(τ)+ηϕntn
4.2 Probabilistic Generative Models
For the problem of two classes, the posterior probability for class one can be:
p(C1∣x)=σ(a)
where a=lnp(x∣C2)p(C2)p(x∣C1)p(C1) and σ(a)=1+exp(−1)1 (logistic sigmoid).
For two-class classification problem, the posterior probability of class C1 can be written as a logistic sigmoid acting on a linear function of the feature vector:
p(C1∣ϕ)=y(x)=σ(wTϕ)
We now use MLE to determine the parameters of logistic regression:
p(t∣w)=n=1∏Nyntn{1−yn}1−tn
where t=(t1,...,tn) and yn=p(C1∥ϕn). We can define an error function by taking the negative logarithm of the likelihood, which gives the cross-entropy error function in the form:
E(w)=−lnp(t∣w)=−n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
where yn=σ(an) and an=wTϕn. Taking the gradient of the error function with respect to w, we obtain:
∇E(w)=n=1∑N(yn−tn)ϕn
4.3.3 Iterative reweighted least squares
The Newton-Raphson update, for minimizing a function E(w), takes the form:
wnew=wold−H−1∇E(w)
Apply the Newton-Raphson update to the cross-entropy error function for the logistic regression model, the gradient and Hessian of the error function are given by:
∇E(w)=n=1∑N(yn−tn)ϕn=ΦT(y−t)
H=∇∇E(w)=n=1∑Nyn(1−yn)ϕnϕnT=ΦTRΦ
Rnn=yn(1−yn)
The Newton-Raphson update formula for the logistic regression model becomes:
wnew=wold−(ΦTRΦ)−1ΦT(y−t)=(ΦTRΦ)−1ΦTRz
z=Φwold−R−1(y−t)
4.3.4 Multiclass logistic regression
For this problem, the posterior probabilities are given by:
p(Ck∣ϕ)=yk(ϕ)=∑jexp(aj)exp(ak)
where ak=wkTϕ.
Similarly, we can write down the likelihood function:
If the value of θ is drawn from a probability density p(θ), then the corresponding activation function will be given by the cumulative distribution function:
f(a)=∫−∞ap(θ)dθ
And we suppose the density is given by a zero mean, unit variance Gaussian:
Φ(a)=∫−∞aN(θ∣0,1)dθ
which is known as probit function. Many numerical packages provide for evaluation of a closely related function defined by:
erf(a)=π2∫0aexp(−θ2)dθ
which is known as erf function. It is related to the probit function by:
Φ(a)=21{1+21erf(a)}
The generalized linear model based on a probit activation function is known as probit regression.
4.3.6 Canonical link functions
If we assume that the conditional distribution of the target variable comes from the exponential family distribution, the corresponding activation function is selected as the standard link function (the link function is the inverse of the activation function), then we have:
∇E(w)=s1n=1∑N{yn−tn}ϕn
For the Gaussian s=β−1, whereas for the logistic model s=1.
4.4 The Laplace Approximation
Laplace approximation aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. Suppose the distribution is difined by:
p(z)=z1f(z)
Expanding around the stationary point:
lnf(z)≃lnf(z0)−21(z−z0)TA(z−z0)
where A=−∇∇lnf(z)∣z=z0. Taking the exponential of both sides we obtain: