Summary
Support vector machine is used for classification and regression analysis
SVM training algorithm build one model to assign the new examples into the category.
An SVM model is a representation of the examples as points in space.
svm can perform linear classification and non-linear
classification with kernel tricks.
Definition
SVM
constructs a hyperplane or
set of hyperplanes in a high- or
infinite-dimensional space, which can be used for classification, regression, or other tasks.
The
hyperplane has the largest distance to
the nearest training data point of any class.the
larger the margin the lower generalization
error of the classifier.
sometimes
The set to discriminate are not liner seperatable in the finite dimension,the original finite-dimensional space is
mapped into a
much higher-dimensional space
By
defining a kernel
function , the
mappings are design to make dot product computation easily.
The
hyperplanes is defined as set of points whoses the dot prodcut with vector is constant.
The
vectors can be chosen as a liner combination with prameters.
Then
the points in
the feature space that are mapped into hyperplane are defined by the relation:
the
sum of kernel are used to measure the relative nearness of each test point to the data points originating in sets.
Motivation
The goal
is to decide which class a new data
point will be in.
H1 does
not separate the classes. H2 does,
but only with a small margin. H3 separates
them with the maximum margin.
we
choose the hyperplane so that the distance from it to the nearest data point
on each
side is maximized.
The
hyperplan is called maximum-margin
hyperplane.
Liner svm
Given training set D, including n points.
Each is
a p-dimensional real vector.
Any
hyperplane can be written as the set of points satisfying
is
the normal vector to the hyperplane.
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes.
Samples on the margin are called the support vectors.
we can select two hyperplanes(seperate the points)can be described by the equations
and
In order to prevent data points from falling into the margin, we add the following constraint: for each either
- of the first class
or
- of the second.
This can be rewritten as:
Minimize (in )
subject to (for any )
Primal form
substituting ||w|| with with math convinience. this is quadratic programming optimization.subject to (for any )
By introducing Lagrange multipliers , the previous constrained problem can be expressed as
- The KKT implies the solution can be expressed as a linear combination of the training vectors:
-
Only a few will be greater than zero. The corresponding are exactly the support vectors, which lie on the margin and satisfy . From this one can derive that the support vectors also satisfy
which allows one to define the offset . In practice, it is more robust to average over all support vectors:
Dual form
Writing the classification rule reveals that the maximum-margin hyperplane and therefore the classification task is only a function of the support vectors.
Using the fact that and substituting , one can show that the dual of the SVM reduces to the following optimization problem:
Maximize (in )
subject to (for any )
and to the constraint from the minimization in
Here the kernel is defined by .
can be computed thanks to the terms:
- Soft Margain
-
Soft Margin method
will choose a hyperplane that splits the examples as cleanly as possible.
-
The
method introduces non-negative slack variables, ,
which measure the degree of misclassification of the data ;
-
-
optimization
becomes a trade off between a large margin and a small error penalty
-
subject to (for any )
-
This constraint in (2) along with the objective of minimizing can be solved using Lagrange multipliers as done above. One has then to solve the following problem:
with .
-
Dual form[edit]
Maximize (in )
subject to (for any )
and
-
every dot
product is replaced by a nonlinear kernel function
-
Some common kernels include:
- Polynomial (homogeneous):
- Polynomial (inhomogeneous):
- Gaussian radial basis function: , for . Sometimes parametrized using
- Hyperbolic tangent: , for some (not every) and
-
- Properties
-
The
effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C
- Implementation
-
The
parameters of the maximum-margin hyperplane are derived by solving the optimization.