support vector machine -note from wiki

Summary

Support vector machine is used for classification and regression analysis

SVM training algorithm build one model to assign the new examples into the category.

An SVM model is a representation of the examples as points in space.

svm can perform linear classification and non-linear classification with kernel tricks.

Definition

SVM constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.

The hyperplane has the largest distance to the nearest training data point of any class.the larger the margin the lower generalization error of the classifier.

sometimes The set to discriminate are not liner seperatable in the finite dimension,the original finite-dimensional space is mapped into a much higher-dimensional space

By defining a kernel function $K(x,y)$ , the mappings are design to make dot product computation easily.

The hyperplanes is defined as set of points whoses the dot prodcut with vector is constant.

The vectors can be chosen as a liner combination with $\alpha_i$ prameters.

Then the points $x$ in the feature space that are mapped into hyperplane are defined by the relation: $\textstyle\sum_i \alpha_i K(x_i,x) = \mathrm{constant}.$

the sum of kernel are used to measure the relative nearness of each test point to the data points originating in sets.

Motivation

The goal is to decide which class a new data point will be in.

H₁ does not separate the classes. H₂ does, but only with a small margin. H₃ separates them with the maximum margin.

we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized.

The hyperplan is called maximum-margin hyperplane.

Liner svm

Given training set D, including n points.

$\mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n$

Each $\mathbf{x}_i$ is a p-dimensional real vector.

Any hyperplane can be written as the set of points $\mathbf{x}$ satisfying

$\mathbf{w}\cdot\mathbf{x} - b=0,\,$

${\mathbf{w}}$ is the normal vector to the hyperplane.

Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

we can select two hyperplanes（seperate the points）can be described by the equations

\mathbf{w}\cdot\mathbf{x} - b=1\,

and

\mathbf{w}\cdot\mathbf{x} - b=-1.\,

the distance between these two hyperplanes is

\tfrac{2}{\|\mathbf{w}\|}

, so we want to minimize

\|\mathbf{w}\|

In order to prevent data points from falling into the margin, we add the following constraint: for each $i$ either

\mathbf{w}\cdot\mathbf{x}_i - b \ge 1\qquad\text{ for }\mathbf{x}_i

of the first class

\mathbf{w}\cdot\mathbf{x}_i - b \le -1\qquad\text{ for }\mathbf{x}_i

of the second.

This can be rewritten as:

y_i(\mathbf{w}\cdot\mathbf{x}_i - b) \ge 1, \quad \text{ for all } 1 \le i \le n.\qquad\qquad(1)

Minimize (in ${\mathbf{w},b}$ )

\|\mathbf{w}\|

subject to (for any $i = 1, \dots, n$ )

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1. \,

Primal form

substituting ||w|| with $\tfrac{1}{2}\|\mathbf{w}\|^2$ with math convinience. this is quadratic programming optimization.

$\arg\min_{(\mathbf{w},b)}\frac{1}{2}\|\mathbf{w}\|^2$

subject to (for any $i = 1, \dots, n$ )

$y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1.$

By introducing Lagrange multipliers $\boldsymbol{\alpha}$ , the previous constrained problem can be expressed as

\arg\min_{\mathbf{w},b } \max_{\boldsymbol{\alpha}\geq 0 } \left\{ \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{n}{\alpha_i[y_i(\mathbf{w}\cdot \mathbf{x_i} - b)-1]} \right\}

The KKT implies the solution can be expressed as a linear combination of the training vectors:

$\mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}.$

Only a few $\alpha_i$ will be greater than zero. The corresponding $\mathbf{x_i}$ are exactly the support vectors, which lie on the margin and satisfy $y_i(\mathbf{w}\cdot\mathbf{x_i} - b) = 1$ . From this one can derive that the support vectors also satisfy

\mathbf{w}\cdot\mathbf{x_i} - b = 1 / y_i = y_i \iff b = \mathbf{w}\cdot\mathbf{x_i} - y_i

which allows one to define the offset $b$ . In practice, it is more robust to average over all $N_{SV}$ support vectors:

b = \frac{1}{N_{SV}} \sum_{i=1}^{N_{SV}}{(\mathbf{w}\cdot\mathbf{x_i} - y_i)}

Dual form

Writing the classification rule reveals that the maximum-margin hyperplane and therefore the classification task is only a function of the support vectors.

Using the fact that $\|\mathbf{w}\|^2 = \mathbf{w}\cdot \mathbf{w}$ and substituting $\mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}$ , one can show that the dual of the SVM reduces to the following optimization problem:

Maximize (in $\alpha_i$ )

\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

subject to (for any $i = 1, \dots, n$ )

\alpha_i \geq 0,\,

and to the constraint from the minimization in $b$

\sum_{i=1}^n \alpha_i y_i = 0.

Here the kernel is defined by $k(\mathbf{x}_i,\mathbf{x}_j)=\mathbf{x}_i\cdot\mathbf{x}_j$ .

$W$ can be computed thanks to the $\alpha$ terms:

\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i.

Soft Margain

Soft Margin method will choose a hyperplane that splits the examples as cleanly as possible.

The method introduces non-negative slack variables,

\xi_i

, which measure the degree of misclassification of the data

x_i

;

$y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i \quad 1 \le i \le n. \quad\quad(2)$

optimization becomes a trade off between a large margin and a small error penalty

\arg\min_{\mathbf{w},\mathbf{\xi}, b } \left\{\frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \right\}

subject to (for any $i=1,\dots n$ )

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i, ~~~~\xi_i \ge 0

This constraint in (2) along with the objective of minimizing $\|\mathbf{w}\|$ can be solved using Lagrange multipliers as done above. One has then to solve the following problem:

\arg\min_{\mathbf{w},\mathbf{\xi}, b } \max_{\boldsymbol{\alpha},\boldsymbol{\beta} }\left \{ \frac{1}{2}\|\mathbf{w}\|^2+C \sum_{i=1}^n \xi_i- \sum_{i=1}^{n}{\alpha_i[y_i(\mathbf{w}\cdot \mathbf{x_i} - b) -1 + \xi_i]}- \sum_{i=1}^{n} \beta_i \xi_i \right \}

with $\alpha_i, \beta_i \ge 0$ .

Dual form[edit]

Maximize (in $\alpha_i$ )

\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

subject to (for any $i = 1, \dots, n$ )

0 \leq \alpha_i \leq C,\,

and

\sum_{i=1}^n \alpha_i y_i = 0.

Nonlinear classification

every dot product is replaced by a nonlinear kernel function

Some common kernels include:

Polynomial (homogeneous): $k(\mathbf{x_i},\mathbf{x_j})=(\mathbf{x_i} \cdot \mathbf{x_j})^d$
Polynomial (inhomogeneous): $k(\mathbf{x_i},\mathbf{x_j})=(\mathbf{x_i} \cdot \mathbf{x_j} + 1)^d$
Gaussian radial basis function: $k(\mathbf{x_i},\mathbf{x_j})=\exp(-\gamma \|\mathbf{x_i} - \mathbf{x_j}\|^2)$ , for $\gamma > 0$ . Sometimes parametrized using $\gamma=1/{2 \sigma^2}$
Hyperbolic tangent: $k(\mathbf{x_i},\mathbf{x_j})=\tanh(\kappa \mathbf{x_i} \cdot \mathbf{x_j}+c)$ , for some (not every) $\kappa > 0$ and $c < 0$