support vector machine -note from wiki

Summary

Support vector machine  is used for classification and regression analysis

SVM training algorithm build one model to assign the new examples into  the category.

An SVM model is a representation of the examples as points in space.

svm can perform linear classification and non-linear classification with kernel tricks.


Definition

SVM constructs hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.

The hyperplane has the largest distance to the nearest training data point of any class.the larger the margin the lower generalization error of the classifier.

sometimes The set to discriminate are not liner seperatable in the finite dimension,the original finite-dimensional space is mapped into a much higher-dimensional space

By defining a kernel function K(x,y) , the mappings are design to make dot product computation easily.

The hyperplanes is defined as set of points whoses the dot prodcut with vector is constant.

The vectors can be chosen as a liner combination with \alpha_i  prameters.

Then the points x in the feature space that are mapped into hyperplane are defined by the relation: \textstyle\sum_i \alpha_i K(x_i,x) = \mathrm{constant}.

the sum of kernel are used to measure the relative nearness of each test point to the data points originating in sets.


Motivation

The goal is to decide which class a new data point will be in.

File:Svm separating hyperplanes (SVG).svg

H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximum margin.

we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized

The hyperplan is called maximum-margin hyperplane.

Liner svm


Given training set D, including n points.

\mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n

Each \mathbf{x}_i is a p-dimensional real vector.

Any hyperplane can be written as the set of points \mathbf{x} satisfying

\mathbf{w}\cdot\mathbf{x} - b=0,\,

{\mathbf{w}} is the normal vector to the hyperplane.

File:Svm max sep hyperplane with margin.png

Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

we can select two hyperplanes(seperate the points)can be described by the equations

\mathbf{w}\cdot\mathbf{x} - b=1\,

and

\mathbf{w}\cdot\mathbf{x} - b=-1.\,
the distance between these two hyperplanes is \tfrac{2}{\|\mathbf{w}\|}, so we want to minimize \|\mathbf{w}\|.

In order to prevent data points from falling into the margin, we add the following constraint: for each i either

\mathbf{w}\cdot\mathbf{x}_i - b \ge 1\qquad\text{ for }\mathbf{x}_i of the first class

or

\mathbf{w}\cdot\mathbf{x}_i - b \le -1\qquad\text{ for }\mathbf{x}_i of the second.

This can be rewritten as:

y_i(\mathbf{w}\cdot\mathbf{x}_i - b) \ge 1, \quad \text{ for all } 1 \le i \le n.\qquad\qquad(1)

Minimize (in {\mathbf{w},b})

\|\mathbf{w}\|

subject to (for any i = 1, \dots, n)

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1. \,

Primal form

substituting ||w|| with \tfrac{1}{2}\|\mathbf{w}\|^2  with math convinience. this is quadratic programming optimization.

\arg\min_{(\mathbf{w},b)}\frac{1}{2}\|\mathbf{w}\|^2

subject to (for any i = 1, \dots, n)

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1.

By introducing Lagrange multipliers \boldsymbol{\alpha}, the previous constrained problem can be expressed as

\arg\min_{\mathbf{w},b } \max_{\boldsymbol{\alpha}\geq 0 } \left\{ \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{n}{\alpha_i[y_i(\mathbf{w}\cdot \mathbf{x_i} - b)-1]} \right\}
The KKT implies the solution can be expressed as a linear combination of the training vectors:
\mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}.

Only a few \alpha_i will be greater than zero. The corresponding \mathbf{x_i} are exactly the support vectors, which lie on the margin and satisfy y_i(\mathbf{w}\cdot\mathbf{x_i} - b) = 1. From this one can derive that the support vectors also satisfy

\mathbf{w}\cdot\mathbf{x_i} - b = 1 / y_i = y_i \iff b = \mathbf{w}\cdot\mathbf{x_i} - y_i

which allows one to define the offset b. In practice, it is more robust to average over all N_{SV} support vectors:

b = \frac{1}{N_{SV}} \sum_{i=1}^{N_{SV}}{(\mathbf{w}\cdot\mathbf{x_i} - y_i)}

Dual form

Writing the classification rule reveals that the maximum-margin hyperplane and therefore the classification task is only a function of the support vectors.

Using the fact that \|\mathbf{w}\|^2 = \mathbf{w}\cdot \mathbf{w} and substituting \mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}, one can show that the dual of the SVM reduces to the following optimization problem:

Maximize (in \alpha_i )

\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

subject to (for any i = 1, \dots, n)

\alpha_i \geq 0,\,

and to the constraint from the minimization in b

\sum_{i=1}^n \alpha_i y_i = 0.

Here the kernel is defined by k(\mathbf{x}_i,\mathbf{x}_j)=\mathbf{x}_i\cdot\mathbf{x}_j.

W can be computed thanks to the \alpha terms:

\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i.
Soft Margain
Soft Margin method will choose a hyperplane that splits the examples as cleanly as possible.
 The method introduces non-negative slack variables, \xi_i, which measure the degree of misclassification of the data x_i;
y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i \quad 1 \le i \le n. \quad\quad(2)
optimization becomes a trade off between a large margin and a small error penalty
\arg\min_{\mathbf{w},\mathbf{\xi}, b } \left\{\frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \right\}

subject to (for any i=1,\dots n)

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i, ~~~~\xi_i \ge 0

This constraint in (2) along with the objective of minimizing \|\mathbf{w}\| can be solved using Lagrange multipliers as done above. One has then to solve the following problem:

\arg\min_{\mathbf{w},\mathbf{\xi}, b } \max_{\boldsymbol{\alpha},\boldsymbol{\beta} }\left \{ \frac{1}{2}\|\mathbf{w}\|^2+C \sum_{i=1}^n \xi_i- \sum_{i=1}^{n}{\alpha_i[y_i(\mathbf{w}\cdot \mathbf{x_i} - b) -1 + \xi_i]}- \sum_{i=1}^{n} \beta_i \xi_i \right \}

with \alpha_i, \beta_i \ge 0.


Dual form[edit]

Maximize (in \alpha_i )

\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

subject to (for any i = 1, \dots, n)

0 \leq \alpha_i \leq C,\,

and

\sum_{i=1}^n \alpha_i y_i = 0.
Nonlinear classification
 every dot product is replaced by a nonlinear kernel function

Some common kernels include:

The kernel is related to the transform \varphi(\mathbf{x_i}) by the equation k(\mathbf{x_i}, \mathbf{x_j}) = \varphi(\mathbf{x_i})\cdot \varphi(\mathbf{x_j}).
\textstyle\mathbf{w} = \sum_i \alpha_i y_i \varphi(\mathbf{x}_i)
Properties
The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C
Implementation
The parameters of the maximum-margin hyperplane are derived by solving the optimization.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章