Introduction to SVM

A simple classical problem

SVM finds an optimal “hyperplane“ between two groups of datapoints. Although the optimization problem it solves is always linear, we can incorporate other geometric structures by transforming the data to other spaces.

In the simplest case, consider this two dimensional example.

Linear split

Linear split of uniform U([0,1]^2) points

Our objective is to find the optimal separating line between these two groups of points. We assume that the best one is the one with the largest distance from red and from black dots. Let’s assume we have n=50 points as in the example above. We denote points as \mathbf{x}_i, the class of a point as y_i and set y_i = 1 for black dots and y_i = -1 for red dots for each point i \in \{1,2,...,n\}. We can now solve a minimization problem, i.e. find a hyperplane (which in \mathbb{R}^2 case is just a line),

Minimize
\| \mathbf{w} \|
subject to
y_i(x_i'\mathbf{w} + b) \geq 1,

where \mathbf{w} is a vector normal to the margin we are looking for and b is an afinite shift, corresponding to the fact that our hyperplane may not go through the origin. We use the fact that any hyperplane in \mathbb{R}^d space can by define by it’s normal vector and it’s shift from origin. Note however that here the length of the vector is crucial and corresponds to the margin size.

Note that the minimization problem does not have a closed for solution, so we need to use optimization methods in order to solve it. Luckily, it is a convex problem, so it is reasonably simple to solve by numerical methods.

Soft margin, ‘intersecting’ classes

In practice often the groups are not entirely separable and the separating hyperplane simply does not exist. In Cortes and Vapnik [2] introduced a so-called soft margin, where they allow misclassification of points close to the bound.

Minimize over \mathbf{w},b and \eta_i for i \in \{1,2,...,n\}
\| \mathbf{w} \| /2 + C \sum_i \xi_i
subject to
y_i(x_i'\mathbf{w} + b) \geq 1 - \xi_i,
\xi_i \geq 0.

The intuition behind this formula is that \xi_i are the distances of misclassified points and C is the cost of this misclassification. The further the point is, the larger the prize. We can also see that in the constraint, which has been relaxed from \geq 1 to \geq 1 - \xi_i, giving in some sense more freedom (and this freedom cost C\xi_i).

For technical reasons it is useful to work on a dual problem (equivalent in a sense that it has the same solution \mathbf{w},b) to

(1) Maximize
\sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\langle x_i,x_j \rangle,
subject to
0 \leq \alpha_i \leq C,
\sum_i \alpha_i y_i = 0,

where x_i = \alpha_i y_i so \langle\mathbf{x}_i,\mathbf{x}_j\rangle = \alpha_i\alpha_j y_iy_j \mathbf{x_i}\mathbf{x_j}. For the details of how to transform the initial problem to the last, most convenient, form, please refer to [1].

Note that if we set C very large, the problem is equivalent an SVM with a hard margin, whereas too small values of C give a robust but potentially inaccurate results.

Non-linearity and kernel methods

The power of SVM comes from the fact that we may incorporate nonlinear separations. The basic idea is to map the points to another space and look for the separating hyperplane in the new space. For example, suppose we have a two dimensional data of the following kind

Circular pattern

Circular pattern in a cloud of U([0,1]^2) points

We want to map \mathbf{x} = (x_1,x_2) to three dimensional space \phi(\mathbf{x}) = (x_1,x_2,x_1^2 + x_2^2) as shown below,

Transformed dataset

Transform (X,Y) \mapto (X,Y,X^2 + Y^2)

Note however, that in (1) we actually need only the inner product between the points, so in the new space we don’t care about the mapped point. What we need is an inner product \langle \phi(\mathbf{x}),\phi(\mathbf{z}) \rangle which is simply

    \[K(\mathbf{x},\mathbf{z}) = x_1z_1 + x_2 z_2 + (x_1^2 + x_2^2)(z_1^2 + z_2^2).\]

This approach, often referred to as a kernel method, already works for the simple case, as can be seen on a figrure. However, from the computational perspective it is important that we do not need to define \phi. We just need to define K and we can just plug it into (1), which now becomes

    \[\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} K(\mathbf{x}_i,\mathbf{x}_j).\]

In the next section we will investigate tuning of parameters. Apart from the cost parameter C which penalizes misclassification, we need to choose the kernel.

Source code

Examples from this post can be generated using the following R code

# Example 1: linear split
n = 1000
X1 = 2 * runif(n)-1
X2 = 2 * runif(n)-1

p = c(0.3, 0.6)
Y = X1 + 2 * X2 < -1
plot(X1, X2, col = Y + 1)

# Example 2: circular pattern
p = c(-0.2, -0.5)
r = 0.4
Y = (X1 - p[1])^2 + (X2 - p[2])^2 < r^2
plot(X1, X2, col = Y + 1)

# the 3d plot of transformed space 
# install.packages('rgl') # uncomment if you don't have rgl package
library(rgl)

X3 = X1^2 + X2^2
plot3d(X1, X2, X3, col=Y + 1)

# add lower layer
col = Y
col[Y] = rgb(1, 0.5, 0.5)
col[!Y] = rgb(0.5, 0.5, 0.5)
points3d(X1, X2, 0, col=col, add=T)

References

[1] Burges C., A Tutorial on Support Vector Machines for Pattern Recognition, Data mining and knowledge discovery 2(2), Springer 1998
[2] Cortes C. and Vapnik V., Support-vector networks, Machine learning 20, Springer 1995

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>