A simple classical problem
SVM finds an optimal “hyperplane“ between two groups of datapoints. Although the optimization problem it solves is always linear, we can incorporate other geometric structures by transforming the data to other spaces.
In the simplest case, consider this two dimensional example.
Our objective is to find the optimal separating line between these two groups of points. We assume that the best one is the one with the largest distance from red and from black dots. Let’s assume we have points as in the example above. We denote points as , the class of a point as and set for black dots and for red dots for each point . We can now solve a minimization problem, i.e. find a hyperplane (which in case is just a line),
where is a vector normal to the margin we are looking for and is an afinite shift, corresponding to the fact that our hyperplane may not go through the origin. We use the fact that any hyperplane in space can by define by it’s normal vector and it’s shift from origin. Note however that here the length of the vector is crucial and corresponds to the margin size.
Note that the minimization problem does not have a closed for solution, so we need to use optimization methods in order to solve it. Luckily, it is a convex problem, so it is reasonably simple to solve by numerical methods.
Soft margin, ‘intersecting’ classes
In practice often the groups are not entirely separable and the separating hyperplane simply does not exist. In Cortes and Vapnik  introduced a so-called soft margin, where they allow misclassification of points close to the bound.
Minimize over and for
The intuition behind this formula is that are the distances of misclassified points and is the cost of this misclassification. The further the point is, the larger the prize. We can also see that in the constraint, which has been relaxed from to , giving in some sense more freedom (and this freedom cost ).
For technical reasons it is useful to work on a dual problem (equivalent in a sense that it has the same solution ) to
where so . For the details of how to transform the initial problem to the last, most convenient, form, please refer to .
Note that if we set very large, the problem is equivalent an SVM with a hard margin, whereas too small values of give a robust but potentially inaccurate results.
Non-linearity and kernel methods
The power of SVM comes from the fact that we may incorporate nonlinear separations. The basic idea is to map the points to another space and look for the separating hyperplane in the new space. For example, suppose we have a two dimensional data of the following kind
We want to map to three dimensional space as shown below,
Note however, that in (1) we actually need only the inner product between the points, so in the new space we don’t care about the mapped point. What we need is an inner product which is simply
This approach, often referred to as a kernel method, already works for the simple case, as can be seen on a figrure. However, from the computational perspective it is important that we do not need to define . We just need to define and we can just plug it into (1), which now becomes
In the next section we will investigate tuning of parameters. Apart from the cost parameter which penalizes misclassification, we need to choose the kernel.
Examples from this post can be generated using the following R code
# Example 1: linear split n = 1000 X1 = 2 * runif(n)-1 X2 = 2 * runif(n)-1 p = c(0.3, 0.6) Y = X1 + 2 * X2 < -1 plot(X1, X2, col = Y + 1) # Example 2: circular pattern p = c(-0.2, -0.5) r = 0.4 Y = (X1 - p)^2 + (X2 - p)^2 < r^2 plot(X1, X2, col = Y + 1) # the 3d plot of transformed space # install.packages('rgl') # uncomment if you don't have rgl package library(rgl) X3 = X1^2 + X2^2 plot3d(X1, X2, X3, col=Y + 1) # add lower layer col = Y col[Y] = rgb(1, 0.5, 0.5) col[!Y] = rgb(0.5, 0.5, 0.5) points3d(X1, X2, 0, col=col, add=T)
 Burges C., A Tutorial on Support Vector Machines for Pattern Recognition, Data mining and knowledge discovery 2(2), Springer 1998
 Cortes C. and Vapnik V., Support-vector networks, Machine learning 20, Springer 1995