Support vector machines (SVMs) are one of the most popular supervised learning algorithms in use today, even with the onslaught of deep learning and neural network take-over. The reason they have remained popular is due to their reliability across a wide variety of problem domains and datasets. They often have great generalisation performance, and this is almost solely due to the clever way in which they work - that is, how they approach the problem of supervised learning and how they formulate the optimisation problem they solve.

There are two types of SVMS - hard-margin and soft-margin. Hard-margin SVMs assume the data is linearly-separable (in the raw feature space or some high-dimensional feature space that we can map to) without any errors, whereas a soft-margin has some leeway in that it allows for some misclassification is the data is not completely linearly-separable. When speaking of SVMs, we are generally referring to soft-margin ones, and thus this post will focus on these. Moreover, we will focus on a binary classification context.

Consider a labeled set of feature vectors and corresponding targets: , where is feature vector and is target . An SVM attempts to find a hyperplane that separates the classes in the feature space, or some transformed version of the feature space. The hyperplane, however, is defined to be a very specific separating hyperplane - the one that separates the data maximally; that is, with the largest margin between the two classes.

Define a hyperplane , such that . Then, we know that is the signed distance from to . As a side note, in the case that the data is linearly-separable, we have that , . However, since we are solely dealing with the linearly non-separable case, we define a set of slack variables . These essentially provide the SVM classifier with some leeway in that it then allows for a certain amount of misclassification. Then, we let be the width of the margin either side of our maximum-margin hyperplane. We want, for all , that , , and , for some . This means that we want a point to be at least a distance of away from (on its correct side of the margin) with a leeway/slack of .

The above contraints lead to a non-convex optimization problem. However, it can be re-formulated in such a way that makes it convex. Thus, we modify such that for all , . That is, we measure the relative distance from a point to the hyperplane, as opposedd to the actual distance as done in the first, non-convex, formulation. The slack variables essentially just represent the proportional amount by which the predictions are on the wrong side of their margin. By bounding , we essentially bound the total proportional amount by which the training predictions fall on the wrong side of their margin.

Thus, it is clear that misclassification occurs when . Therefore, the constraint means we can have at most training misclassifications.

If we drop the unit norm constraint of parameter , we define , we can then formulate the following convex optimization problem for the SVM:

With this formulation, it is clear that points well within their bounds (i.e. well within their class boundary) do not have much affect on shaping the decision/class boundary. Also, if the class boundary can be constructed using a small number of support vector relative to the train set size, then the generalization performance will be high, even in an infinite-dimensional space.

It should be noted that up until now we have been working in the base feature space. However, SVMs are part of a class of methods known as kernel methods. This means that we can apply a function (known as the kernel function) to transform the base feature space into a (possibly) very high-dimensional feature space. We hypothesise that in this transformed feature space, it may be easier to find a separating hyperplane between the classes. It is this kernel property of SVMs that allow them to learn non-linear decision boundaries (as opposed to linear, which the SVM without a kernel function can learn). We can simply replace in the formulation by , where is the kernel. The most common kernel functions are polynomial or radial basis (Gaussian) functions.

Some great resources for SVMs can be found at the following links:

The Elements of Statistical Learning

Support Vector Networks

Support Vector Machine Soft Margin Classifiers: Error Analysis