Support vector machines (SVMs) are one of the most popular supervised learning algorithms in use today, even with the onslaught of deep learning and neural network take-over. The reason they have remained popular is due to their reliability across a wide variety of problem domains and datasets. They often have great generalisation performance, and this is almost solely due to the clever way in which they work - that is, how they approach the problem of supervised learning and how they formulate the optimisation problem they solve.

There are two types of SVMS - hard-margin and soft-margin. Hard-margin SVMs assume the data is linearly-separable (in the raw feature space or some high-dimensional feature space that we can map to) without any errors, whereas a soft-margin has some leeway in that it allows for some misclassification is the data is not completely linearly-separable. When speaking of SVMs, we are generally referring to soft-margin ones, and thus this post will focus on these. Moreover, we will focus on a binary classification context.

Consider a labeled set of $n$ feature vectors and corresponding targets: $\{(x_i, y_i)\}^{n}_{i=1}$, where $x_i \in \mathbb{R}^m$ is feature vector $i$ and $y_i \in \{0, 1\}$ is target $i$. An SVM attempts to find a hyperplane that separates the classes in the feature space, or some transformed version of the feature space. The hyperplane, however, is defined to be a very specific separating hyperplane - the one that separates the data maximally; that is, with the largest margin between the two classes.

Define a hyperplane $\mathcal{H} := \{x : f(x) = x^T \beta + \beta_0 = 0\}$, such that $\|\beta\| = 1$. Then, we know that $f(x)$ is the signed distance from $x$ to $\mathcal{H}$. As a side note, in the case that the data is linearly-separable, we have that $y_i f(x_i) > 0$, $\forall i$. However, since we are solely dealing with the linearly non-separable case, we define a set of slack variables $\xi = [\xi_1, \xi_2, \dots, \xi_n]$. These essentially provide the SVM classifier with some leeway in that it then allows for a certain amount of misclassification. Then, we let $M$ be the width of the margin either side of our maximum-margin hyperplane. We want, for all $i$, that $y_i (x_i^T \beta + \beta_0) \ge M - \xi_i$, $\xi_i \ge 0$, and $\sum_i \xi_i \le K$, for some $K \in \mathbb{R}$. This means that we want a point $x_i$ to be at least a distance of $M$ away from $\mathcal{H}$ (on its correct side of the margin) with a leeway/slack of $\xi_i$.

The above contraints lead to a non-convex optimization problem. However, it can be re-formulated in such a way that makes it convex. Thus, we modify such that for all $i$, $y_i (x_i^T \beta + \beta_0) \ge M (1 - \xi_i)$. That is, we measure the relative distance from a point $x_i$ to the hyperplane, as opposedd to the actual distance as done in the first, non-convex, formulation. The slack variables essentially just represent the proportional amount by which the predictions are on the wrong side of their margin. By bounding $\sum_i \xi_i$, we essentially bound the total proportional amount by which the training predictions fall on the wrong side of their margin.

Thus, it is clear that misclassification occurs when $\xi_i > 1$. Therefore, the $\sum_i \xi_i \le K$ constraint means we can have at most $K$ training misclassifications.

If we drop the unit norm constraint of parameter $\beta$, we define $M := \frac{1}{\|\beta\|}$, we can then formulate the following convex optimization problem for the SVM:

$\text{min} \|\beta\| \\ \text{s.t. } y_i (x_i^T \beta + \beta_0) \ge 1 - \xi_i, \xi_i \ge 0, \sum_i \xi_i \le K, \forall i$

With this formulation, it is clear that points well within their bounds (i.e. well within their class boundary) do not have much affect on shaping the decision/class boundary. Also, if the class boundary can be constructed using a small number of support vector relative to the train set size, then the generalization performance will be high, even in an infinite-dimensional space.

It should be noted that up until now we have been working in the base feature space. However, SVMs are part of a class of methods known as kernel methods. This means that we can apply a function (known as the kernel function) to transform the base feature space into a (possibly) very high-dimensional feature space. We hypothesise that in this transformed feature space, it may be easier to find a separating hyperplane between the classes. It is this kernel property of SVMs that allow them to learn non-linear decision boundaries (as opposed to linear, which the SVM without a kernel function can learn). We can simply replace $x_i$ in the formulation by $\phi(x_i)$, where $\phi_i$ is the kernel. The most common kernel functions are polynomial or radial basis (Gaussian) functions.

Some great resources for SVMs can be found at the following links:

The Elements of Statistical Learning

Support Vector Networks

Support Vector Machine Soft Margin Classifiers: Error Analysis