Support Vector Machines - Why and How
Support vector machines (SVMs) are one of the most popular supervised learning algorithms in use today, even with the onslaught of deep learning and neural network take-over. The reason they have remained popular is due to their reliability across a wide variety of problem domains and datasets. They often have great generalisation performance, and this is almost solely due to the clever way in which they work - that is, how they approach the problem of supervised learning and how they formulate the optimisation problem they solve.
There are two types of SVMS - hard-margin and soft-margin. Hard-margin SVMs assume the data is linearly-separable (in the raw feature space or some high-dimensional feature space that we can map to) without any errors, whereas a soft-margin has some leeway in that it allows for some misclassification is the data is not completely linearly-separable. When speaking of SVMs, we are generally referring to soft-margin ones, and thus this post will focus on these. Moreover, we will focus on a binary classification context.
Consider a labeled set of \(n\) feature vectors and corresponding targets: \(\{(x_i, y_i)\}^{n}_{i=1}\), where \(x_i \in \mathbb{R}^m\) is feature vector \(i\) and \(y_i \in \{0, 1\}\) is target \(i\). An SVM attempts to find a hyperplane that separates the classes in the feature space, or some transformed version of the feature space. The hyperplane, however, is defined to be a very specific separating hyperplane - the one that separates the data maximally; that is, with the largest margin between the two classes.
Define a hyperplane \(\mathcal{H} := \{x : f(x) = x^T \beta + \beta_0 = 0\}\), such that \(\|\beta\| = 1\). Then, we know that \(f(x)\) is the signed distance from \(x\) to \(\mathcal{H}\). As a side note, in the case that the data is linearly-separable, we have that \(y_i f(x_i) > 0\), \(\forall i\). However, since we are solely dealing with the linearly non-separable case, we define a set of slack variables \(\xi = [\xi_1, \xi_2, \dots, \xi_n]\). These essentially provide the SVM classifier with some leeway in that it then allows for a certain amount of misclassification. Then, we let \(M\) be the width of the margin either side of our maximum-margin hyperplane. We want, for all \(i\), that \(y_i (x_i^T \beta + \beta_0) \ge M - \xi_i\), \(\xi_i \ge 0\), and \(\sum_i \xi_i \le K\), for some \(K \in \mathbb{R}\). This means that we want a point \(x_i\) to be at least a distance of \(M\) away from \(\mathcal{H}\) (on its correct side of the margin) with a leeway/slack of \(\xi_i\).
The above contraints lead to a non-convex optimization problem. However, it can be re-formulated in such a way that makes it convex. Thus, we modify such that for all \(i\), \(y_i (x_i^T \beta + \beta_0) \ge M (1 - \xi_i)\). That is, we measure the relative distance from a point \(x_i\) to the hyperplane, as opposedd to the actual distance as done in the first, non-convex, formulation. The slack variables essentially just represent the proportional amount by which the predictions are on the wrong side of their margin. By bounding \(\sum_i \xi_i\), we essentially bound the total proportional amount by which the training predictions fall on the wrong side of their margin.
Thus, it is clear that misclassification occurs when \(\xi_i > 1\). Therefore, the \(\sum_i \xi_i \le K\) constraint means we can have at most \(K\) training misclassifications.
If we drop the unit norm constraint of parameter \(\beta\), we define \(M := \frac{1}{\|\beta\|}\), we can then formulate the following convex optimization problem for the SVM:
\[\text{min} \|\beta\| \\ \text{s.t. } y_i (x_i^T \beta + \beta_0) \ge 1 - \xi_i, \xi_i \ge 0, \sum_i \xi_i \le K, \forall i\]With this formulation, it is clear that points well within their bounds (i.e. well within their class boundary) do not have much affect on shaping the decision/class boundary. Also, if the class boundary can be constructed using a small number of support vector relative to the train set size, then the generalization performance will be high, even in an infinite-dimensional space.
It should be noted that up until now we have been working in the base feature space. However, SVMs are part of a class of methods known as kernel methods. This means that we can apply a function (known as the kernel function) to transform the base feature space into a (possibly) very high-dimensional feature space. We hypothesise that in this transformed feature space, it may be easier to find a separating hyperplane between the classes. It is this kernel property of SVMs that allow them to learn non-linear decision boundaries (as opposed to linear, which the SVM without a kernel function can learn). We can simply replace \(x_i\) in the formulation by \(\phi(x_i)\), where \(\phi_i\) is the kernel. The most common kernel functions are polynomial or radial basis (Gaussian) functions.
Some great resources for SVMs can be found at the following links:
The Elements of Statistical Learning
Support Vector Machine Soft Margin Classifiers: Error Analysis