Jekyll2023-01-15T17:16:17+00:00https://davidtorpey.com//feed.xmlDavid TorpeyStuff that interests me, and hopefully you too. Hopefully we learn something along the way as well.On the Robustness of Self-Supervised Representations for Multi-view Object Classification2023-01-15T13:55:55+00:002023-01-15T13:55:55+00:00https://davidtorpey.com//2023/01/15/ssl-multiview<p>In this post, I’ll talk about a paper we <a href="https://www.sciencedirect.com/science/article/abs/pii/S0167865522002276">recently published</a>
on the robustness of self-supervised representation with respective to
viewpoint variation - one of the core tenants of any capable vision system.
At this point, it is known that vision models pretrained using self-supervised
objectives outperform standard supervised pretraining on a set of common, standard
benchmark datasets such as ImageNet, CIFAR10, COCO, and Birdsnap. However, these
datasets all serve to evaluate these models in a very narrow aspect - simple
object classification performance.</p>
<p>Little work has been done to evaluate these models in a more granular way, and
on more niche datasets. In this paper, we evaluate these models specifically
with respect to multi-view recognition performance. We tackle this problem through
two main approaches. Firstly, we synthetically vary viewpoint by approximating
it with a homography of different strength. This allows us to have fine-grained
control of the viewpoint, and serve as an initial way to benchmark supervised
learning against self-supervised (SS) learning in this domain. Then, we evaluate the
models on real-world, multi-view datasets.</p>
<h2 id="an-empirical-measure-of-robustness">An Empirical Measure of Robustness</h2>
<p>We define an empirical measure of robustness to viewpoint variation in order
to quantify and rank models during evaluation. Consider functions \(f : \mathcal{X} \rightarrow \mathbb{R}^n\)
and \(g : \mathcal{X} \rightarrow \mathbb{R}^n\), and a sample space of images \(\mathcal{X}\).
These are the supervised pretrained and SS pretrained models, respectively. We
aim to analyse the efficacy and representational power of embeddings \(f(x), g(x) \in \mathbb{R}^n\)
in terms of robustness to viewpoint variation. Essentially, we aim to analyse
whether SS representations produced by \(g\) are more robust to those from \(f\).
Mathematically, this can be formalised as follows. Consider a function \(V : \mathcal{X} \rightarrow \mathcal{X}\)
that is tasked with altering an object’s viewpoint. Then, a function \(g\) is
more robust to \(V\) than a function \(f\), if:</p>
\[\mathbb{E}[L(f(x), f(v(x)))] \ge \mathbb{E}[L(g(x), g(V(x)))]\]
<p>for some loss function \(L\), and for all \(x \in \mathcal{X}\). This is the
criterion we use to measure and compare models’ multi-view recognition performance,
and implicitly, their viewpoint invariance.</p>
<h2 id="synthetic-viewpoint-variation-analysis">Synthetic Viewpoint Variation Analysis</h2>
<p>We alter viewpoint in a controlled environment by applying a homography to an
image. We represent a homography as \(H_{\alpha} : \mathcal{X} \rightarrow \mathcal{X}\),
where \(\alpha \in [0, 1]\) is the strength of the homography.</p>
<p>We also evaluate the potential bias of the models towards the black background
induced from performing a homography on an image. We do this using (what we term)
a <em>bounded homography</em>, whereby we crop the maximum-area inscribed axis-aligned
rectangle from the resulting polygon. An example can be seen below:</p>
<p><img src="/assets/bounded-homography.png" alt="Bounded Homography" /></p>
<p>The below table contains the results using linear evaluation on a host of common
benchmark datasets:</p>
<p><img src="/assets/linear-eval-homog.png" alt="Linear Eval Results" /></p>
<p>Interestingly, the supervised baselines perform best on the 2 most common
small-scale benchmark datasets - CIFAR10 and CIFAR100. The rest of the datasets
are dominated by SSL models. Next, we show a summary of results by
synthetically varying viewpoint using homographies of strength 0.2, 0.4, 0.6, and 0.8.</p>
<p><img src="/assets/bounded-results.png" alt="Bounded Homog Results" /></p>
<p>These results suggest that supervised models are more biased towards the
black backgrounds induced in default homographies, and when accounted for using
the bounded homography, SSL consistently performs best.</p>
<h2 id="multi-view-performance-in-the-wild">Multi-view Performance in the Wild</h2>
<p>We evaluate these models in a real-world environment using a set of 5 inherently
multi-view real-world datasets (see paper for details). Below if a table of the
top-performing models overall for each of these:</p>
<p><img src="/assets/real-world-top.png" alt="Real World Top" /></p>
<p>Once again, SS methods dominate - except for Recon3D. This is an expected results,
since this dataset is the least like ImageNet (the dataset all evaluated models
were pretrained on). It <a href="https://openaccess.thecvf.com/content/CVPR2021/papers/Ericsson_How_Well_Do_Self-Supervised_Models_Transfer_CVPR_2021_paper.pdf">has been shown</a>
that SS models are less robust than supervised models when evaluated on datasets
that contain data with a large distribution shift from ImageNet. Next, we show
results for varying the amount of <strong>context</strong> a model requires to perform retrieval
of the correct object (at a different viewpoint):</p>
<p><img src="/assets/mvmc.png" alt="MVMC" /></p>
<p>We see that with as few as <em>one</em> image per class, SS techniques outperform the
supervised baselines significantly. We encourage you to see the paper for further
experiments on evaluating models with respect to the amount of context needed
to perform a particular multi-view task.</p>
<h2 id="tldr">TL;DR</h2>
<ul>
<li>Representations learned through SS techniques are shown, in multiple scenarios,
to be more robust to viewpoint changes. This holds for both the synthetic, and
real-world experiments.</li>
<li>We show that with very little context (e.g. number of support samples, amount
of training images, etc.), SSL consistently outperforms the supervised baselines
in a real-world environment.</li>
<li>We posit that SS representations encode information more pertinent to object
parts (as a byproduct of the training objectives), which enables improved
robustness to viewpoint.</li>
<li>In our experiments, the <em>instance discrimination</em> class of SSL models outperforms
the <em>pretext task</em> class of SSL models, but supervised models outperform the latter.</li>
<li>ViT-based supervised and SS models show promise, and perform best in many
experiments.</li>
</ul>In this post, I’ll talk about a paper we recently published on the robustness of self-supervised representation with respective to viewpoint variation - one of the core tenants of any capable vision system. At this point, it is known that vision models pretrained using self-supervised objectives outperform standard supervised pretraining on a set of common, standard benchmark datasets such as ImageNet, CIFAR10, COCO, and Birdsnap. However, these datasets all serve to evaluate these models in a very narrow aspect - simple object classification performance.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale2021-06-11T19:55:55+00:002021-06-11T19:55:55+00:00https://davidtorpey.com//2021/06/11/vision-transformer<p>As is commonly known at this point, transformers have transformed the field of
NLP, and sequence modelling in general. However, computer vision has thus far
remained dominated by the CNN. Its inductive biases result in unparalleled
efficiency in terms of data and parameters for modelling data with a grid-like
topology - most often images or video.</p>
<p>However, it is known that a vanilla MLP (under certain assumptions) is a universal
approximator of continuous functions defined on compact subsets of
\(\mathbb{R}^n\). Therefore, architectures without such strong inductive biases
(such as the locality assumptions and convolution in CNNs) may work just as well
for computer vision. Although previous attempts have been made to model images
using non-CNN neural network architectures, none have enjoyed the widespread
success and utility of the CNN and its derivatives. One of the major problems of
applying transformers (or more broadly, self-attention) to images is the inherent
dimensionality of image data. In vanilla self-attention, one would need every input
to attend to every other input. With text data, this is more manageable than with images,
since one would need to attend from every pixel to every other pixel in an image
(in the vanilla formulation). This is computationally infeasible.</p>
<p>This post focuses on a paper that shows that competitive performance to state-of-the-art
can be achieved using a transformer-like architecture, thereby (somewhat, as we will see)
circumventing the strong inductive biases in CNNs. There are a few caveats to this claim,
which will be elaborated upon below. The architecture proposed in this paper is known as
<em>vision transformer</em> (<a href="https://openreview.net/pdf?id=YicbFdNTTy">ViT</a>).</p>
<p>There have been previous attempts at trying to reduce the reliance of CNNs for computer
vision architectures in preference for self-attention and transformer-based
architectures (see <a href="https://arxiv.org/pdf/1711.07971.pdf">this</a>,
<a href="https://arxiv.org/abs/2005.12872">this</a>, and <a href="https://arxiv.org/abs/1906.05909">this</a>).
However, these previous attempts are notably more complex than the standard
transformer, and thus cannot benefit from the vanilla formulation’s computational
efficiency and scalability - which will see is essential for these non-CNN
architectures to compete. Because of this, ViT aims to make as few modifications as
possibile to ensure the maximum benefit in terms of scalability and
efficiency from the standard transformer can be realised.</p>
<p>The architecture is fairly simple, and seems to be a natural way to model images using a
transformer-like architecture (see below):</p>
<p><img src="/assets/vit.png" alt="vit" /></p>
<p>An image is broken up into patches, and each patch is flattened and projected
into a latent space using a linear projection (using a projection matrix \(E\)).
However, without additional information, the transformer would be processing the
patches as a set instead of as sequence. This would make the result permutation
invariant, which is not ideal in this case. To overcome this, the authors assign
(much like in NLP applications of transformers) positional embeddings to each of
the patch embeddings to give the transformer context of patch ordering.</p>
<p>After this point, it is a standard transformer model (i.e. the transformer encoder
from <a href="https://arxiv.org/abs/1706.03762">the original transformer paper</a>) that eventually
maps to the class prediction. This is interesting as there is no convolution present
in the architecture, thereby forgoing the strong inductive biases made by CNNs. The
global approach to self-attention in ViT is a weaker prior than the strong locality
assumption made when performing convolution. The only time priors based on the 2D
structure of images is added into the modelling process of ViT is when higher
resolution images are fed in. In this case, the effective sequence length is
longer for the same patch size, at which point the positional embeddings may not
be as meaningful as for shorter image patch sequences. Thus, 2D interpolation of
the positional embeddings is performed, according to their location in the
original image.</p>
<p>The results of ViT are interesting. Due to the lack of strong, but beneficial
inductive biases of CNNs, the ViT does not perform well when the amount of data
is not large enough. Only when the scale of the data is, frankly, ridiculous (300M images),
does the ViT start outperforming ResNet-like architectures. At smaller scales, the
ViT is not as competitive, but does take far fewer computational resources to train
to achieve the same accuracy as the state-of-the-art CNNs.</p>
<p>Interestly, even without the manual injection of image-specific priors, the ViT
model still learns surprising things during training. For example, the linear
projection (to embed the patches into the initial latent space), learns filters
that are surprisingly similar to those usually learned by CNNs (see below).</p>
<p><img src="/assets/vitviz1.png" alt="vitviz1" /></p>
<p>Overall, applying more general, scalable models to problems is often a good
approach to take, and transformers are in many ways the most general neural
network architectures we currently have that work (more so than vanilla MLPs).
Their current utility for computer vision, even with ViT, is somewhat limited
due to the computational resources required for training them at scale being
infeasible for most people (as compared with CNNs). However, it is most definitely
a promising step in the direction of converging on a unified architecture that
can effectively model any modality of data (images, video, text, audio, etc.).</p>As is commonly known at this point, transformers have transformed the field of NLP, and sequence modelling in general. However, computer vision has thus far remained dominated by the CNN. Its inductive biases result in unparalleled efficiency in terms of data and parameters for modelling data with a grid-like topology - most often images or video.pydags - A lightweight DAG framework for Python2021-05-03T19:55:55+00:002021-05-03T19:55:55+00:00https://davidtorpey.com//2021/05/03/pydags<p>I recently released a pre-alpha version of a Python library I’ve been working on. It’s still in the
very early stages of development, but this tutorial aims to give an introduction to the library and
its purpose.</p>
<p>The library is called <a href="https://github.com/DavidTorpey/pydags">pydags</a>, and its meant to serve as a
lightweight alternative to the enterprise, heavyweight DAG frameworks such as Airflow, Kubeflow, and
Luigi. pydags is a Python-native framework to express and execute DAG workloads, focusing on local
development, with no reliance on Kubernets and Docker. It’s a quick and easy way to get started with
DAG computation in Python, and has a Kubeflow-like interface for defining inter-node dependencies in
the DAG.</p>
<h2 id="pydags-terminology">pydags Terminology</h2>
<p>A quick note on terminology in pydags. Firstly, a DAG is called a <code class="language-plaintext highlighter-rouge">Pipeline</code> in pydags. And a pipeline
consists of many <code class="language-plaintext highlighter-rouge">Stages</code>. In essence, pipelines and stages are synonymous with DAGs and nodes,
respectively.</p>
<h2 id="example-usage---simple">Example Usage - Simple</h2>
<p>Suppose we want to create a moderately complex DAG consisting of 6 nodes. First, we import the required
classes and methods from pydags:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pydags.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="kn">from</span> <span class="nn">pydags.stage</span> <span class="kn">import</span> <span class="n">stage</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Pipeline</code> is the main class for defining and executing DAGs. <code class="language-plaintext highlighter-rouge">stage</code> is a decorator, and is one of the
ways to define stages in pydags. Next, we define a dummy stage in the form of a method. This will be
where the computation for that particular node/stage of the DAG will be defined. As we will see later,
one may also define stages as classes instead of methods.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">stage</span>
<span class="k">def</span> <span class="nf">stage_1</span><span class="p">():</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Running stage 1'</span><span class="p">)</span>
</code></pre></div></div>
<p>This is just a dummy stage (for demonstration purposes), and thus doesn’t really do anything useful. In a real
use-case, the computation would be more significant. We can not create a few more dummy stages with this <code class="language-plaintext highlighter-rouge">@stage</code>
decorator:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">stage</span>
<span class="k">def</span> <span class="nf">stage_2</span><span class="p">():</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Running stage 2'</span><span class="p">)</span>
<span class="o">@</span><span class="n">stage</span>
<span class="k">def</span> <span class="nf">stage_3</span><span class="p">():</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Running stage 3'</span><span class="p">)</span>
<span class="o">@</span><span class="n">stage</span>
<span class="k">def</span> <span class="nf">stage_4</span><span class="p">():</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Running stage 4'</span><span class="p">)</span>
<span class="o">@</span><span class="n">stage</span>
<span class="k">def</span> <span class="nf">stage_5</span><span class="p">():</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Running stage 5'</span><span class="p">)</span>
<span class="o">@</span><span class="n">stage</span>
<span class="k">def</span> <span class="nf">stage_6</span><span class="p">():</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Running stage 6'</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, we can instantiate the stages, and create a pipeline. Instantiating a stage that has been defined
as a function simply means invoking <code class="language-plaintext highlighter-rouge">__call__</code>, for example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stage1</span> <span class="o">=</span> <span class="n">stage_1</span><span class="p">()</span>
<span class="n">stage2</span> <span class="o">=</span> <span class="n">stage_2</span><span class="p">()</span>
</code></pre></div></div>
<p>It should be noted that this does not actually invoke the functions, but instead the decorator wraps the
function and its argument in a proxy class that is readable by the Pipeline class.</p>
<p>In order to define inter-dependencies between pipeline stages, one simply has to call the <code class="language-plaintext highlighter-rouge">.after()</code> method
of a particular stage in a sort of object builder pattern. This is similar to Kubeflow in this way. Next,
we define some inter-dependencies in our 6-stage pipeline:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stage3</span> <span class="o">=</span> <span class="n">stage_3</span><span class="p">().</span><span class="n">after</span><span class="p">(</span><span class="n">stage2</span><span class="p">)</span>
<span class="n">stage4</span> <span class="o">=</span> <span class="n">stage_4</span><span class="p">().</span><span class="n">after</span><span class="p">(</span><span class="n">stage2</span><span class="p">)</span>
<span class="n">stage5</span> <span class="o">=</span> <span class="n">stage_5</span><span class="p">().</span><span class="n">after</span><span class="p">(</span><span class="n">stage1</span><span class="p">)</span>
<span class="n">stage6</span> <span class="o">=</span> <span class="n">stage_6</span><span class="p">().</span><span class="n">after</span><span class="p">(</span><span class="n">stage3</span><span class="p">).</span><span class="n">after</span><span class="p">(</span><span class="n">stage4</span><span class="p">).</span><span class="n">after</span><span class="p">(</span><span class="n">stage5</span><span class="p">)</span>
</code></pre></div></div>
<p>In the first line above, we tell pydags that the computation for Stage 3 must occur after Stage 2, and so on. We
can now define the pipeline:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">()</span>
<span class="n">pipeline</span><span class="p">.</span><span class="n">add_stages</span><span class="p">([</span>
<span class="n">stage1</span><span class="p">,</span> <span class="n">stage2</span><span class="p">,</span> <span class="n">stage3</span><span class="p">,</span>
<span class="n">stage4</span><span class="p">,</span> <span class="n">stage5</span><span class="p">,</span> <span class="n">stage6</span>
<span class="p">])</span>
</code></pre></div></div>
<p>The two primary methods for a Pipeline object are <code class="language-plaintext highlighter-rouge">visualize</code> and <code class="language-plaintext highlighter-rouge">start</code>. Firstly, <code class="language-plaintext highlighter-rouge">visualize</code> simply renders a
visual representation of the pipeline in a matplotlib figure. For this Pipeline, the following figure is shown when
running <code class="language-plaintext highlighter-rouge">pipeline.visualize()</code>:</p>
<p><img src="/assets/pydags_1.png" alt="Simple DAG" /></p>
<p>The <code class="language-plaintext highlighter-rouge">.start()</code> method invokes the execution of the pipeline. All the stages will execute in the order defined by the
user. One may specify a positive integer for the <code class="language-plaintext highlighter-rouge">num_cores</code> parameter of the <code class="language-plaintext highlighter-rouge">.start()</code> method in order to run
stages of the pipeline in parallel (those which can be run in parallel, such as stages 1 and 2, or stages 3, 4,
and 5). pydags will distribute the computation over the number of CPU specified.</p>I recently released a pre-alpha version of a Python library I’ve been working on. It’s still in the very early stages of development, but this tutorial aims to give an introduction to the library and its purpose.A Foundation of Mathematics - The Peano Axioms2020-09-12T19:55:55+00:002020-09-12T19:55:55+00:00https://davidtorpey.com//2020/09/12/peano<p>In the past mathematicians wished to created a foundation for all of mathematics. The number system can be constructed hierarchically from the set of natural numbers \(\mathbb{N}\). From \(\mathbb{N}\), we can construct the integers \(\mathbb{Z}\), rationals \(\mathbb{Q}\), reals \(\mathbb{R}\), complex numbers \(\mathbb{C}\), and more. However, it is desirable to be able to construct the naturals (\(\mathbb{N}\)) from more basic ingredients, since there is no reason \(\mathbb{N}\) should itself be fundamental.</p>
<p>The Peano axioms (1889) are a set of axoims that allow for the construction of the natural numbers without ever referencing concepts such as arithmetic or counting. In this way, these axioms are fundamental.</p>
<h2 id="background-concepts">Background Concepts</h2>
<p>One should be familiar with the concept of a set, and that two sets with the same elements means that they are the same set. Secondly, we define a binary operation \(=\), known commonly as equals, that is reflexive (\(x=x\)), symmetric (\(x=y \implies y=x\)), and transitive (\(x=y \wedge y=z \implies x=z\)). These may seem obvious, however, they are key for the definition of what we consider <em>equality</em> to hold true. Lastly, we require that the set \(\mathbb{N}\) that we wish to construct with these axoims is closed under this \(=\) operation. Finally, we require the notion of a map / function. This is simply something that maps inputs to outputs.</p>
<h2 id="axoim-1">Axoim 1</h2>
\[a \in \mathbb{N}\]
<p>This axiom essentially forces the set under construction to be nonempty: \(\mathbb{N} \neq \emptyset\). We state that there is some element \(a\) that is a member of our set.</p>
<h2 id="axoim-2">Axoim 2</h2>
\[\exists S \ni x \in \mathbb{N} \implies S(x) \in \mathbb{N}\]
<h2 id="axoim-3">Axoim 3</h2>
\[\nexists x \in \mathbb{N} \ni S(x) = a\]
<h2 id="axiom-4">Axiom 4</h2>
\[x, y \in \mathbb{N} \wedge S(x) = S(y) \implies x=y\]
<p>Here we are essentially stating that our map \(S\) is injective.</p>
<p>Axioms 1-4 allow us to define a concept of <em>next</em> or <em>successor</em> without ever explicitly imposing preconceived notions about numbers. Now, if we associate each value of our successor function \(S\) with some symbol, it starts looking a lot like the set of natural numbers has been constructed. For example, if we define \(0 := a\), \(1 := S(a)\), \(2 := S(S(a))\), this seems very similar to the natural numbers.</p>
<p>However, we are not done. There is still a loophole that leads to a contradiction. Consider \(e_1, e_2 \in \mathbb{N}\), and that \(S(e_1) = e_2\) and \(S(e_1) = e_1\). This does not violate Axoims 1-4. This somehow allows for a set that seems bigger than \(\mathbb{N}\) since \(e_1\) and \(e_2\) are detached from every other element in \(\mathbb{N}\).</p>
<h2 id="axiom-5">Axiom 5</h2>
<p>Suppose \(\exists T \subset \mathbb{N}\) such that:</p>
\[a \in T \wedge\]
<p>and</p>
\[x \in T \implies S(x) \in T\]
<p>The only such set \(T\) is \(\mathbb{N}\). This axiom circumvents the above-described loophole.</p>
<p>Please not that this blog post is, in part, a summary of <a href="https://www.youtube.com/watch?v=3gBoP8jZ1Is">this video</a>.</p>In the past mathematicians wished to created a foundation for all of mathematics. The number system can be constructed hierarchically from the set of natural numbers \(\mathbb{N}\). From \(\mathbb{N}\), we can construct the integers \(\mathbb{Z}\), rationals \(\mathbb{Q}\), reals \(\mathbb{R}\), complex numbers \(\mathbb{C}\), and more. However, it is desirable to be able to construct the naturals (\(\mathbb{N}\)) from more basic ingredients, since there is no reason \(\mathbb{N}\) should itself be fundamental.A Simple Framework for Contrastive Learning of Visual Representations2020-09-12T19:55:55+00:002020-09-12T19:55:55+00:00https://davidtorpey.com//2020/09/12/simclr<p>A popular and useful framework for <em>contrastive</em> self-supervised learning known as <strong>SimCLR</strong> was introduced by <a href="https://arxiv.org/pdf/2002.05709.pdf">Chen et. al.</a>. The framework simplifies previous contrastive methods to self-supervised learning, and at the time was state-of-the-art at unsupervised image representation learning. The main simplification lies in the fact that SimCLR requires no specialised modules or additions to the architecture such as memory banks.</p>
<h2 id="architecture">Architecture</h2>
<p>As with previous contrastive methods, the architecture is a Siamese-like, as can be seen below:</p>
<p><img src="/assets/simclr.png" alt="simclr" /></p>
<p>An input image \(x\) is sampled from the training dataset. Two random augmentations are then applied to this input image to produce two distinct views \(x_i\) and \(x_j\) of the same image. These two images are fed through the same encoder \(f\) to produce latent vectors \(h_i\) and \(h_j\). It should be noted that \(f\) is parameterised as a large CNN such as ResNet. Finallly, these latent vectors are fed through an MLP \(g\) (known as the <em>projection head</em>) to produce final latent vectors \(z_i\) and \(z_j\).</p>
<p>Although \(f\) may initially seem unnecessary, it has been shown empirically to improve performance versus just using \(h_i\) and \(h_j\) as the latent representations. Importantly, the supervision signal is computed using the projection head latent vectors \(z_i\) and \(z_j\), and <strong>not</strong> using the encoder latent vectors. Further, the encoder latent vectors are the ones used for downstream tasks (linear evaluation, etc.) after this self-supervised training.</p>
<h2 id="loss-function">Loss Function</h2>
<p>Since this is a <em>contrastive</em>, the loss function is defined in such a way that it contrasts negative pairs of examples from positive pairs of examples (i.e. \(x_i\) and \(x_j\) are a positive pair of examples). This loss function is known as normalised temperature-scaled cross entropy (NT-Xent), and is defined as:</p>
\[l_{i,j} = -\text{log} \frac{\text{exp}(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} 1\{k \neq i\} \text{exp}(\text{sim}(z_i, z_k) / \tau)}\]
<p>where \(\tau\) is the temperature, \(N\) is the minibatch size, and \(\text{sim}\) is the cosine similarity. With this loss, all pairs of examples in a minibatch are treated as negative examples, except for the single pair \(z_i, z_j\). This loss function works well in practise, but typically requires very large batch sizes to be effective.</p>
<p><em>Note that this is architecture and loss function are fully unsupervised.</em></p>
<h2 id="data-augmentations">Data Augmentations</h2>
<p>The augmentations applied during training are 1. random cropping and resizing, 2. random colour distortions (in the form of brightness, hue, saturation, and contrast jitter), and 3. random Gaussian blurring. It should be noted that, for SimCLR, random cropping and colour jittering are crucial for good performance.</p>
<h2 id="consideration">Consideration</h2>
<p>Although SimCLR works well, there are a few drawbacks. Firstly, the computation needed to train the architecture to an acceptable level can be prohibitive. The batch size using in the paper is 4096 which means 8192 images are in each the batch for each training iterations.</p>A popular and useful framework for contrastive self-supervised learning known as SimCLR was introduced by Chen et. al.. The framework simplifies previous contrastive methods to self-supervised learning, and at the time was state-of-the-art at unsupervised image representation learning. The main simplification lies in the fact that SimCLR requires no specialised modules or additions to the architecture such as memory banks.Reducing the dimensionality of data with neural networks2020-06-26T19:55:55+00:002020-06-26T19:55:55+00:00https://davidtorpey.com//2020/06/26/reducing-dim-ae<p>Reducing the dimensionality of data has many valuable potential uses. The low-dimensional version of the data can be used for visualisation, or for further processing in a modelling pipeline. The low-dimensional version should capture only the salient features of the data, and can indeed be seen as a form of compression. Many techniques for dimensionality reduction exists, including <a href="https://www.tandfonline.com/doi/abs/10.1080/14786440109462720">PCA</a> (and its kernelized variant Kernel PCA), <a href="https://cs.nyu.edu/~roweis/lle/papers/lleintro.pdf">Locally Linear Embedding</a>, <a href="https://web.mit.edu/cocosci/Papers/sci_reprint.pdf">ISOMAP</a>, <a href="https://arxiv.org/pdf/1802.03426.pdf">UMAP</a>, <a href="https://www.ics.uci.edu/~welling/teaching/273ASpring09/Fisher-LDA.pdf">Linear Discriminant Analysis</a>, and <a href="http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf">t-SNE</a>. Some of these are linear methods, while others are non-linear methods. Many of the non-linear methods falls into a class of algorithms known as manifold learning algorithms.</p>
<h2 id="architecture">Architecture</h2>
<p>The dimensionality reduction technique discussed in this paper is based on neural networks, and is known as the <a href="https://www.cs.toronto.edu/~hinton/science.pdf">autoencoder</a>. An autoencoder is essentially a non-linear generalisation of PCA. The autoencoder architecture consists of an encoder network and decoder network, with a latent code bottleneck layer in the middle (see below figure). The goal of the encoder is to compress the input vector into a low-dimensional code that captures the salient features / information in the data. The goal of the decoder is to use that code to reconstruct an approximation of the input vector. The two networks are parameterised as multi-layer perceptrons (MLPs), and the full autoencoder (encoder + decoder) is trained end-to-end using gradient descent. Formally, the goal of an autoencoder is to minimise \(L(x, g(f(x)))\), where \(L\) is some loss function, \(f\) is the encoder network, and \(g\) is the decoder network.</p>
<p><img src="/assets/rddnn1.png" alt="rddnn1" /></p>
<h2 id="pre-training">Pre-Training</h2>
<p>One important trick performed in the paper is pre-training of the autoencoder. This is done in order to get the weights of the network to be at a suitable initialisation such that fine-tuning is easier and more effective. The pre-training is done in a greedy, layer-wise manner (i.e. each pair of layers is pre-trained separately). This pre-training is done using a restricted Boltzmann machine (<a href="https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf">RBM</a>).</p>
<h2 id="results">Results</h2>
<p>It is important to recall that an autoencoder is performing non-linear dimensionality reduction, and as such should learn a better low-dimensional data manifold than linear methods such as PCA or <a href="http://www.gbv.de/dms/ilmenau/toc/180019538.PDF">factor analysis</a>. We can see a comparison between the low-dimensional representations learned by <a href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">LSA</a> and an autoencoder in the below figure (applied to documents). Clearly, the autoencoder appears to learn a better representation.</p>
<p><img src="/assets/rddnn2.png" alt="rddnn2" /></p>
<p><img src="/assets/rddnn3.png" alt="rddnn3" /></p>Reducing the dimensionality of data has many valuable potential uses. The low-dimensional version of the data can be used for visualisation, or for further processing in a modelling pipeline. The low-dimensional version should capture only the salient features of the data, and can indeed be seen as a form of compression. Many techniques for dimensionality reduction exists, including PCA (and its kernelized variant Kernel PCA), Locally Linear Embedding, ISOMAP, UMAP, Linear Discriminant Analysis, and t-SNE. Some of these are linear methods, while others are non-linear methods. Many of the non-linear methods falls into a class of algorithms known as manifold learning algorithms.FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence2020-06-25T19:55:55+00:002020-06-25T19:55:55+00:00https://davidtorpey.com//2020/06/25/fixmatch<p>Labelled data is often either expensive or hard to obtain. As such, there has been a plethora of work to make better use of unlabelled data in machine learning, with paradigms such as unsupervised learning, semi-supervised learning, and more recently, self-supervised learning. <a href="https://arxiv.org/pdf/2001.07685.pdf">FixMatch</a> is an approach to semi-supervised learning (SSL) that combines two common approaches of SSL: 1. consistency regularisation and 2. pseudo-labelling.</p>
<h2 id="consistency-regularisation">Consistency Regularisation</h2>
<p>Consistency regularisation is an approach that utilises unlabelled data, and its core assumption is: <em>the model should output similar predictions when fed perturbed versions of the same input sample</em>. Formally, what this means is that given a model \(f\) and input sample \(x\), \(f(x) = f(a(x))\) for some perturbation function \(a\). For example, for a given image, the model should return the same prediction for any perturbed version of that image (e.g. colour jittering, or affine transform).</p>
<p>The vanilla loss term when enforcing consistency is given by:</p>
\[\sum_i ||p(y_i | \alpha(u_i)) - p(y_i | \alpha(u_i))||_2^2\]
<p>where \(p\) is the model, \(u_i\) is an unlabelled example, and \(\alpha\) is a stochastic perturbation function. The \(L_2\)-norm can be swapped out for other norms or metrics, but the key idea is that perturbed versions of the same input should produce similar predictions.</p>
<h2 id="pseudo-labelling">Pseudo-Labelling</h2>
<p>The idea behind pseudo-labeling is to use the model itself to produce artificial labels for unlabelled data. Such pseudo-labels are usually made to be hard labels (i.e. argmax of the model’s predicted class distribution), since this encourages the model to be confident in its predictions.</p>
<p>The vanilla loss term when employing pseudo-labelling is given by:</p>
\[\sum_i 1\{\max(q_i) \ge \tau\} H(\hat{q}_i, q_i)\]
<p>where \(q_i = p(y_i \vert u_i)\), \(\hat{q}_i = \text{argmax}(q_i)\) is a one-hot pseudo-label, \(H\) is cross-entropy, and \(\tau\) is a threshold parameter.</p>
<h2 id="the-model">The Model</h2>
<p><img src="/assets/fixmatch1.png" alt="fixmatch1" /></p>
<p>Consistency regularisation is enforced through the use of two data augmentation strategies. The first is weak augmentation, which is a simple flip-and-shift strategy whereby images are randomly flipped horizontally with probability \(0.5\), and randomly translated up to \(12.5\)% horizontally and vertically. The second is strong augmentation, which is implemented using either <a href="https://arxiv.org/abs/1909.13719">RandAugment</a> or <a href="https://arxiv.org/pdf/1911.09785.pdf">CTAugment</a>. Both of these strong augmentation strategies employ a stronger form of distortion on to the source images, such as colour distortion and other affine transformations such as shearing.</p>
<h3 id="loss-function">Loss Function</h3>
<p>The loss function for the FixMatch model consist of two terms: a supervised term \(l_s\) and an unsupervised term \(l_u\). Additionally, since FixMatch is an SSL algorithm, the loss is computed using a labelled batch of images, as well as a larger unlabelled batch of images. Note that \(\alpha\) is a weak augmentation function, and \(A\) is a strong augmentation function.</p>
<p>The supervised term is standard cross-entropy on weakly-augmented versions of the images in the batch:</p>
\[l_s = \frac{1}{B} \sum_i H(p_i, p(y_i | \alpha(x_i)))\]
<p>where \(B\) is the number of the images in the batch, and \(x_i\) is the labelled example.</p>
<p>The unsupervised term relies on a model-generated pseudo-label. To compute this one-hot label, we first compute the model’s class distribution on weakly-augmented versions of the images: \(q_i = p(y_i \vert \alpha(u_i))\). The pseudo-label is then given by: \(\hat{q}_i = \text{argmax}(q_i)\). The actual loss term is then standard cross-entropy using this pseudo-label as the ground truth vs. predictions on <em>strongly-augmented</em> versions of the images:</p>
\[\frac{1}{\mu B} \sum_i 1\{\max(q_i) \ge \tau\} H(\hat{q}_i, p(y_i \vert A(u_i)))\]
<p>where \(\mu \in \mathbb{N}\) (typically \(\mu > 1\)), and \(\tau\) denotes the threshold above which we will retain the generated pseudo-label.</p>
<p>The full final loss function is then given by \(l_s + \lambda_u l_u\), where \(\lambda_u \in \mathbb{R}\) is a parameter that controls the weight given to the unlabelled loss term.</p>
<h2 id="results">Results</h2>
<p><img src="/assets/fixmatch2.png" alt="fixmatch2" /></p>
<p>The key results from the paper can be seen in the above picture. It is also interesting to note that FixMatch manages to achieve \(78\)% accuracy on CIFAR-10 with only <strong>1</strong> image per class.</p>
<h2 id="important-considerations">Important Considerations</h2>
<p>The paper notes that careful attention has to be given to various factors of the deep learning pipeline in label-sparse settings such as SSL. In particular, SSL methods are disproportionately affected by factors such as optimiser choice, learning rate schedule, and regularisation. The recommendations from the paper include using vanilla SGD with momentum instead of Adam, weight decay regularisation (parameter norm penalties), and a specific cosine-based learning rate schedule.</p>Labelled data is often either expensive or hard to obtain. As such, there has been a plethora of work to make better use of unlabelled data in machine learning, with paradigms such as unsupervised learning, semi-supervised learning, and more recently, self-supervised learning. FixMatch is an approach to semi-supervised learning (SSL) that combines two common approaches of SSL: 1. consistency regularisation and 2. pseudo-labelling.All About Convex Hulls2020-06-24T19:55:55+00:002020-06-24T19:55:55+00:00https://davidtorpey.com//2020/06/24/convex-hulls<p>The convex hull is a very important concept in geometry, and has many applications in fields such as computer vision, mathematics, statistics, and economics. Essentially, a convex hull of a shape or set of points is the smallest convex set that contains that shape or set of points. Many algorithms exist to compute a convex hull. Many of these algorithms have focused on the 2D or 3D case, however, the general \(d\)-dimensional case is of big interest in many applications.</p>
<p>It is important to first build up some background knowledge so that we can effectively talk about convex hulls. We will be working with the general \(d\)-dimensional case, but will visualise in 2D. First, we have the concept of a \(d\)-simplex, which is just a generalisation of the concept of a triangle to arbitrary dimensions (similar to what a cube is to a square or a hyperplane to a line). Additionally, a \(d\)-dimensional convex hull is represented by its vertices and “faces” (which are essentially \(d - 1\)-dimensional affine hyperplanes).</p>
<p>Consider a set of \(n\) points \(\mathcal{S} = \{\mathbf{x}_i \in \mathbb{R}^d\}_{i=1}^n\) for which we want to compute a convex hull.</p>
<p>A very naive approach (that I strongly recommend against using) is the following. We simply consider all possible faces / hyperplanes that can be made using points from \(\mathcal{S}\), and choose only those faces where all other points lie only to one side of the face. In the below figure, F1 is one such face, since all other points lie to one side of the face. However, F2 is not, since points lie on both side of it. Thus F1 will be part of the convex hull, whereas F2 will not.</p>
<p><img src="/assets/ch1.png" alt="CH1" /></p>
<p>This naive algorithm is highly inefficient since all possible faces will need to be checked. This involves checking \(n \choose d\) possible faces. In order words, the computational complexity of this algorithm is \(\Theta({n \choose d})\).</p>
<p>A much more efficient algorithm for computing a convex hull is the quickhull algorithm. It is a popular algorithm for the general dimension case, and is indeed the implementation in the scipy package (which leverages the qhull library).</p>
<p>A key operation used in the quickhull algorithm is <em>signed distance from a point to a hyperplane</em>. We need it to be signed, since we want to know which side of the hyperplane /face the point lies. Formally, we can compute this signed distance for a point \(\mathbf{x}\) using the following:</p>
\[\frac{\langle \mathbf{x}, \mathbf{n} \rangle - \langle \mathbf{p}, \mathbf{n} \rangle}{||\mathbf{n}||_2}\]
<p>where \(\mathbf{p}\) is a point that lies on the hyperplane, and \(\mathbf{n}\) is the hyperplane’s normal vector. This distance is visualised in the below figure.</p>
<p><img src="/assets/ch2.png" alt="CH2" /></p>
<p>The full algorithm is given below:</p>
<p><img src="/assets/quickhull.png" alt="CH3" /></p>The convex hull is a very important concept in geometry, and has many applications in fields such as computer vision, mathematics, statistics, and economics. Essentially, a convex hull of a shape or set of points is the smallest convex set that contains that shape or set of points. Many algorithms exist to compute a convex hull. Many of these algorithms have focused on the 2D or 3D case, however, the general \(d\)-dimensional case is of big interest in many applications.Representation Learning (1)2019-12-28T19:55:55+00:002019-12-28T19:55:55+00:00https://davidtorpey.com//2019/12/28/fisher-vector-nn<p>For a while I’ve been interested in representation learning in the context of deep learning. Concepts such as self-supervised learning, unsupervised representation learning using GANs or VAEs, or simply through a vanilla supervised learning of some neural network architecture. Upon reading the literature, I had an idea that serves as a nice integration of two very interesting and useful models / techniques - the Fisher vector (which I’ve previously posted about in my blog <a href="https://davidtorpey.com/2018/11/25/feature-quantisation.html">here</a>), and the variational autoencoder (which I’ve been meaning to write a blog post about!). This blog post just serves to flesh out the idea, should I choose to pursue or revisit it at some point.</p>
<p>The Fisher vector is a state-of-the-art patch encoding technique. It can be seen a soft / probabilistic version of VLAD (vector of locally-aggregated descriptors), which itself is very similar to the bag of visual words encoding / quantisation technique, except that you quantise the residuals of local descriptors to their cluster center, instead of the actual visual word occurrences. The Fisher vector is based on the Fisher kernel, and assumes that the generation process of the descriptors being encoded can be modelled by some parametric probability distribution \(u_{\theta}\), where \(u\) is the PDF and \(\theta\) are the associated parameters of this distribution. Typically, in the context of Fisher vectors, \(u_{\theta}\) is chosen to be a \(K\)-mode GMM. Thus, \(\theta = \{\alpha_i, \mu_i, \Sigma_i\}_{i=1}^{K}\) are the \(K\) mixture weights, means, and covariances matrices of the GMM. EM can then be used to compute the maximum likelihood estimates of the parameters of the GMM. The Fisher vector is then defined to be the concatenation of the gradients of the log likelihood function of the GMM with respect to each of the parameters. What should be emphasised here is that the \(u_{\theta}\) can be <strong>any</strong> parametric distribution, and the estimation of its parameters can be done in any way we prescribe, not necessarily using MLE/EM.</p>
<p>A <a href="https://arxiv.org/abs/1312.6114">variational autoencoder</a> (VAE) is a neural network architecture, and is a generative model. It is one of the most popular current generative models in deep learning, along with the <a href="https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf">generative adversarial network</a> (GAN). A VAE is a type of autoencoder, (or more correctly, encoder-decoder network) that contains a stochastic encoder function \(q_{\theta}(z\|x)\), which is parameterised as a neural network. This encoder outputs the parameters of \(q_{\theta}(z\|x)\), for which we choose, a-priori, some parametric form (e.g. a multivariate Gaussian). We can then obtain a latent representation \(z\) of our input by sampling from this distribution using our learned parameter estimates. The decoder part of the VAE is also parameterised as a neural network, and is defined as \(p_{\phi}(x\|z)\). Using this function, we can compute the reconstruction of our input \(x\). One of the goals of the VAE (and AEs in general) is that the inputs, and their associated reconstructions from the deocder, be similar. This should be achieved within the paradigm of the latent space serving as a bottleneck in the learning process. This encourages the network to only encode salient information in the latent representations of the input. The VAE loss function includes a KL-divergence term, in additional to the regular pixel-space loss (which is usually MSE or some variant). The KL-divergence terms serves as a regularisation to the learning process which forces distribution q to be close to distribution p. In other words, we want the KL-divergence between the encoder \(q_{\theta}\) and the prior \(p(z)\) to be small.</p>
<p>Typically, \(p\) is chosen to be standard Normal, and \(q\) is chosen to be a multivariate Gaussian. Once trained, samples similar to those it was trained on can be generate using the learned distribution. However, for the purposes of this post, we focuses on the VAE’s ability to learn the parameters of some distribution, whose functional form we choose a-priori.</p>
<p>The idea is to learn a Fisher vector using a variant of the VAE architecture. One prohibiting factor of the Fisher vector is that the information it encodes is based off of interest points with associated descriptors. These interest points are usually things like SIFT or SURF, which all, in some way or another, define “interesting” as having large gradients in all directions. In this way, they often focus on image region contains edges or corner-like structures, thus disregard large portions of images which containing homogenous regions or regions of with a low colour gradient. However, such regions I hypothesise can provide very valuable information in the global context of an image. Using a convolutional variant of a VAE, we can learn better representations of the images that take into account the full context of the image. Additionally, we can assume \(q\) to be a GMM, and can learn the parameters of the GMM. An additional layer in the network can be used to compute the Fisher vector using the GMM parameters. These can be easily included as a neural network layer, since all the operations to compute the Fisher vector have simple gradients. Thus, the full process of training the VAE, and by proxy learning a Fisher vector, can be done in an end-to-end learnable way.</p>
<p>Some great resources for this post can be found below:</p>
<p><a href="https://lear.inrialpes.fr/pubs/2010/PSM10/PSM10_0766.pdf">Fisher vectors</a></p>
<p><a href="https://jaan.io/what-is-variational-autoencoder-vae-tutorial/">VAE</a></p>
<p><a href="http://anotherdatum.com/vae.html">VAE</a></p>For a while I’ve been interested in representation learning in the context of deep learning. Concepts such as self-supervised learning, unsupervised representation learning using GANs or VAEs, or simply through a vanilla supervised learning of some neural network architecture. Upon reading the literature, I had an idea that serves as a nice integration of two very interesting and useful models / techniques - the Fisher vector (which I’ve previously posted about in my blog here), and the variational autoencoder (which I’ve been meaning to write a blog post about!). This blog post just serves to flesh out the idea, should I choose to pursue or revisit it at some point.SVMs: A Geometric Interpretation2019-03-30T19:55:55+00:002019-03-30T19:55:55+00:00https://davidtorpey.com//2019/03/30/svm-geometric-interpretation<p><img src="/assets/base.png" alt="Example Points" /></p>
<p>Consider a set of positive and negative samples from some dataset as shown above. How can we approach the problem of classifying these - and more importantly, unseen - samples as either positive or negative examples? The most intuitive way to do this is to draw a line / hyperplane between the between the positive and negative samples.</p>
<p>However, which line should we draw? We could draw this one:</p>
<p><img src="/assets/badline1.png" alt="Wrong line 1" /></p>
<p>or this one:</p>
<p><img src="/assets/badline2.png" alt="Wrong line 2" /></p>
<p>However, neither of the above seem like the best fit. Perhaps a line such that the boundary between the two classes is maximal is the optimal line?</p>
<p><img src="/assets/svmline.png" alt="SVM line" /></p>
<p>This line is such that the margin is maximized. This is the line an SVM attempts to find - an SVM attempts to find the <strong>maximum-margin separating hyperplane</strong> between the two classes. However, we need to construct a decision rule to classify examples. To do this, consider a vector \(\mathbf{w}\) perpendicular to the margin. Further, consider some unknown vector \(\mathbf{u}\) representing some example we want to classify:</p>
<p><img src="/assets/wandu.png" alt="Wrong line 1" /></p>
<p>We want to know what side of the decision boundary \(\mathbf{u}\) is in order to classify it. To do this, we project it onto \(\mathbf{w}\) by computing \(\mathbf{w} \cdot \mathbf{u}\). This will give us a value that is proportional to the distance \(\mathbf{u}\) is, <em>in the direction of</em> \(\mathbf{w}\). We can then use this to determine which side of the boundary \(\mathbf{u}\) lies on using the following decision rule:</p>
\[\mathbf{w} \cdot \mathbf{u} \ge c\]
<p>for some \(c \in \mathbb{R}\). \(c\) is basically telling us that if we are far <em>enough</em> away, we can classify \(\mathbf{u}\) as a positive example. We can rewrite the above decision rule as follows:</p>
\[\mathbf{w} \cdot \mathbf{u} + b \ge 0\]
<p>where \(b = -c\).</p>
<p>But, what \(\mathbf{w}\) and \(b\) should we choose? We don’t have enough constraint in the problem to fix a particular \(\mathbf{w}\) or \(b\). Therefore, we introduce additional constraints:</p>
\[\mathbf{w} \cdot \mathbf{x}_+ + b \ge 1\]
<p>and</p>
\[\mathbf{w} \cdot \mathbf{x}_- + b \le -1\]
<p>These constraints basically force the function that defines our decision rule to produce a value of 1 or greater for positive examples, and -1 or less for negative examples.</p>
<p>Now, instead of dealing with two inequalities, we introduce a new variable, \(y_i\), for mathematical convenience. It is defined as:</p>
\[y_i = \begin{cases}
1 & \text{positive example} \\
-1 & \text{negative example}
\end{cases}\]
<p>This variable essentially encodes the targets of each example. We multiply both inequalities from above by \(y_i\). For the positive example constraint we get:</p>
\[y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1\]
<p>and for the negative example constraint we get:</p>
\[y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1\]
<p>which is the same constraint! The introduction of \(y_i\) has simplified the problem. We can rewrite this constraint as:</p>
\[y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1 \ge 0\]
<p>However, we go a step further by making the above inequality even more stringent:</p>
\[y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1 = 0\]
<p>The above equation constrains examples lying on the margins (known as <em>support vectors</em>) to be exactly 0. We do this because if a training point lies exactly on the margin, we don’t want to classify it as either positive or negative, since it’s exactly in the middle. We instead want such points to define our decision boundary. It is also clearly the equation of a hyperplane, which is what we want!</p>
<p>Keep in mind that our goal is to find the margin separating positive and negative examples to be as large as possible. This means that we will need to know the width of our margin so that we can maximize it. The following picture shows how we can calculate this width.</p>
<p><img src="/assets/width.png" alt="Margin Width" /></p>
<p>To calculate the width of the margin, we need a unit normal. Then we can just project \(\mathbf{x}_+ - \mathbf{x}_-\) onto this unit normal and this would exactly be the width of the margin. Luckily, vector \(\mathbf{w}\) was defined to be normal! Thus, we can compute the width as follows:</p>
\[\text{width} = (\mathbf{x}_+ - \mathbf{x}_-) \cdot \frac{\mathbf{w}}{||\mathbf{w}||}\]
<p>where the norm ensures that \(\mathbf{w}\) becomes a unit normal. From earlier, we know \(y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1 = 0\). Using this, simple algebra yields:</p>
\[\mathbf{x}_+ \cdot \mathbf{w} = 1 - b\]
<p>and</p>
\[- \mathbf{x}_- \cdot \mathbf{w} = 1 + b\]
<p>Thus, substituting into the expression for the width yields:</p>
\[\text{width} = \frac{2}{||\mathbf{w}||}\]
<p>which is interesting! The width of our margin for such a problem depends only on \(\mathbf{w}\). Since we want to maximize the margin, we want:</p>
\[\text{max} \frac{2}{||\mathbf{w}||}\]
<p>which is the same as</p>
\[\text{max} \frac{1}{||\mathbf{w}||}\]
<p>which is the same as</p>
\[\text{min} ||\mathbf{w}||\]
<p>which is the same as</p>
\[\text{min} \frac{1}{2} ||\mathbf{w}||^2\]
<p>where we write it like this for mathematical convenience reasons that will become apparent shortly.</p>
<p>One easy approach to solve such an optimisation problem is using Lagrange multipliers. We first formulate our Lagrangian:</p>
\[L(\mathbf{w}, b) = \frac{1}{2} ||\mathbf{w}||^2 - \sum_i \alpha_i [y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1]\]
<p>We find the optimal settings for \(\mathbf{w}\) and \(b\) by computing the respective partial derivatives and setting them to zero. First, for \(\mathbf{w}\):</p>
\[\frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i x_i = 0\]
<p>which implies that \(\mathbf{w} = \sum_i \alpha_i y_i x_i\). This means that \(\mathbf{w}\) is simply a linear combination of the samples! Now, for \(b\):</p>
\[\frac{\partial L}{\partial b} = - \sum_i \alpha_i y_i = 0\]
<p>which implies that \(\sum_i \alpha_i y_i = 0\).</p>
<p>We could just stop here. We can solve the optimisation problem as is. However, we shall not do that! At least not yet. Let’s plug our expressions for \(\mathbf{w}\) and \(b\) back into the Lagrangian:</p>
\[L = \frac{1}{2} (\sum_i \alpha_i y_i \mathbf{x}_i) \cdot (\sum_j \alpha_j y_j \mathbf{x}_j) - \sum_i \alpha_i y_i \mathbf{x}_i \cdot (\sum_j \alpha_j y_j \mathbf{x}_j) - \sum_i \alpha_i y_i b + \sum_i \alpha_i\]
<p>which, after some algebra, results in:</p>
\[L = \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i \cdot \mathbf{x}_j\]
<p>What the above equation tells us is that the optimisation depends <strong>only</strong> on dot products of pairs of samples! This observation will prove key later on. Also, we should note that training examples that are not support vectors will have \(\alpha_i = 0\), as these examples do not effect or define the decision boundary.</p>
<p>Putting the expressions for \(\mathbf{w}\) and \(b\) back into our decision rule yields:</p>
\[\sum_i \alpha_i y_i \mathbf{x}_i \cdot \mathbf{u} + b \ge 0\]
<p>which means the decision rule also depends <strong>only</strong> on dot products of pairs of samples! Another great benefit is that it is provable that this optimisation problem is convex - meaning we are guaranteed to always find global optima.</p>
<p>However, now a problem arises! The above optimisation problem assumes the data is linearly-separable in the input vector space. However, in most real-life scenarios, this assumption is simply untrue. We therefore have to adapt the SVM to accommodate for this, and to allow for non-linear decision boundaries. To do this, we introduce a transformation \(\phi\) which will transform the input vector into a (high-dimensional) vector space. It is in this vector space that we will attempt to find the maximum-margin line / hyperplane.
In this case, we would simply need to swap the dot product \(\mathbf{x}_i \cdot \mathbf{x_j}\) in the optimisation problem with \(\phi(\mathbf{x}_i) \cdot \phi(\mathbf{x_j})\). We can do this solely because, as shown above, both the optimisation and decision rule depends only on dot products between pairs of samples. This is known as the <em>kernel trick</em>. Thus, if we have a function \(K\) such that:</p>
\[K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x_j})\]
<p>then we don’t actually need to know the transformation \(\phi\) itself! We only need the function \(K\), which is known as a kernel function. This is why we can use kernels that transform the data into an infinite-dimensional space (such as the RBF kernel), because we are not computing the transformations directly. Instead, we simply use a special function (i.e. kernel function) to compute dot products in this space without needing to compute the transformations.</p>
<p>This kernel trick allows the SVM to learn non-linear decision boundaries, and the problem still clearly remains convex. However, even with the kernel trick, the SVM with such a formulation still assumes that the data in linearly-separable in this transformed space. Such SVMs are known as <em>hard-margin</em> SVMs. This assumption does not hold most the time for real-world data. Therefore, we arrive at the most common form of the SVM nowadays - the <em>soft-margin</em> SVMs. Essentially, so-called <em>slack</em> variables are introduced into the optimisation problem to control the amount of misclassification the SVM is allowed to make. For more information on soft-margin SVMs, see <a href="https://davidtorpey.com/2018/11/25/svm.html">my blog post on the subject</a>.</p>
<p>I highly recommend looking at <a href="https://www.youtube.com/watch?v=_PwhiWxHK8o&t=2s">this</a> lecture if you would like to learn more about the concept behind SVMs.</p>