Jekyll2019-07-01T06:43:48+00:00https://davidtorpey.com//feed.xmlDavid TorpeyStuff that interests me, and hopefully you too. Hopefully we learn something along the way as well.Face Recognition: Eigenfaces2019-04-29T19:55:55+00:002019-04-29T19:55:55+00:00https://davidtorpey.com//2019/04/29/eigenfaces<p>The main idea behind eigenfaces is that we want to learn a low-dimensional space - known as the eigenface subspace - on which we assume the faces intrinsically lie. From there, we can then compare faces within this low-dimensional space in order to perform facial recognition. It’s a relatively simple approach to facial recognition, but indeed one of the most famous and effective ones of the early approaches. It still works well in simple, controlled scenarios.</p>
<p>Assume we have a set of <script type="math/tex">m</script> images <script type="math/tex">\{I_i\}^{m}_{i=1}</script>, where <script type="math/tex">I_i \in \mathcal{G}^{r \times c}</script>; <script type="math/tex">\mathcal{G} = \{0, 1, \dots, 255\}</script>; and <script type="math/tex">r \times c</script> is the spatial dimension of the image. The first step to the algorithm is to resize all the images in the set to the same size. Typically, the images are converted to grayscale, since it is assumed that colour is not an important factor. This is clearly debatable, however, for the purposes of this post we will assume that the images are grayscale images.</p>
<p>Each image is then converted to a vector, by appending each row into one long vector. Given an image from the set, we convert it to a vector <script type="math/tex">\Gamma_i \in \mathcal{G}^{rc}</script>.</p>
<p>We then calculate the mean face <script type="math/tex">\Psi</script>:</p>
<script type="math/tex; mode=display">\Psi = \frac{1}{m} \sum_{i=1}^{m} \Gamma_i</script>
<p>We then zero-centre the image vectors <script type="math/tex">\Gamma_i</script> by subtracting the mean from each. This results in a set of vectors <script type="math/tex">\Phi_i</script>:</p>
<script type="math/tex; mode=display">\Phi_i = \Gamma_i - \Psi</script>
<p>We then perform PCA on the matrix <script type="math/tex">A</script>, where <script type="math/tex">A</script> is given by:</p>
<script type="math/tex; mode=display">A = [\Phi_1 \Phi_2 \cdots \Phi_m] \in \mathbb{R}^{rc \times m}</script>
<p>More concretely, we compute the covariance matrix <script type="math/tex">C \in \mathbb{R}^{rc \times rc}</script>:</p>
<script type="math/tex; mode=display">C = \frac{1}{m} \sum_{i=1}^m \Phi_i \Phi_i^T = A A^T</script>
<p>We would then typically compute the eigen decomposition of this matrix. However, in the interest of speed, the eigen decomposition is instead computed for <script type="math/tex">A^T A \in \mathbb{R}^{m \times m}</script>. This is mathematically justified since the <script type="math/tex">m</script> eigenvalues of <script type="math/tex">A^T A</script> (along with their associated eigenvectors) correspond to the <script type="math/tex">m</script> largest eigenvalues of <script type="math/tex">A A^T</script> (along with their associated eigenvectors).</p>
<p>We then retain the first <script type="math/tex">k</script> principal components: the <script type="math/tex">k</script> eigenvectors with largest associated absolute eigenvalues. This corresponds to a matrix <script type="math/tex">V \in \mathbb{R}^{m \times k}</script>, where the columns of the matrix are these chosen eigenvectors. We then compute the so-called projection matrix <script type="math/tex">U \in \mathbb{R}^{rc \times k}</script>:</p>
<script type="math/tex; mode=display">U = A V</script>
<p>Lastly, we can finally find the eigenface subspace <script type="math/tex">\Omega \in \mathbb{R}^{k \times m}</script>:</p>
<script type="math/tex; mode=display">\Omega = U^T A</script>
<p>Now, for the actual facial recognition part! Consider a resized grayscale test image <script type="math/tex">I \in \mathcal{G}^{r \times c}</script>. We reshape this into a vector:</p>
<script type="math/tex; mode=display">\Gamma \in \mathcal{G}^{rc \times 1}</script>
<p>We then mean-normalise:</p>
<script type="math/tex; mode=display">\Phi = \Gamma - \Psi</script>
<p>Finally, we project the test face onto the eigenface subspace (i.e. the linear manifold learned by PCA):</p>
<script type="math/tex; mode=display">\hat{\Phi} = U^T \Phi</script>
<p>Given this projected face, we can find which face it is closest to in the eigenface subspace, and classify it as that that person’s face:</p>
<script type="math/tex; mode=display">\text{prediction} = \text{argmin}_{i} ||\Omega_i - \hat{\Phi}||_2</script>
<p>where <script type="math/tex">\Omega_i</script> is the <script type="math/tex">i^{\text{th}}</script> face in the eigenface subspace. It is clear that this is using Euclidean distance, as this is the metric used in the classical eigenface algorithm. We can, however, instead opt for <script type="math/tex">L_1</script> distance or any other distance metric.</p>The main idea behind eigenfaces is that we want to learn a low-dimensional space - known as the eigenface subspace - on which we assume the faces intrinsically lie. From there, we can then compare faces within this low-dimensional space in order to perform facial recognition. It’s a relatively simple approach to facial recognition, but indeed one of the most famous and effective ones of the early approaches. It still works well in simple, controlled scenarios.SVMs: A Geometric Interpretation2019-03-30T19:55:55+00:002019-03-30T19:55:55+00:00https://davidtorpey.com//2019/03/30/svm-geometric-interpretation<p><img src="/assets/base.png" alt="Example Points" /></p>
<p>Consider a set of positive and negative samples from some dataset as shown above. How can we approach the problem of classifying these - and more importantly, unseen - samples as either positive or negative examples? The most intuitive way to do this is to draw a line / hyperplane between the between the positive and negative samples.</p>
<p>However, which line should we draw? We could draw this one:</p>
<p><img src="/assets/badline1.png" alt="Wrong line 1" /></p>
<p>or this one:</p>
<p><img src="/assets/badline2.png" alt="Wrong line 2" /></p>
<p>However, neither of the above seem like the best fit. Perhaps a line such that the boundary between the two classes is maximal is the optimal line?</p>
<p><img src="/assets/svmline.png" alt="SVM line" /></p>
<p>This line is such that the margin is maximized. This is the line an SVM attempts to find - an SVM attempts to find the <strong>maximum-margin separating hyperplane</strong> between the two classes. However, we need to construct a decision rule to classify examples. To do this, consider a vector <script type="math/tex">\mathbf{w}</script> perpendicular to the margin. Further, consider some unknown vector <script type="math/tex">\mathbf{u}</script> representing some example we want to classify:</p>
<p><img src="/assets/wandu.png" alt="Wrong line 1" /></p>
<p>We want to know what side of the decision boundary <script type="math/tex">\mathbf{u}</script> is in order to classify it. To do this, we project it onto <script type="math/tex">\mathbf{w}</script> by computing <script type="math/tex">\mathbf{w} \cdot \mathbf{u}</script>. This will give us a value that is proportional to the distance <script type="math/tex">\mathbf{u}</script> is, <em>in the direction of</em> <script type="math/tex">\mathbf{w}</script>. We can then use this to determine which side of the boundary <script type="math/tex">\mathbf{u}</script> lies on using the following decision rule:</p>
<script type="math/tex; mode=display">\mathbf{w} \cdot \mathbf{u} \ge c</script>
<p>for some <script type="math/tex">c \in \mathbb{R}</script>. <script type="math/tex">c</script> is basically telling us that if we are far <em>enough</em> away, we can classify <script type="math/tex">\mathbf{u}</script> as a positive example. We can rewrite the above decision rule as follows:</p>
<script type="math/tex; mode=display">\mathbf{w} \cdot \mathbf{u} + b \ge 0</script>
<p>where <script type="math/tex">b = -c</script>.</p>
<p>But, what <script type="math/tex">\mathbf{w}</script> and <script type="math/tex">b</script> should we choose? We don’t have enough constraint in the problem to fix a particular <script type="math/tex">\mathbf{w}</script> or <script type="math/tex">b</script>. Therefore, we introduce additional constraints:</p>
<script type="math/tex; mode=display">\mathbf{w} \cdot \mathbf{x}_+ + b \ge 1</script>
<p>and</p>
<script type="math/tex; mode=display">\mathbf{w} \cdot \mathbf{x}_- + b \le -1</script>
<p>These constraints basically force the function that defines our decision rule to produce a value of 1 or greater for positive examples, and -1 or less for negative examples.</p>
<p>Now, instead of dealing with two inequalities, we introduce a new variable, <script type="math/tex">y_i</script>, for mathematical convenience. It is defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
y_i = \begin{cases}
1 & \text{positive example} \\
-1 & \text{negative example}
\end{cases} %]]></script>
<p>This variable essentially encodes the targets of each example. We multiply both inequalities from above by <script type="math/tex">y_i</script>. For the positive example constraint we get:</p>
<script type="math/tex; mode=display">y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1</script>
<p>and for the negative example constraint we get:</p>
<script type="math/tex; mode=display">y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1</script>
<p>which is the same constraint! The introduction of <script type="math/tex">y_i</script> has simplified the problem. We can rewrite this constraint as:</p>
<script type="math/tex; mode=display">y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1 \ge 0</script>
<p>However, we go a step further by making the above inequality even more stringent:</p>
<script type="math/tex; mode=display">y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1 = 0</script>
<p>The above equation constrains examples lying on the margins (known as <em>support vectors</em>) to be exactly 0. We do this because if a training point lies exactly on the margin, we don’t want to classify it as either positive or negative, since it’s exactly in the middle. We instead want such points to define our decision boundary. It is also clearly the equation of a hyperplane, which is what we want!</p>
<p>Keep in mind that our goal is to find the margin separating positive and negative examples to be as large as possible. This means that we will need to know the width of our margin so that we can maximize it. The following picture shows how we can calculate this width.</p>
<p><img src="/assets/width.png" alt="Margin Width" /></p>
<p>To calculate the width of the margin, we need a unit normal. Then we can just project <script type="math/tex">\mathbf{x}_+ - \mathbf{x}_-</script> onto this unit normal and this would exactly be the width of the margin. Luckily, vector <script type="math/tex">\mathbf{w}</script> was defined to be normal! Thus, we can compute the width as follows:</p>
<script type="math/tex; mode=display">\text{width} = (\mathbf{x}_+ - \mathbf{x}_-) \cdot \frac{\mathbf{w}}{||\mathbf{w}||}</script>
<p>where the norm ensures that <script type="math/tex">\mathbf{w}</script> becomes a unit normal. From earlier, we know <script type="math/tex">y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1 = 0</script>. Using this, simple algebra yields:</p>
<script type="math/tex; mode=display">\mathbf{x}_+ \cdot \mathbf{w} = 1 - b</script>
<p>and</p>
<script type="math/tex; mode=display">- \mathbf{x}_- \cdot \mathbf{w} = 1 + b</script>
<p>Thus, substituting into the expression for the width yields:</p>
<script type="math/tex; mode=display">\text{width} = \frac{2}{||\mathbf{w}||}</script>
<p>which is interesting! The width of our margin for such a problem depends only on <script type="math/tex">\mathbf{w}</script>. Since we want to maximize the margin, we want:</p>
<script type="math/tex; mode=display">\text{max} \frac{2}{||\mathbf{w}||}</script>
<p>which is the same as</p>
<script type="math/tex; mode=display">\text{max} \frac{1}{||\mathbf{w}||}</script>
<p>which is the same as</p>
<script type="math/tex; mode=display">\text{min} ||\mathbf{w}||</script>
<p>which is the same as</p>
<script type="math/tex; mode=display">\text{min} \frac{1}{2} ||\mathbf{w}||^2</script>
<p>where we write it like this for mathematical convenience reasons that will become apparent shortly.</p>
<p>One easy approach to solve such an optimisation problem is using Lagrange multipliers. We first formulate our Lagrangian:</p>
<script type="math/tex; mode=display">L(\mathbf{w}, b) = \frac{1}{2} ||\mathbf{w}||^2 - \sum_i \alpha_i [y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1]</script>
<p>We find the optimal settings for <script type="math/tex">\mathbf{w}</script> and <script type="math/tex">b</script> by computing the respective partial derivatives and setting them to zero. First, for <script type="math/tex">\mathbf{w}</script>:</p>
<script type="math/tex; mode=display">\frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i x_i = 0</script>
<p>which implies that <script type="math/tex">\mathbf{w} = \sum_i \alpha_i y_i x_i</script>. This means that <script type="math/tex">\mathbf{w}</script> is simply a linear combination of the samples! Now, for <script type="math/tex">b</script>:</p>
<script type="math/tex; mode=display">\frac{\partial L}{\partial b} = - \sum_i \alpha_i y_i = 0</script>
<p>which implies that <script type="math/tex">\sum_i \alpha_i y_i = 0</script>.</p>
<p>We could just stop here. We can solve the optimisation problem as is. However, we shall not do that! At least not yet. Let’s plug our expressions for <script type="math/tex">\mathbf{w}</script> and <script type="math/tex">b</script> back into the Lagrangian:</p>
<script type="math/tex; mode=display">L = \frac{1}{2} (\sum_i \alpha_i y_i \mathbf{x}_i) \cdot (\sum_j \alpha_j y_j \mathbf{x}_j) - \sum_i \alpha_i y_i \mathbf{x}_i \cdot (\sum_j \alpha_j y_j \mathbf{x}_j) - \sum_i \alpha_i y_i b + \sum_i \alpha_i</script>
<p>which, after some algebra, results in:</p>
<script type="math/tex; mode=display">L = \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i \cdot \mathbf{x}_j</script>
<p>What the above equation tells us is that the optimisation depends <strong>only</strong> on dot products of pairs of samples! This observation will prove key later on. Also, we should note that training examples that are not support vectors will have <script type="math/tex">\alpha_i = 0</script>, as these examples do not effect or define the decision boundary.</p>
<p>Putting the expressions for <script type="math/tex">\mathbf{w}</script> and <script type="math/tex">b</script> back into our decision rule yields:</p>
<script type="math/tex; mode=display">\sum_i \alpha_i y_i \mathbf{x}_i \cdot \mathbf{u} + b \ge 0</script>
<p>which means the decision rule also depends <strong>only</strong> on dot products of pairs of samples! Another great benefit is that it is provable that this optimisation problem is convex - meaning we are guaranteed to always find global optima.</p>
<p>However, now a problem arises! The above optimisation problem assumes the data is linearly-separable in the input vector space. However, in most real-life scenarios, this assumption is simply untrue. We therefore have to adapt the SVM to accommodate for this, and to allow for non-linear decision boundaries. To do this, we introduce a transformation <script type="math/tex">\phi</script> which will transform the input vector into a (high-dimensional) vector space. It is in this vector space that we will attempt to find the maximum-margin line / hyperplane.
In this case, we would simply need to swap the dot product <script type="math/tex">\mathbf{x}_i \cdot \mathbf{x_j}</script> in the optimisation problem with <script type="math/tex">\phi(\mathbf{x}_i) \cdot \phi(\mathbf{x_j})</script>. We can do this solely because, as shown above, both the optimisation and decision rule depends only on dot products between pairs of samples. This is known as the <em>kernel trick</em>. Thus, if we have a function <script type="math/tex">K</script> such that:</p>
<script type="math/tex; mode=display">K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x_j})</script>
<p>then we don’t actually need to know the transformation <script type="math/tex">\phi</script> itself! We only need the function <script type="math/tex">K</script>, which is known as a kernel function. This is why we can use kernels that transform the data into an infinite-dimensional space (such as the RBF kernel), because we are not computing the transformations directly. Instead, we simply use a special function (i.e. kernel function) to compute dot products in this space without needing to compute the transformations.</p>
<p>This kernel trick allows the SVM to learn non-linear decision boundaries, and the problem still clearly remains convex. However, even with the kernel trick, the SVM with such a formulation still assumes that the data in linearly-separable in this transformed space. Such SVMs are known as <em>hard-margin</em> SVMs. This assumption does not hold most the time for real-world data. Therefore, we arrive at the most common form of the SVM nowadays - the <em>soft-margin</em> SVMs. Essentially, so-called <em>slack</em> variables are introduced into the optimisation problem to control the amount of misclassification the SVM is allowed to make. For more information on soft-margin SVMs, see <a href="https://davidtorpey.com/2018/11/25/svm.html">my blog post on the subject</a>.</p>Human Action Recognition2019-03-18T19:55:55+00:002019-03-18T19:55:55+00:00https://davidtorpey.com//2019/03/18/human-action-recognition<p>In this post we will discuss the problem of human action recognition - an application of video analysis / recognition. The task is simply to identify a single action from a video. The typically setting is a dataset consisting of <script type="math/tex">N</script> action classes, where each class has a set of videos associated with it relating to that action. We will focus on the approaches typically taken in early action recognition research, and then focus on the current state-of-the-art approaches. There is a recurring theme in action recognition of extending conventional two-dimensional algorithms into three dimensions to accommodate for the extra (temporal) dimension when dealing with videos instead of images.</p>
<p>Early research tends to focus on hand-crafting features. The benefit of this is that you are incorporating domain knowledge into the features, which should increase performance. The high-level idea behind these approaches is as follows:</p>
<ul>
<li>Use interest point detection mechanism to localise points of interest to be used as the basis for feature extraction.</li>
<li>Compute descriptions of these interest points in the form of (typically, gradient-based) descriptors.</li>
<li>Quantise local descriptors into global video feature representations.</li>
<li>Train an SVM of some form to learn to map from gloval video representation to action class.</li>
</ul>
<p>Interest points are usually detected using a three-dimensional extension of the well-known Harris operator - space-time interest points (STIPs). However, in later research simple dense sampling was instead preferred for its resulting performance and speed. Interest points are also detected at multiple spatial and temporal scales to account for actions of differing speed and temporal extent. Descriptors are commonly computed within a local three-dimensional volume of the interest points (i.e. a cuboid). These descriptors are typically one of the following three (into some or other form): 1. histogram of oriented gradients; 2. histogram of optical flow; 3. motion boundary histograms.</p>
<p>The quantisation step to encode these local features into a global, fixed-length feature representation is usually done using either: 1. K-Means clustering using a bag-of-visual-words approach; or 2. Fisher vectors. Fisher vectors typically result in higher performance, but at a cost of dimensionality exploding. The normalisation applied to these features is important. The common approach was applying <script type="math/tex">L_2</script> normalisation, however power normalisation is preferred more recently. An SVM then learns the mapping to action classes from the normalised versions of the representations. The most successful of these hand-crafted approaches is iDT (improved dense trajectories). iDTs are often used in tandem with deep networks in state-of-the-art approaches as they are able to encode some pertinent, salient information about the videos / actions that is difficult for the networks to capture.</p>
<p>More recent research into action recognition has, unsurprisingly, been focused on deep learning. The most natural way to apply deep neural networks to video is to extend the successful 2D CNN architectures into the temporal domain by simply using 3D kernels in the convolutional layers and 3D pooling. This use of 3D CNNs is very common in this domain, although some research did attempt to process individual RGB frames with 2D CNN architectures. An example of a 3D CNN can be seen below.</p>
<p><img src="/assets/3dcnn.png" alt="3D CNN" /></p>
<p>The most significant contribution to human action recognition using deep learning, however, was the introduction of additional cues to model the action. More concretely, the raw RGB videos are fed into one 3D CNN which will learn salient appearance features. Further, there is another network - a flow network - which learns salient motion features from optical flow videos. An optical flow video is computed by performing frame-by-frame dense optical flow on the raw video, and using the resulting horizontal and vertical optical flow vector fields as the “images” / “frames” of the flow video. This modeling process is based on the intuition that actions can naturally be decomposed into a spatial and temporal components (which will be modelled by the RGB and flow networks separately). An example of a optical flow field “frame” using different optical flow algorithms can be seen below (RGB frame, MPEG flow, Farneback flow, and Brox flow). The more accurate flow algorithms such as Brox and TVL-1, result in higher performance. However, they are much more intensive to compute, especially without their GPU implementations.</p>
<p><img src="/assets/flow.png" alt="Optical Flow Fields" /></p>
<p>This two-network approach is the basis for the state-of-the-art approaches in action recognition such as I3D and temporal segment networks. Some research attempts to add additional cues to appearance and motion to model actions, such as pose.</p>
<p>It is important to note that when using deep learning to solve action recognition, massive computational resources are needed to train the 3D CNNs. Some of the state-of-the-art approaches utilise upwards of 64 powerful GPUs to train the networks. This is needed in particular to pre-train the networks on massive datasets like Kinetics to make use of transfer learning.</p>
<p>Another consideration to consider (using deep learning approaches particularly) is the temporal resolution of the samples used during training. The durations of actions vary hugely, and in order to make the system robust, the model needs to accommodate for this. Some approaches employ careful sampling of various snippets along the temporal evolution of the video so that the samples cover the action fully. Others employ a large temporal resolution for the sample - 60-100 frames. However, this increases computational cost significantly.</p>
<p>Some good resources and references can be found here:</p>
<p><a href="https://hal.inria.fr/hal-00873267v2/document">iDT</a></p>
<p><a href="https://arxiv.org/pdf/1705.07750.pdf">I3D</a></p>
<p><a href="https://wanglimin.github.io/papers/WangXWQLTV_ECCV16.pdf">Temporal Segment Networks</a></p>
<p><a href="https://hal.inria.fr/hal-01764222/document">PoTion</a></p>In this post we will discuss the problem of human action recognition - an application of video analysis / recognition. The task is simply to identify a single action from a video. The typically setting is a dataset consisting of action classes, where each class has a set of videos associated with it relating to that action. We will focus on the approaches typically taken in early action recognition research, and then focus on the current state-of-the-art approaches. There is a recurring theme in action recognition of extending conventional two-dimensional algorithms into three dimensions to accommodate for the extra (temporal) dimension when dealing with videos instead of images.Dimensionality Reduction2019-01-31T19:55:55+00:002019-01-31T19:55:55+00:00https://davidtorpey.com//2019/01/31/dimensionality-reduction<p>In machine learning, we often work with very high-dimensional data. For example, we might be working in a genome prediction context, in which case our feature vectors would contains thousands of dimensions, or perhaps we’re dealing in another context where the dimensions reach of hundreds of thousands or possibly millions. In such a context, one common way to get a handle on the data - to understand it better - is to visualise the data by reducing its dimensions. The can be done using conventional dimensionality reduction techniques such as PCA and LDA, or using manifold learning techniques such as t-SNE and LLE.</p>
<p>For the purposes of this post, let’s assume the input features are <script type="math/tex">M</script>-dimensional.</p>
<p>The most popular, and perhaps simplest, dimensionality reduction technique is principal components analysis (PCA). In it, we assume that the relationships between the variables / features are linear. “Importance” in the PCA algorithm is defined by variance. This assumption that variance is the important factor often holds (but not always!). To get the so-called principal components of the data, we find the orthogonal directions of maximum variance. These are the components that maximize the variance of the data.</p>
<p>We obtain these principal components via finding the eigen decomposition of the covariance matrix of the input matrix - that is, its eigenvalues and eigenvectors. Since computing the covariance matrix is often prohibitive to compute for a large number of features, the eigenvalues and eigenvectors are often found by using the SVD algorithm which decomposes the input matrix down into three separate matrices, two of which are the eigenvalues and eigenvectors. In this way, we need to directly compute the covariance matrix. The data must be centered in order for this SVD trick to work.</p>
<p>The <script type="math/tex">N</script> principal components are then the <script type="math/tex">N</script> eigenvectors with largest associated absolute eigenvalues. These are linear combinations of the input features, where is each feature contributes different amounts to the principal component. If there are strong linear relationships between the input variables, relatively few principal components will capture the majority of the variance in the data. However, if not much of the variance is captured by relatively few components, this does not necessarily mean that there are no relationships or underlying structure in the data - the structure might be in the form of non-linear interactiions and relationships. This is the reason non-linear dimensionality reduction (such as KPCA) and manifold learning techniques exist.</p>
<p><img src="/assets/pca.png" alt="PCA" /></p>
<p>In the above image we can see that in the original, 3-dimensional, raw feature space, the clusters of data are separated quite nicely. The 4 groups are roughly linear separable. In the left plot, we can also see the first two pincipal components of the data - the two (orthogonal) directions / axes in which the data varies maximally with respect to variance. In the right plot, we can see the projection of the data into the 2-dimensional principal subspace. The data separates quite nicely into 4 distinct clusters. This suggests that the data has strong linear relationships.</p>
<p>Manifold learning allows us to estimate the hypothesised low-dimensional non-linear manifold (or set of manifolds) on which our high-dimensional data lies. Different manifold learning algorithms optimise for different criteria depending on what type of structure of the data they want to capture - local or global or a combination.</p>
<p>I’ll discuss one manifold learning technique. This technique - t-SNE - is popular in the machine learning research communited. t-SNE stands for t-distributed stochastic neighbour embedding. t-SNE spawns from a technique known as SNE (unsuprisingly known as stochastic neighbour embedding). SNE converts distances between data points in the original, high-dimensional space (termed datapoints) into conditional probabilities that represent similarities. These similarities are simply the probability that a datapoint <script type="math/tex">x_i</script> would pick a datapoint <script type="math/tex">x_j</script> as its neighbour if neighbours were picked in proportion to their probability density under a Gaussian centered at <script type="math/tex">x_i</script>, which we denote as <script type="math/tex">p_{ij}</script>. This means that for nearby points, this similarity is relatively high, and for widely-separated points this similarity approaches zero. The low-dimensional counterparts of the datapoints (known as the map points) are <script type="math/tex">y_i</script> and <script type="math/tex">y_j</script>. We compute similar conditional probabilities (i.e. similarities) for these map points, which we denote <script type="math/tex">q_{ij}</script>.</p>
<p>If these map points correctly model the similarities of the datapoints, we should have that <script type="math/tex">p_{ij}</script> is equal to <script type="math/tex">q_{ij}</script>. Thus, SNE attempts to find the low-dimensional representation that minimizes the KL-divergence between these two conditional distributions. The problem with this approach is that the cost function is difficult to optimize, and it also suffers from the infamous crowding problem - the area of the low-dimensional map that is available to accomodate moderately-distant datapoints will not be nearly large enough compared with the area available to accomodate nearby datapoints. Thus, t-SNE is born.</p>
<p>t-SNE addresses these issues of SNE by using a symmetric cost function with simpler gradients, and uses a student t-distribution to calculate the similarities in the low-dimensional space instead of a Gaussian. This heavy-tailed distribution in the low-dimensional space alleviates the crowding and optimization problems. The KL-divergence-based cost function can be easily optimized using a variant of gradient descent with momentum.</p>
<p>t-SNE is able to learn good, realistic manifolds as it is able to effectively capture the non-linear relationships and interactions in data, if they are present. t-SNE in its original form computes, specifically, a 2-dimensional projection / map. We can see a comparison of t-SNE and PCA in the below image. It is clear that PCA is inherently limited, since the projection into the principal subspace is linear. It is clear that t-SNE has much more effectively captured the structure of the data, and allowed for a much nicer, clearer visualization.</p>
<p><img src="/assets/pcavstsne.png" alt="PCA vs t-SNE" /></p>
<p>Some great resources for this topic can be found at:</p>
<p><a href="http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf">t-SNE</a></p>
<p><a href="http://www.jmlr.org/papers/volume9/goldberg08a/goldberg08a.pdf">Manifold Learning</a></p>
<p><a href="https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf">PCA Tutorial</a></p>In machine learning, we often work with very high-dimensional data. For example, we might be working in a genome prediction context, in which case our feature vectors would contains thousands of dimensions, or perhaps we’re dealing in another context where the dimensions reach of hundreds of thousands or possibly millions. In such a context, one common way to get a handle on the data - to understand it better - is to visualise the data by reducing its dimensions. The can be done using conventional dimensionality reduction techniques such as PCA and LDA, or using manifold learning techniques such as t-SNE and LLE.Optical Flow2018-12-23T14:23:55+00:002018-12-23T14:23:55+00:00https://davidtorpey.com//2018/12/23/optical-flow<p>Optical flow is a method for motion analysis and image registration that aims to compute displacement of intensity patterns. Optical flow is used in many different settings in the computer vision realm, such as video recognition and video compression. The key assumption to many optical flow algorithms is known as the brightness constancy constraint, as is defined as:</p>
<script type="math/tex; mode=display">f(x, y, t) = f(x + dx, y + dy, t + dt)</script>
<p>This constraint simply states that the intensity of moving pixels remains constant during motion. If we take the MacLaurin series expansion of this equation, we obtain <script type="math/tex">f_x dx + f_y dy + f_t dt = 0</script>. Dividing by <script type="math/tex">d_t</script> yields:</p>
<script type="math/tex; mode=display">f_x u + f_y v + f_t = 0</script>
<p>where <script type="math/tex">u = \frac{dx}{dt}</script>, and <script type="math/tex">v = \frac{dy}{dt}</script>. This equation is known as the optical flow (constraint) equation. Since we want to solve for <script type="math/tex">u</script> and <script type="math/tex">v</script>, the system is underconstrained.</p>
<p>The first optical flow algorithm that will be discussed is perhaps the most well-known - Lucas Kanade, otherwise known as KLT. In order to perform optical flow, one first needs to detect some interest points (pixels) we want to track. In the case of the KLT tracker, these are usually a set of sparse interest points, such as Shi-Thomasi good features to track.</p>
<p>Since the system is underconstrainted, KLT considers local optical flow - a <script type="math/tex">2k+1 \times 2k+1</script> window. This yields a system of equations <script type="math/tex">A u = f_t</script>. Using the pseudo-inverse of <script type="math/tex">A</script>, we can obtain a solution:</p>
<p><script type="math/tex">u = (A^T A)^{-1} A^T f_t</script>.</p>
<p>There are other optical flow algorithm that perform dense optical flow - optical flow for dense interest points. Lucas-Kanade works well for sparse interest points, but it too computationally-intensive for dense optical flow. Dense interest points are most often sampled using a technique known as dense sampling - sampling points on a regular grid on the image. This can even be every pixel.</p>
<p>One such algorithm is Farneback’s method, and computes the flow for dense interest points. For example, if every pixel is tracked from one frame to another in a video, the result would be the per-pixel horizontal and vertical flow of that pixel. These flows essentially result in a two-channel image of the same size as the input frames, where the channels are optical flow vector fields representing the horizontal and vertical flow, respectively.</p>Optical flow is a method for motion analysis and image registration that aims to compute displacement of intensity patterns. Optical flow is used in many different settings in the computer vision realm, such as video recognition and video compression. The key assumption to many optical flow algorithms is known as the brightness constancy constraint, as is defined as:Ensemble Learning2018-12-10T19:55:55+00:002018-12-10T19:55:55+00:00https://davidtorpey.com//2018/12/10/ensemble-learning<p>Ensemble learning is one of the most useful methods in the machine learning, not least for the fact that it is essentially agnostic to the statistical learning algorithm being used. Ensemble learning techniques are a set of algorithms that define how to combine multiple classifiers to make one strong classifier. There are various ensemble learning techniques, but this post will focus on the two most popular - bagging and boosting. These two approach the same problem in very different ways.</p>
<p>To explain these two algorithms, we assume a binary classification context, with a dataset consisting of a feature set <script type="math/tex">D</script> and a target set <script type="math/tex">Y</script>, where <script type="math/tex">y \in \{-1, 1\}</script> <script type="math/tex">\forall y \in Y</script>.</p>
<p>Bagging, otherwise known as bootstrap aggregation, depends on a sampling technique known as the boostrap. This is a resampling method where we sample, with replacement, over each step of the aggregation. Essentially we obtain bootstrapped samples from <script type="math/tex">X_t \subset D</script>, and train a weak learner <script type="math/tex">h_t : X_t \mapsto Y</script>, for <script type="math/tex">t = 1, \dots, M</script>, where <script type="math/tex">M \in \mathbb{N}</script> is the number of so-called weak learners in the ensemble. Then, on a test example <script type="math/tex">x \in S</script>, we make a prediction by taking the mode of the prediction of the <script type="math/tex">M</script> weak learners: <script type="math/tex">y_p = \text{mode}([h_1(x), h_2(x), \dots, h_M(x)])</script>. Random Forests, for example, employ bagging in their predictions. However, the bootstrapped samples used to train each of the weak learners (usually a decision stump - a decision tree of depth 1), consist of random samples of both the examples and the features. In this way, the decision trees in the random forest are made to be approximately de-correlated from each other, which gives the algorithm its effectiveness. The main reason for using bagging is to reduce the variance of an estimator. Such estimators are usually ones with a large VC-dimension or capacity, such as random forests.</p>
<p>Boosting is another very popular ensemble learning method. Unlike bagging, the current learner in the ensemble depends on the results of the previous learner in the ensemble. We will discuss the popular boosting algorithm known as Adaboost. Adaboost adaptively reweights samples such that the difficult-to-classify samples are given more weight as the emsemble progresses. A prediction scheme is then introduced to incorporate the predicitons of each learner in the ensemble. Similar to bagging, we create an ensemble consisting of <script type="math/tex">M</script> weak learners. We then initialise the weight for each sample to a uniform distribution: <script type="math/tex">D_t(i) = \frac{1}{m}</script> <script type="math/tex">\forall i</script>, where <script type="math/tex">m</script> is the number of samples. Then, for each weak lerner in the ensemble, we train a weak learner <script type="math/tex">h_t : X \mapsto \{-1, 1\}</script> using distribution <script type="math/tex">D_t</script>. We then find the error of the weak learner: <script type="math/tex">\epsilon_t = P_{i \sim D_t}[h_t(x_i) \neq y_i]</script>. Finally, we compute the weights that we will use to adaptively amend the distribution for the next weak learner in the ensemble so that difficult-to-classify samples are weighted more heavily. This is done using: <script type="math/tex">\alpha_t = \frac{1}{2} \text{ln}(\frac{1-\epsilon_t}{\epsilon_t})</script>. The distribution is then updated using the following formula: <script type="math/tex">D_{t+1}(i) = \frac{D_t(i) \text{exp}(-\alpha_i y_i h_t(x_i))}{Z_t}</script>, where <script type="math/tex">Z_t</script> is a normalisation constant to ensure <script type="math/tex">D_{t+1}</script> is a distribution. The final classifier is then given by the following formula: <script type="math/tex">H(x) = \text{sign}(\sum_{t=1}^T \alpha_t h_t(x))</script>. Boosting is most commonly used to reduce the bias of an estimator, and the weak learner can be any classifier.</p>Ensemble learning is one of the most useful methods in the machine learning, not least for the fact that it is essentially agnostic to the statistical learning algorithm being used. Ensemble learning techniques are a set of algorithms that define how to combine multiple classifiers to make one strong classifier. There are various ensemble learning techniques, but this post will focus on the two most popular - bagging and boosting. These two approach the same problem in very different ways.Autoencoders2018-12-02T19:55:55+00:002018-12-02T19:55:55+00:00https://davidtorpey.com//2018/12/02/auto-encoders<p>Autoencoders fall under the unsupervised learning category, and are a special case of neural networks that map the inputs (in the input layer) back to the inputs (in the final layer). This can be seen mathematically as <script type="math/tex">f : \mathbb{R}^m \mapsto \mathbb{R}^m</script>. Autoencoders were originally introduced to address dimensionality reduction. In the original paper, Hinton compares it with PCA, another dimensionality reduction algorithm. He showed that autoencoders outperform PCA when non-linear mappings are needed to represent the data. They are able to learn a more realistic low-dimensional manifold than linear methods due to their non-linear nature.</p>
<p>Okay, enough with the introduction; let’s get into it. Autoencoders can be thought of as having two networks in one grand network. We refer to the first network as the encoder network. This takes in the actual data as the input and runs to the network to the output, similar to a vanilla neural network. The second network is the decoder network. This takes the output of the encoder as inputs to the network and uses the original input data as targets.</p>
<p>Usually, when we speak about autoencoders, we refer to the under-complete structure. This means that the “code” layer has less neurons than the input layer. The “code” layer, also sometimes referred to as the “latent variables”, is the layer we described above. That is, the output layer of the encoder and the input layer of the decoder. Now, using a under-complete structure starts to make sense since we are essentially decreasing the dimensionality of our data. As research continued over the past few years, people have become much more interested in what the network learns in the code layer and a lot of research has gone into investigating this.</p>
<p>Generally the decoder is a reflection of the encoder along the code layer. However, in encoder-decoder models we can have various combinations in that we can add LSTM cells in the encoder and not in the decoder or vice-versa.</p>
<p>Since math makes everything easier, let’s represent the above mathematically as follows: <script type="math/tex">f : \mathbb{R}^m \rightarrow \mathbb{R}^n</script> and <script type="math/tex">g : \mathbb{R}^n \rightarrow \mathbb{R}^m</script>, where <script type="math/tex">f</script> is the encoder and <script type="math/tex">g</script> is the decoder. If we are considering an under-complete structure then <script type="math/tex">m > n</script>.</p>
<p>Autoencoders seek to describe the low-dimensional smooth structure of our high dimensional data, otherwise referred to as high-dimensional surfaces.</p>
<p>There are many variations of autoencoders that have been developed over the past few years, these include: over-complete autoencoders, de-noising autoencoders, variational auroencoders, etc. The basic idea for all these models are the same as the normal autoencoder.</p>
<p>Applications of these models can vary from dimensionality reduction to information retrieval.</p>
<p>Some great resources can be found at:</p>
<p><a href="https://www.cs.toronto.edu/~hinton/science.pdf">Reducing the dimensionality of data with neural networks</a></p>
<p><a href="http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/">Autoencoders - Tutorial</a></p>
<p><a href="https://becominghuman.ai/understanding-autoencoders-unsupervised-learning-technique-82fb3fbaec2">Understanding Autoencoders</a></p>Autoencoders fall under the unsupervised learning category, and are a special case of neural networks that map the inputs (in the input layer) back to the inputs (in the final layer). This can be seen mathematically as . Autoencoders were originally introduced to address dimensionality reduction. In the original paper, Hinton compares it with PCA, another dimensionality reduction algorithm. He showed that autoencoders outperform PCA when non-linear mappings are needed to represent the data. They are able to learn a more realistic low-dimensional manifold than linear methods due to their non-linear nature.Local Feature Encoding and Quantisation2018-11-25T19:55:55+00:002018-11-25T19:55:55+00:00https://davidtorpey.com//2018/11/25/feature-quantisation<p>In this post, I will describe local feature encoding and quantisation - why it is useful, where it is used, and some of the popular techniques used to perform it.</p>
<p>Feature quantisation is commonly used in domains such as image and video retrieval, however, it can be applied anywhere we would like to convert a variable number of local features into a single feature of uniform dimensionality.</p>
<p>Consider a set of images <script type="math/tex">\{I_i\}^{n}_{i=1}</script>, and that we would like to obtain a fixed-length representation of each image so that we can index them quickly and easily by comparing these representation using some similarity measure. One common way to do this is to find interest points on the images, and compute the SIFT or SURF descriptors around those interest points. This means that each image will have a set of descriptors <script type="math/tex">\{v_j\}_{j=1}^{n_i}</script>, where <script type="math/tex">n_i</script> is the number of descriptors found for image <script type="math/tex">I_i</script>. It is clear that since the <script type="math/tex">n_i</script>s are not necessarily equal, we need some scheme to compute a fixed-length global representation of the image, using these local descriptors, in order, for example, to be able to compare similarity between images.</p>
<p>The most popular of these local feature encoding methods is bag-of-words (BoW). This is sometimes known as bag-of-visual-words, or bag-of-features. This technique is performed in the following manner. We sample a subset of the local descriptors across all images. Call this set <script type="math/tex">S</script>. We then use the descriptors in <script type="math/tex">S</script> to estimate a K-Means clustering with <script type="math/tex">K</script> cluster centroids. These <script type="math/tex">K</script> centroids can be thought of as visual codebooks in the image feature space. Once we have learned this so-called visual codebook, we can then use it to compute a global, fixed-length representation of an image.</p>
<p>To compute the fixed-length representation, consider a particular image <script type="math/tex">I_i</script>’s descriptor set <script type="math/tex">\{v_j\}_{j=1}^{n_i}</script>. We then compute a vector <script type="math/tex">h \in \mathbb{R}^K</script>, where the <script type="math/tex">i</script>th dimension of <script type="math/tex">h</script> relates to the number of local descriptors of <script type="math/tex">I_i</script> that belong to <script type="math/tex">i</script>th visual word (i.e. cluster centroid) of the K-Means clustering. This is the quantisation part of the process. Determining what visual word a particular descriptor belongs to is achieved by computed distance between the descriptor and all the cluster centroids, using some distance metric (usually Euclidean distance) in the image feature space. This vector of counts is then L2-normalized to obtain the final, global, fixed-length representation of the image <script type="math/tex">I_i</script>.</p>
<p>Other techniques exist to encode local features into global features such as Fisher vectors, and VLAD (vector of locally aggregated descriptors). Fisher vectors are the current state-of-the-art in this domain. However, they can quickly become very high-dimensional, as they are essentially a concatenation of partial derivatives of the parameters of a GMM (Gaussian mixture model) estimated with <script type="math/tex">D</script> modes in the image feature space. They are <script type="math/tex">2 K D + K</script>-dimensional, however, the <script type="math/tex">K</script> term is often discarded as these are the derivates of the GMM with respect to the mixture weights, and have been emprically shown to not provide much value to the representation. Thus, they are typically <script type="math/tex">2 K D</script>-dimensional. VLAD is a representation computed by quantising the residuals of the descriptors with respect to their assigned cluster centroids in a K-Means clustering of the data. They often result in similar performance to Fisher vectors, while being of a lower dimensionality and quicker to compute.</p>
<p><img src="/assets/BoW.png" alt="BoW Flow" /></p>
<p>Some great resources for this topic can be found at:</p>
<p><a href="https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/csurka-eccv-04.pdf">BoW</a></p>
<p><a href="https://www.robots.ox.ac.uk/~vgg/rg/papers/peronnin_etal_ECCV10.pdf">Fisher vectors</a></p>
<p><a href="https://lear.inrialpes.fr/pubs/2010/JDSP10/jegou_compactimagerepresentation.pdf">VLAD</a></p>In this post, I will describe local feature encoding and quantisation - why it is useful, where it is used, and some of the popular techniques used to perform it.Support Vector Machines - Why and How2018-11-25T19:55:55+00:002018-11-25T19:55:55+00:00https://davidtorpey.com//2018/11/25/svm<p>Support vector machines (SVMs) are one of the most popular supervised learning algorithms in use today, even with the onslaught of deep learning and neural network take-over. The reason they have remained popular is due to their reliability across a wide variety of problem domains and datasets. They often have great generalisation performance, and this is almost solely due to the clever way in which they work - that is, how they approach the problem of supervised learning and how they formulate the optimisation problem they solve.</p>
<p>There are two types of SVMS - hard-margin and soft-margin. Hard-margin SVMs assume the data is linearly-separable (in the raw feature space or some high-dimensional feature space that we can map to) without any errors, whereas a soft-margin has some leeway in that it allows for some misclassification is the data is not completely linearly-separable. When speaking of SVMs, we are generally referring to soft-margin ones, and thus this post will focus on these. Moreover, we will focus on a binary classification context.</p>
<p>Consider a labeled set of <script type="math/tex">n</script> feature vectors and corresponding targets: <script type="math/tex">\{(x_i, y_i)\}^{n}_{i=1}</script>, where <script type="math/tex">x_i \in \mathbb{R}^m</script> is feature vector <script type="math/tex">i</script> and <script type="math/tex">y_i \in \{0, 1\}</script> is target <script type="math/tex">i</script>. An SVM attempts to find a hyperplane that separates the classes in the feature space, or some transformed version of the feature space. The hyperplane, however, is defined to be a very specific separating hyperplane - the one that separates the data maximally; that is, with the largest margin between the two classes.</p>
<p>Define a hyperplane <script type="math/tex">\mathcal{H} := \{x : f(x) = x^T \beta + \beta_0 = 0\}</script>, such that <script type="math/tex">\|\beta\| = 1</script>. Then, we know that <script type="math/tex">f(x)</script> is the signed distance from <script type="math/tex">x</script> to <script type="math/tex">\mathcal{H}</script>. As a side note, in the case that the data is linearly-separable, we have that <script type="math/tex">y_i f(x_i) > 0</script>, <script type="math/tex">\forall i</script>. However, since we are solely dealing with the linearly non-separable case, we define a set of slack variables <script type="math/tex">\xi = [\xi_1, \xi_2, \dots, \xi_n]</script>. These essentially provide the SVM classifier with some leeway in that it then allows for a certain amount of misclassification. Then, we let <script type="math/tex">M</script> be the width of the margin either side of our maximum-margin hyperplane. We want, for all <script type="math/tex">i</script>, that <script type="math/tex">y_i (x_i^T \beta + \beta_0) \ge M - \xi_i</script>, <script type="math/tex">\xi_i \ge 0</script>, and <script type="math/tex">\sum_i \xi_i \le K</script>, for some <script type="math/tex">K \in \mathbb{R}</script>. This means that we want a point <script type="math/tex">x_i</script> to be at least a distance of <script type="math/tex">M</script> away from <script type="math/tex">\mathcal{H}</script> (on its correct side of the margin) with a leeway/slack of <script type="math/tex">\xi_i</script>.</p>
<p>The above contraints lead to a non-convex optimization problem. However, it can be re-formulated in such a way that makes it convex. Thus, we modify such that for all <script type="math/tex">i</script>, <script type="math/tex">y_i (x_i^T \beta + \beta_0) \ge M (1 - \xi_i)</script>. That is, we measure the relative distance from a point <script type="math/tex">x_i</script> to the hyperplane, as opposedd to the actual distance as done in the first, non-convex, formulation. The slack variables essentially just represent the proportional amount by which the predictions are on the wrong side of their margin. By bounding <script type="math/tex">\sum_i \xi_i</script>, we essentially bound the total proportional amount by which the training predictions fall on the wrong side of their margin.</p>
<p>Thus, it is clear that misclassification occurs when <script type="math/tex">\xi_i > 1</script>. Therefore, the <script type="math/tex">\sum_i \xi_i \le K</script> constraint means we can have at most <script type="math/tex">K</script> training misclassifications.</p>
<p>If we drop the unit norm constraint of parameter <script type="math/tex">\beta</script>, we define <script type="math/tex">M := \frac{1}{\|\beta\|}</script>, we can then formulate the following convex optimization problem for the SVM:</p>
<script type="math/tex; mode=display">\text{min} \|\beta\| \\
\text{s.t. } y_i (x_i^T \beta + \beta_0) \ge 1 - \xi_i, \xi_i \ge 0, \sum_i \xi_i \le K, \forall i</script>
<p>With this formulation, it is clear that points well within their bounds (i.e. well within their class boundary) do not have much affect on shaping the decision/class boundary. Also, if the class boundary can be constructed using a small number of support vector relative to the train set size, then the generalization performance will be high, even in an infinite-dimensional space.</p>
<p>It should be noted that up until now we have been working in the base feature space. However, SVMs are part of a class of methods known as kernel methods. This means that we can apply a function (known as the kernel function) to transform the base feature space into a (possibly) very high-dimensional feature space. We hypothesise that in this transformed feature space, it may be easier to find a separating hyperplane between the classes. It is this kernel property of SVMs that allow them to learn non-linear decision boundaries (as opposed to linear, which the SVM without a kernel function can learn). We can simply replace <script type="math/tex">x_i</script> in the formulation by <script type="math/tex">\phi(x_i)</script>, where <script type="math/tex">\phi_i</script> is the kernel. The most common kernel functions are polynomial or radial basis (Gaussian) functions.</p>
<p>Some great resources for SVMs can be found at the following links:</p>
<p><a href="https://web.stanford.edu/~hastie/Papers/ESLII.pdf">The Elements of Statistical Learning</a></p>
<p><a href="http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf">Support Vector Networks</a></p>
<p><a href="http://www.jmlr.org/papers/volume5/chen04b/chen04b.pdf">Support Vector Machine Soft Margin Classifiers: Error Analysis</a></p>Support vector machines (SVMs) are one of the most popular supervised learning algorithms in use today, even with the onslaught of deep learning and neural network take-over. The reason they have remained popular is due to their reliability across a wide variety of problem domains and datasets. They often have great generalisation performance, and this is almost solely due to the clever way in which they work - that is, how they approach the problem of supervised learning and how they formulate the optimisation problem they solve.Face Recognition: Eigenfaces2018-11-25T19:55:55+00:002018-11-25T19:55:55+00:00https://davidtorpey.com//2018/11/25/eigenfaces<p>The main idea behind eigenfaces is that we want to learn a low-dimensional space - known as the eigenface subspace - on which we assume the faces intrinsically lie. From there, we can then compare faces within this low-dimensional space in order to perform facial recognition. It’s a relatively simple approach to facial recognition, but indeed one of the most famous and effective ones of the early approaches. It still works well in simple, controlled scenarios.</p>
<p>Assume we have a set of <script type="math/tex">m</script> images <script type="math/tex">\{I_i\}^{m}_{i=1}</script>, where <script type="math/tex">I_i \in \mathcal{G}^{r \times c}</script>; <script type="math/tex">\mathcal{G} = \{0, 1, \dots, 255\}</script>; and <script type="math/tex">r \times c</script> is the spatial dimension of the image. The first step to the algorithm is to resize all the images in the set to the same size. Typically, the images are converted to grayscale, since it is assumed that colour is not an important factor. This is clearly debatable, however, for the purposes of this post we will assume that the images are grayscale images.</p>
<p>Each image is then converted to a vector, by appending each row into one long vector. Given an image from the set, we convert it to a vector <script type="math/tex">\Gamma_i \in \mathcal{G}^{rc}</script>.</p>
<p>We then calculate the mean face <script type="math/tex">\Psi</script>:</p>
<script type="math/tex; mode=display">\Psi = \frac{1}{m} \sum_{i=1}^{m} \Gamma_i</script>
<p>We then zero-centre the image vectors <script type="math/tex">\Gamma_i</script> by subtracting the mean from each. This results in a set of vectors <script type="math/tex">\Phi_i</script>:</p>
<script type="math/tex; mode=display">\Phi_i = \Gamma_i - \Psi</script>
<p>We then perform PCA on the matrix <script type="math/tex">A</script>, where <script type="math/tex">A</script> is given by:</p>
<script type="math/tex; mode=display">A = [\Phi_1 \Phi_2 \cdots \Phi_m] \in \mathbb{R}^{rc \times m}</script>
<p>More concretely, we compute the covariance matrix <script type="math/tex">C \in \mathbb{R}^{rc \times rc}</script>:</p>
<script type="math/tex; mode=display">C = \frac{1}{m} \sum_{i=1}^m \Phi_i \Phi_i^T = A A^T</script>
<p>We would then typically compute the eigen decomposition of this matrix. However, in the interest of speed, the eigen decomposition is instead computed for <script type="math/tex">A^T A \in \mathbb{R}^{m \times m}</script>. This is mathematically justified since the <script type="math/tex">m</script> eigenvalues of <script type="math/tex">A^T A</script> (along with their associated eigenvectors) correspond to the <script type="math/tex">m</script> largest eigenvalues of <script type="math/tex">A A^T</script> (along with their associated eigenvectors).</p>
<p>We then retain the first <script type="math/tex">k</script> principal components: the <script type="math/tex">k</script> eigenvectors with largest associated absolute eigenvalues. This corresponds to a matrix <script type="math/tex">V \in \mathbb{R}^{m \times k}</script>, where the columns of the matrix are these chosen eigenvectors. We then compute the so-called projection matrix <script type="math/tex">U \in \mathbb{R}^{rc \times k}</script>:</p>
<script type="math/tex; mode=display">U = A V</script>
<p>Lastly, we can finally find the eigenface subspace <script type="math/tex">\Omega \in \mathbb{R}^{k \times m}</script>:</p>
<script type="math/tex; mode=display">\Omega = U^T A</script>
<p>Now, for the actual facial recognition part! Consider a resized grayscale test image <script type="math/tex">I \in \mathcal{G}^{r \times c}</script>. We reshape this into a vector:</p>
<script type="math/tex; mode=display">\Gamma \in \mathcal{G}^{rc \times 1}</script>
<p>We then mean-normalise:</p>
<script type="math/tex; mode=display">\Phi = \Gamma - \Psi</script>
<p>Finally, we project the test face onto the eigenface subspace (i.e. the linear manifold learned by PCA):</p>
<script type="math/tex; mode=display">\hat{\Phi} = U^T \Phi</script>
<p>Given this projected face, we can find which face it is closest to in the eigenface subspace, and classify it as that that person’s face:</p>
<script type="math/tex; mode=display">\text{prediction} = \argmin_{i} ||\Omega_i - \hat{\Phi}||_2</script>
<p>where <script type="math/tex">\Omega_i</script> is the <script type="math/tex">i^{\text{th}}</script> face in the eigenface subspace. It is clear that this is using Euclidean distance, as this is the metric used in the classical eigenface algorithm. We can, however, instead opt for <script type="math/tex">L_1</script> distance or any other distance metric.</p>The main idea behind eigenfaces is that we want to learn a low-dimensional space - known as the eigenface subspace - on which we assume the faces intrinsically lie. From there, we can then compare faces within this low-dimensional space in order to perform facial recognition. It’s a relatively simple approach to facial recognition, but indeed one of the most famous and effective ones of the early approaches. It still works well in simple, controlled scenarios.