Jekyll2019-03-20T13:27:11+00:00https://davidtorpey.com//feed.xmlDavid TorpeyStuff that interests me, and hopefully you too. Hopefully we learn something along the way as well.Human Action Recognition2019-03-18T19:55:55+00:002019-03-18T19:55:55+00:00https://davidtorpey.com//2019/03/18/human-action-recognition<p>In this post we will discuss the problem of human action recognition - an application of video analysis / recognition. The task is simply to identify a single action from a video. The typically setting is a dataset consisting of <script type="math/tex">N</script> action classes, where each class has a set of videos associated with it relating to that action. We will focus on the approaches typically taken in early action recognition research, and then focus on the current state-of-the-art approaches. There is a recurring theme in action recognition of extending conventional two-dimensional algorithms into three dimensions to accommodate for the extra (temporal) dimension when dealing with videos instead of images.</p>
<p>Early research tends to focus on hand-crafting features. The benefit of this is that you are incorporating domain knowledge into the features, which should increase performance. The high-level idea behind these approaches is as follows:</p>
<ul>
<li>Use interest point detection mechanism to localise points of interest to be used as the basis for feature extraction.</li>
<li>Compute descriptions of these interest points in the form of (typically, gradient-based) descriptors.</li>
<li>Quantise local descriptors into global video feature representations.</li>
<li>Train an SVM of some form to learn to map from gloval video representation to action class.</li>
</ul>
<p>Interest points are usually detected using a three-dimensional extension of the well-known Harris operator - space-time interest points (STIPs). However, in later research simple dense sampling was instead preferred for its resulting performance and speed. Interest points are also detected at multiple spatial and temporal scales to account for actions of differing speed and temporal extent. Descriptors are commonly computed within a local three-dimensional volume of the interest points (i.e. a cuboid). These descriptors are typically one of the following three (into some or other form): 1. histogram of oriented gradients; 2. histogram of optical flow; 3. motion boundary histograms.</p>
<p>The quantisation step to encode these local features into a global, fixed-length feature representation is usually done using either: 1. K-Means clustering using a bag-of-visual-words approach; or 2. Fisher vectors. Fisher vectors typically result in higher performance, but at a cost of dimensionality exploding. The normalisation applied to these features is important. The common approach was applying <script type="math/tex">L_2</script> normalisation, however power normalisation is preferred more recently. An SVM then learns the mapping to action classes from the normalised versions of the representations. The most successful of these hand-crafted approaches is iDT (improved dense trajectories). iDTs are often used in tandem with deep networks in state-of-the-art approaches as they are able to encode some pertinent, salient information about the videos / actions that is difficult for the networks to capture.</p>
<p>More recent research into action recognition has, unsurprisingly, been focused on deep learning. The most natural way to apply deep neural networks to video is to extend the successful 2D CNN architectures into the temporal domain by simply using 3D kernels in the convolutional layers and 3D pooling. This use of 3D CNNs is very common in this domain, although some research did attempt to process individual RGB frames with 2D CNN architectures. An example of a 3D CNN can be seen below.</p>
<p><img src="/assets/3dcnn.png" alt="3D CNN" /></p>
<p>The most significant contribution to human action recognition using deep learning, however, was the introduction of additional cues to model the action. More concretely, the raw RGB videos are fed into one 3D CNN which will learn salient appearance features. Further, there is another network - a flow network - which learns salient motion features from optical flow videos. An optical flow video is computed by performing frame-by-frame dense optical flow on the raw video, and using the resulting horizontal and vertical optical flow vector fields as the “images” / “frames” of the flow video. This modeling process is based on the intuition that actions can naturally be decomposed into a spatial and temporal components (which will be modelled by the RGB and flow networks separately). An example of a optical flow field “frame” using different optical flow algorithms can be seen below (RGB frame, MPEG flow, Farneback flow, and Brox flow). The more accurate flow algorithms such as Brox and TVL-1, result in higher performance. However, they are much more intensive to compute, especially without their GPU implementations.</p>
<p><img src="/assets/flow.png" alt="Optical Flow Fields" /></p>
<p>This two-network approach is the basis for the state-of-the-art approaches in action recognition such as I3D and temporal segment networks. Some research attempts to add additional cues to appearance and motion to model actions, such as pose.</p>
<p>It is important to note that when using deep learning to solve action recognition, massive computational resources are needed to train the 3D CNNs. Some of the state-of-the-art approaches utilise upwards of 64 powerful GPUs to train the networks. This is needed in particular to pre-train the networks on massive datasets like Kinetics to make use of transfer learning.</p>
<p>Another consideration to consider (using deep learning approaches particularly) is the temporal resolution of the samples used during training. The durations of actions vary hugely, and in order to make the system robust, the model needs to accommodate for this. Some approaches employ careful sampling of various snippets along the temporal evolution of the video so that the samples cover the action fully. Others employ a large temporal resolution for the sample - 60-100 frames. However, this increases computational cost significantly.</p>
<p>Some good resources and references can be found here:</p>
<p><a href="https://hal.inria.fr/hal-00873267v2/document">iDT</a></p>
<p><a href="https://arxiv.org/pdf/1705.07750.pdf">I3D</a></p>
<p><a href="https://wanglimin.github.io/papers/WangXWQLTV_ECCV16.pdf">Temporal Segment Networks</a></p>
<p><a href="https://hal.inria.fr/hal-01764222/document">PoTion</a></p>In this post we will discuss the problem of human action recognition - an application of video analysis / recognition. The task is simply to identify a single action from a video. The typically setting is a dataset consisting of action classes, where each class has a set of videos associated with it relating to that action. We will focus on the approaches typically taken in early action recognition research, and then focus on the current state-of-the-art approaches. There is a recurring theme in action recognition of extending conventional two-dimensional algorithms into three dimensions to accommodate for the extra (temporal) dimension when dealing with videos instead of images.Dimensionality Reduction2019-01-31T19:55:55+00:002019-01-31T19:55:55+00:00https://davidtorpey.com//2019/01/31/dimensionality-reduction<p>In machine learning, we often work with very high-dimensional data. For example, we might be working in a genome prediction context, in which case our feature vectors would contains thousands of dimensions, or perhaps we’re dealing in another context where the dimensions reach of hundreds of thousands or possibly millions. In such a context, one common way to get a handle on the data - to understand it better - is to visualise the data by reducing its dimensions. The can be done using conventional dimensionality reduction techniques such as PCA and LDA, or using manifold learning techniques such as t-SNE and LLE.</p>
<p>For the purposes of this post, let’s assume the input features are <script type="math/tex">M</script>-dimensional.</p>
<p>The most popular, and perhaps simplest, dimensionality reduction technique is principal components analysis (PCA). In it, we assume that the relationships between the variables / features are linear. “Importance” in the PCA algorithm is defined by variance. This assumption that variance is the important factor often holds (but not always!). To get the so-called principal components of the data, we find the orthogonal directions of maximum variance. These are the components that maximize the variance of the data.</p>
<p>We obtain these principal components via finding the eigen decomposition of the covariance matrix of the input matrix - that is, its eigenvalues and eigenvectors. Since computing the covariance matrix is often prohibitive to compute for a large number of features, the eigenvalues and eigenvectors are often found by using the SVD algorithm which decomposes the input matrix down into three separate matrices, two of which are the eigenvalues and eigenvectors. In this way, we need to directly compute the covariance matrix. The data must be centered in order for this SVD trick to work.</p>
<p>The <script type="math/tex">N</script> principal components are then the <script type="math/tex">N</script> eigenvectors with largest associated absolute eigenvalues. These are linear combinations of the input features, where is each feature contributes different amounts to the principal component. If there are strong linear relationships between the input variables, relatively few principal components will capture the majority of the variance in the data. However, if not much of the variance is captured by relatively few components, this does not necessarily mean that there are no relationships or underlying structure in the data - the structure might be in the form of non-linear interactiions and relationships. This is the reason non-linear dimensionality reduction (such as KPCA) and manifold learning techniques exist.</p>
<p><img src="/assets/pca.png" alt="PCA" /></p>
<p>In the above image we can see that in the original, 3-dimensional, raw feature space, the clusters of data are separated quite nicely. The 4 groups are roughly linear separable. In the left plot, we can also see the first two pincipal components of the data - the two (orthogonal) directions / axes in which the data varies maximally with respect to variance. In the right plot, we can see the projection of the data into the 2-dimensional principal subspace. The data separates quite nicely into 4 distinct clusters. This suggests that the data has strong linear relationships.</p>
<p>Manifold learning allows us to estimate the hypothesised low-dimensional non-linear manifold (or set of manifolds) on which our high-dimensional data lies. Different manifold learning algorithms optimise for different criteria depending on what type of structure of the data they want to capture - local or global or a combination.</p>
<p>I’ll discuss one manifold learning technique. This technique - t-SNE - is popular in the machine learning research communited. t-SNE stands for t-distributed stochastic neighbour embedding. t-SNE spawns from a technique known as SNE (unsuprisingly known as stochastic neighbour embedding). SNE converts distances between data points in the original, high-dimensional space (termed datapoints) into conditional probabilities that represent similarities. These similarities are simply the probability that a datapoint <script type="math/tex">x_i</script> would pick a datapoint <script type="math/tex">x_j</script> as its neighbour if neighbours were picked in proportion to their probability density under a Gaussian centered at <script type="math/tex">x_i</script>, which we denote as <script type="math/tex">p_{ij}</script>. This means that for nearby points, this similarity is relatively high, and for widely-separated points this similarity approaches zero. The low-dimensional counterparts of the datapoints (known as the map points) are <script type="math/tex">y_i</script> and <script type="math/tex">y_j</script>. We compute similar conditional probabilities (i.e. similarities) for these map points, which we denote <script type="math/tex">q_{ij}</script>.</p>
<p>If these map points correctly model the similarities of the datapoints, we should have that <script type="math/tex">p_{ij}</script> is equal to <script type="math/tex">q_{ij}</script>. Thus, SNE attempts to find the low-dimensional representation that minimizes the KL-divergence between these two conditional distributions. The problem with this approach is that the cost function is difficult to optimize, and it also suffers from the infamous crowding problem - the area of the low-dimensional map that is available to accomodate moderately-distant datapoints will not be nearly large enough compared with the area available to accomodate nearby datapoints. Thus, t-SNE is born.</p>
<p>t-SNE addresses these issues of SNE by using a symmetric cost function with simpler gradients, and uses a student t-distribution to calculate the similarities in the low-dimensional space instead of a Gaussian. This heavy-tailed distribution in the low-dimensional space alleviates the crowding and optimization problems. The KL-divergence-based cost function can be easily optimized using a variant of gradient descent with momentum.</p>
<p>t-SNE is able to learn good, realistic manifolds as it is able to effectively capture the non-linear relationships and interactions in data, if they are present. t-SNE in its original form computes, specifically, a 2-dimensional projection / map. We can see a comparison of t-SNE and PCA in the below image. It is clear that PCA is inherently limited, since the projection into the principal subspace is linear. It is clear that t-SNE has much more effectively captured the structure of the data, and allowed for a much nicer, clearer visualization.</p>
<p><img src="/assets/pcavstsne.png" alt="PCA vs t-SNE" /></p>
<p>Some great resources for this topic can be found at:</p>
<p><a href="http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf">t-SNE</a></p>
<p><a href="http://www.jmlr.org/papers/volume9/goldberg08a/goldberg08a.pdf">Manifold Learning</a></p>
<p><a href="https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf">PCA Tutorial</a></p>In machine learning, we often work with very high-dimensional data. For example, we might be working in a genome prediction context, in which case our feature vectors would contains thousands of dimensions, or perhaps we’re dealing in another context where the dimensions reach of hundreds of thousands or possibly millions. In such a context, one common way to get a handle on the data - to understand it better - is to visualise the data by reducing its dimensions. The can be done using conventional dimensionality reduction techniques such as PCA and LDA, or using manifold learning techniques such as t-SNE and LLE.Optical Flow2018-12-23T14:23:55+00:002018-12-23T14:23:55+00:00https://davidtorpey.com//2018/12/23/optical-flow<p>Optical flow is a method for motion analysis and image registration that aims to compute displacement of intensity patterns. Optical flow is used in many different settings in the computer vision realm, such as video recognition and video compression. The key assumption to many optical flow algorithms is known as the brightness constancy constraint, as is defined as:</p>
<script type="math/tex; mode=display">f(x, y, t) = f(x + dx, y + dy, t + dt)</script>
<p>This constraint simply states that the intensity of moving pixels remains constant during motion. If we take the MacLaurin series expansion of this equation, we obtain <script type="math/tex">f_x dx + f_y dy + f_t dt = 0</script>. Dividing by <script type="math/tex">d_t</script> yields:</p>
<script type="math/tex; mode=display">f_x u + f_y v + f_t = 0</script>
<p>where <script type="math/tex">u = \frac{dx}{dt}</script>, and <script type="math/tex">v = \frac{dy}{dt}</script>. This equation is known as the optical flow (constraint) equation. Since we want to solve for <script type="math/tex">u</script> and <script type="math/tex">v</script>, the system is underconstrained.</p>
<p>The first optical flow algorithm that will be discussed is perhaps the most well-known - Lucas Kanade, otherwise known as KLT. In order to perform optical flow, one first needs to detect some interest points (pixels) we want to track. In the case of the KLT tracker, these are usually a set of sparse interest points, such as Shi-Thomasi good features to track.</p>
<p>Since the system is underconstrainted, KLT considers local optical flow - a <script type="math/tex">2k+1 \times 2k+1</script> window. This yields a system of equations <script type="math/tex">A u = f_t</script>. Using the pseudo-inverse of <script type="math/tex">A</script>, we can obtain a solution:</p>
<p><script type="math/tex">u = (A^T A)^{-1} A^T f_t</script>.</p>
<p>There are other optical flow algorithm that perform dense optical flow - optical flow for dense interest points. Lucas-Kanade works well for sparse interest points, but it too computationally-intensive for dense optical flow. Dense interest points are most often sampled using a technique known as dense sampling - sampling points on a regular grid on the image. This can even be every pixel.</p>
<p>One such algorithm is Farneback’s method, and computes the flow for dense interest points. For example, if every pixel is tracked from one frame to another in a video, the result would be the per-pixel horizontal and vertical flow of that pixel. These flows essentially result in a two-channel image of the same size as the input frames, where the channels are optical flow vector fields representing the horizontal and vertical flow, respectively.</p>Optical flow is a method for motion analysis and image registration that aims to compute displacement of intensity patterns. Optical flow is used in many different settings in the computer vision realm, such as video recognition and video compression. The key assumption to many optical flow algorithms is known as the brightness constancy constraint, as is defined as:Ensemble Learning2018-12-10T19:55:55+00:002018-12-10T19:55:55+00:00https://davidtorpey.com//2018/12/10/ensemble-learning<p>Ensemble learning is one of the most useful methods in the machine learning, not least for the fact that it is essentially agnostic to the statistical learning algorithm being used. Ensemble learning techniques are a set of algorithms that define how to combine multiple classifiers to make one strong classifier. There are various ensemble learning techniques, but this post will focus on the two most popular - bagging and boosting. These two approach the same problem in very different ways.</p>
<p>To explain these two algorithms, we assume a binary classification context, with a dataset consisting of a feature set <script type="math/tex">D</script> and a target set <script type="math/tex">Y</script>, where <script type="math/tex">y \in \{-1, 1\}</script> <script type="math/tex">\forall y \in Y</script>.</p>
<p>Bagging, otherwise known as bootstrap aggregation, depends on a sampling technique known as the boostrap. This is a resampling method where we sample, with replacement, over each step of the aggregation. Essentially we obtain bootstrapped samples from <script type="math/tex">X_t \subset D</script>, and train a weak learner <script type="math/tex">h_t : X_t \mapsto Y</script>, for <script type="math/tex">t = 1, \dots, M</script>, where <script type="math/tex">M \in \mathbb{N}</script> is the number of so-called weak learners in the ensemble. Then, on a test example <script type="math/tex">x \in S</script>, we make a prediction by taking the mode of the prediction of the <script type="math/tex">M</script> weak learners: <script type="math/tex">y_p = \text{mode}([h_1(x), h_2(x), \dots, h_M(x)])</script>. Random Forests, for example, employ bagging in their predictions. However, the bootstrapped samples used to train each of the weak learners (usually a decision stump - a decision tree of depth 1), consist of random samples of both the examples and the features. In this way, the decision trees in the random forest are made to be approximately de-correlated from each other, which gives the algorithm its effectiveness. The main reason for using bagging is to reduce the variance of an estimator. Such estimators are usually ones with a large VC-dimension or capacity, such as random forests.</p>
<p>Boosting is another very popular ensemble learning method. Unlike bagging, the current learner in the ensemble depends on the results of the previous learner in the ensemble. We will discuss the popular boosting algorithm known as Adaboost. Adaboost adaptively reweights samples such that the difficult-to-classify samples are given more weight as the emsemble progresses. A prediction scheme is then introduced to incorporate the predicitons of each learner in the ensemble. Similar to bagging, we create an ensemble consisting of <script type="math/tex">M</script> weak learners. We then initialise the weight for each sample to a uniform distribution: <script type="math/tex">D_t(i) = \frac{1}{m}</script> <script type="math/tex">\forall i</script>, where <script type="math/tex">m</script> is the number of samples. Then, for each weak lerner in the ensemble, we train a weak learner <script type="math/tex">h_t : X \mapsto \{-1, 1\}</script> using distribution <script type="math/tex">D_t</script>. We then find the error of the weak learner: <script type="math/tex">\epsilon_t = P_{i \sim D_t}[h_t(x_i) \neq y_i]</script>. Finally, we compute the weights that we will use to adaptively amend the distribution for the next weak learner in the ensemble so that difficult-to-classify samples are weighted more heavily. This is done using: <script type="math/tex">\alpha_t = \frac{1}{2} \text{ln}(\frac{1-\epsilon_t}{\epsilon_t})</script>. The distribution is then updated using the following formula: <script type="math/tex">D_{t+1}(i) = \frac{D_t(i) \text{exp}(-\alpha_i y_i h_t(x_i))}{Z_t}</script>, where <script type="math/tex">Z_t</script> is a normalisation constant to ensure <script type="math/tex">D_{t+1}</script> is a distribution. The final classifier is then given by the following formula: <script type="math/tex">H(x) = \text{sign}(\sum_{t=1}^T \alpha_t h_t(x))</script>. Boosting is most commonly used to reduce the bias of an estimator, and the weak learner can be any classifier.</p>Ensemble learning is one of the most useful methods in the machine learning, not least for the fact that it is essentially agnostic to the statistical learning algorithm being used. Ensemble learning techniques are a set of algorithms that define how to combine multiple classifiers to make one strong classifier. There are various ensemble learning techniques, but this post will focus on the two most popular - bagging and boosting. These two approach the same problem in very different ways.Autoencoders2018-12-02T19:55:55+00:002018-12-02T19:55:55+00:00https://davidtorpey.com//2018/12/02/auto-encoders<p>Autoencoders fall under the unsupervised learning category, and are a special case of neural networks that map the inputs (in the input layer) back to the inputs (in the final layer). This can be seen mathematically as <script type="math/tex">f : \mathbb{R}^m \mapsto \mathbb{R}^m</script>. Autoencoders were originally introduced to address dimensionality reduction. In the original paper, Hinton compares it with PCA, another dimensionality reduction algorithm. He showed that autoencoders outperform PCA when non-linear mappings are needed to represent the data. They are able to learn a more realistic low-dimensional manifold than linear methods due to their non-linear nature.</p>
<p>Okay, enough with the introduction; let’s get into it. Autoencoders can be thought of as having two networks in one grand network. We refer to the first network as the encoder network. This takes in the actual data as the input and runs to the network to the output, similar to a vanilla neural network. The second network is the decoder network. This takes the output of the encoder as inputs to the network and uses the original input data as targets.</p>
<p>Usually, when we speak about autoencoders, we refer to the under-complete structure. This means that the “code” layer has less neurons than the input layer. The “code” layer, also sometimes referred to as the “latent variables”, is the layer we described above. That is, the output layer of the encoder and the input layer of the decoder. Now, using a under-complete structure starts to make sense since we are essentially decreasing the dimensionality of our data. As research continued over the past few years, people have become much more interested in what the network learns in the code layer and a lot of research has gone into investigating this.</p>
<p>Generally the decoder is a reflection of the encoder along the code layer. However, in encoder-decoder models we can have various combinations in that we can add LSTM cells in the encoder and not in the decoder or vice-versa.</p>
<p>Since math makes everything easier, let’s represent the above mathematically as follows: <script type="math/tex">f : \mathbb{R}^m \rightarrow \mathbb{R}^n</script> and <script type="math/tex">g : \mathbb{R}^n \rightarrow \mathbb{R}^m</script>, where <script type="math/tex">f</script> is the encoder and <script type="math/tex">g</script> is the decoder. If we are considering an under-complete structure then <script type="math/tex">m > n</script>.</p>
<p>Autoencoders seek to describe the low-dimensional smooth structure of our high dimensional data, otherwise referred to as high-dimensional surfaces.</p>
<p>There are many variations of autoencoders that have been developed over the past few years, these include: over-complete autoencoders, de-noising autoencoders, variational auroencoders, etc. The basic idea for all these models are the same as the normal autoencoder.</p>
<p>Applications of these models can vary from dimensionality reduction to information retrieval.</p>
<p>Some great resources can be found at:</p>
<p><a href="https://www.cs.toronto.edu/~hinton/science.pdf">Reducing the dimensionality of data with neural networks</a></p>
<p><a href="http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/">Autoencoders - Tutorial</a></p>
<p><a href="https://becominghuman.ai/understanding-autoencoders-unsupervised-learning-technique-82fb3fbaec2">Understanding Autoencoders</a></p>Autoencoders fall under the unsupervised learning category, and are a special case of neural networks that map the inputs (in the input layer) back to the inputs (in the final layer). This can be seen mathematically as . Autoencoders were originally introduced to address dimensionality reduction. In the original paper, Hinton compares it with PCA, another dimensionality reduction algorithm. He showed that autoencoders outperform PCA when non-linear mappings are needed to represent the data. They are able to learn a more realistic low-dimensional manifold than linear methods due to their non-linear nature.Local Feature Encoding and Quantisation2018-11-25T19:55:55+00:002018-11-25T19:55:55+00:00https://davidtorpey.com//2018/11/25/feature-quantisation<p>In this post, I will describe local feature encoding and quantisation - why it is useful, where it is used, and some of the popular techniques used to perform it.</p>
<p>Feature quantisation is commonly used in domains such as image and video retrieval, however, it can be applied anywhere we would like to convert a variable number of local features into a single feature of uniform dimensionality.</p>
<p>Consider a set of images <script type="math/tex">\{I_i\}^{n}_{i=1}</script>, and that we would like to obtain a fixed-length representation of each image so that we can index them quickly and easily by comparing these representation using some similarity measure. One common way to do this is to find interest points on the images, and compute the SIFT or SURF descriptors around those interest points. This means that each image will have a set of descriptors <script type="math/tex">\{v_j\}_{j=1}^{n_i}</script>, where <script type="math/tex">n_i</script> is the number of descriptors found for image <script type="math/tex">I_i</script>. It is clear that since the <script type="math/tex">n_i</script>s are not necessarily equal, we need some scheme to compute a fixed-length global representation of the image, using these local descriptors, in order, for example, to be able to compare similarity between images.</p>
<p>The most popular of these local feature encoding methods is bag-of-words (BoW). This is sometimes known as bag-of-visual-words, or bag-of-features. This technique is performed in the following manner. We sample a subset of the local descriptors across all images. Call this set <script type="math/tex">S</script>. We then use the descriptors in <script type="math/tex">S</script> to estimate a K-Means clustering with <script type="math/tex">K</script> cluster centroids. These <script type="math/tex">K</script> centroids can be thought of as visual codebooks in the image feature space. Once we have learned this so-called visual codebook, we can then use it to compute a global, fixed-length representation of an image.</p>
<p>To compute the fixed-length representation, consider a particular image <script type="math/tex">I_i</script>’s descriptor set <script type="math/tex">\{v_j\}_{j=1}^{n_i}</script>. We then compute a vector <script type="math/tex">h \in \mathbb{R}^K</script>, where the <script type="math/tex">i</script>th dimension of <script type="math/tex">h</script> relates to the number of local descriptors of <script type="math/tex">I_i</script> that belong to <script type="math/tex">i</script>th visual word (i.e. cluster centroid) of the K-Means clustering. This is the quantisation part of the process. Determining what visual word a particular descriptor belongs to is achieved by computed distance between the descriptor and all the cluster centroids, using some distance metric (usually Euclidean distance) in the image feature space. This vector of counts is then L2-normalized to obtain the final, global, fixed-length representation of the image <script type="math/tex">I_i</script>.</p>
<p>Other techniques exist to encode local features into global features such as Fisher vectors, and VLAD (vector of locally aggregated descriptors). Fisher vectors are the current state-of-the-art in this domain. However, they can quickly become very high-dimensional, as they are essentially a concatenation of partial derivatives of the parameters of a GMM (Gaussian mixture model) estimated with <script type="math/tex">D</script> modes in the image feature space. They are <script type="math/tex">2 K D + K</script>-dimensional, however, the <script type="math/tex">K</script> term is often discarded as these are the derivates of the GMM with respect to the mixture weights, and have been emprically shown to not provide much value to the representation. Thus, they are typically <script type="math/tex">2 K D</script>-dimensional. VLAD is a representation computed by quantising the residuals of the descriptors with respect to their assigned cluster centroids in a K-Means clustering of the data. They often result in similar performance to Fisher vectors, while being of a lower dimensionality and quicker to compute.</p>
<p><img src="/assets/BoW.png" alt="BoW Flow" /></p>
<p>Some great resources for this topic can be found at:</p>
<p><a href="https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/csurka-eccv-04.pdf">BoW</a></p>
<p><a href="https://www.robots.ox.ac.uk/~vgg/rg/papers/peronnin_etal_ECCV10.pdf">Fisher vectors</a></p>
<p><a href="https://lear.inrialpes.fr/pubs/2010/JDSP10/jegou_compactimagerepresentation.pdf">VLAD</a></p>In this post, I will describe local feature encoding and quantisation - why it is useful, where it is used, and some of the popular techniques used to perform it.Support Vector Machines - Why and How2018-11-25T19:55:55+00:002018-11-25T19:55:55+00:00https://davidtorpey.com//2018/11/25/svm<p>Support vector machines (SVMs) are one of the most popular supervised learning algorithms in use today, even with the onslaught of deep learning and neural network take-over. The reason they have remained popular is due to their reliability across a wide variety of problem domains and datasets. They often have great generalisation performance, and this is almost solely due to the clever way in which they work - that is, how they approach the problem of supervised learning and how they formulate the optimisation problem they solve.</p>
<p>There are two types of SVMS - hard-margin and soft-margin. Hard-margin SVMs assume the data is linearly-separable (in the raw feature space or some high-dimensional feature space that we can map to) without any errors, whereas a soft-margin has some leeway in that it allows for some misclassification is the data is not completely linearly-separable. When speaking of SVMs, we are generally referring to soft-margin ones, and thus this post will focus on these. Moreover, we will focus on a binary classification context.</p>
<p>Consider a labeled set of <script type="math/tex">n</script> feature vectors and corresponding targets: <script type="math/tex">\{(x_i, y_i)\}^{n}_{i=1}</script>, where <script type="math/tex">x_i \in \mathbb{R}^m</script> is feature vector <script type="math/tex">i</script> and <script type="math/tex">y_i \in \{0, 1\}</script> is target <script type="math/tex">i</script>. An SVM attempts to find a hyperplane that separates the classes in the feature space, or some transformed version of the feature space. The hyperplane, however, is defined to be a very specific separating hyperplane - the one that separates the data maximally; that is, with the largest margin between the two classes.</p>
<p>Define a hyperplane <script type="math/tex">\mathcal{H} := \{x : f(x) = x^T \beta + \beta_0 = 0\}</script>, such that <script type="math/tex">\|\beta\| = 1</script>. Then, we know that <script type="math/tex">f(x)</script> is the signed distance from <script type="math/tex">x</script> to <script type="math/tex">\mathcal{H}</script>. As a side note, in the case that the data is linearly-separable, we have that <script type="math/tex">y_i f(x_i) > 0</script>, <script type="math/tex">\forall i</script>. However, since we are solely dealing with the linearly non-separable case, we define a set of slack variables <script type="math/tex">\xi = [\xi_1, \xi_2, \dots, \xi_n]</script>. These essentially provide the SVM classifier with some leeway in that it then allows for a certain amount of misclassification. Then, we let <script type="math/tex">M</script> be the width of the margin either side of our maximum-margin hyperplane. We want, for all <script type="math/tex">i</script>, that <script type="math/tex">y_i (x_i^T \beta + \beta_0) \ge M - \xi_i</script>, <script type="math/tex">\xi_i \ge 0</script>, and <script type="math/tex">\sum_i \xi_i \le K</script>, for some <script type="math/tex">K \in \mathbb{R}</script>. This means that we want a point <script type="math/tex">x_i</script> to be at least a distance of <script type="math/tex">M</script> away from <script type="math/tex">\mathcal{H}</script> (on its correct side of the margin) with a leeway/slack of <script type="math/tex">\xi_i</script>.</p>
<p>The above contraints lead to a non-convex optimization problem. However, it can be re-formulated in such a way that makes it convex. Thus, we modify such that for all <script type="math/tex">i</script>, <script type="math/tex">y_i (x_i^T \beta + \beta_0) \ge M (1 - \xi_i)</script>. That is, we measure the relative distance from a point <script type="math/tex">x_i</script> to the hyperplane, as opposedd to the actual distance as done in the first, non-convex, formulation. The slack variables essentially just represent the proportional amount by which the predictions are on the wrong side of their margin. By bounding <script type="math/tex">\sum_i \xi_i</script>, we essentially bound the total proportional amount by which the training predictions fall on the wrong side of their margin.</p>
<p>Thus, it is clear that misclassification occurs when <script type="math/tex">\xi_i > 1</script>. Therefore, the <script type="math/tex">\sum_i \xi_i \le K</script> constraint means we can have at most <script type="math/tex">K</script> training misclassifications.</p>
<p>If we drop the unit norm constraint of parameter <script type="math/tex">\beta</script>, we define <script type="math/tex">M := \frac{1}{\|\beta\|}</script>, we can then formulate the following convex optimization problem for the SVM:</p>
<script type="math/tex; mode=display">\text{min} \|\beta\| \\
\text{s.t. } y_i (x_i^T \beta + \beta_0) \ge 1 - \xi_i, \xi_i \ge 0, \sum_i \xi_i \le K, \forall i</script>
<p>With this formulation, it is clear that points well within their bounds (i.e. well within their class boundary) do not have much affect on shaping the decision/class boundary. Also, if the class boundary can be constructed using a small number of support vector relative to the train set size, then the generalization performance will be high, even in an infinite-dimensional space.</p>
<p>It should be noted that up until now we have been working in the base feature space. However, SVMs are part of a class of methods known as kernel methods. This means that we can apply a function (known as the kernel function) to transform the base feature space into a (possibly) very high-dimensional feature space. We hypothesise that in this transformed feature space, it may be easier to find a separating hyperplane between the classes. It is this kernel property of SVMs that allow them to learn non-linear decision boundaries (as opposed to linear, which the SVM without a kernel function can learn). We can simply replace <script type="math/tex">x_i</script> in the formulation by <script type="math/tex">\phi(x_i)</script>, where <script type="math/tex">\phi_i</script> is the kernel. The most common kernel functions are polynomial or radial basis (Gaussian) functions.</p>
<p>Some great resources for SVMs can be found at the following links:</p>
<p><a href="https://web.stanford.edu/~hastie/Papers/ESLII.pdf">The Elements of Statistical Learning</a></p>
<p><a href="http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf">Support Vector Networks</a></p>
<p><a href="http://www.jmlr.org/papers/volume5/chen04b/chen04b.pdf">Support Vector Machine Soft Margin Classifiers: Error Analysis</a></p>Support vector machines (SVMs) are one of the most popular supervised learning algorithms in use today, even with the onslaught of deep learning and neural network take-over. The reason they have remained popular is due to their reliability across a wide variety of problem domains and datasets. They often have great generalisation performance, and this is almost solely due to the clever way in which they work - that is, how they approach the problem of supervised learning and how they formulate the optimisation problem they solve.