For a while I’ve been interested in representation learning in the context of deep learning. Concepts such as self-supervised learning, unsupervised representation learning using GANs or VAEs, or simply through a vanilla supervised learning of some neural network architecture. Upon reading the literature, I had an idea that serves as a nice integration of two very interesting and useful models / techniques - the Fisher vector (which I’ve previously posted about in my blog here), and the variational autoencoder (which I’ve been meaning to write a blog post about!). This blog post just serves to flesh out the idea, should I choose to pursue or revisit it at some point.

The Fisher vector is a state-of-the-art patch encoding technique. It can be seen a soft / probabilistic version of VLAD (vector of locally-aggregated descriptors), which itself is very similar to the bag of visual words encoding / quantisation technique, except that you quantise the residuals of local descriptors to their cluster center, instead of the actual visual word occurrences. The Fisher vector is based on the Fisher kernel, and assumes that the generation process of the descriptors being encoded can be modelled by some parametric probability distribution $u_{\theta}$, where $u$ is the PDF and $\theta$ are the associated parameters of this distribution. Typically, in the context of Fisher vectors, $u_{\theta}$ is chosen to be a $K$-mode GMM. Thus, $\theta = \{\alpha_i, \mu_i, \Sigma_i\}_{i=1}^{K}$ are the $K$ mixture weights, means, and covariances matrices of the GMM. EM can then be used to compute the maximum likelihood estimates of the parameters of the GMM. The Fisher vector is then defined to be the concatenation of the gradients of the log likelihood function of the GMM with respect to each of the parameters. What should be emphasised here is that the $u_{\theta}$ can be any parametric distribution, and the estimation of its parameters can be done in any way we prescribe, not necessarily using MLE/EM.

A variational autoencoder (VAE) is a neural network architecture, and is a generative model. It is one of the most popular current generative models in deep learning, along with the generative adversarial network (GAN). A VAE is a type of autoencoder, (or more correctly, encoder-decoder network) that contains a stochastic encoder function $q_{\theta}(z\|x)$, which is parameterised as a neural network. This encoder outputs the parameters of $q_{\theta}(z\|x)$, for which we choose, a-priori, some parametric form (e.g. a multivariate Gaussian). We can then obtain a latent representation $z$ of our input by sampling from this distribution using our learned parameter estimates. The decoder part of the VAE is also parameterised as a neural network, and is defined as $p_{\phi}(x\|z)$. Using this function, we can compute the reconstruction of our input $x$. One of the goals of the VAE (and AEs in general) is that the inputs, and their associated reconstructions from the deocder, be similar. This should be achieved within the paradigm of the latent space serving as a bottleneck in the learning process. This encourages the network to only encode salient information in the latent representations of the input. The VAE loss function includes a KL-divergence term, in additional to the regular pixel-space loss (which is usually MSE or some variant). The KL-divergence terms serves as a regularisation to the learning process which forces distribution q to be close to distribution p. In other words, we want the KL-divergence between the encoder $q_{\theta}$ and the prior $p(z)$ to be small.

Typically, $p$ is chosen to be standard Normal, and $q$ is chosen to be a multivariate Gaussian. Once trained, samples similar to those it was trained on can be generate using the learned distribution. However, for the purposes of this post, we focuses on the VAE’s ability to learn the parameters of some distribution, whose functional form we choose a-priori.

The idea is to learn a Fisher vector using a variant of the VAE architecture. One prohibiting factor of the Fisher vector is that the information it encodes is based off of interest points with associated descriptors. These interest points are usually things like SIFT or SURF, which all, in some way or another, define “interesting” as having large gradients in all directions. In this way, they often focus on image region contains edges or corner-like structures, thus disregard large portions of images which containing homogenous regions or regions of with a low colour gradient. However, such regions I hypothesise can provide very valuable information in the global context of an image. Using a convolutional variant of a VAE, we can learn better representations of the images that take into account the full context of the image. Additionally, we can assume $q$ to be a GMM, and can learn the parameters of the GMM. An additional layer in the network can be used to compute the Fisher vector using the GMM parameters. These can be easily included as a neural network layer, since all the operations to compute the Fisher vector have simple gradients. Thus, the full process of training the VAE, and by proxy learning a Fisher vector, can be done in an end-to-end learnable way.

Some great resources for this post can be found below:

Fisher vectors

VAE

VAE