Variational Autoencoders

Assumed knowledge: you should already know how neural networks work on a general basis for this to make any sense.

Problem motivation

Unsupervised learning is trying to learn structure from data without explicit labels of what is in the image. Using the image/data alone, the model has to be able to find patterns within the training set. How can we achieve this?

One common method to achieve this is the autoencoder. It takes in an image, compresses it down to some small number of dimensions (also called an embedding/latent space), then tries to recreate the image/original data that you had. By doing so, the model has to learn meaningful features.

TODO: Include image here & clean this up

One issue with autoencoders is that there is no constraint on the latent space, so your embeddings could be all over the place and you can’t sample from it in any meaningful way. There’s also no real way to go from one latent to another such as transforming a 0 to a 6 since there might be large discontinuities within the space where the latent doesn’t map to anything meaningful. How can we fix this?

Autoencoders to VAE

The key issue with autoencoders is that there is no constraint on what the latent dimensions could look like, so one way to solve this would be to put constraints on it. If we could force the latent dimensions to be a diagonal Gaussian, then there would be meaningful ways to interpolate between points, and the space would have much normality (both literally and figuratively).

What is a VAE?

If you go from a traditional autoencoder, then we keep the decoder the same. The decoder still takes some set latent vector and tries to recreate the input. The difference is how we create the encoder. Rather than deterministically using a NN to compress down into the latent vector, we instead have the NN output the means and variances to a diagonal Gaussian.¹ Once we have the means and variances, we can randomly sample a latent vector from that!

Reparameterization trick

You’ll notice that if we are doing random sampling, there is no way to backpropogate the error through that to train your neural network. The cleverest way I saw this explained is that normally, if you were to sample directly from the \mu and \sigma that the encoder produces, then your formula for the latent would be z \sim N(\mu, \sigma) which you can’t differentiate to. Now, if instead you have z=\mu+\sigma \odot \epsilon then this is now differentiable since you’ve factored out the randomness! What’s important to remember is that if you have = then you’re able to take derivatives through, but if you have \sim, you can’t take derivates of randomly sampled variables!

TLDR: Don’t use the parameters as actual parameters you’re randomly sampling from and instead use them to multiple some outside form of randomness so that you can still backprop the errors through.

What is the point of ELBO?

TODO: Now how in the world do we ensure that the latent stays close to a normal Gaussian? Also note that there’s no reason it has to be this distribution, and you can sub it out for anything else where you can get closed form KL-divergence and reparameterize - what is the solution?

Deriving ELBO

TODO

Mode Collapse

VAE’s can suffer from mode collapse where all the inputs map to a single output. Generally what this looks like is no matter the encoding in the latent dimension, the decoder will always output a single output which is usually an average of your dataset.

If this occurs, it’s generally good to follow the outline in neural net recipes. If you do have mode collapse AND you’re certain the problem is within the model, then what’s probably happening is your KL-divergence loss term dominated your reconstruction loss, and the model is now stuck in a local-optima where it is better put latents in a normal Gaussian without any differentiation.

To fix this, try overfitting on one x and set a weight on the KL divergence weight \le 1. Normal weights are somewhere in the range of [0.0001, 0.005]. If you continue having divergence even with tiny weights, you can try getting rid of the KL divergence altogether and ensuring the model can even recreate your input. If a normal schedule doesn’t work, you can try a linear warmup where the KL-divergence weight slowly scales from 0 to 1 or a cyclic schedule.

For both these schedule, it seems important for the model to be able to learn a good representation before collapsing down the latent dimension to a Gaussian. If anyone has papers on the theory of this besides the natural intuition of why this is good, I’d love to read them!

Sources

A diagonal Gaussian is one where each dimension is independent of the other ones so there is no interaction between them. This makes modeling simpler since we only need n means and variances rather than n^2! You can think of this like the same way Naive Bayes makes a simplifying assumption that is probably false but makes the problem tractable.↩︎