Diffusion models

The general vibe here is that you take Gaussian noise, remove a little bit, keep doing that, eventually you get an image?! Isn’t that kind of amazing?

Basically, I really like the framing that diffusion is like a big VAE. So generally with VAE, you have an encoder and a decoder, but they have some small latent dimension. The goal here is to hopefully learn something interesting in the small bottleneck that you’ve introduced. You can obviously extend this out to multiple layers for a hierarchal VAE. A diffusion model is then basically a VAE with the restrictions that the encoder is always just adding noise, the latent dimension is the same as the input, and then you just learn the decoder.

TODO: There is really beautiful math behind classifier free guidance, I just love it.

TODO: The math behind diffusion models is generally just very pretty.

TODO: I need to write a notebook on deriving ELBO since that’s something I wish I had.

My intuition for this is flawed seeing as there are papers that show that you can do “diffusion” but with a wide family of image augmentation methods.