From transformers to R1

The general idea is to flesh out a blog post that goes from a vanilla transformer like you’d learn about in a class to the cutting edge of how DeepSeek is training their LLMs. This is mainly an excuse for me to learn and read all the pretty papers they’ve written.

Prereqs

If you want to catch up on your transformer fundamentals, then I would recommend reading [TODO]. Link to the visual guides to transformers websites.

TODO: Create some large lineage that shows model names and how they came to be

DeepSeek LLM

We start off at their first language model where they just copy Llama 2 and use a standard dense transformer.

DeepSeekMoe

MoE stands for Mixture of Experts.

Outline: - What is a MoE? - Why would we want one? - What are the pitfalls? - How does DeepSeekMoe seek to address those pitfalls?

DeepSeekMoe introduced fine-grained experts and shared-experts. ### Mixture of Experts (MoE)

TODO

DeepSeek-V2

One of the largest jumps in model architecture. While V1 was a standard dense model, V2 used a Mixture of Experts (MoE) and Multi-head Latent Attention (MLA).

You can think of this as a spiritual just enlargement of the small models in DeepSeekMoe, but it’s good enough that they decide to call it V2.

DeepSeek-V3

DeepSeek-R1

Their first model to use reinforcement learning on verifiable rewards.

Appendix

Massive List of Benchmarks

This is a personal goal of mine to understand what benchmarks are being tested, why they were created, and what they’re meant to test.
What is testing bits-per-byte on a dataset?