Attention Is All You Need (Vaswani et al., 2017) — The original 2017 paper. Surprisingly readable; everything else is commentary on this.
The Annotated Transformer — Harvard NLP walks through the paper line-by-line with PyTorch code interleaved. Best companion read.
The Illustrated Transformer — Jay Alammar — Still the best visual explanation on the internet. If a formula feels abstract, Jay has probably drawn it.
3Blue1Brown — Attention in Transformers (visual) — Grant Sanderson's animated walkthrough of Q, K, V, and the attention matrix. The geometric intuition for the dot product as "similarity" is much clearer in animation.
3Blue1Brown — But what is a GPT? (precursor video) — Watch this one first if you haven't. Sets up the embedding-space intuition.
Andrej Karpathy — Let's build GPT, from scratch, in code, spelled out — 2-hour video building essentially what your repo does. The single best resource if you like learning by typing.
nanoGPT (Karpathy's repo) — Production-quality cousin of your transformers-from-scratch repo. Reproduces GPT-2. Read model.py after you've read your own.
Sebastian Raschka — Self-Attention from Scratch — Codes scaled dot-product attention step-by-step in PyTorch with all dimensions labeled. Maps directly onto the Q, K, V, d, n vocabulary I gave you.
Sebastian Raschka — Understanding and Coding Self-Attention (same as above — it's the best one).
Dive into Deep Learning — Attention Mechanisms chapter — Free textbook with runnable code. The math notation is rigorous without being intimidating.
Sinusoidal — Amirhossein Kazemnejad's blog — The canonical "why those weird sin/cos formulas?" explainer. Walks through what pos, i, d, and the 10000 constant actually do.
RoPE — EleutherAI blog — The clearest write-up of rotary embeddings. Explains why "rotate Q and K" is mathematically equivalent to encoding relative position.
RoPE — Su Jianlin's original blog (translated) — From the inventor. Denser, but rewarding.
Layer Normalization paper (Ba, Kiros, Hinton 2016) — Short and readable. Explains why LayerNorm works without batch dependence.
Deep Residual Learning (He et al. 2016) — The original ResNet paper where the x + sublayer(x) trick was introduced. Pre-dates transformers but is the reason you can stack 100 blocks.
On Layer Normalization in the Transformer Architecture (Xiong et al. 2020) — Pre-norm vs post-norm; explains why modern models put LayerNorm before the sublayer instead of after.
Scaling Laws for Neural Language Models (Kaplan et al. 2020) — The original OpenAI paper introducing N, D, C and the power-law curves on your slide 16.
Training Compute-Optimal LLMs / Chinchilla (Hoffmann et al. 2022) — The DeepMind paper that revised Kaplan's conclusions and gave us the "20 tokens per parameter" rule.
Chinchilla Explained — Alexandra Barr — Plain-English breakdown if the paper is too dense.
FlashAttention (Dao et al. 2022) — The fused-kernel paper. The accompanying Tri Dao blog post is more digestible.
GQA paper (Ainslie et al. 2023) — Grouped-query attention. Short paper.
IBM — What is grouped-query attention? — Good plain-English intro to GQA vs MHA vs MQA.
Mixture of Experts Explained — Hugging Face blog — Best practical MoE explainer. Covers Mixtral, routing, load balancing.
Speculative decoding paper (Leviathan et al. 2023).
Hugging Face NLP Course — Chapter 6: Tokenizers — Walks through BPE, WordPiece, and Unigram with runnable code.
Karpathy — Let's build the GPT Tokenizer — Companion 2-hour video to "Let's build GPT." Builds tiktoken-style BPE from scratch.
Anthropic — Transformer Circuits — If you want to understand what's happening inside the weights once you grok the architecture. Start with "A Mathematical Framework for Transformer Circuits."
Neel Nanda — TransformerLens tutorials — Hands-on mechanistic interpretability library; great for "what does head 5 in layer 3 actually do?"