The foundational paper

Attention Is All You Need (Vaswani et al., 2017) — The original 2017 paper. Surprisingly readable; everything else is commentary on this.

The Annotated Transformer — Harvard NLP walks through the paper line-by-line with PyTorch code interleaved. Best companion read.

The visual intuition tier (start here if anything felt fuzzy)

The Illustrated Transformer — Jay Alammar — Still the best visual explanation on the internet. If a formula feels abstract, Jay has probably drawn it.

3Blue1Brown — Attention in Transformers (visual) — Grant Sanderson's animated walkthrough of Q, K, V, and the attention matrix. The geometric intuition for the dot product as "similarity" is much clearer in animation.

3Blue1Brown — But what is a GPT? (precursor video) — Watch this one first if you haven't. Sets up the embedding-space intuition.

Code-first learning (matches your repo's spirit)

Andrej Karpathy — Let's build GPT, from scratch, in code, spelled out — 2-hour video building essentially what your repo does. The single best resource if you like learning by typing.

nanoGPT (Karpathy's repo) — Production-quality cousin of your transformers-from-scratch repo. Reproduces GPT-2. Read model.py after you've read your own.

Sebastian Raschka — Self-Attention from Scratch — Codes scaled dot-product attention step-by-step in PyTorch with all dimensions labeled. Maps directly onto the Q, K, V, d, n vocabulary I gave you.

Q, K, V and the attention formula

Sebastian Raschka — Understanding and Coding Self-Attention (same as above — it's the best one).

Dive into Deep Learning — Attention Mechanisms chapter — Free textbook with runnable code. The math notation is rigorous without being intimidating.

Positional encoding

Sinusoidal — Amirhossein Kazemnejad's blog — The canonical "why those weird sin/cos formulas?" explainer. Walks through what pos, i, d, and the 10000 constant actually do.

RoPE — EleutherAI blog — The clearest write-up of rotary embeddings. Explains why "rotate Q and K" is mathematically equivalent to encoding relative position.

RoPE — Su Jianlin's original blog (translated) — From the inventor. Denser, but rewarding.

Layer Norm and residuals

Layer Normalization paper (Ba, Kiros, Hinton 2016) — Short and readable. Explains why LayerNorm works without batch dependence.

Deep Residual Learning (He et al. 2016) — The original ResNet paper where the x + sublayer(x) trick was introduced. Pre-dates transformers but is the reason you can stack 100 blocks.

On Layer Normalization in the Transformer Architecture (Xiong et al. 2020) — Pre-norm vs post-norm; explains why modern models put LayerNorm before the sublayer instead of after.

Scaling laws

Scaling Laws for Neural Language Models (Kaplan et al. 2020) — The original OpenAI paper introducing N, D, C and the power-law curves on your slide 16.

Training Compute-Optimal LLMs / Chinchilla (Hoffmann et al. 2022) — The DeepMind paper that revised Kaplan's conclusions and gave us the "20 tokens per parameter" rule.

Chinchilla Explained — Alexandra Barr — Plain-English breakdown if the paper is too dense.

State-of-the-art techniques (slide 17)

FlashAttention (Dao et al. 2022) — The fused-kernel paper. The accompanying Tri Dao blog post is more digestible.

GQA paper (Ainslie et al. 2023) — Grouped-query attention. Short paper.

IBM — What is grouped-query attention? — Good plain-English intro to GQA vs MHA vs MQA.

Mixture of Experts Explained — Hugging Face blog — Best practical MoE explainer. Covers Mixtral, routing, load balancing.

Speculative decoding paper (Leviathan et al. 2023).

Tokenization / BPE

Hugging Face NLP Course — Chapter 6: Tokenizers — Walks through BPE, WordPiece, and Unigram with runnable code.

Karpathy — Let's build the GPT Tokenizer — Companion 2-hour video to "Let's build GPT." Builds tiktoken-style BPE from scratch.

Going deeper / interpretability

Anthropic — Transformer Circuits — If you want to understand what's happening inside the weights once you grok the architecture. Start with "A Mathematical Framework for Transformer Circuits."

Neel Nanda — TransformerLens tutorials — Hands-on mechanistic interpretability library; great for "what does head 5 in layer 3 actually do?"

decagondev/transformers.md

Select an option

No results found