Transformers Lecture Notes
- Large Language Models (LLMs) are essentially stacks of Transformer blocks.
- Transformer = a specialized neural-network architecture that generalizes feed-forward nets by replacing recurrence with the “attention” operation.
- Core intuition: every token can dynamically look at ("attend to") every other token in its context and decide—by learned similarity scores—how much information to borrow.
Timeline of Language-Modeling Milestones
- 1990 – Static Word Embeddings
- Hand-crafted or shallow‐learned vectors (e.g., word2vec, GloVe).
- 2003 – Neural Language Model
- Bengio et al. introduce end-to-end neural probability models for sequences.
- 2008 – Multi-Task Learning
- Jointly training on heterogeneous objectives.
- 2015 – Attention
- Bahdanau attention for sequence-to-sequence translation.
- 2017 – Transformer
- Vaswani et al. remove recurrence, make attention the complete workhorse.
- 2018 – Contextual Word Embeddings & Pre-training
- ELMo, BERT: fine-tuning giant pretrained encoders.
- 2019 – Prompting
- Treat tasks as text completion; unleash zero-/few-shot capabilities.
Static vs Contextual Word Embeddings
- Problem with static embeddings
- One vector per word type ⇒ cannot capture polysemy.
- Example: “The chicken didn’t cross the road because it was too tired.”
• Static “it” vector has no clue what it refers to.
- Contextual Embeddings
- Each token obtains a context-specific vector.
- Obtained by Attention: integrate neighboring words with learned weights.
- Properties demonstrated with cloze examples (“it was too tired” vs “it was too wide”).
Attention Mechanism – Intuition
- Build a token’s contextual embedding by selectively integrating information from all other tokens.
- “Attending” = assigning higher weights to more relevant tokens.
- Visually: A column of layer $k$ interacts with all columns of layer $k$ to create the next-layer column.
- Weighted sum over vectors:
- Given previous-layer token vectors x1,\dots,xN, produce a_i for position $i$.
- Left-to-right causal LM: positions j>i are masked.
Single-Head Equations
- Project three role-specific versions of each vector:
- qi = xi W_Q (Query)
- ki = xi W_K (Key)
- vi = xi W_V (Value)
- Each projection matrix size d\times dk (or d\times dv for values).
- Similarity score (scaled dot-product):
\text{score}(i,j)=\frac{qi\cdot kj}{\sqrt{d_k}} - Softmax over allowed keys ⇒ attention weights:
\alpha{ij}=\operatorname{softmax}j\big(\text{score}(i,j)\big) - Output vector:
ai=\sum{j\le i} \alpha{ij}\, vj
Step-by-step Example (computing a_3)
- Compute q3,k{1:3},v_{1:3}.
- Dot-product q3\cdot kj for $j=1..3$.
- Scale by \sqrt{d_k}.
- Apply softmax to obtain weights $\alpha_{3j}$.
- Multiply each vj by \alpha{3j}.
- Sum to obtain a_3.
Multi-Head Attention
- Instead of one set $(WQ,WK,W_V)$, use $h$ independent sets ⇒ $h$ heads.
- Each head learns to focus on different relations: syntax, coreference, negation, positional offsets, etc.
- Pipeline per head $t$:
\text{head}t = \operatorname{Attention}(X WQ^{(t)}, X WK^{(t)}, X WV^{(t)}) - Concatenate heads and apply output projection WO:
\text{MHA}(X)=\big[\text{head}1;\dots;\text{head}h\big] WO
- One residual stream per token propagates up the stack.
- Each block performs (with residual connections):
- LayerNorm t1 = \operatorname{LayerNorm}(xi)
- Multi-Head Attention t2 = \text{MHA}(t1)
- Residual add t2 = xi + t_2
- LayerNorm t3 = \operatorname{LayerNorm}(t2)
- Feed-Forward Network t4 = \text{FFN}(t3)
- Residual add x{i}^{\text{next}} = t2 + t_4
- FFN: two linear layers with non-linearity (ReLU or GELU)
\operatorname{FFN}(z)=\max(0, zW1+b1)W2+b2 - LayerNorm: z-score over features of one vector.
- Stack $L$ blocks ⇒ same dimensionality d throughout.
- Every component except attention only touches its own residual stream.
- Attention literally moves information between streams (Elhage et al., 2021).
Parallelizing Attention Computation
- Pack tokens into matrix X\in\mathbb{R}^{N\times d} (rows = tokens).
- Compute all queries, keys, values in one matmul:
Q = X WQ,\;K = X WK,\;V = X WV (shapes N\times dk, etc.) - Score matrix via batched dot-product:
S = Q K^T (shape N\times N). - Apply scaling, causal mask (upper triangle set to -\infty), softmax row-wise ⇒ A.
- Output matrix: O = A V (shape N\times d_v).
- Quadratic cost \mathcal{O}(N^2) in sequence length ⇒ motivation for efficient/long-range variants.
Token & Position Embeddings
- Initial input matrix X is sum of:
- Token embedding from matrix E\in\mathbb{R}^{|V|\times d}.
- Positional embedding E_{\text{pos}}\in\mathbb{R}^{N\times d} (learned absolute positions in this lecture).
- Example workflow (BPE-tokenized string "Thanks for all the"):
- Token indices [5,4000,10532,2224].
- Lookup rows in E.
- Add corresponding positional rows [0,1,2,3].
Language Modeling Head
- After $L$ blocks, final hidden matrix H\in\mathbb{R}^{N\times d}.
- Unembedding layer: linear map back to vocabulary logits using tied weights E^T:
U = H E^T (shape N\times |V|). - Softmax over each row produces predictive distribution P(w{t+1}|\text{context}):
P{ti}=\frac{e^{U{ti}}}{\sum{j} e^{U_{tj}}}.
- “Weight tying” constrains parameters and empirically improves generalization.
- Pipeline summary for one training step:
- Tokenize input sequence (max length N).
- Form X = E[\text{tokens}] + E_{\text{pos}}[0:N].
- Pass through stacked Transformer blocks (with masking).
- Compute logits via unembedding; softmax gives probabilities.
- Compute cross-entropy loss vs. true next-token labels; back-propagate to update all parameters {WQ^{(t)},WK^{(t)},\dots,E,E_{\text{pos}}}.
Practical, Philosophical & Ethical Notes
- Attention’s context integration yields interpretability hooks (e.g., visualize heads for syntax or coreference).
- Quadratic complexity poses environmental cost ⇒ active research on linear/efficient attention.
- Weight tying re-uses parameters, shrinking model size and aligning embedding & predictive spaces, often improving fairness by consistent treatment of rare words.
- Contextual embeddings revolutionize downstream NLP by providing task-agnostic meaning representations fine-tuned with minimal labeled data.
Key Takeaways
- Transformers convert symbolic sequences into rich context-dependent vectors through stacked blocks of LayerNorm → Multi-Head Attention → FFN with residuals.
- Attention = scaled dot-product softmax weighted-sum; multi-head instantiation learns diverse relational patterns.
- Parallel matrix formulation enables GPU/TPU efficiency but is \mathcal{O}(N^2).
- Additive token + positional embeddings encode both lexical identity and order.
- The language-model head “unembeds” the final hidden states to vocab logits, closing the auto-regressive loop.