Transformers Lecture Notes

Transformers: Introduction

  • Large Language Models (LLMs) are essentially stacks of Transformer blocks.
  • Transformer = a specialized neural-network architecture that generalizes feed-forward nets by replacing recurrence with the “attention” operation.
  • Core intuition: every token can dynamically look at ("attend to") every other token in its context and decide—by learned similarity scores—how much information to borrow.

Timeline of Language-Modeling Milestones

  • 1990 – Static Word Embeddings
    • Hand-crafted or shallow‐learned vectors (e.g., word2vec, GloVe).
  • 2003 – Neural Language Model
    • Bengio et al. introduce end-to-end neural probability models for sequences.
  • 2008 – Multi-Task Learning
    • Jointly training on heterogeneous objectives.
  • 2015 – Attention
    • Bahdanau attention for sequence-to-sequence translation.
  • 2017 – Transformer
    • Vaswani et al. remove recurrence, make attention the complete workhorse.
  • 2018 – Contextual Word Embeddings & Pre-training
    • ELMo, BERT: fine-tuning giant pretrained encoders.
  • 2019 – Prompting
    • Treat tasks as text completion; unleash zero-/few-shot capabilities.

Static vs Contextual Word Embeddings

  • Problem with static embeddings
    • One vector per word type ⇒ cannot capture polysemy.
    • Example: “The chicken didn’t cross the road because it was too tired.”
      • Static “it” vector has no clue what it refers to.
  • Contextual Embeddings
    • Each token obtains a context-specific vector.
    • Obtained by Attention: integrate neighboring words with learned weights.
    • Properties demonstrated with cloze examples (“it was too tired” vs “it was too wide”).

Attention Mechanism – Intuition

  • Build a token’s contextual embedding by selectively integrating information from all other tokens.
  • “Attending” = assigning higher weights to more relevant tokens.
  • Visually: A column of layer $k$ interacts with all columns of layer $k$ to create the next-layer column.

Attention Formal Definition

  • Weighted sum over vectors:
    • Given previous-layer token vectors x1,\dots,xN, produce a_i for position $i$.
    • Left-to-right causal LM: positions j>i are masked.

Single-Head Equations

  • Project three role-specific versions of each vector:
    • qi = xi W_Q (Query)
    • ki = xi W_K (Key)
    • vi = xi W_V (Value)
    • Each projection matrix size d\times dk (or d\times dv for values).
  • Similarity score (scaled dot-product):
    \text{score}(i,j)=\frac{qi\cdot kj}{\sqrt{d_k}}
  • Softmax over allowed keys ⇒ attention weights:
    \alpha{ij}=\operatorname{softmax}j\big(\text{score}(i,j)\big)
  • Output vector:
    ai=\sum{j\le i} \alpha{ij}\, vj

Step-by-step Example (computing a_3)

  1. Compute q3,k{1:3},v_{1:3}.
  2. Dot-product q3\cdot kj for $j=1..3$.
  3. Scale by \sqrt{d_k}.
  4. Apply softmax to obtain weights $\alpha_{3j}$.
  5. Multiply each vj by \alpha{3j}.
  6. Sum to obtain a_3.

Multi-Head Attention

  • Instead of one set $(WQ,WK,W_V)$, use $h$ independent sets ⇒ $h$ heads.
  • Each head learns to focus on different relations: syntax, coreference, negation, positional offsets, etc.
  • Pipeline per head $t$:
    \text{head}t = \operatorname{Attention}(X WQ^{(t)}, X WK^{(t)}, X WV^{(t)})
  • Concatenate heads and apply output projection WO: \text{MHA}(X)=\big[\text{head}1;\dots;\text{head}h\big] WO

Transformer Block Architecture

  • One residual stream per token propagates up the stack.
  • Each block performs (with residual connections):
    1. LayerNorm t1 = \operatorname{LayerNorm}(xi)
    2. Multi-Head Attention t2 = \text{MHA}(t1)
    3. Residual add t2 = xi + t_2
    4. LayerNorm t3 = \operatorname{LayerNorm}(t2)
    5. Feed-Forward Network t4 = \text{FFN}(t3)
    6. Residual add x{i}^{\text{next}} = t2 + t_4
  • FFN: two linear layers with non-linearity (ReLU or GELU)
    \operatorname{FFN}(z)=\max(0, zW1+b1)W2+b2
  • LayerNorm: z-score over features of one vector.
  • Stack $L$ blocks ⇒ same dimensionality d throughout.

Information-Flow Perspective

  • Every component except attention only touches its own residual stream.
  • Attention literally moves information between streams (Elhage et al., 2021).

Parallelizing Attention Computation

  • Pack tokens into matrix X\in\mathbb{R}^{N\times d} (rows = tokens).
  • Compute all queries, keys, values in one matmul:
    Q = X WQ,\;K = X WK,\;V = X WV (shapes N\times dk, etc.)
  • Score matrix via batched dot-product:
    S = Q K^T (shape N\times N).
  • Apply scaling, causal mask (upper triangle set to -\infty), softmax row-wise ⇒ A.
  • Output matrix: O = A V (shape N\times d_v).
  • Quadratic cost \mathcal{O}(N^2) in sequence length ⇒ motivation for efficient/long-range variants.

Token & Position Embeddings

  • Initial input matrix X is sum of:
    • Token embedding from matrix E\in\mathbb{R}^{|V|\times d}.
    • Positional embedding E_{\text{pos}}\in\mathbb{R}^{N\times d} (learned absolute positions in this lecture).
  • Example workflow (BPE-tokenized string "Thanks for all the"):
    1. Token indices [5,4000,10532,2224].
    2. Lookup rows in E.
    3. Add corresponding positional rows [0,1,2,3].

Language Modeling Head

  • After $L$ blocks, final hidden matrix H\in\mathbb{R}^{N\times d}.
  • Unembedding layer: linear map back to vocabulary logits using tied weights E^T:
    U = H E^T (shape N\times |V|).
  • Softmax over each row produces predictive distribution P(w{t+1}|\text{context}): P{ti}=\frac{e^{U{ti}}}{\sum{j} e^{U_{tj}}}.
  • “Weight tying” constrains parameters and empirically improves generalization.

The Final Transformer Language Model

  • Pipeline summary for one training step:
    1. Tokenize input sequence (max length N).
    2. Form X = E[\text{tokens}] + E_{\text{pos}}[0:N].
    3. Pass through stacked Transformer blocks (with masking).
    4. Compute logits via unembedding; softmax gives probabilities.
    5. Compute cross-entropy loss vs. true next-token labels; back-propagate to update all parameters {WQ^{(t)},WK^{(t)},\dots,E,E_{\text{pos}}}.

Practical, Philosophical & Ethical Notes

  • Attention’s context integration yields interpretability hooks (e.g., visualize heads for syntax or coreference).
  • Quadratic complexity poses environmental cost ⇒ active research on linear/efficient attention.
  • Weight tying re-uses parameters, shrinking model size and aligning embedding & predictive spaces, often improving fairness by consistent treatment of rare words.
  • Contextual embeddings revolutionize downstream NLP by providing task-agnostic meaning representations fine-tuned with minimal labeled data.

Key Takeaways

  • Transformers convert symbolic sequences into rich context-dependent vectors through stacked blocks of LayerNorm → Multi-Head Attention → FFN with residuals.
  • Attention = scaled dot-product softmax weighted-sum; multi-head instantiation learns diverse relational patterns.
  • Parallel matrix formulation enables GPU/TPU efficiency but is \mathcal{O}(N^2).
  • Additive token + positional embeddings encode both lexical identity and order.
  • The language-model head “unembeds” the final hidden states to vocab logits, closing the auto-regressive loop.