Transformers – Comprehensive Lecture 1 Notes

Module 1.1 – Limitations of Sequential (RNN) Models

  • Seq2Seq with RNN encoder–decoder
    • Encoder produces hidden states h0,\dots ,hT for source sentence (e.g. “I enjoyed the movie transformers”).
    • Final state hT (a.k.a. concept/thought/context vector) passed to decoder as s0=h_T.
    • Decoder generates target sentence step-by-step using its own hidden states s_1,\dots.
  • Problems
    • Bottleneck: single vector h_T must encode all information, ignoring direct word alignment.
    • Computation is inherently sequential → cannot parallelize across time steps during training.
    • Susceptible to vanishing / exploding gradients.

Attention Mechanism Refresher (RNN context)

  • Idea: provide decoder direct access to every encoder hidden state {hi} instead of only hT.
  • Context vector for decoder step t ct = \sum{i=1}^{n} \alpha{ti} hi
    • Alignment scores
      \alpha{ti}=\text{align}(yt,hi)=\frac{\exp(\text{score}(s{t-1},hi))}{\sum{i'}\exp(\text{score}(s{t-1},h{i'}))}
    • Typical score function (Bahdanau):
      \text{score}(s,h)=Va^\top\,\tanh(U{att}s + W_{att}h).
  • Benefit: better translation via word-to-word correspondence.
  • Still sequential: s_{t-1} required before we can compute row t of \alpha → no full parallelization.

Desire for New Architecture

  • Maintain attention’s alignment strengths.
  • Enable full parallel computation across sentence positions (eliminate sequential recurrence).
  • Address gradient problems.

Transition to Transformers – “Attention Is All You Need”

  • Replace RNNs with stacked attention + feed-forward blocks in both encoder & decoder.
  • Key primitives:
    1. Self-Attention (within encoder or decoder).
    2. Encoder–Decoder (Cross) Attention (decoder queries encoder outputs).
    3. Position-wise Feed-Forward Networks (FFN).

Self-Attention Mechanics

  • Objective: for every input token, produce representation that aggregates information from all tokens weighted by contextual relevance.
  • Inputs & linear projections
    • Word embeddings H=[h1, \dots , hT] \in \mathbb R^{d{model}\times T} (here d{model}=512).
    • Three learned matrices WQ, WK, WV \in \mathbb R^{dk\times d{model}} (with dk=dq=dv=64 in original base model).
    • Compute matrices
      Q=WQ H, \; K=WK H, \; V=WV H where Q,K,V \in \mathbb R^{dk\times T}.
  • Scaled Dot-Product Attention (vectorized) Z = \text{softmax}!\left(\frac{Q^\top K}{\sqrt{dk}}\right) V \quad (Z \in \mathbb R^{T\times dv})
    • Q^\top K → T\times T matrix of all pairwise scores qi \cdot kj.
    • Division by \sqrt{d_k} prevents large dot products causing small gradients (stabilizes softmax).
    • Softmax along keys axis ensures each row sums to 1.
  • Parallelism: all rows computed in one matrix operation – no time-step dependency.

Multi-Head Self-Attention

  • Motivation: analogous to multiple convolutional kernels → capture diverse relation sub-spaces.
  • Procedure for each head i
    \text{head}i = \text{Attention}(Qi,Ki,Vi) with separate W{Qi},W{Ki},W{Vi}.
  • Concatenation & output projection
    \text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}1,\dots,\text{head}h) WO, where WO \in \mathbb R^{hdv\times d{model}} (h=8 in base model → concat dimension 512).
  • Learning is fully parallel across heads.

Encoder Layer Structure

  • Two sublayers per layer (stack of N=6):
    1. Multi-Head Self-Attention.
    2. Position-wise FFN: \text{FFN}(z)=W2\,\text{ReLU}(W1 z + b1)+b2 with W1\in\mathbb R^{2048\times512},W2\in\mathbb R^{512\times2048}.
  • Residual connection plus LayerNorm after each sublayer.
  • Parameter count (per layer, h=8):
    • Attention projections: 3\times (8\cdot64\times512)=\approx\,2.6\times10^5 each.
    • Output W_O: 512\times512=2.6\times10^5.
    • FFN ≈ 2\times10^6.
    • ≈ 3\times10^6 total per layer → \approx18\;\text{M} for 6 layers.

Decoder Layer Structure

  • Three sublayers (stack of 6):
    1. Masked Multi-Head Self-Attention (prevents using future tokens during training).
    2. Multi-Head Cross-Attention where Q comes from decoder self-att output; K,V come from encoder outputs E.
    3. Position-wise FFN.
  • Same residual + LayerNorm after each.
  • Teacher Forcing & Masking
    • During training, ground-truth target tokens shifted right ( + sentence) fed as inputs.
    • Mask matrix M is upper-triangular with -\infty above diagonal → softmax zeroes future positions.
  • Parameter cost per decoder layer: ~4 M (≈2 M FFN + 1 M masked self-attn + 1 M cross-attn).

Positional Encoding

  • Self-attention is permutation-invariant; need token order info.
  • Sinusoidal encoding (fixed, no new parameters): \text{PE}(j,i)=\begin{cases}\sin\left(\frac{j}{10000^{2i/d{model}}}\right), & i\text{ even}\[4pt] \cos\left(\frac{j}{10000^{2i/d{model}}}\right), & i\text{ odd}\end{cases}
    • j = position (0…T−1), i = dimension index (0…d_{model}-1).
    • Produces unique, smooth patterns enabling model to learn relative distances (distance matrix symmetric property).
  • Added element-wise to word embeddings.
    hj' = hj + p_j.

Normalization & Residuals

  • Layer Normalization preferred over BatchNorm in NLP:
    \mu = \frac1H \sum{i=1}^H xi,\quad \sigma^2 = \frac1H \sum (xi-\mu)^2 \hat{x}i = \frac{xi-\mu}{\sqrt{\sigma^2+\epsilon}},\qquad yi = \gamma \hat{x}_i + \beta.
  • Applied after each sublayer alongside residual add.
  • Ensures stable gradients and allows batch size 1 during inference.

Training Details – Learning-Rate Warm-Up

  • Standard decaying LR \eta \propto \text{step}^{-0.5} converges slowly early on.
  • Growing LR only ((\eta \propto \text{step})) diverges later.
  • Combined schedule (warm-up): \eta = d_{model}^{-0.5} \cdot \min\left(\text{step}^{-0.5},\; \text{step}\;\cdot\text{warmup}^{-1.5}\right)
    • WarmupSteps = 4000 in original paper.
    • Blue curve increases linearly then follows inverse-sqrt decay; always ≥ simple decay curve after warmup.

Output Generation & Parameters

  • Top decoder layer outputs S \in \mathbb R^{T\times512}.
  • Linear projection W_D \in \mathbb R^{512\times|V|} followed by softmax → word distribution.
  • Vocabulary |V|\approx37000 ⇒ \approx19\,\text{M} parameters, dominant share of overall ~65 M.
  • Inference is autoregressive: previous decoded word fed back; teacher forcing not used.

Summary of Transformer Advantages

  • Full parallelization across sequence positions during training (both encoder & masked decoder self-attn computed with matrix ops).
  • Multi-head attention captures diverse contextual relations.
  • Residuals + LayerNorm → deep (=42 total sublayers) yet trainable network.
  • Sinusoidal positional encodings allow unlimited length generalization without learned embeddings.
  • Warm-up learning-rate schedule accelerates convergence.

Ethical / Practical Notes

  • Parameter counts (65 M) imply high compute & energy cost; scaling further magnifies this.
  • Attention visualisations (Colab demos) reveal which words influence each other → interpretability aid.
  • Masking strategy prevents information leakage during training yet allows efficient teacher forcing.

Connections & Real-World Relevance

  • CNN analogy: multi-head ≈ multiple kernels; FFN ≈ pointwise conv.
  • Compared to RNN: eliminates sequential dependency, solves vanishing-gradient problem, achieves state-of-the-art in machine translation and many NLP tasks.
  • Positional encoding ideas now reused in vision (ViT) & audio transformers.

Numerical Recap (Base Model)

  • d{model}=512,\;dk=64,\;h=8,\;N=6 encoder & decoder.
  • Encoder params ≈ 18\,\text{M}, decoder core ≈ 24\,\text{M}, output head ≈ 19\,\text{M} → total ≈ 65\,\text{M}.