Transformers – Comprehensive Lecture 1 Notes
Module 1.1 – Limitations of Sequential (RNN) Models
- Seq2Seq with RNN encoder–decoder
- Encoder produces hidden states h0,\dots ,hT for source sentence (e.g. “I enjoyed the movie transformers”).
- Final state hT (a.k.a. concept/thought/context vector) passed to decoder as s0=h_T.
- Decoder generates target sentence step-by-step using its own hidden states s_1,\dots.
- Problems
- Bottleneck: single vector h_T must encode all information, ignoring direct word alignment.
- Computation is inherently sequential → cannot parallelize across time steps during training.
- Susceptible to vanishing / exploding gradients.
Attention Mechanism Refresher (RNN context)
- Idea: provide decoder direct access to every encoder hidden state {hi} instead of only hT.
- Context vector for decoder step t
ct = \sum{i=1}^{n} \alpha{ti} hi
- Alignment scores
\alpha{ti}=\text{align}(yt,hi)=\frac{\exp(\text{score}(s{t-1},hi))}{\sum{i'}\exp(\text{score}(s{t-1},h{i'}))} - Typical score function (Bahdanau):
\text{score}(s,h)=Va^\top\,\tanh(U{att}s + W_{att}h).
- Benefit: better translation via word-to-word correspondence.
- Still sequential: s_{t-1} required before we can compute row t of \alpha → no full parallelization.
Desire for New Architecture
- Maintain attention’s alignment strengths.
- Enable full parallel computation across sentence positions (eliminate sequential recurrence).
- Address gradient problems.
- Replace RNNs with stacked attention + feed-forward blocks in both encoder & decoder.
- Key primitives:
- Self-Attention (within encoder or decoder).
- Encoder–Decoder (Cross) Attention (decoder queries encoder outputs).
- Position-wise Feed-Forward Networks (FFN).
Self-Attention Mechanics
- Objective: for every input token, produce representation that aggregates information from all tokens weighted by contextual relevance.
- Inputs & linear projections
- Word embeddings H=[h1, \dots , hT] \in \mathbb R^{d{model}\times T} (here d{model}=512).
- Three learned matrices WQ, WK, WV \in \mathbb R^{dk\times d{model}} (with dk=dq=dv=64 in original base model).
- Compute matrices
Q=WQ H, \; K=WK H, \; V=WV H where Q,K,V \in \mathbb R^{dk\times T}.
- Scaled Dot-Product Attention (vectorized)
Z = \text{softmax}!\left(\frac{Q^\top K}{\sqrt{dk}}\right) V \quad (Z \in \mathbb R^{T\times dv})
- Q^\top K → T\times T matrix of all pairwise scores qi \cdot kj.
- Division by \sqrt{d_k} prevents large dot products causing small gradients (stabilizes softmax).
- Softmax along keys axis ensures each row sums to 1.
- Parallelism: all rows computed in one matrix operation – no time-step dependency.
Multi-Head Self-Attention
- Motivation: analogous to multiple convolutional kernels → capture diverse relation sub-spaces.
- Procedure for each head i
\text{head}i = \text{Attention}(Qi,Ki,Vi) with separate W{Qi},W{Ki},W{Vi}. - Concatenation & output projection
\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}1,\dots,\text{head}h) WO,
where WO \in \mathbb R^{hdv\times d{model}} (h=8 in base model → concat dimension 512). - Learning is fully parallel across heads.
Encoder Layer Structure
- Two sublayers per layer (stack of N=6):
- Multi-Head Self-Attention.
- Position-wise FFN: \text{FFN}(z)=W2\,\text{ReLU}(W1 z + b1)+b2 with W1\in\mathbb R^{2048\times512},W2\in\mathbb R^{512\times2048}.
- Residual connection plus LayerNorm after each sublayer.
- Parameter count (per layer, h=8):
- Attention projections: 3\times (8\cdot64\times512)=\approx\,2.6\times10^5 each.
- Output W_O: 512\times512=2.6\times10^5.
- FFN ≈ 2\times10^6.
- ≈ 3\times10^6 total per layer → \approx18\;\text{M} for 6 layers.
Decoder Layer Structure
- Three sublayers (stack of 6):
- Masked Multi-Head Self-Attention (prevents using future tokens during training).
- Multi-Head Cross-Attention where Q comes from decoder self-att output; K,V come from encoder outputs E.
- Position-wise FFN.
- Same residual + LayerNorm after each.
- Teacher Forcing & Masking
- During training, ground-truth target tokens shifted right ( + sentence) fed as inputs.
- Mask matrix M is upper-triangular with -\infty above diagonal → softmax zeroes future positions.
- Parameter cost per decoder layer: ~4 M (≈2 M FFN + 1 M masked self-attn + 1 M cross-attn).
Positional Encoding
- Self-attention is permutation-invariant; need token order info.
- Sinusoidal encoding (fixed, no new parameters):
\text{PE}(j,i)=\begin{cases}\sin\left(\frac{j}{10000^{2i/d{model}}}\right), & i\text{ even}\[4pt]
\cos\left(\frac{j}{10000^{2i/d{model}}}\right), & i\text{ odd}\end{cases}
- j = position (0…T−1), i = dimension index (0…d_{model}-1).
- Produces unique, smooth patterns enabling model to learn relative distances (distance matrix symmetric property).
- Added element-wise to word embeddings.
hj' = hj + p_j.
Normalization & Residuals
- Layer Normalization preferred over BatchNorm in NLP:
\mu = \frac1H \sum{i=1}^H xi,\quad \sigma^2 = \frac1H \sum (xi-\mu)^2
\hat{x}i = \frac{xi-\mu}{\sqrt{\sigma^2+\epsilon}},\qquad yi = \gamma \hat{x}_i + \beta. - Applied after each sublayer alongside residual add.
- Ensures stable gradients and allows batch size 1 during inference.
Training Details – Learning-Rate Warm-Up
- Standard decaying LR \eta \propto \text{step}^{-0.5} converges slowly early on.
- Growing LR only ((\eta \propto \text{step})) diverges later.
- Combined schedule (warm-up):
\eta = d_{model}^{-0.5} \cdot \min\left(\text{step}^{-0.5},\; \text{step}\;\cdot\text{warmup}^{-1.5}\right)
- WarmupSteps = 4000 in original paper.
- Blue curve increases linearly then follows inverse-sqrt decay; always ≥ simple decay curve after warmup.
Output Generation & Parameters
- Top decoder layer outputs S \in \mathbb R^{T\times512}.
- Linear projection W_D \in \mathbb R^{512\times|V|} followed by softmax → word distribution.
- Vocabulary |V|\approx37000 ⇒ \approx19\,\text{M} parameters, dominant share of overall ~65 M.
- Inference is autoregressive: previous decoded word fed back; teacher forcing not used.
- Full parallelization across sequence positions during training (both encoder & masked decoder self-attn computed with matrix ops).
- Multi-head attention captures diverse contextual relations.
- Residuals + LayerNorm → deep (=42 total sublayers) yet trainable network.
- Sinusoidal positional encodings allow unlimited length generalization without learned embeddings.
- Warm-up learning-rate schedule accelerates convergence.
Ethical / Practical Notes
- Parameter counts (65 M) imply high compute & energy cost; scaling further magnifies this.
- Attention visualisations (Colab demos) reveal which words influence each other → interpretability aid.
- Masking strategy prevents information leakage during training yet allows efficient teacher forcing.
Connections & Real-World Relevance
- CNN analogy: multi-head ≈ multiple kernels; FFN ≈ pointwise conv.
- Compared to RNN: eliminates sequential dependency, solves vanishing-gradient problem, achieves state-of-the-art in machine translation and many NLP tasks.
- Positional encoding ideas now reused in vision (ViT) & audio transformers.
Numerical Recap (Base Model)
- d{model}=512,\;dk=64,\;h=8,\;N=6 encoder & decoder.
- Encoder params ≈ 18\,\text{M}, decoder core ≈ 24\,\text{M}, output head ≈ 19\,\text{M} → total ≈ 65\,\text{M}.