Transformers – Comprehensive Lecture 1 Notes

Module 1.1 – Limitations of Sequential (RNN) Models

Seq2Seq with RNN encoder–decoder
- Encoder produces hidden states h0,\dots ,hT for source sentence (e.g. “I enjoyed the movie transformers”).
- Final state hT (a.k.a. concept/thought/context vector) passed to decoder as s0=h_T.
- Decoder generates target sentence step-by-step using its own hidden states s_1,\dots.
Problems
- Bottleneck: single vector h_T must encode all information, ignoring direct word alignment.
- Computation is inherently sequential → cannot parallelize across time steps during training.
- Susceptible to vanishing / exploding gradients.

Idea: provide decoder direct access to every encoder hidden state {hi} instead of only hT.
Context vector for decoder step t ct = \sum{i=1}^{n} \alpha{ti} hi
- Alignment scores
  \alpha{ti}=\text{align}(yt,hi)=\frac{\exp(\text{score}(s{t-1},hi))}{\sum{i'}\exp(\text{score}(s{t-1},h{i'}))}
- Typical score function (Bahdanau):
  \text{score}(s,h)=Va^\top\,\tanh(U{att}s + W_{att}h).
Benefit: better translation via word-to-word correspondence.
Still sequential: s_{t-1} required before we can compute row t of \alpha → no full parallelization.

Maintain attention’s alignment strengths.
Enable full parallel computation across sentence positions (eliminate sequential recurrence).
Address gradient problems.

Replace RNNs with stacked attention + feed-forward blocks in both encoder & decoder.
Key primitives:
1. Self-Attention (within encoder or decoder).
2. Encoder–Decoder (Cross) Attention (decoder queries encoder outputs).
3. Position-wise Feed-Forward Networks (FFN).

Objective: for every input token, produce representation that aggregates information from all tokens weighted by contextual relevance.
Inputs & linear projections
- Word embeddings H=[h1, \dots , hT] \in \mathbb R^{d{model}\times T} (here d{model}=512).
- Three learned matrices WQ, WK, WV \in \mathbb R^{dk\times d{model}} (with dk=dq=dv=64 in original base model).
- Compute matrices
  Q=WQ H, \; K=WK H, \; V=WV H where Q,K,V \in \mathbb R^{dk\times T}.
Scaled Dot-Product Attention (vectorized) Z = \text{softmax}!\left(\frac{Q^\top K}{\sqrt{dk}}\right) V \quad (Z \in \mathbb R^{T\times dv})
- Q^\top K → T\times T matrix of all pairwise scores qi \cdot kj.
- Division by \sqrt{d_k} prevents large dot products causing small gradients (stabilizes softmax).
- Softmax along keys axis ensures each row sums to 1.
Parallelism: all rows computed in one matrix operation – no time-step dependency.

Motivation: analogous to multiple convolutional kernels → capture diverse relation sub-spaces.
Procedure for each head i
\text{head}i = \text{Attention}(Qi,Ki,Vi) with separate W{Qi},W{Ki},W{Vi}.
Concatenation & output projection
\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}1,\dots,\text{head}h) WO, where WO \in \mathbb R^{hdv\times d{model}} (h=8 in base model → concat dimension 512).
Learning is fully parallel across heads.

Two sublayers per layer (stack of N=6):
1. Multi-Head Self-Attention.
2. Position-wise FFN: \text{FFN}(z)=W2\,\text{ReLU}(W1 z + b1)+b2 with W1\in\mathbb R^{2048\times512},W2\in\mathbb R^{512\times2048}.
Residual connection plus LayerNorm after each sublayer.
Parameter count (per layer, h=8):
- Attention projections: 3\times (8\cdot64\times512)=\approx\,2.6\times10^5 each.
- Output W_O: 512\times512=2.6\times10^5.
- FFN ≈ 2\times10^6.
- ≈ 3\times10^6 total per layer → \approx18\;\text{M} for 6 layers.

Three sublayers (stack of 6):
1. Masked Multi-Head Self-Attention (prevents using future tokens during training).
2. Multi-Head Cross-Attention where Q comes from decoder self-att output; K,V come from encoder outputs E.
3. Position-wise FFN.
Same residual + LayerNorm after each.
Teacher Forcing & Masking
- During training, ground-truth target tokens shifted right ( + sentence) fed as inputs.
- Mask matrix M is upper-triangular with -\infty above diagonal → softmax zeroes future positions.
Parameter cost per decoder layer: ~4 M (≈2 M FFN + 1 M masked self-attn + 1 M cross-attn).

Self-attention is permutation-invariant; need token order info.
Sinusoidal encoding (fixed, no new parameters): \text{PE}(j,i)=\begin{cases}\sin\left(\frac{j}{10000^{2i/d{model}}}\right), & i\text{ even}\[4pt] \cos\left(\frac{j}{10000^{2i/d{model}}}\right), & i\text{ odd}\end{cases}
- j = position (0…T−1), i = dimension index (0…d_{model}-1).
- Produces unique, smooth patterns enabling model to learn relative distances (distance matrix symmetric property).
Added element-wise to word embeddings.
hj' = hj + p_j.

Layer Normalization preferred over BatchNorm in NLP:
\mu = \frac1H \sum{i=1}^H xi,\quad \sigma^2 = \frac1H \sum (xi-\mu)^2 \hat{x}i = \frac{xi-\mu}{\sqrt{\sigma^2+\epsilon}},\qquad yi = \gamma \hat{x}_i + \beta.
Applied after each sublayer alongside residual add.
Ensures stable gradients and allows batch size 1 during inference.

Standard decaying LR \eta \propto \text{step}^{-0.5} converges slowly early on.
Growing LR only ((\eta \propto \text{step})) diverges later.
Combined schedule (warm-up): \eta = d_{model}^{-0.5} \cdot \min\left(\text{step}^{-0.5},\; \text{step}\;\cdot\text{warmup}^{-1.5}\right)
- WarmupSteps = 4000 in original paper.
- Blue curve increases linearly then follows inverse-sqrt decay; always ≥ simple decay curve after warmup.

Top decoder layer outputs S \in \mathbb R^{T\times512}.
Linear projection W_D \in \mathbb R^{512\times|V|} followed by softmax → word distribution.
Vocabulary |V|\approx37000 ⇒ \approx19\,\text{M} parameters, dominant share of overall ~65 M.
Inference is autoregressive: previous decoded word fed back; teacher forcing not used.

Full parallelization across sequence positions during training (both encoder & masked decoder self-attn computed with matrix ops).
Multi-head attention captures diverse contextual relations.
Residuals + LayerNorm → deep (=42 total sublayers) yet trainable network.
Sinusoidal positional encodings allow unlimited length generalization without learned embeddings.
Warm-up learning-rate schedule accelerates convergence.

Parameter counts (65 M) imply high compute & energy cost; scaling further magnifies this.
Attention visualisations (Colab demos) reveal which words influence each other → interpretability aid.
Masking strategy prevents information leakage during training yet allows efficient teacher forcing.

CNN analogy: multi-head ≈ multiple kernels; FFN ≈ pointwise conv.
Compared to RNN: eliminates sequential dependency, solves vanishing-gradient problem, achieves state-of-the-art in machine translation and many NLP tasks.
Positional encoding ideas now reused in vision (ViT) & audio transformers.

d{model}=512,\;dk=64,\;h=8,\;N=6 encoder & decoder.
Encoder params ≈ 18\,\text{M}, decoder core ≈ 24\,\text{M}, output head ≈ 19\,\text{M} → total ≈ 65\,\text{M}.