Transformers Lecture Notes

Transformers: Introduction

Large Language Models (LLMs) are essentially stacks of Transformer blocks.
Transformer = a specialized neural-network architecture that generalizes feed-forward nets by replacing recurrence with the “attention” operation.
Core intuition: every token can dynamically look at ("attend to") every other token in its context and decide—by learned similarity scores—how much information to borrow.

1990 – Static Word Embeddings
- Hand-crafted or shallow‐learned vectors (e.g., word2vec, GloVe).
2003 – Neural Language Model
- Bengio et al. introduce end-to-end neural probability models for sequences.
2008 – Multi-Task Learning
- Jointly training on heterogeneous objectives.
2015 – Attention
- Bahdanau attention for sequence-to-sequence translation.
2017 – Transformer
- Vaswani et al. remove recurrence, make attention the complete workhorse.
2018 – Contextual Word Embeddings & Pre-training
- ELMo, BERT: fine-tuning giant pretrained encoders.
2019 – Prompting
- Treat tasks as text completion; unleash zero-/few-shot capabilities.

Problem with static embeddings
- One vector per word type ⇒ cannot capture polysemy.
- Example: “The chicken didn’t cross the road because it was too tired.”
  • Static “it” vector has no clue what it refers to.
Contextual Embeddings
- Each token obtains a context-specific vector.
- Obtained by Attention: integrate neighboring words with learned weights.
- Properties demonstrated with cloze examples (“it was too tired” vs “it was too wide”).

Build a token’s contextual embedding by selectively integrating information from all other tokens.
“Attending” = assigning higher weights to more relevant tokens.
Visually: A column of layer $k$ interacts with all columns of layer $k$ to create the next-layer column.

Weighted sum over vectors:
- Given previous-layer token vectors x1,\dots,xN, produce a_i for position $i$.
- Left-to-right causal LM: positions j>i are masked.

Project three role-specific versions of each vector:
- qi = xi W_Q (Query)
- ki = xi W_K (Key)
- vi = xi W_V (Value)
- Each projection matrix size d\times dk (or d\times dv for values).
Similarity score (scaled dot-product):
\text{score}(i,j)=\frac{qi\cdot kj}{\sqrt{d_k}}
Softmax over allowed keys ⇒ attention weights:
\alpha{ij}=\operatorname{softmax}j\big(\text{score}(i,j)\big)
Output vector:
ai=\sum{j\le i} \alpha{ij}\, vj

Instead of one set $(WQ,WK,W_V)$, use $h$ independent sets ⇒ $h$ heads.
Each head learns to focus on different relations: syntax, coreference, negation, positional offsets, etc.
Pipeline per head $t$:
\text{head}t = \operatorname{Attention}(X WQ^{(t)}, X WK^{(t)}, X WV^{(t)})
Concatenate heads and apply output projection WO: \text{MHA}(X)=\big[\text{head}1;\dots;\text{head}h\big] WO

One residual stream per token propagates up the stack.
Each block performs (with residual connections):
1. LayerNorm t1 = \operatorname{LayerNorm}(xi)
2. Multi-Head Attention t2 = \text{MHA}(t1)
3. Residual add t2 = xi + t_2
4. LayerNorm t3 = \operatorname{LayerNorm}(t2)
5. Feed-Forward Network t4 = \text{FFN}(t3)
6. Residual add x{i}^{\text{next}} = t2 + t_4
FFN: two linear layers with non-linearity (ReLU or GELU)
\operatorname{FFN}(z)=\max(0, zW1+b1)W2+b2
LayerNorm: z-score over features of one vector.
Stack $L$ blocks ⇒ same dimensionality d throughout.

Every component except attention only touches its own residual stream.
Attention literally moves information between streams (Elhage et al., 2021).

Pack tokens into matrix X\in\mathbb{R}^{N\times d} (rows = tokens).
Compute all queries, keys, values in one matmul:
Q = X WQ,\;K = X WK,\;V = X WV (shapes N\times dk, etc.)
Score matrix via batched dot-product:
S = Q K^T (shape N\times N).
Apply scaling, causal mask (upper triangle set to -\infty), softmax row-wise ⇒ A.
Output matrix: O = A V (shape N\times d_v).
Quadratic cost \mathcal{O}(N^2) in sequence length ⇒ motivation for efficient/long-range variants.

Initial input matrix X is sum of:
- Token embedding from matrix E\in\mathbb{R}^{|V|\times d}.
- Positional embedding E_{\text{pos}}\in\mathbb{R}^{N\times d} (learned absolute positions in this lecture).
Example workflow (BPE-tokenized string "Thanks for all the"):
1. Token indices [5,4000,10532,2224].
2. Lookup rows in E.
3. Add corresponding positional rows [0,1,2,3].

After $L$ blocks, final hidden matrix H\in\mathbb{R}^{N\times d}.
Unembedding layer: linear map back to vocabulary logits using tied weights E^T:
U = H E^T (shape N\times |V|).
Softmax over each row produces predictive distribution P(w{t+1}|\text{context}): P{ti}=\frac{e^{U{ti}}}{\sum{j} e^{U_{tj}}}.
“Weight tying” constrains parameters and empirically improves generalization.

Attention’s context integration yields interpretability hooks (e.g., visualize heads for syntax or coreference).
Quadratic complexity poses environmental cost ⇒ active research on linear/efficient attention.
Weight tying re-uses parameters, shrinking model size and aligning embedding & predictive spaces, often improving fairness by consistent treatment of rare words.
Contextual embeddings revolutionize downstream NLP by providing task-agnostic meaning representations fine-tuned with minimal labeled data.

Transformers convert symbolic sequences into rich context-dependent vectors through stacked blocks of LayerNorm → Multi-Head Attention → FFN with residuals.
Attention = scaled dot-product softmax weighted-sum; multi-head instantiation learns diverse relational patterns.
Parallel matrix formulation enables GPU/TPU efficiency but is \mathcal{O}(N^2).
Additive token + positional embeddings encode both lexical identity and order.
The language-model head “unembeds” the final hidden states to vocab logits, closing the auto-regressive loop.