Transformers

0.0(0)
studied byStudied by 2 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/116

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

117 Terms

1
New cards

Foundation

2
New cards

What is a transformer?

A transformer is a neural network architecture that uses self-attention to process all tokens in a sequence in parallel, so it can learn which parts of the input should attend to which others.

3
New cards

What does attending mean?

What does token A attends to token B mean?

Attending” just means paying attention to something more than other things.

it means the model is focusing more on token B when deciding what token A should mean.

4
New cards

What is the self-attention? simpler terms?

Self-attention is a mechanism where each token in a sequence ‘looks at’ all the other tokens and learns how much to weight each one when computing its own representation.

That way, the model can focus on the most relevant context for every token.

Simpler: tokens look at themselves (same sequence).

5
New cards

What is masked self-attention?

Masked self-attention is where we block a token from attending to some other tokens—usually future tokens—by using a mask. In the decoder of a transformer, we mask future positions so the model can only use past and current tokens when predicting the next one

6
New cards

What is cross-attention? simpler terms?

Cross-attention is where queries come from one sequence and keys/values come from another, so the model can use information from the input sequence when generating each output token.

tokens in one place (e.g., the decoder) look at tokens from somewhere else (e.g., the encoder output) to get information.

7
New cards

What problem do Transformers solve?

Transformers model long-range dependencies in sequences (text, audio, DNA, etc.) efficiently and in parallel, avoiding the sequential bottlenecks of RNNs/LSTMs.

8
New cards

What are the two major Transformer roles?

Encoders produce contextual embeddings of inputs;

Decoders generate outputs token-by-token while attending to both prior outputs and the encoder.

9
New cards

Architecture (encoder/decoder/blocks)

10
New cards

What is a Transformer block?

A transformer block is made of two main parts:

a multi-head self-attention layer and a position-wise feed-forward network, each wrapped with residual connections and layer normalization (plus dropout for regularization)

Multi-head (self-)attention → Add & LayerNorm → Feed-forward MLP → Add & LayerNorm
(with dropout inside the attention and MLP).

11
New cards

How do encoder and decoder differ?

Encoder blocks use full self-attention;

Decoder blocks use masked self-attention (to prevent peeking ahead) plus cross-attention over encoder outputs.

12
New cards
13
New cards

What are residual connections for?

Residuals help gradient flow and stabilize deep networks by adding the block input to the block output.

14
New cards

Why is layer normalization used?

LayerNorm stabilizes activations and gradients within each token’s feature vector, improving training stability.

15
New cards

What does the feed-forward network (FFN) do?

The FFN applies two linear layers with a nonlinearity (e.g., GELU/ReLU) to each token independently, increasing model capacity.

16
New cards

Attention mechanics (Q/K/V)

17
New cards

What are Query, Key, and Value?

The model projects token embeddings into Queries (what I’m looking for), Keys (what I offer), and Values (the information to aggregate)

18
New cards

How is scaled dot-product attention computed?

knowt flashcard image
19
New cards

Why “multi-head” attention?

Multiple heads learn different relationships in parallel (e.g., syntax vs semantics), then their outputs are concatenated and mixed.

20
New cards

What is masked attention and why is it needed?

A causal mask ensures token t only attends to ≤ t, preventing information leakage during generation.

21
New cards

What is cross-attention?

Cross-attention uses the decoder’s Queries with the encoder’s Keys/Values so generated tokens can attend to the encoded input.

22
New cards

Positional information

23
New cards

Why do we need positional encodings?

Because attention is permutation-invariant; positional encodings inject order information so the model knows token positions.

24
New cards

What are common positional schemes?

Sinusoidal (fixed), learned absolute, and relative/rotary (RoPE) encodings; relative/rotary often generalize better to longer contexts.

25
New cards

Tokenization & embeddings

26
New cards

How does tokenization work here?

Subword tokenizers (e.g., BPE/WordPiece) split text into units that balance vocabulary size with coverage, then map tokens to embeddings.

27
New cards

What is weight tying?

The input embedding matrix is reused as the output projection, reducing parameters and sometimes improving performance.

28
New cards

Training & Objectives

29
New cards

How are encoder-only models trained?

They are often trained with masked-language modeling (MLM), predicting randomly masked tokens from context.

30
New cards

How are decoder-only models trained?

They use causal language modeling (CLM), predicting the next token given previous tokens.

31
New cards

What is teacher forcing?

During training, the ground-truth previous tokens are fed to the decoder instead of its own predictions to stabilize learning.

32
New cards

What loss is typically used?

Cross-entropy loss over the next-token distribution is standard.

33
New cards

Scaling & Efficiency

34
New cards

What is kv-cache during inference?

It stores Keys/Values of past tokens so each new token reuses them, speeding autoregressive decoding.

35
New cards

What is quantization?

Quantization stores weights/activations in lower precision (e.g., INT8/INT4) to reduce memory and increase speed with minimal accuracy loss

36
New cards

Practical Usage

37
New cards

When do you choose encoder-only vs decoder-only?

Use encoder-only for understanding tasks (classification, retrieval); decoder-only for generation (chat, writing code); encoder-decoder for seq2seq (translation, summarization).

38
New cards

How does fine-tuning differ from prompt-only use?

Fine-tuning updates parameters for a task; prompting keeps the model fixed and steers behavior via context.

39
New cards

What are adapters/LoRA at a high level?

They add small trainable modules or low-rank updates on top of frozen backbones to achieve task adaptation with far fewer trainable parameters.

40
New cards

Common pitfalls & Stabilization

41
New cards

Why do Transformers sometimes “stall” in training?

Poor learning rates, initialization, or normalization can cause instability; optimizers like AdamW and warmup schedules help.

42
New cards

What is gradient clipping and why use it?

Clipping caps gradient norms to prevent exploding gradients, improving training stability.

43
New cards

Why use dropout/attention dropout?

They regularize the model, reducing overfitting by randomly zeroing parts of activations/attention weights

44
New cards

Why divide by √d_k in attention scores?

It keeps dot-products in a stable range so softmax does not saturate, which stabilizes training.

45
New cards

General Specifics

46
New cards

How do sampling strategies affect outputs?

Greedy/beam search improve determinism but reduce diversity; top-k/top-p sampling increases diversity at the cost of possible errors.

47
New cards

Why does context length matter?

Longer context lets the model condition on more prior information, but increases compute and memory; extrapolation beyond training length can degrade quality.

48
New cards

Evaluation & safety (interview-ready)

How do you evaluate a Transformer on NLP tasks?

49
New cards

How do you evaluate a Transformer on NLP tasks?

Use task-appropriate metrics: accuracy/F1 for classification, BLEU/ROUGE for generation, perplexity for language modeling, and human eval for quality/safety.

50
New cards

What causes hallucinations and how to mitigate them?

Hallucinations arise when the model confidently generates unsupported content; mitigations include retrieval-augmented generation, better prompts, and post-hoc verification.

51
New cards

Implementation Knobs

52
New cards

Why use GELU over ReLU in FFNs?

GELU often improves performance in language models due to smoother gating around zero.

53
New cards

What is pre-norm vs post-norm?

Pre-norm applies LayerNorm before attention/FFN and is more stable for deep stacks; post-norm normalizes after sublayers.

54
New cards

Why increase model depth vs width?

Depth often improves hierarchical abstraction; width increases per-layer capacity. The best choice depends on data/compute budgets.

55
New cards

What is weight decay (AdamW) doing here?

It regularizes by penalizing large weights, improving generalization without entangling with Adam’s moment updates.

56
New cards

Modern variants

57
New cards

What is a Vision Transformer (ViT)?

ViT splits images into patches, embeds them as tokens, and applies the standard Transformer encoder for image classification.

58
New cards

What is an encoder-decoder LLM (e.g., T5) vs decoder-only LLM (e.g., GPT)?

T5 uses seq2seq with MLM-style objectives, good for conditioned generation; GPT uses causal decoding, strong for open-ended generation.

59
New cards

What is Rotary Positional Embedding (RoPE)?

RoPE encodes relative positions by rotating Q/K in complex space, enabling better length generalization and extrapolation.

60
New cards

What is multi-query attention (MQA)?

MQA shares Keys/Values across heads while keeping separate Queries, reducing memory and speeding up decoding with minimal quality loss.

61
New cards

Foundations (Deep Dive)

62
New cards

What core limitation of RNNs/LSTMs do Transformers fix?

They remove the sequential dependency during training, so long-range context is learned with parallel matrix ops instead of step-by-step recurrence.

63
New cards

Why does “self-attention” scale better conceptually than recurrence?

Because each token directly attends to all other tokens in one shot, avoiding vanishing dependencies across many time steps.

64
New cards

What is the central object a Transformer learns?

It learns a contextual embedding for each token such that meaning depends on both the token and its surrounding context via attention.

65
New cards

Why is attention described as “content-based addressing”?

Queries retrieve Values by matching to Keys, so tokens fetch information based on semantic similarity rather than fixed positions.

66
New cards

Why use the softmax in attention?

Softmax converts similarity scores into a probability distribution, emphasizing a few relevant tokens while keeping gradients stable.

67
New cards

Why divide by √dₖ in attention?

It keeps dot-products in a numerically stable range so the softmax does not saturate early in training.

68
New cards

What is the learning signal in language modeling?

Cross-entropy between the predicted token distribution and the ground-truth next (or masked) token provides gradients to all layers.

69
New cards

Why do Transformers need positional information at all?

Self-attention is permutation-invariant; positional encodings inject order so the model can distinguish “cat chased dog” from “dog chased cat.”

70
New cards

How does multi-head attention increase capacity without huge cost?

It splits the hidden size across heads, letting each head specialize in different relations, then concatenates and mixes them with a linear layer.

71
New cards

When do we prefer encoder-decoder over decoder-only?

We prefer encoder-decoder when the task is explicitly sequence-to-sequence (e.g., translation, summarization with long sources) because cross-attention cleanly conditions on the encoded input.

72
New cards

What does “causal” mean in decoder-only LMs?

Causal means each position can only attend to previous positions, enabling next-token prediction without leaking future information.

73
New cards

What is weight tying and why use it?

It reuses the input embedding matrix for the output projection to reduce parameters and often improve perplexity.

74
New cards

What is the kv-cache and why is it crucial for generation?

It stores past Keys/Values so decoding a new token only computes attention against cached states instead of recomputing all previous layers.

75
New cards

Architecture (Anatomy + Shapes)

76
New cards

What is the minimal Transformer block recipe?

[LayerNorm] → Multi-Head Self-Attention → Residual add → [LayerNorm] → Feed-Forward (two linear layers with activation) → Residual add. (Pre-norm shown.)

77
New cards

Why pre-norm over post-norm in modern LLMs?

Pre-norm improves gradient flow in very deep stacks and is empirically more stable.

78
New cards

What does the FFN actually do?

It expands the hidden dimension (e.g., 4×) with a nonlinearity (GELU/ReLU) and projects back, acting as a per-token MLP to increase representational power.

79
New cards

What are typical tensor shapes in attention?

For batch B, length n, hidden d, heads h, head dim d_h = d/h:


Embeddings: (B, n, d) → Q/K/V: (B, h, n, d_h) → scores: (B, h, n, n) → output: (B, n, d).

80
New cards

What is the output projection after multi-head attention?

Concatenated head outputs (B, n, h·d_h) are linearly projected back to (B, n, d) to mix information across heads.

81
New cards

How do encoder vs decoder blocks differ structurally?

Encoders: self-attention + FFN.


Decoders: masked self-attention + cross-attention (to encoder) + FFN.

82
New cards

What is cross-attention wiring?

Decoder Queries attend to Keys/Values produced by the encoder, allowing the decoder to condition on the source sequence.

83
New cards

What does LayerNorm normalize?

It normalizes features within each token vector (mean/variance across the hidden dimension) to stabilize activations.

84
New cards

Why choose GELU over ReLU in FFNs?

GELU provides smoother gating around zero and empirically improves language model quality.

85
New cards

What is Multi-Query Attention (MQA) and why use it?

MQA shares Keys/Values across heads while keeping separate Queries, reducing memory and latency in decoding with small quality loss.

86
New cards

What is Grouped-Query Attention (GQA)?

GQA is a middle ground where Keys/Values are shared within groups of heads, balancing speed and quality.

87
New cards

Where do positional encodings plug in?

They are added or applied to token embeddings (absolute) or to Q/K (relative/rotary) before computing attention.

88
New cards

What does the final LM head do?

It projects the decoder’s hidden states to vocabulary logits; with weight tying, it shares weights with the input embeddings.

89
New cards

Attention Mechanics (Key Terms)

90
New cards

What is a Query?

A learned linear projection that represents “what information this token is seeking.”

91
New cards

What is a Key?

A learned linear projection that represents “what information this token offers.”

92
New cards

What is a Value?

A learned linear projection that carries the actual information aggregated by attention.

93
New cards

What is a Scaled Dot-Product Attention?

Compute scores = QKᵀ/√dₖ → softmax → weighted sum over V.

94
New cards

What is a Head?

One parallel attention subspace with its own Q/K/V projections; multiple heads run in parallel.

95
New cards

What is a Residual Stream?

The pathway carrying the token representations through the stack; sublayers add to it via skip connections.

96
New cards

what is Pre-LN vs Post-LN?

Pre-LN normalizes inputs to sublayers; Post-LN normalizes outputs. Pre-LN is more stable for deep models.

97
New cards

What is a Causal Mask?

A binary mask that blocks future positions to ensure valid next-token prediction.

98
New cards
99
New cards
100
New cards