Transformers

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/113

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

114 Terms

1
New cards

Foundation

2
New cards

What is a transformer?

A Transformer is a neural network that models sequences using self-attention (not recurrence), letting each token weight all others in parallel to capture long-range dependencies efficiently for encoding and generation.

3
New cards

What is the high-level idea of attention?

Attention lets each token compute a weighted average of information from other tokens so it can focus on the most relevant context.

4
New cards

What problem do Transformers solve?

Transformers model long-range dependencies in sequences (text, audio, DNA, etc.) efficiently and in parallel, avoiding the sequential bottlenecks of RNNs/LSTMs.

5
New cards

What are the two major Transformer roles?

Encoders produce contextual embeddings of inputs;

Decoders generate outputs token-by-token while attending to both prior outputs and the encoder.

6
New cards

What does “self-attention” mean?

Self-attention means a sequence attends to itself so each token can gather context from other tokens in the same sequence.

7
New cards

Architecture (encoder/decoder/blocks)

8
New cards

What is the canonical Transformer block?

A block stacks Multi-Head Self-Attention (MHSA), a residual connection, layer normalization, and a position-wise feed-forward network (FFN), typically repeated N times.

9
New cards

How do encoder and decoder differ?

Encoder blocks use full self-attention;

Decoder blocks use masked self-attention (to prevent peeking ahead) plus cross-attention over encoder outputs.

10
New cards

What are residual connections for?

Residuals help gradient flow and stabilize deep networks by adding the block input to the block output.

11
New cards

Why is layer normalization used?

LayerNorm stabilizes activations and gradients within each token’s feature vector, improving training stability.

12
New cards

What does the feed-forward network (FFN) do?

The FFN applies two linear layers with a nonlinearity (e.g., GELU/ReLU) to each token independently, increasing model capacity.

13
New cards

Attention mechanics (Q/K/V)

14
New cards

What are Query, Key, and Value?

The model projects token embeddings into Queries (what I’m looking for), Keys (what I offer), and Values (the information to aggregate)

15
New cards

How is scaled dot-product attention computed?

knowt flashcard image
16
New cards

Why “multi-head” attention?

Multiple heads learn different relationships in parallel (e.g., syntax vs semantics), then their outputs are concatenated and mixed.

17
New cards

What is masked attention and why is it needed?

A causal mask ensures token t only attends to ≤ t, preventing information leakage during generation.

18
New cards

What is cross-attention?

Cross-attention uses the decoder’s Queries with the encoder’s Keys/Values so generated tokens can attend to the encoded input.

19
New cards

Positional information

20
New cards

Why do we need positional encodings?

Because attention is permutation-invariant; positional encodings inject order information so the model knows token positions.

21
New cards

What are common positional schemes?

Sinusoidal (fixed), learned absolute, and relative/rotary (RoPE) encodings; relative/rotary often generalize better to longer contexts.

22
New cards

Tokenization & embeddings

23
New cards

How does tokenization work here?

Subword tokenizers (e.g., BPE/WordPiece) split text into units that balance vocabulary size with coverage, then map tokens to embeddings.

24
New cards

What is weight tying?

The input embedding matrix is reused as the output projection, reducing parameters and sometimes improving performance.

25
New cards

Training & Objectives

26
New cards

How are encoder-only models trained?

They are often trained with masked-language modeling (MLM), predicting randomly masked tokens from context.

27
New cards

How are decoder-only models trained?

They use causal language modeling (CLM), predicting the next token given previous tokens.

28
New cards

What is teacher forcing?

During training, the ground-truth previous tokens are fed to the decoder instead of its own predictions to stabilize learning.

29
New cards

What loss is typically used?

Cross-entropy loss over the next-token distribution is standard.

30
New cards

Scaling & Efficiency

31
New cards

What is kv-cache during inference?

It stores Keys/Values of past tokens so each new token reuses them, speeding autoregressive decoding.

32
New cards

What is quantization?

Quantization stores weights/activations in lower precision (e.g., INT8/INT4) to reduce memory and increase speed with minimal accuracy loss

33
New cards

Practical Usage

34
New cards

When do you choose encoder-only vs decoder-only?

Use encoder-only for understanding tasks (classification, retrieval); decoder-only for generation (chat, writing code); encoder-decoder for seq2seq (translation, summarization).

35
New cards

How does fine-tuning differ from prompt-only use?

Fine-tuning updates parameters for a task; prompting keeps the model fixed and steers behavior via context.

36
New cards

What are adapters/LoRA at a high level?

They add small trainable modules or low-rank updates on top of frozen backbones to achieve task adaptation with far fewer trainable parameters.

37
New cards

Common pitfalls & Stabilization

38
New cards

Why do Transformers sometimes “stall” in training?

Poor learning rates, initialization, or normalization can cause instability; optimizers like AdamW and warmup schedules help.

39
New cards

What is gradient clipping and why use it?

Clipping caps gradient norms to prevent exploding gradients, improving training stability.

40
New cards

Why use dropout/attention dropout?

They regularize the model, reducing overfitting by randomly zeroing parts of activations/attention weights

41
New cards

Why divide by √d_k in attention scores?

It keeps dot-products in a stable range so softmax does not saturate, which stabilizes training.

42
New cards

General Specifics

43
New cards

How do sampling strategies affect outputs?

Greedy/beam search improve determinism but reduce diversity; top-k/top-p sampling increases diversity at the cost of possible errors.

44
New cards

Why does context length matter?

Longer context lets the model condition on more prior information, but increases compute and memory; extrapolation beyond training length can degrade quality.

45
New cards

Evaluation & safety (interview-ready)

How do you evaluate a Transformer on NLP tasks?

46
New cards

How do you evaluate a Transformer on NLP tasks?

Use task-appropriate metrics: accuracy/F1 for classification, BLEU/ROUGE for generation, perplexity for language modeling, and human eval for quality/safety.

47
New cards

What causes hallucinations and how to mitigate them?

Hallucinations arise when the model confidently generates unsupported content; mitigations include retrieval-augmented generation, better prompts, and post-hoc verification.

48
New cards

Implementation Knobs

49
New cards

Why use GELU over ReLU in FFNs?

GELU often improves performance in language models due to smoother gating around zero.

50
New cards

What is pre-norm vs post-norm?

Pre-norm applies LayerNorm before attention/FFN and is more stable for deep stacks; post-norm normalizes after sublayers.

51
New cards

Why increase model depth vs width?

Depth often improves hierarchical abstraction; width increases per-layer capacity. The best choice depends on data/compute budgets.

52
New cards

What is weight decay (AdamW) doing here?

It regularizes by penalizing large weights, improving generalization without entangling with Adam’s moment updates.

53
New cards

Modern variants

54
New cards

What is a Vision Transformer (ViT)?

ViT splits images into patches, embeds them as tokens, and applies the standard Transformer encoder for image classification.

55
New cards

What is an encoder-decoder LLM (e.g., T5) vs decoder-only LLM (e.g., GPT)?

T5 uses seq2seq with MLM-style objectives, good for conditioned generation; GPT uses causal decoding, strong for open-ended generation.

56
New cards

What is Rotary Positional Embedding (RoPE)?

RoPE encodes relative positions by rotating Q/K in complex space, enabling better length generalization and extrapolation.

57
New cards

What is multi-query attention (MQA)?

MQA shares Keys/Values across heads while keeping separate Queries, reducing memory and speeding up decoding with minimal quality loss.

58
New cards

Foundations (Deep Dive)

59
New cards

What core limitation of RNNs/LSTMs do Transformers fix?

They remove the sequential dependency during training, so long-range context is learned with parallel matrix ops instead of step-by-step recurrence.

60
New cards

Why does “self-attention” scale better conceptually than recurrence?

Because each token directly attends to all other tokens in one shot, avoiding vanishing dependencies across many time steps.

61
New cards

What is the central object a Transformer learns?

It learns a contextual embedding for each token such that meaning depends on both the token and its surrounding context via attention.

62
New cards

Why is attention described as “content-based addressing”?

Queries retrieve Values by matching to Keys, so tokens fetch information based on semantic similarity rather than fixed positions.

63
New cards

Why use the softmax in attention?

Softmax converts similarity scores into a probability distribution, emphasizing a few relevant tokens while keeping gradients stable.

64
New cards

Why divide by √dₖ in attention?

It keeps dot-products in a numerically stable range so the softmax does not saturate early in training.

65
New cards

What is the learning signal in language modeling?

Cross-entropy between the predicted token distribution and the ground-truth next (or masked) token provides gradients to all layers.

66
New cards

Why do Transformers need positional information at all?

Self-attention is permutation-invariant; positional encodings inject order so the model can distinguish “cat chased dog” from “dog chased cat.”

67
New cards

How does multi-head attention increase capacity without huge cost?

It splits the hidden size across heads, letting each head specialize in different relations, then concatenates and mixes them with a linear layer.

68
New cards

When do we prefer encoder-decoder over decoder-only?

We prefer encoder-decoder when the task is explicitly sequence-to-sequence (e.g., translation, summarization with long sources) because cross-attention cleanly conditions on the encoded input.

69
New cards

What does “causal” mean in decoder-only LMs?

Causal means each position can only attend to previous positions, enabling next-token prediction without leaking future information.

70
New cards

What is weight tying and why use it?

It reuses the input embedding matrix for the output projection to reduce parameters and often improve perplexity.

71
New cards

What is the kv-cache and why is it crucial for generation?

It stores past Keys/Values so decoding a new token only computes attention against cached states instead of recomputing all previous layers.

72
New cards

Architecture (Anatomy + Shapes)

73
New cards

What is the minimal Transformer block recipe?

[LayerNorm] → Multi-Head Self-Attention → Residual add → [LayerNorm] → Feed-Forward (two linear layers with activation) → Residual add. (Pre-norm shown.)

74
New cards

Why pre-norm over post-norm in modern LLMs?

Pre-norm improves gradient flow in very deep stacks and is empirically more stable.

75
New cards

What does the FFN actually do?

It expands the hidden dimension (e.g., 4×) with a nonlinearity (GELU/ReLU) and projects back, acting as a per-token MLP to increase representational power.

76
New cards

What are typical tensor shapes in attention?

For batch B, length n, hidden d, heads h, head dim d_h = d/h:


Embeddings: (B, n, d) → Q/K/V: (B, h, n, d_h) → scores: (B, h, n, n) → output: (B, n, d).

77
New cards

What is the output projection after multi-head attention?

Concatenated head outputs (B, n, h·d_h) are linearly projected back to (B, n, d) to mix information across heads.

78
New cards

How do encoder vs decoder blocks differ structurally?

Encoders: self-attention + FFN.


Decoders: masked self-attention + cross-attention (to encoder) + FFN.

79
New cards

What is cross-attention wiring?

Decoder Queries attend to Keys/Values produced by the encoder, allowing the decoder to condition on the source sequence.

80
New cards

What does LayerNorm normalize?

It normalizes features within each token vector (mean/variance across the hidden dimension) to stabilize activations.

81
New cards

Why choose GELU over ReLU in FFNs?

GELU provides smoother gating around zero and empirically improves language model quality.

82
New cards

What is Multi-Query Attention (MQA) and why use it?

MQA shares Keys/Values across heads while keeping separate Queries, reducing memory and latency in decoding with small quality loss.

83
New cards

What is Grouped-Query Attention (GQA)?

GQA is a middle ground where Keys/Values are shared within groups of heads, balancing speed and quality.

84
New cards

Where do positional encodings plug in?

They are added or applied to token embeddings (absolute) or to Q/K (relative/rotary) before computing attention.

85
New cards

What does the final LM head do?

It projects the decoder’s hidden states to vocabulary logits; with weight tying, it shares weights with the input embeddings.

86
New cards

Attention Mechanics (Key Terms)

87
New cards

What is a Query?

A learned linear projection that represents “what information this token is seeking.”

88
New cards

What is a Key?

A learned linear projection that represents “what information this token offers.”

89
New cards

What is a Value?

A learned linear projection that carries the actual information aggregated by attention.

90
New cards

What is a Scaled Dot-Product Attention?

Compute scores = QKᵀ/√dₖ → softmax → weighted sum over V.

91
New cards

What is a Head?

One parallel attention subspace with its own Q/K/V projections; multiple heads run in parallel.

92
New cards

What is a Residual Stream?

The pathway carrying the token representations through the stack; sublayers add to it via skip connections.

93
New cards

what is Pre-LN vs Post-LN?

Pre-LN normalizes inputs to sublayers; Post-LN normalizes outputs. Pre-LN is more stable for deep models.

94
New cards

What is a Causal Mask?

A binary mask that blocks future positions to ensure valid next-token prediction.

95
New cards
96
New cards
97
New cards
98
New cards
99
New cards
100
New cards

Explore top flashcards

Chosen Quiz
Updated 771d ago
flashcards Flashcards (24)
Taschengeld
Updated 1092d ago
flashcards Flashcards (29)
Astronomy revision
Updated 176d ago
flashcards Flashcards (30)
anatomy2exam2
Updated 247d ago
flashcards Flashcards (293)
Blood
Updated 408d ago
flashcards Flashcards (32)
AQA A-Level Spanish
Updated 701d ago
flashcards Flashcards (923)
Chosen Quiz
Updated 771d ago
flashcards Flashcards (24)
Taschengeld
Updated 1092d ago
flashcards Flashcards (29)
Astronomy revision
Updated 176d ago
flashcards Flashcards (30)
anatomy2exam2
Updated 247d ago
flashcards Flashcards (293)
Blood
Updated 408d ago
flashcards Flashcards (32)
AQA A-Level Spanish
Updated 701d ago
flashcards Flashcards (923)