Transformers Oral Exam Study Set

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/43

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

44 Terms

1
New cards
what does vocab_size represent in the embedding layer?
it represents the number of unique tokens the model can embed, typically determined during tokenization
2
New cards
what does embed_dim represent in the embedding layer?
it refers to the length of each embedding vector and is set as a model hyperparameter
3
New cards
how are embeddings initialized during training?
they are initialized randomly unless pretrained, and are updated as the model learns
4
New cards
why do transformers need positional embeddings?
because transformers lack inherent order awareness, and positional embeddings inject sequence information
5
New cards
are sinusoidal positional embeddings trainable?
no, they are fixed and not updated during training
6
New cards
what happens when the same token appears in the same position in two sequences?
the final embedding is the same since both word and positional embeddings are identical
7
New cards
why use multiple attention heads in a transformer?
to capture diverse relationships between tokens in parallel, improving representation power
8
New cards
how many attention heads are used in the given implementation?
8 attention heads are used
9
New cards
what is the dimensionality of each attention head in a 512d embedding?
64, since 512 is split equally among 8 heads
10
New cards
how many sets of Q, K, and V are logically created for multi-head attention?
8 sets logically, but only one set of Q, K, and V matrices is implemented and then split
11
New cards
how can shared Q, K, V projections still support diverse attention?
because each head processes a separate slice and learns patterns independently
12
New cards
what does Add & Norm do in a transformer block?
it adds residual connections and applies layer normalization to stabilize and preserve information
13
New cards
what is the structure of the transformer feedforward network?
two linear layers with ReLU in between, typically sized 512 → 2048 → 512
14
New cards
why are the same vectors used as Q, K, and V in the encoder?
because the encoder uses self-attention to relate tokens within the same input
15
New cards
how many encoder blocks are in the original transformer?
6 blocks, each with their own parameters and stacked sequentially
16
New cards
why is a look-ahead mask used in the decoder?
to prevent attention to future tokens during training and ensure autoregressive behavior
17
New cards
what is cross-attention in the decoder for?
it allows the decoder to focus on relevant encoder outputs for generating context-aware outputs
18
New cards
what does the final linear layer in the decoder do?
it projects the output to vocabulary size, allowing the model to score and choose the next token
19
New cards
why use softmax after the final decoder layer?
to convert scores into probabilities for each possible next token
20
New cards
how is the encoder output used in the decoder?
it’s passed as both key and value in cross-attention layers for every decoder block
21
New cards
what does model.eval() do?
it switches the model to evaluation mode, disabling training behaviors like dropout
22
New cards
why use torch.no_grad() during inference?
to save memory and computation by disabling gradient tracking
23
New cards
what happens to model parameters during inference vs. fine-tuning?
they stay fixed during inference but are updated during fine-tuning
24
New cards
why fine-tune instead of training from scratch on small datasets?
because small datasets can’t train a model effectively from scratch, but fine-tuning leverages existing knowledge
25
New cards
what is the main innovation of the transformer architecture?
it removes recurrence and convolutions, using only attention mechanisms
26
New cards
how is the transformer structured?
it follows an encoder-decoder architecture with stacked layers of attention and feed-forward networks
27
New cards
how many layers are in the transformer encoder?
6 identical layers
28
New cards
how many layers are in the transformer decoder?
6 identical layers
29
New cards
what are the two main components of each encoder layer?
multi-head self-attention and a feed-forward network
30
New cards
what additional component does each decoder layer have?
a cross-attention layer over the encoder output
31
New cards
why is masking used in the decoder?
to prevent attending to future tokens and preserve auto-regressive behavior
32
New cards
what is scaled dot-product attention?
it is a mechanism that computes attention scores by scaling dot products of queries and keys
33
New cards
why do we scale the dot product in attention?
to prevent large values that push softmax into regions with very small gradients
34
New cards
what is multi-head attention?
it applies attention multiple times in parallel with different learned projections and then combines the results
35
New cards
how many heads are used in multi-head attention?
8 heads
36
New cards
what are the advantages of multi-head attention?
it allows the model to attend to information from different representation subspaces simultaneously
37
New cards
what is the feed-forward network in each layer?
two linear layers with a ReLU activation in between, applied independently to each position
38
New cards
what is the input and output size of the feed-forward network?
both are 512, with an inner layer of size 2048
39
New cards
what is the role of the embedding layer?
to convert tokens into vectors of dimension 512
40
New cards
what is positional encoding used for?
to inject order information into the model since it lacks recurrence and convolution
41
New cards
how is positional encoding implemented?
using fixed sinusoidal functions of different frequencies
42
New cards
why were sinusoidal positional encodings chosen?
they help the model attend by relative position and generalize to longer sequences
43
New cards
how is self-attention different from convolution?
self-attention connects all positions with constant sequential operations, while convolution does not
44
New cards
what makes self-attention efficient?
it enables parallelization and has shorter paths for learning long-range dependencies