Transformers Oral Exam Study Set

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/23

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

24 Terms

1
New cards
what does vocab_size represent in the embedding layer?
it represents the number of unique tokens the model can embed, typically determined during tokenization
2
New cards
what does embed_dim represent in the embedding layer?
it refers to the length of each embedding vector and is set as a model hyperparameter
3
New cards
how are embeddings initialized during training?
they are initialized randomly unless pretrained, and are updated as the model learns
4
New cards
why do transformers need positional embeddings?
because transformers lack inherent order awareness, and positional embeddings inject sequence information
5
New cards
are sinusoidal positional embeddings trainable?
no, they are fixed and not updated during training
6
New cards
what happens when the same token appears in the same position in two sequences?
the final embedding is the same since both word and positional embeddings are identical
7
New cards
why use multiple attention heads in a transformer?
to capture diverse relationships between tokens in parallel, improving representation power
8
New cards
how many attention heads are used in the given implementation?
8 attention heads are used
9
New cards
what is the dimensionality of each attention head in a 512d embedding?
64, since 512 is split equally among 8 heads
10
New cards
how many sets of Q, K, and V are logically created for multi-head attention?
8 sets logically, but only one set of Q, K, and V matrices is implemented and then split
11
New cards
how can shared Q, K, V projections still support diverse attention?
because each head processes a separate slice and learns patterns independently
12
New cards
what does Add & Norm do in a transformer block?
it adds residual connections and applies layer normalization to stabilize and preserve information
13
New cards
what is the structure of the transformer feedforward network?
two linear layers with ReLU in between, typically sized 512 → 2048 → 512
14
New cards
why are the same vectors used as Q, K, and V in the encoder?
because the encoder uses self-attention to relate tokens within the same input
15
New cards
how many encoder blocks are in the original transformer?
6 blocks, each with their own parameters and stacked sequentially
16
New cards
why is a look-ahead mask used in the decoder?
to prevent attention to future tokens during training and ensure autoregressive behavior
17
New cards
what is cross-attention in the decoder for?
it allows the decoder to focus on relevant encoder outputs for generating context-aware outputs
18
New cards
what does the final linear layer in the decoder do?
it projects the output to vocabulary size, allowing the model to score and choose the next token
19
New cards
why use softmax after the final decoder layer?
to convert scores into probabilities for each possible next token
20
New cards
how is the encoder output used in the decoder?
it’s passed as both key and value in cross-attention layers for every decoder block
21
New cards
what does model.eval() do?
it switches the model to evaluation mode, disabling training behaviors like dropout
22
New cards
why use torch.no_grad() during inference?
to save memory and computation by disabling gradient tracking
23
New cards
what happens to model parameters during inference vs. fine-tuning?
they stay fixed during inference but are updated during fine-tuning
24
New cards
why fine-tune instead of training from scratch on small datasets?
because small datasets can’t train a model effectively from scratch, but fine-tuning leverages existing knowledge