Home
Explore
Exams
Search for anything
Login
Get started
Home
Transformers Oral Exam Study Set
Transformers Oral Exam Study Set
0.0
(0)
Rate it
Studied by 0 people
Learn
Practice Test
Spaced Repetition
Match
Flashcards
Card Sorting
1/23
There's no tags or description
Looks like no tags are added yet.
Study Analytics
All
Learn
Practice Test
Matching
Spaced Repetition
Name
Mastery
Learn
Test
Matching
Spaced
No study sessions yet.
24 Terms
View all (24)
Star these 24
1
New cards
what does vocab_size represent in the embedding layer?
it represents the number of unique tokens the model can embed, typically determined during tokenization
2
New cards
what does embed_dim represent in the embedding layer?
it refers to the length of each embedding vector and is set as a model hyperparameter
3
New cards
how are embeddings initialized during training?
they are initialized randomly unless pretrained, and are updated as the model learns
4
New cards
why do transformers need positional embeddings?
because transformers lack inherent order awareness, and positional embeddings inject sequence information
5
New cards
are sinusoidal positional embeddings trainable?
no, they are fixed and not updated during training
6
New cards
what happens when the same token appears in the same position in two sequences?
the final embedding is the same since both word and positional embeddings are identical
7
New cards
why use multiple attention heads in a transformer?
to capture diverse relationships between tokens in parallel, improving representation power
8
New cards
how many attention heads are used in the given implementation?
8 attention heads are used
9
New cards
what is the dimensionality of each attention head in a 512d embedding?
64, since 512 is split equally among 8 heads
10
New cards
how many sets of Q, K, and V are logically created for multi-head attention?
8 sets logically, but only one set of Q, K, and V matrices is implemented and then split
11
New cards
how can shared Q, K, V projections still support diverse attention?
because each head processes a separate slice and learns patterns independently
12
New cards
what does Add & Norm do in a transformer block?
it adds residual connections and applies layer normalization to stabilize and preserve information
13
New cards
what is the structure of the transformer feedforward network?
two linear layers with ReLU in between, typically sized 512 → 2048 → 512
14
New cards
why are the same vectors used as Q, K, and V in the encoder?
because the encoder uses self-attention to relate tokens within the same input
15
New cards
how many encoder blocks are in the original transformer?
6 blocks, each with their own parameters and stacked sequentially
16
New cards
why is a look-ahead mask used in the decoder?
to prevent attention to future tokens during training and ensure autoregressive behavior
17
New cards
what is cross-attention in the decoder for?
it allows the decoder to focus on relevant encoder outputs for generating context-aware outputs
18
New cards
what does the final linear layer in the decoder do?
it projects the output to vocabulary size, allowing the model to score and choose the next token
19
New cards
why use softmax after the final decoder layer?
to convert scores into probabilities for each possible next token
20
New cards
how is the encoder output used in the decoder?
it’s passed as both key and value in cross-attention layers for every decoder block
21
New cards
what does model.eval() do?
it switches the model to evaluation mode, disabling training behaviors like dropout
22
New cards
why use torch.no_grad() during inference?
to save memory and computation by disabling gradient tracking
23
New cards
what happens to model parameters during inference vs. fine-tuning?
they stay fixed during inference but are updated during fine-tuning
24
New cards
why fine-tune instead of training from scratch on small datasets?
because small datasets can’t train a model effectively from scratch, but fine-tuning leverages existing knowledge
Explore top notes
Heat Energy from Cells Creates Biological Order
Updated 878d ago
Note
Preview
economic strategies/systems
Updated 812d ago
Note
Preview
Baumrind's Parenting Styles
Updated 909d ago
Note
Preview
Data Trends
Updated 853d ago
Note
Preview
Micro-Organismes
Updated 893d ago
Note
Preview
Chapter 15: Natural Resource and Energy Economics
Updated 969d ago
Note
Preview
Organization of Sentences, Paragraphs, and Passages
Updated 883d ago
Note
Preview
🦅 APUSH Unit 6 Notes
Updated 104d ago
Note
Preview
Explore top flashcards
European History
Updated 584d ago
Flashcards (21)
Preview
La Nourriture
Updated 724d ago
Flashcards (37)
Preview
Französisch Unite 2 I2a
Updated 925d ago
Flashcards (230)
Preview
Micro Exam 4
Updated 561d ago
Flashcards (322)
Preview
Behavioural change
Updated 144d ago
Flashcards (22)
Preview
Vocabulario Completo - La salud y el bienestar
Updated 820d ago
Flashcards (69)
Preview
reproduction
Updated 383d ago
Flashcards (44)
Preview
top 300 Injectables
Updated 9d ago
Flashcards (176)
Preview