Transformers Oral Exam Study Set

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/23

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

24 Terms

New cards

what does vocab_size represent in the embedding layer?

it represents the number of unique tokens the model can embed, typically determined during tokenization

New cards

what does embed_dim represent in the embedding layer?

it refers to the length of each embedding vector and is set as a model hyperparameter

New cards

how are embeddings initialized during training?

they are initialized randomly unless pretrained, and are updated as the model learns

New cards

why do transformers need positional embeddings?

because transformers lack inherent order awareness, and positional embeddings inject sequence information

New cards

are sinusoidal positional embeddings trainable?

no, they are fixed and not updated during training

New cards

what happens when the same token appears in the same position in two sequences?

the final embedding is the same since both word and positional embeddings are identical

New cards

why use multiple attention heads in a transformer?

to capture diverse relationships between tokens in parallel, improving representation power

New cards

how many attention heads are used in the given implementation?

8 attention heads are used

New cards

what is the dimensionality of each attention head in a 512d embedding?

64, since 512 is split equally among 8 heads

New cards

how many sets of Q, K, and V are logically created for multi-head attention?

8 sets logically, but only one set of Q, K, and V matrices is implemented and then split

New cards

how can shared Q, K, V projections still support diverse attention?

because each head processes a separate slice and learns patterns independently

New cards

what does Add & Norm do in a transformer block?

it adds residual connections and applies layer normalization to stabilize and preserve information

New cards

what is the structure of the transformer feedforward network?

two linear layers with ReLU in between, typically sized 512 → 2048 → 512

New cards

why are the same vectors used as Q, K, and V in the encoder?

because the encoder uses self-attention to relate tokens within the same input

New cards

how many encoder blocks are in the original transformer?

6 blocks, each with their own parameters and stacked sequentially

New cards

why is a look-ahead mask used in the decoder?

to prevent attention to future tokens during training and ensure autoregressive behavior

New cards

what is cross-attention in the decoder for?

it allows the decoder to focus on relevant encoder outputs for generating context-aware outputs

New cards

what does the final linear layer in the decoder do?

it projects the output to vocabulary size, allowing the model to score and choose the next token

New cards

why use softmax after the final decoder layer?

to convert scores into probabilities for each possible next token

New cards

how is the encoder output used in the decoder?

it’s passed as both key and value in cross-attention layers for every decoder block

New cards

what does model.eval() do?

it switches the model to evaluation mode, disabling training behaviors like dropout

New cards

why use torch.no_grad() during inference?

to save memory and computation by disabling gradient tracking

New cards

what happens to model parameters during inference vs. fine-tuning?

they stay fixed during inference but are updated during fine-tuning

New cards

why fine-tune instead of training from scratch on small datasets?

because small datasets can’t train a model effectively from scratch, but fine-tuning leverages existing knowledge

Explore top notes

Heat Energy from Cells Creates Biological Order

Updated 878d ago

Note

economic strategies/systems

Updated 812d ago

Note

Baumrind's Parenting Styles

Note

Note

Note

Chapter 15: Natural Resource and Energy Economics

Updated 969d ago

Note

Organization of Sentences, Paragraphs, and Passages

Updated 883d ago

Note

🦅 APUSH Unit 6 Notes

Updated 104d ago

Note

Explore top flashcards

Flashcards (21)

Flashcards (37)

Französisch Unite 2 I2a

Flashcards (230)

Flashcards (322)

Flashcards (22)

Vocabulario Completo - La salud y el bienestar

Flashcards (69)

Flashcards (44)

Flashcards (176)