1/37
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What are the four main model families introduced in this lecture?
Decoder-only, Encoder-only, Embedding models, and Mixture of Experts (MoE).
What is the primary capability of encoder-decoder architectures?
Sequence-to-sequence modeling.
What are typical use cases of encoder-decoder models?
Translation and summarization.
What is the main characteristic of decoder-only models?
Autoregressive generation.
What tasks are decoder-only models typically used for?
Text generation, code synthesis, and chat.
What is the defining feature of encoder-only models?
Bidirectional attention for representation learning.
What tasks are encoder-only models best suited for?
Classification, Named Entity Recognition, and semantic search.
What is the role of embedding models?
Producing dense vector representations of text.
What applications commonly use embedding models?
RAG, similarity search, and clustering.
What is the key idea behind Mixture of Experts (MoE)?
Sparse routing enables scalable computation.
What is autoregressive generation in decoder-only models?
Each token is generated based only on previously generated tokens.
What type of attention do decoder-only models use?
Causal (masked) attention.
Why is causal attention necessary in decoder-only models?
It prevents the model from attending to future tokens during training.
How is the architecture of decoder-only transformers simplified?
The encoder and cross-attention mechanisms are removed.
How do decoder-only models process input tokens?
They rely entirely on prior outputs fed back into the input stream.
What is the goal of autoregressive generation examples shown in the lecture?
To illustrate how outputs are recursively fed back into the model input.
What problem does KV caching address during inference?
Recomputing attention keys and values for past tokens is expensive.
How does KV caching improve inference efficiency?
By storing previously computed key and value matrices in memory.
What is the trade-off of using KV caching?
Increased memory usage proportional to sequence length.
What distinguishes encoder-only models from decoder-only models?
They process the entire sequence simultaneously with bidirectional context.
What training objective is used in encoder-only models?
Masked Language Modeling (MLM).
How does Masked Language Modeling work?
Random tokens are masked and the model predicts them using surrounding context.
Why do encoder-only models not require KV caching?
They are not autoregressive and do not generate tokens sequentially.
What tasks are encoder-only architectures ideal for?
Classification, sentiment analysis, and named entity recognition.
What problem does Retrieval Augmented Generation (RAG) address?
The fixed knowledge cutoff of LLMs.
What are the three steps of RAG?
Retrieve, augment, and generate.
What happens during the retrieve step in RAG?
The system searches a vector database for relevant documents.
How are documents represented for retrieval in RAG systems?
As vector embeddings generated by embedding models.
What does semantic similarity mean in embedding models?
Distance in vector space corresponds to similarity in meaning.
What is the difference between bi-encoders and cross-encoders?
Bi-encoders encode queries and documents separately, while cross-encoders process them jointly.
What is the main advantage of bi-encoders?
Efficient similarity search using precomputed embeddings.
What is the main advantage of cross-encoders?
Higher accuracy by modeling full query-document interactions.
How does Mixture of Experts decouple model size from computation cost?
Only a subset of experts is activated per token.
What is conditional computation in MoE models?
Activating only selected experts for each token.
What role does the gating mechanism play in MoE?
It routes tokens to the most appropriate experts.
What is sparse gating in MoE architectures?
Selecting only the top-k experts for each token.
Why are load balancing losses used in MoE training?
To prevent expert collapse where one expert handles all tokens.
What does the capacity factor control in MoE systems?
The maximum number of tokens an expert can process.