Model Families & Architecture Variants

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/37

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

38 Terms

1
New cards

What are the four main model families introduced in this lecture?

Decoder-only, Encoder-only, Embedding models, and Mixture of Experts (MoE).

2
New cards

What is the primary capability of encoder-decoder architectures?

Sequence-to-sequence modeling.

3
New cards

What are typical use cases of encoder-decoder models?

Translation and summarization.

4
New cards

What is the main characteristic of decoder-only models?

Autoregressive generation.

5
New cards

What tasks are decoder-only models typically used for?

Text generation, code synthesis, and chat.

6
New cards

What is the defining feature of encoder-only models?

Bidirectional attention for representation learning.

7
New cards

What tasks are encoder-only models best suited for?

Classification, Named Entity Recognition, and semantic search.

8
New cards

What is the role of embedding models?

Producing dense vector representations of text.

9
New cards

What applications commonly use embedding models?

RAG, similarity search, and clustering.

10
New cards

What is the key idea behind Mixture of Experts (MoE)?

Sparse routing enables scalable computation.

11
New cards

What is autoregressive generation in decoder-only models?

Each token is generated based only on previously generated tokens.

12
New cards

What type of attention do decoder-only models use?

Causal (masked) attention.

13
New cards

Why is causal attention necessary in decoder-only models?

It prevents the model from attending to future tokens during training.

14
New cards

How is the architecture of decoder-only transformers simplified?

The encoder and cross-attention mechanisms are removed.

15
New cards

How do decoder-only models process input tokens?

They rely entirely on prior outputs fed back into the input stream.

16
New cards

What is the goal of autoregressive generation examples shown in the lecture?

To illustrate how outputs are recursively fed back into the model input.

17
New cards

What problem does KV caching address during inference?

Recomputing attention keys and values for past tokens is expensive.

18
New cards

How does KV caching improve inference efficiency?

By storing previously computed key and value matrices in memory.

19
New cards

What is the trade-off of using KV caching?

Increased memory usage proportional to sequence length.

20
New cards

What distinguishes encoder-only models from decoder-only models?

They process the entire sequence simultaneously with bidirectional context.

21
New cards

What training objective is used in encoder-only models?

Masked Language Modeling (MLM).

22
New cards

How does Masked Language Modeling work?

Random tokens are masked and the model predicts them using surrounding context.

23
New cards

Why do encoder-only models not require KV caching?

They are not autoregressive and do not generate tokens sequentially.

24
New cards

What tasks are encoder-only architectures ideal for?

Classification, sentiment analysis, and named entity recognition.

25
New cards

What problem does Retrieval Augmented Generation (RAG) address?

The fixed knowledge cutoff of LLMs.

26
New cards

What are the three steps of RAG?

Retrieve, augment, and generate.

27
New cards

What happens during the retrieve step in RAG?

The system searches a vector database for relevant documents.

28
New cards

How are documents represented for retrieval in RAG systems?

As vector embeddings generated by embedding models.

29
New cards

What does semantic similarity mean in embedding models?

Distance in vector space corresponds to similarity in meaning.

30
New cards

What is the difference between bi-encoders and cross-encoders?

Bi-encoders encode queries and documents separately, while cross-encoders process them jointly.

31
New cards

What is the main advantage of bi-encoders?

Efficient similarity search using precomputed embeddings.

32
New cards

What is the main advantage of cross-encoders?

Higher accuracy by modeling full query-document interactions.

33
New cards

How does Mixture of Experts decouple model size from computation cost?

Only a subset of experts is activated per token.

34
New cards

What is conditional computation in MoE models?

Activating only selected experts for each token.

35
New cards

What role does the gating mechanism play in MoE?

It routes tokens to the most appropriate experts.

36
New cards

What is sparse gating in MoE architectures?

Selecting only the top-k experts for each token.

37
New cards

Why are load balancing losses used in MoE training?

To prevent expert collapse where one expert handles all tokens.

38
New cards

What does the capacity factor control in MoE systems?

The maximum number of tokens an expert can process.