Comprehensive Guide to Mixture of Experts, RAG, Quantization, and Fine-Tuning in Large Language Models

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/34

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

35 Terms

1
New cards

Mixture of Experts (MoE)

A neural architecture that activates only a subset of "expert" networks per input, enabling large capacity with reduced compute per token.

2
New cards

MoE Gating Network

A small model that decides which experts to activate for each token, often using top-k routing or softmax-based scores.

3
New cards

MoE Load Balancing

Regularization technique that prevents expert overuse or underuse by encouraging uniform routing.

4
New cards

Dense vs Sparse MoE

Dense uses all experts per input; Sparse activates a few, saving compute but increasing routing complexity.

5
New cards

MoE Trade-offs

High capacity and efficiency, but adds memory, routing overhead, and training instability.

6
New cards

Retrieval-Augmented Generation (RAG)

A system that retrieves relevant documents from a knowledge base before generation to ground outputs and reduce hallucination.

7
New cards

RAG Components

Retriever (finds documents) + Generator (LLM that conditions on retrieved context).

8
New cards

RAG Challenges

Hallucinations, irrelevant retrievals, context window limits, and inconsistent document grounding.

9
New cards

Hybrid Retrieval

Combines dense (vector-based) and sparse (keyword-based) retrieval for better coverage and precision.

10
New cards

Improving RAG Accuracy

Use re-ranking, better chunking, retrieval fine-tuning, and instruct model to cite sources.

11
New cards

Quantization

Representing model weights and activations in lower precision (e.g. 8-bit or 4-bit) to reduce memory and speed up inference.

12
New cards

Post-Training Quantization (PTQ)

Quantize a pre-trained model without retraining; fast but can reduce accuracy.

13
New cards

Quantization-Aware Training (QAT)

Simulate quantization during training to maintain accuracy after conversion.

14
New cards

Symmetric vs Asymmetric Quantization

Symmetric uses zero-centered scaling; asymmetric includes offset to handle non-zero means.

15
New cards

Quantization Trade-offs

Improves speed and efficiency but risks numerical instability and performance loss.

16
New cards

Knowledge Distillation

Training a smaller student model to mimic a larger teacher's outputs, reducing size and latency.

17
New cards

Distillation Loss

Combines student's task loss with KL divergence between teacher and student logits.

18
New cards

Intermediate Layer Distillation

Matching hidden states or attention maps between teacher and student for richer transfer.

19
New cards

Distillation Benefits

Retains performance while reducing model size, latency, and energy consumption.

20
New cards

Distillation Limitations

May underperform if teacher outputs are noisy or student is too small.

21
New cards

Attention Mechanism

Computes weighted combinations of values based on query-key similarity, allowing models to focus on relevant tokens.

22
New cards

Self-Attention

Each token attends to every other token in the sequence, capturing global dependencies.

23
New cards

Multi-Head Attention

Multiple attention heads learn diverse relationships in parallel subspaces.

24
New cards

Positional Encoding

Injects sequence order information since attention is permutation-invariant.

25
New cards

Attention Complexity

O(n²) time and memory with sequence length; motivates efficient attention variants.

26
New cards

Fine-Tuning

Adapting a pre-trained model to new data or tasks by updating its weights or small subsets of parameters.

27
New cards

Full Fine-Tuning

Updates all weights; high flexibility but expensive and risks overfitting.

28
New cards

LoRA (Low-Rank Adaptation)

Fine-tunes low-rank matrices added to weight layers; efficient and reversible.

29
New cards

Adapter Tuning

Adds small trainable modules between layers while freezing the backbone model.

30
New cards

Prefix or Prompt Tuning

Learns task-specific prompt vectors without modifying model weights.

31
New cards

Prompt Engineering

Crafting effective input prompts to elicit desired outputs from LLMs without retraining.

32
New cards

Chain-of-Thought Prompting

Encouraging step-by-step reasoning to improve problem-solving accuracy.

33
New cards

Prompt Optimization

Automating prompt improvement via gradient-based or search-based methods.

34
New cards

Prompt Injection

Adversarial input that manipulates model behavior; mitigated by input sanitization and filtering.

35
New cards

Prompt Engineering Goal

Maximize LLM accuracy, faithfulness, and control using structure, context, and examples.