1/34
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
Mixture of Experts (MoE)
A neural architecture that activates only a subset of "expert" networks per input, enabling large capacity with reduced compute per token.
MoE Gating Network
A small model that decides which experts to activate for each token, often using top-k routing or softmax-based scores.
MoE Load Balancing
Regularization technique that prevents expert overuse or underuse by encouraging uniform routing.
Dense vs Sparse MoE
Dense uses all experts per input; Sparse activates a few, saving compute but increasing routing complexity.
MoE Trade-offs
High capacity and efficiency, but adds memory, routing overhead, and training instability.
Retrieval-Augmented Generation (RAG)
A system that retrieves relevant documents from a knowledge base before generation to ground outputs and reduce hallucination.
RAG Components
Retriever (finds documents) + Generator (LLM that conditions on retrieved context).
RAG Challenges
Hallucinations, irrelevant retrievals, context window limits, and inconsistent document grounding.
Hybrid Retrieval
Combines dense (vector-based) and sparse (keyword-based) retrieval for better coverage and precision.
Improving RAG Accuracy
Use re-ranking, better chunking, retrieval fine-tuning, and instruct model to cite sources.
Quantization
Representing model weights and activations in lower precision (e.g. 8-bit or 4-bit) to reduce memory and speed up inference.
Post-Training Quantization (PTQ)
Quantize a pre-trained model without retraining; fast but can reduce accuracy.
Quantization-Aware Training (QAT)
Simulate quantization during training to maintain accuracy after conversion.
Symmetric vs Asymmetric Quantization
Symmetric uses zero-centered scaling; asymmetric includes offset to handle non-zero means.
Quantization Trade-offs
Improves speed and efficiency but risks numerical instability and performance loss.
Knowledge Distillation
Training a smaller student model to mimic a larger teacher's outputs, reducing size and latency.
Distillation Loss
Combines student's task loss with KL divergence between teacher and student logits.
Intermediate Layer Distillation
Matching hidden states or attention maps between teacher and student for richer transfer.
Distillation Benefits
Retains performance while reducing model size, latency, and energy consumption.
Distillation Limitations
May underperform if teacher outputs are noisy or student is too small.
Attention Mechanism
Computes weighted combinations of values based on query-key similarity, allowing models to focus on relevant tokens.
Self-Attention
Each token attends to every other token in the sequence, capturing global dependencies.
Multi-Head Attention
Multiple attention heads learn diverse relationships in parallel subspaces.
Positional Encoding
Injects sequence order information since attention is permutation-invariant.
Attention Complexity
O(n²) time and memory with sequence length; motivates efficient attention variants.
Fine-Tuning
Adapting a pre-trained model to new data or tasks by updating its weights or small subsets of parameters.
Full Fine-Tuning
Updates all weights; high flexibility but expensive and risks overfitting.
LoRA (Low-Rank Adaptation)
Fine-tunes low-rank matrices added to weight layers; efficient and reversible.
Adapter Tuning
Adds small trainable modules between layers while freezing the backbone model.
Prefix or Prompt Tuning
Learns task-specific prompt vectors without modifying model weights.
Prompt Engineering
Crafting effective input prompts to elicit desired outputs from LLMs without retraining.
Chain-of-Thought Prompting
Encouraging step-by-step reasoning to improve problem-solving accuracy.
Prompt Optimization
Automating prompt improvement via gradient-based or search-based methods.
Prompt Injection
Adversarial input that manipulates model behavior; mitigated by input sanitization and filtering.
Prompt Engineering Goal
Maximize LLM accuracy, faithfulness, and control using structure, context, and examples.