1/29
Vocabulary flashcards covering key terms, methods, rewards, benchmarks, and models discussed in the lecture on DeepSeek-R1 and its reinforcement-learning-based reasoning pipeline.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
DeepSeek-R1-Zero
First-generation reasoning model trained purely with large-scale reinforcement learning and no supervised fine-tuning; shows strong reasoning but poor readability and language mixing.
DeepSeek-R1
Improved model that adds cold-start data and multi-stage SFT + RL, reaching reasoning performance comparable to OpenAI-o1-1217.
Reinforcement Learning (RL) in LLMs
Post-training technique that optimizes a language model’s policy via rewards, boosting reasoning with lower cost than pre-training.
Supervised Fine-Tuning (SFT)
Training a pretrained model on labeled examples; skipped for DeepSeek-R1-Zero but used in later stages of DeepSeek-R1.
Group Relative Policy Optimization (GRPO)
Cost-efficient RL algorithm that replaces a critic with group-based baseline estimation when updating the policy model.
Accuracy Reward
Reward signal that grants positive feedback when the model’s answer is verifiably correct (e.g., math box check, code test cases).
Format Reward
Reward component that enforces wrapping the reasoning in
Chain-of-Thought (CoT)
Explicit sequence of intermediate reasoning steps written out by the model to arrive at an answer.
Cold Start Data
Small, high-quality set of long CoTs used to fine-tune the base model before RL, accelerating convergence and improving readability.
Language Consistency Reward
Additional RL signal that penalizes mixed-language CoTs and encourages output in the target language.
Majority Voting
Inference method that samples multiple outputs and selects the most frequent answer to raise accuracy.
Self-Evolution Process
Gradual lengthening and refinement of CoTs during RL, leading to emergent reflection and better reasoning.
“Aha Moment”
Training milestone where the model begins reallocating more thinking time and reevaluating its steps, showing emergent insight.
Distillation
Transferring reasoning patterns from a large teacher (DeepSeek-R1) to smaller dense models via supervised learning.
DeepSeek-R1-Distill-Qwen-7B
7-B parameter distilled model that attains 55.5 % pass@1 on AIME 2024, beating larger open baselines.
Pass@1
Metric measuring the fraction of problems the model solves correctly on its first attempt.
AIME 2024 Benchmark
Set of 15 American Invitational Mathematics Examination problems used to test mathematical reasoning.
MATH-500
Collection of 500 advanced mathematics problems evaluating higher-level problem solving.
Codeforces Rating
Elo-style score derived from competitive programming tasks; gauges coding ability of LLMs.
LiveCodeBench
Benchmark of real-world coding tasks across multiple languages, assessed with CoT-based generation.
Reasoning-Oriented RL
RL phase centered on math, coding, science, and logic tasks with clear rule-based feedback.
Rejection Sampling
Generating multiple outputs from a checkpoint and keeping only correct/high-quality ones to build new SFT data.
Reward Hacking
Undesired behavior where the model exploits flaws in a reward model to inflate reward without real improvement.
Process Reward Model (PRM)
Neural model scoring intermediate reasoning steps; prone to reward hacking and high overhead in large-scale RL.
Monte Carlo Tree Search (MCTS) in LLMs
Search algorithm tried for token-level exploration; faced scaling issues due to enormous search space.
Reasoning Patterns
Strategies such as self-verification and reflection that enable effective problem solving and can be transferred via distillation.
Safety RL
Reinforcement learning stage aimed at reducing harmful or non-compliant outputs; sometimes lowers factual QA in certain languages.
AlpacaEval 2.0
GPT-4-judged evaluation of open-ended tasks; DeepSeek-R1 achieves an 87.6 % length-controlled win-rate.
ArenaHard
Challenging GPT-4-judged benchmark for open-domain QA; DeepSeek-R1 scores a 92.3 % win-rate.
DeepSeek-V3-Base
Underlying pretrained model that serves as the starting point for both DeepSeek-R1-Zero and DeepSeek-R1.