DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/29

Earn XP

Description and Tags

Vocabulary flashcards covering key terms, methods, rewards, benchmarks, and models discussed in the lecture on DeepSeek-R1 and its reinforcement-learning-based reasoning pipeline.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

30 Terms

New cards

DeepSeek-R1-Zero

First-generation reasoning model trained purely with large-scale reinforcement learning and no supervised fine-tuning; shows strong reasoning but poor readability and language mixing.

New cards

DeepSeek-R1

Improved model that adds cold-start data and multi-stage SFT + RL, reaching reasoning performance comparable to OpenAI-o1-1217.

New cards

Reinforcement Learning (RL) in LLMs

Post-training technique that optimizes a language model’s policy via rewards, boosting reasoning with lower cost than pre-training.

New cards

Supervised Fine-Tuning (SFT)

Training a pretrained model on labeled examples; skipped for DeepSeek-R1-Zero but used in later stages of DeepSeek-R1.

New cards

Group Relative Policy Optimization (GRPO)

Cost-efficient RL algorithm that replaces a critic with group-based baseline estimation when updating the policy model.

New cards

Accuracy Reward

Reward signal that grants positive feedback when the model’s answer is verifiably correct (e.g., math box check, code test cases).

New cards

Format Reward

Reward component that enforces wrapping the reasoning in tags and the final answer in tags.

New cards

Chain-of-Thought (CoT)

Explicit sequence of intermediate reasoning steps written out by the model to arrive at an answer.

New cards

Cold Start Data

Small, high-quality set of long CoTs used to fine-tune the base model before RL, accelerating convergence and improving readability.

New cards

Language Consistency Reward

Additional RL signal that penalizes mixed-language CoTs and encourages output in the target language.

New cards

Majority Voting

Inference method that samples multiple outputs and selects the most frequent answer to raise accuracy.

New cards

Self-Evolution Process

Gradual lengthening and refinement of CoTs during RL, leading to emergent reflection and better reasoning.

New cards

“Aha Moment”

Training milestone where the model begins reallocating more thinking time and reevaluating its steps, showing emergent insight.

New cards

Distillation

Transferring reasoning patterns from a large teacher (DeepSeek-R1) to smaller dense models via supervised learning.

New cards

DeepSeek-R1-Distill-Qwen-7B

7-B parameter distilled model that attains 55.5 % pass@1 on AIME 2024, beating larger open baselines.

New cards

Pass@1

Metric measuring the fraction of problems the model solves correctly on its first attempt.

New cards

AIME 2024 Benchmark

Set of 15 American Invitational Mathematics Examination problems used to test mathematical reasoning.

New cards

MATH-500

Collection of 500 advanced mathematics problems evaluating higher-level problem solving.

New cards

Codeforces Rating

Elo-style score derived from competitive programming tasks; gauges coding ability of LLMs.

New cards

LiveCodeBench

Benchmark of real-world coding tasks across multiple languages, assessed with CoT-based generation.

New cards

Reasoning-Oriented RL

RL phase centered on math, coding, science, and logic tasks with clear rule-based feedback.

New cards

Rejection Sampling

Generating multiple outputs from a checkpoint and keeping only correct/high-quality ones to build new SFT data.

New cards

Reward Hacking

Undesired behavior where the model exploits flaws in a reward model to inflate reward without real improvement.

New cards

Process Reward Model (PRM)

Neural model scoring intermediate reasoning steps; prone to reward hacking and high overhead in large-scale RL.

New cards

Monte Carlo Tree Search (MCTS) in LLMs

Search algorithm tried for token-level exploration; faced scaling issues due to enormous search space.

New cards

Reasoning Patterns

Strategies such as self-verification and reflection that enable effective problem solving and can be transferred via distillation.

New cards

Safety RL

Reinforcement learning stage aimed at reducing harmful or non-compliant outputs; sometimes lowers factual QA in certain languages.

New cards

AlpacaEval 2.0

GPT-4-judged evaluation of open-ended tasks; DeepSeek-R1 achieves an 87.6 % length-controlled win-rate.

New cards

ArenaHard

Challenging GPT-4-judged benchmark for open-domain QA; DeepSeek-R1 scores a 92.3 % win-rate.

New cards

DeepSeek-V3-Base

Underlying pretrained model that serves as the starting point for both DeepSeek-R1-Zero and DeepSeek-R1.