Deep Reinforcement Learning – Vocabulary

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/37

flashcard set

Earn XP

Description and Tags

Key vocabulary terms and definitions covering Deep Reinforcement Learning beyond basic DQN, including algorithm variants, core concepts, improvements, policy-gradient methods, and imitation learning challenges.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

38 Terms

1
New cards

Reinforcement Learning (RL)

A machine-learning paradigm in which an agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.

2
New cards

Deep Reinforcement Learning (DRL)

RL that leverages deep neural networks to approximate value functions, policies, or models, enabling end-to-end learning from high-dimensional inputs.

3
New cards

Deep Q-Network (DQN)

A value-based DRL algorithm that combines Q-learning with deep neural networks plus experience replay and a target network to stabilize training.

4
New cards

Double DQN (DDQN)

An extension of DQN that reduces Q-value overestimation by using the online network to choose the best action and the target network to evaluate it.

5
New cards

Experience Replay

A technique that stores past transitions in a buffer and samples them randomly to break temporal correlations and improve data efficiency.

6
New cards

Target Network

A periodically or softly updated copy of the Q-network used to compute stable target values during training.

7
New cards

Replay Memory

The buffer that holds past state–action–reward–next-state tuples for experience replay.

8
New cards

Q-Table

A tabular representation of state–action values used in classical (non-deep) Q-learning or SARSA.

9
New cards

Value Function

A function that estimates expected return from a state (V) or state-action pair (Q) under a given policy.

10
New cards

Policy Function

A mapping from states to probabilities of selecting each action; represents the agent’s behavior.

11
New cards

ε-Greedy Exploration

An exploration strategy that chooses a random action with probability ε and the greedy (best) action otherwise.

12
New cards

Dueling DQN

A DQN variant that decomposes Q(s,a) into a state-value stream V(s) and an advantage stream A(s,a) to learn which states are valuable independent of action.

13
New cards

Advantage Function

Defined as A(s,a)=Q(s,a)−V(s); measures how much better an action is compared to the average at state s.

14
New cards

Prioritized Experience Replay (PER)

A replay strategy that samples transitions with probability proportional to their TD error magnitude, focusing learning on informative experiences.

15
New cards

Polyak (Soft) Update

A target-network update rule: θtarget ← τ θonline + (1−τ) θ_target, providing smooth parameter changes.

16
New cards

Policy Gradient Methods

Algorithms that directly optimize parameterized policies by ascending the gradient of expected return.

17
New cards

REINFORCE

The basic Monte-Carlo policy-gradient algorithm that updates policy parameters after complete episodes using returns as weights.

18
New cards

Actor-Critic

A framework with an actor (policy) and critic (value estimator) where the critic provides low-variance gradients to update the actor.

19
New cards

A2C (Advantage Actor-Critic)

A synchronous actor-critic method that uses advantage estimates for more stable updates.

20
New cards

A3C (Asynchronous Advantage Actor-Critic)

A parallel version of A2C where multiple workers asynchronously update shared parameters, improving exploration.

21
New cards

Proximal Policy Optimization (PPO)

A policy-gradient algorithm that uses a clipped surrogate loss to constrain policy updates, balancing stability and sample efficiency.

22
New cards

Deterministic Policy Gradient (DPG)

A policy-gradient formulation for continuous actions where the policy outputs a single action instead of a distribution.

23
New cards

Deep Deterministic Policy Gradient (DDPG)

An actor-critic algorithm that extends DPG with target networks and replay buffer, suitable for continuous control.

24
New cards

Twin Delayed DDPG (TD3)

Improves DDPG by using two critics, delaying actor updates, and adding noise to target actions to reduce overestimation bias.

25
New cards

Soft Actor-Critic (SAC)

An entropy-regularized actor-critic method that maximizes expected return plus policy entropy, encouraging exploration.

26
New cards

Stochastic Policy

A policy that outputs a probability distribution over actions, enabling inherent exploration and robustness.

27
New cards

Deterministic Policy

A policy that outputs a single action for each state; useful when exploration is handled separately.

28
New cards

Imitation Learning

A learning paradigm where an agent learns a policy from expert demonstrations instead of a reward signal.

29
New cards

Behavioral Cloning

A supervised imitation-learning approach that treats state–action pairs as labeled data and trains a policy via classification or regression.

30
New cards

DAgger (Dataset Aggregation)

An imitation-learning algorithm that iteratively collects new states visited by the learner, labels them with the expert, and retrains to mitigate distributional shift.

31
New cards

Sample Complexity

The amount of data or interactions required for an RL algorithm to learn a satisfactory policy.

32
New cards

Stability (in RL)

The tendency of an algorithm to converge reliably without divergence or oscillations during training.

33
New cards

Generalization (in RL)

The capacity of a learned policy to perform well in unseen states or slightly different environments.

34
New cards

Moravec’s Paradox

The observation that high-level reasoning requires little computation, while low-level sensorimotor skills require enormous computational resources for AI.

35
New cards

Multi-Agent Reinforcement Learning (MARL)

RL settings involving multiple interacting agents that may cooperate or compete; extensions include MADDPG and QMIX.

36
New cards

Exploration–Exploitation Trade-off

The dilemma between exploring new actions to gain information and exploiting known rewarding actions.

37
New cards

World Model

A learned forward model of environment dynamics used in model-based RL for planning or imagination.

38
New cards

Rainbow DQN

An integrated DQN variant combining several improvements (DDQN, PER, Dueling, Multi-Step, Noisy Nets, C51, etc.) for superior performance.