1/37
Key vocabulary terms and definitions covering Deep Reinforcement Learning beyond basic DQN, including algorithm variants, core concepts, improvements, policy-gradient methods, and imitation learning challenges.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Reinforcement Learning (RL)
A machine-learning paradigm in which an agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.
Deep Reinforcement Learning (DRL)
RL that leverages deep neural networks to approximate value functions, policies, or models, enabling end-to-end learning from high-dimensional inputs.
Deep Q-Network (DQN)
A value-based DRL algorithm that combines Q-learning with deep neural networks plus experience replay and a target network to stabilize training.
Double DQN (DDQN)
An extension of DQN that reduces Q-value overestimation by using the online network to choose the best action and the target network to evaluate it.
Experience Replay
A technique that stores past transitions in a buffer and samples them randomly to break temporal correlations and improve data efficiency.
Target Network
A periodically or softly updated copy of the Q-network used to compute stable target values during training.
Replay Memory
The buffer that holds past state–action–reward–next-state tuples for experience replay.
Q-Table
A tabular representation of state–action values used in classical (non-deep) Q-learning or SARSA.
Value Function
A function that estimates expected return from a state (V) or state-action pair (Q) under a given policy.
Policy Function
A mapping from states to probabilities of selecting each action; represents the agent’s behavior.
ε-Greedy Exploration
An exploration strategy that chooses a random action with probability ε and the greedy (best) action otherwise.
Dueling DQN
A DQN variant that decomposes Q(s,a) into a state-value stream V(s) and an advantage stream A(s,a) to learn which states are valuable independent of action.
Advantage Function
Defined as A(s,a)=Q(s,a)−V(s); measures how much better an action is compared to the average at state s.
Prioritized Experience Replay (PER)
A replay strategy that samples transitions with probability proportional to their TD error magnitude, focusing learning on informative experiences.
Polyak (Soft) Update
A target-network update rule: θtarget ← τ θonline + (1−τ) θ_target, providing smooth parameter changes.
Policy Gradient Methods
Algorithms that directly optimize parameterized policies by ascending the gradient of expected return.
REINFORCE
The basic Monte-Carlo policy-gradient algorithm that updates policy parameters after complete episodes using returns as weights.
Actor-Critic
A framework with an actor (policy) and critic (value estimator) where the critic provides low-variance gradients to update the actor.
A2C (Advantage Actor-Critic)
A synchronous actor-critic method that uses advantage estimates for more stable updates.
A3C (Asynchronous Advantage Actor-Critic)
A parallel version of A2C where multiple workers asynchronously update shared parameters, improving exploration.
Proximal Policy Optimization (PPO)
A policy-gradient algorithm that uses a clipped surrogate loss to constrain policy updates, balancing stability and sample efficiency.
Deterministic Policy Gradient (DPG)
A policy-gradient formulation for continuous actions where the policy outputs a single action instead of a distribution.
Deep Deterministic Policy Gradient (DDPG)
An actor-critic algorithm that extends DPG with target networks and replay buffer, suitable for continuous control.
Twin Delayed DDPG (TD3)
Improves DDPG by using two critics, delaying actor updates, and adding noise to target actions to reduce overestimation bias.
Soft Actor-Critic (SAC)
An entropy-regularized actor-critic method that maximizes expected return plus policy entropy, encouraging exploration.
Stochastic Policy
A policy that outputs a probability distribution over actions, enabling inherent exploration and robustness.
Deterministic Policy
A policy that outputs a single action for each state; useful when exploration is handled separately.
Imitation Learning
A learning paradigm where an agent learns a policy from expert demonstrations instead of a reward signal.
Behavioral Cloning
A supervised imitation-learning approach that treats state–action pairs as labeled data and trains a policy via classification or regression.
DAgger (Dataset Aggregation)
An imitation-learning algorithm that iteratively collects new states visited by the learner, labels them with the expert, and retrains to mitigate distributional shift.
Sample Complexity
The amount of data or interactions required for an RL algorithm to learn a satisfactory policy.
Stability (in RL)
The tendency of an algorithm to converge reliably without divergence or oscillations during training.
Generalization (in RL)
The capacity of a learned policy to perform well in unseen states or slightly different environments.
Moravec’s Paradox
The observation that high-level reasoning requires little computation, while low-level sensorimotor skills require enormous computational resources for AI.
Multi-Agent Reinforcement Learning (MARL)
RL settings involving multiple interacting agents that may cooperate or compete; extensions include MADDPG and QMIX.
Exploration–Exploitation Trade-off
The dilemma between exploring new actions to gain information and exploiting known rewarding actions.
World Model
A learned forward model of environment dynamics used in model-based RL for planning or imagination.
Rainbow DQN
An integrated DQN variant combining several improvements (DDQN, PER, Dueling, Multi-Step, Noisy Nets, C51, etc.) for superior performance.