1/39
A vocabulary set covering key reinforcement-learning concepts, algorithms, and implementation details from Lecture 9, with emphasis on Deep Q-Networks (DQN), Double DQN (DDQN), function approximation, and practical training considerations.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Bellman Equation
Fundamental recursive relationship that expresses the value of a state as immediate reward plus discounted value of successor states.
Temporal Difference (TD) Methods
Learning techniques that update value estimates based on the difference between successive predictions over time.
Q-Learning
Off-policy TD control algorithm that learns the optimal action-value function Q*(s,a) by bootstrapping from future estimates.
Deep Q-Network (DQN)
Algorithm that combines Q-learning with a deep convolutional neural network to approximate Q(s,a) from high-dimensional inputs such as pixels.
Double DQN (DDQN)
Variant of DQN that reduces Q-value overestimation by using the online network for action selection and the target network for value evaluation.
Experience Replay
Technique that stores past transitions in a replay buffer and samples random mini-batches to break correlations and improve data efficiency.
Replay Buffer
Finite memory structure (often a deque) that holds (state, action, reward, next_state, done) tuples for experience replay.
Target Network
Frozen copy of the online Q-network whose weights are updated periodically or softly to provide stable target values.
Fixed Q-Targets
Practice of computing TD targets with parameters of an older, non-updated network to reduce non-stationarity during training.
Overestimation Bias
Systematic inflation of Q-values caused by using the same network for both action selection and evaluation (addressed by DDQN).
Epsilon-Greedy Policy
Action selection strategy that chooses a random action with probability ε and the best estimated action with probability 1−ε.
Reward Clipping
Constraining reward signals (e.g., to −1, 0, +1) to stabilize learning across games with different score magnitudes.
Function Approximation
Replacing tabular value storage with a parameterized function (e.g., neural network) to handle large or continuous state spaces.
Value Function Approximation (VFA)
Estimating V(s) or Q(s,a) with a parameterized model V(s,w) ≈ Vπ(s) to generalize across states.
Gradient Descent
Optimization procedure that updates parameters in the negative direction of the loss gradient to minimize error.
Batch Gradient Methods
Approaches that use accumulated experience to compute updates over data sets rather than single transitions.
Convolutional Neural Network (CNN)
Neural architecture with convolutional layers suited for extracting spatial features from images; core of the Atari DQN.
Polyak (Soft) Averaging
Smoothly updating target network weights by τ·wonline + (1−τ)·wtarget each step instead of hard copying.
Hard Update
Replacing target network weights with online network weights after a fixed number of steps or episodes.
Shaped Reward
Modified reward signal designed to guide learning more effectively than the environment’s sparse or uninformative reward.
Discretization (Binning)
Dividing continuous observation dimensions into discrete intervals to enable tabular or simpler function approximation.
Bins
Intervals used in discretization; number and size of bins control resolution of state representation.
MountainCar-v0
Classic control environment where the agent must drive up a steep hill; often used to illustrate shaped rewards and discretization.
CartPole-v1
Benchmark environment requiring balancing a pole on a cart; used in discretization and DDQN assignment.
Action Space
Set of all possible actions an agent can take in an environment; e.g., Discrete(3) for MountainCar.
Observation Space
Range and dimensionality of state vectors an environment returns; e.g., Box([-1.2,-0.07],[0.6,0.07],(2,),float32).
Stacked Frames
Technique of concatenating the last k observations (e.g., 4 Atari frames) to include temporal information in input.
Loss Function (MSE)
Mean-squared-error objective between predicted Q values and target Q values minimized during DQN training.
Adam Optimizer
Adaptive learning-rate optimization algorithm often used to train neural networks in RL.
Learning Rate
Hyperparameter that scales gradient updates; critical for stable convergence in DQN.
Gamma (Discount Factor)
Parameter 0<γ≤1 that determines how future rewards are weighted relative to immediate rewards.
Solved Threshold
Predefined average reward level indicating an environment is considered mastered (e.g., 200 in CartPole).
Rolling Average
Moving mean of recent episode rewards used to assess learning progress and convergence.
GradientTape
TensorFlow/Keras API for manually computing gradients and applying custom training loops.
model.fit()
High-level Keras method that automates forward pass, loss computation, backpropagation, and weight updates.
Model Predict
Keras routine that performs a forward pass to output Q-values given input states during acting or evaluation.
Atari 2600 Benchmark
Suite of video games used to evaluate DQN’s ability to learn directly from raw pixels across diverse tasks.
Reward Clipping (−1,0,1)
Normalization technique ensuring each game’s reward magnitude is comparable and prevents exploding gradients.
Action Repeat
Executing the chosen action for multiple frames (e.g., 4) to reduce computational load and stabilize learning.
Softmax vs. Linear Output
DQN uses linear output neurons for Q-values (regression) rather than a Softmax used for classification.