Lecture 9 – Value-Function Approximation, DQN and DDQN

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/39

flashcard set

Earn XP

Description and Tags

A vocabulary set covering key reinforcement-learning concepts, algorithms, and implementation details from Lecture 9, with emphasis on Deep Q-Networks (DQN), Double DQN (DDQN), function approximation, and practical training considerations.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

40 Terms

1
New cards

Bellman Equation

Fundamental recursive relationship that expresses the value of a state as immediate reward plus discounted value of successor states.

2
New cards

Temporal Difference (TD) Methods

Learning techniques that update value estimates based on the difference between successive predictions over time.

3
New cards

Q-Learning

Off-policy TD control algorithm that learns the optimal action-value function Q*(s,a) by bootstrapping from future estimates.

4
New cards

Deep Q-Network (DQN)

Algorithm that combines Q-learning with a deep convolutional neural network to approximate Q(s,a) from high-dimensional inputs such as pixels.

5
New cards

Double DQN (DDQN)

Variant of DQN that reduces Q-value overestimation by using the online network for action selection and the target network for value evaluation.

6
New cards

Experience Replay

Technique that stores past transitions in a replay buffer and samples random mini-batches to break correlations and improve data efficiency.

7
New cards

Replay Buffer

Finite memory structure (often a deque) that holds (state, action, reward, next_state, done) tuples for experience replay.

8
New cards

Target Network

Frozen copy of the online Q-network whose weights are updated periodically or softly to provide stable target values.

9
New cards

Fixed Q-Targets

Practice of computing TD targets with parameters of an older, non-updated network to reduce non-stationarity during training.

10
New cards

Overestimation Bias

Systematic inflation of Q-values caused by using the same network for both action selection and evaluation (addressed by DDQN).

11
New cards

Epsilon-Greedy Policy

Action selection strategy that chooses a random action with probability ε and the best estimated action with probability 1−ε.

12
New cards

Reward Clipping

Constraining reward signals (e.g., to −1, 0, +1) to stabilize learning across games with different score magnitudes.

13
New cards

Function Approximation

Replacing tabular value storage with a parameterized function (e.g., neural network) to handle large or continuous state spaces.

14
New cards

Value Function Approximation (VFA)

Estimating V(s) or Q(s,a) with a parameterized model V(s,w) ≈ Vπ(s) to generalize across states.

15
New cards

Gradient Descent

Optimization procedure that updates parameters in the negative direction of the loss gradient to minimize error.

16
New cards

Batch Gradient Methods

Approaches that use accumulated experience to compute updates over data sets rather than single transitions.

17
New cards

Convolutional Neural Network (CNN)

Neural architecture with convolutional layers suited for extracting spatial features from images; core of the Atari DQN.

18
New cards

Polyak (Soft) Averaging

Smoothly updating target network weights by τ·wonline + (1−τ)·wtarget each step instead of hard copying.

19
New cards

Hard Update

Replacing target network weights with online network weights after a fixed number of steps or episodes.

20
New cards

Shaped Reward

Modified reward signal designed to guide learning more effectively than the environment’s sparse or uninformative reward.

21
New cards

Discretization (Binning)

Dividing continuous observation dimensions into discrete intervals to enable tabular or simpler function approximation.

22
New cards

Bins

Intervals used in discretization; number and size of bins control resolution of state representation.

23
New cards

MountainCar-v0

Classic control environment where the agent must drive up a steep hill; often used to illustrate shaped rewards and discretization.

24
New cards

CartPole-v1

Benchmark environment requiring balancing a pole on a cart; used in discretization and DDQN assignment.

25
New cards

Action Space

Set of all possible actions an agent can take in an environment; e.g., Discrete(3) for MountainCar.

26
New cards

Observation Space

Range and dimensionality of state vectors an environment returns; e.g., Box([-1.2,-0.07],[0.6,0.07],(2,),float32).

27
New cards

Stacked Frames

Technique of concatenating the last k observations (e.g., 4 Atari frames) to include temporal information in input.

28
New cards

Loss Function (MSE)

Mean-squared-error objective between predicted Q values and target Q values minimized during DQN training.

29
New cards

Adam Optimizer

Adaptive learning-rate optimization algorithm often used to train neural networks in RL.

30
New cards

Learning Rate

Hyperparameter that scales gradient updates; critical for stable convergence in DQN.

31
New cards

Gamma (Discount Factor)

Parameter 0<γ≤1 that determines how future rewards are weighted relative to immediate rewards.

32
New cards

Solved Threshold

Predefined average reward level indicating an environment is considered mastered (e.g., 200 in CartPole).

33
New cards

Rolling Average

Moving mean of recent episode rewards used to assess learning progress and convergence.

34
New cards

GradientTape

TensorFlow/Keras API for manually computing gradients and applying custom training loops.

35
New cards

model.fit()

High-level Keras method that automates forward pass, loss computation, backpropagation, and weight updates.

36
New cards

Model Predict

Keras routine that performs a forward pass to output Q-values given input states during acting or evaluation.

37
New cards

Atari 2600 Benchmark

Suite of video games used to evaluate DQN’s ability to learn directly from raw pixels across diverse tasks.

38
New cards

Reward Clipping (−1,0,1)

Normalization technique ensuring each game’s reward magnitude is comparable and prevents exploding gradients.

39
New cards

Action Repeat

Executing the chosen action for multiple frames (e.g., 4) to reduce computational load and stabilize learning.

40
New cards

Softmax vs. Linear Output

DQN uses linear output neurons for Q-values (regression) rather than a Softmax used for classification.