Reinforcement Learning – Session 10 Vocabulary

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/27

flashcard set

Earn XP

Description and Tags

Vocabulary flashcards covering key terms, algorithms, environments, and concepts from Session 10 on Reinforcement Learning.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

28 Terms

1
New cards

Dynamic Programming

A family of RL methods that solve Markov Decision Processes by using a known model of the environment to compute value functions backward from the goal.

2
New cards

Monte Carlo (MC) Method

Model-free algorithm that learns value functions by averaging complete episode returns; converges slowly but needs no environment model.

3
New cards

Temporal Difference (TD) Learning

Updates value estimates from incomplete episodes by bootstrapping one step ahead, blending MC and Dynamic Programming ideas.

4
New cards

SARSA

On-policy TD control algorithm that updates Q-values with the quintuple (State, Action, Reward, next State, next Action).

5
New cards

Q-Learning

Off-policy TD control algorithm that updates Q-values toward the maximal estimated value of the next state, independent of the agent’s current policy.

6
New cards

On-policy

Learning strategy that updates values using the action actually taken by the current behavior policy (e.g., SARSA).

7
New cards

Off-policy

Learning strategy that updates values using actions from a different (typically greedy) target policy (e.g., Q-Learning).

8
New cards

Exploration vs. Exploitation

The trade-off between trying new actions to gather information and choosing known rewarding actions to maximize return.

9
New cards

Frozen Lake Environment

8×8 gridworld with fixed start and goal, slippery (stochastic) dynamics, used to test tabular RL methods.

10
New cards

Taxi Environment

5×5 grid task where a taxi picks up and drops passengers; large state space encoded by taxi row/col, passenger location, and destination.

11
New cards

CartPole Environment

Classic control task that balances a pole on a moving cart; observations are continuous and often discretized for Q-learning.

12
New cards

Mountain Car Environment

Task where a car must build momentum to reach a hilltop; sparse reward (goal only) often improved via reward shaping.

13
New cards

Reward Shaping

Adding intermediate, artificial rewards to guide learning toward the goal faster without altering the optimal policy.

14
New cards

Sparse Reward

Reward structure in which significant feedback is given only at distant goals, making learning slow.

15
New cards

Discretization (Binning)

Converting continuous state variables into finite bins so that tabular methods like Q-learning can be applied.

16
New cards

Bins

The boundaries that partition each continuous variable’s range during discretization.

17
New cards

Gamma (γ) – Discount Factor

Hyperparameter (0–1) that weights future rewards when computing returns; higher γ values value distant rewards more.

18
New cards

Epsilon (ε) – Exploration Rate

Probability of selecting a random action in ε-greedy policies to encourage exploration.

19
New cards

Epsilon Decay

Gradual reduction of ε over episodes to shift behavior from exploration to exploitation.

20
New cards

Experience Replay

Technique where past transitions are stored in a buffer and sampled randomly to break temporal correlations when training a DQN.

21
New cards

Replay Buffer

Memory structure that holds tuples (state, action, reward, next_state, done) for experience replay.

22
New cards

Deep Q-Network (DQN)

Neural-network-based Q-learning algorithm that approximates the action-value function and leverages experience replay and target networks.

23
New cards

Target Network

A periodically updated copy of the Q-network used in DQN to stabilize learning by providing fixed targets during backpropagation.

24
New cards

Bellman Equation (in DQN)

Target update rule Qtarget = r + γ · maxa′ Q(next_state, a′) that drives Q-value learning.

25
New cards

Batch Size

Number of sampled transitions used in one gradient update during experience replay.

26
New cards

Rolling Average

Average reward over a recent window of episodes (e.g., last 20) used to track performance trends.

27
New cards

Solved Threshold

Pre-defined rolling-average reward indicating that an environment (e.g., CartPole ≥ 195) is considered solved.

28
New cards

Hyperparameters

Tunable settings such as learning rate, γ, ε, batch size, and network architecture that govern RL algorithm performance.