1/27
Vocabulary flashcards covering key terms, algorithms, environments, and concepts from Session 10 on Reinforcement Learning.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Dynamic Programming
A family of RL methods that solve Markov Decision Processes by using a known model of the environment to compute value functions backward from the goal.
Monte Carlo (MC) Method
Model-free algorithm that learns value functions by averaging complete episode returns; converges slowly but needs no environment model.
Temporal Difference (TD) Learning
Updates value estimates from incomplete episodes by bootstrapping one step ahead, blending MC and Dynamic Programming ideas.
SARSA
On-policy TD control algorithm that updates Q-values with the quintuple (State, Action, Reward, next State, next Action).
Q-Learning
Off-policy TD control algorithm that updates Q-values toward the maximal estimated value of the next state, independent of the agent’s current policy.
On-policy
Learning strategy that updates values using the action actually taken by the current behavior policy (e.g., SARSA).
Off-policy
Learning strategy that updates values using actions from a different (typically greedy) target policy (e.g., Q-Learning).
Exploration vs. Exploitation
The trade-off between trying new actions to gather information and choosing known rewarding actions to maximize return.
Frozen Lake Environment
8×8 gridworld with fixed start and goal, slippery (stochastic) dynamics, used to test tabular RL methods.
Taxi Environment
5×5 grid task where a taxi picks up and drops passengers; large state space encoded by taxi row/col, passenger location, and destination.
CartPole Environment
Classic control task that balances a pole on a moving cart; observations are continuous and often discretized for Q-learning.
Mountain Car Environment
Task where a car must build momentum to reach a hilltop; sparse reward (goal only) often improved via reward shaping.
Reward Shaping
Adding intermediate, artificial rewards to guide learning toward the goal faster without altering the optimal policy.
Sparse Reward
Reward structure in which significant feedback is given only at distant goals, making learning slow.
Discretization (Binning)
Converting continuous state variables into finite bins so that tabular methods like Q-learning can be applied.
Bins
The boundaries that partition each continuous variable’s range during discretization.
Gamma (γ) – Discount Factor
Hyperparameter (0–1) that weights future rewards when computing returns; higher γ values value distant rewards more.
Epsilon (ε) – Exploration Rate
Probability of selecting a random action in ε-greedy policies to encourage exploration.
Epsilon Decay
Gradual reduction of ε over episodes to shift behavior from exploration to exploitation.
Experience Replay
Technique where past transitions are stored in a buffer and sampled randomly to break temporal correlations when training a DQN.
Replay Buffer
Memory structure that holds tuples (state, action, reward, next_state, done) for experience replay.
Deep Q-Network (DQN)
Neural-network-based Q-learning algorithm that approximates the action-value function and leverages experience replay and target networks.
Target Network
A periodically updated copy of the Q-network used in DQN to stabilize learning by providing fixed targets during backpropagation.
Bellman Equation (in DQN)
Target update rule Qtarget = r + γ · maxa′ Q(next_state, a′) that drives Q-value learning.
Batch Size
Number of sampled transitions used in one gradient update during experience replay.
Rolling Average
Average reward over a recent window of episodes (e.g., last 20) used to track performance trends.
Solved Threshold
Pre-defined rolling-average reward indicating that an environment (e.g., CartPole ≥ 195) is considered solved.
Hyperparameters
Tunable settings such as learning rate, γ, ε, batch size, and network architecture that govern RL algorithm performance.