1/60
Vocabulary-style flashcards covering key RL concepts from the notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Reinforcement Learning (RL)
A framework for solving control/decision problems by an agent learning from interaction with an environment to maximize rewards.
Agent
An intelligent program that learns to make decisions by moving through an environment and taking actions.
Environment
The world in which the agent operates, providing states, rewards, and transitions.
State
A representation of the current situation returned by the environment; typically a feature-vector.
Action
A possible decision the agent can take in a given state.
Reward
Numeric feedback from the environment evaluating the agent's action; can be positive or negative.
Policy
A rule mapping states to actions; can be deterministic or stochastic and aims to maximize cumulative rewards.
Value Function
An estimate of the expected return (cumulative reward) from a state or state-action pair.
State-Value Function (V(s))
Expected return starting from state s and following a given policy.
Action-Value Function (Q(s,a))
Expected return starting from state s, taking action a, then following a policy.
Model of the Environment
A representation of environment dynamics used to predict next state and reward for planning.
Model-Based RL
RL that uses an explicit model of the environment for planning and optimization.
Model-Free RL
RL that learns from interactions without building an explicit model of the environment.
Immediate Reinforcement Learning (IRL)
RL where evaluation (reward) happens immediately after taking an action.
Bandit Problem
A simple RL scenario with one-step decisions and rewards but no state transitions.
Multi-Armed Bandit (MAB)
A bandit problem with multiple arms, each with an unknown reward distribution, to maximize cumulative reward.
Exploration
Trying new actions to gather information about rewards.
Exploitation
Choosing the best-known action to maximize reward.
ε-Greedy
Policy that mostly exploits the best-known action while exploring randomly with probability ε.
Upper Confidence Bound (UCB)
Balances exploration and exploitation by adding a confidence bound to estimated values.
Thompson Sampling
Bayesian method for bandits that samples actions from posterior reward distributions.
PAC (Probably Approximately Correct)
Framework that guarantees near-optimal learning with high probability within finite samples.
PAC-MDP
PAC framework applied to MDPs, guaranteeing near-optimal policy with high probability.
Regret
Difference between the reward of the optimal policy and the reward achieved by the algorithm.
Bandit Optimality
Policy that minimizes cumulative regret or maximizes cumulative reward in bandits.
Value-Based Methods
RL methods that learn value functions (V or Q) to guide action selection.
Q-Learning
Model-free, off-policy algorithm that learns the action-value function Q(s,a).
SARSA
On-policy algorithm updating Q-values based on the action actually taken.
Deep Q-Network (DQN)
Neural-network approximation of the Q-function for large or continuous state spaces.
Policy Gradient
Directly optimizing a parameterized policy by gradient ascent on expected return.
REINFORCE
Basic policy gradient algorithm updating policy parameters to increase expected reward.
TRPO (Trust Region Policy Optimization)
Policy optimization method enforcing updates within a trust region for stability.
PPO (Proximal Policy Optimization)
Practical policy optimization using a surrogate objective with clipping to limit updates.
Policy Representation
How the policy is represented (e.g., table or neural network) mapping states to actions.
Policy-Based Reinforcement Learning
Learning the policy directly without relying on a value function.
Deterministic Policy
Policy that selects a single action for each state.
Stochastic Policy
Policy that assigns probabilities to multiple actions.
On-Policy
Decision policy used to generate data and update the same policy (e.g., SARSA).
Off-Policy
Decision policy different from the policy used to generate data (e.g., Q-learning).
Policy Update Stability
Techniques (e.g., TRPO/PPO) to keep updates from destabilizing learning.
Policy Gradient vs Value-Based
Policy gradient directly optimizes the policy; value-based learns value functions to derive a policy.
Policy Optimization
Adjusting policy parameters to maximize expected return.
Immediate Reward Example
Instant feedback example such as ad clicks—reward observed right after action.
Credit Assignment
Determining which action caused a reward; easier with immediate rewards.
Reward Signal
The goal-defining numerical feedback from the environment.
State-Action Value
Q(s,a); the value of taking action a in state s and following a policy thereafter.
Planning
Deciding on actions by considering possible future states using a model.
Exploration-Exploitation Trade-off
Balancing learning new information vs. using known good actions.
Optimal Policy
Policy that maximizes expected cumulative reward.
Cumulative Reward
Total reward accumulated over time under a policy.
Action Space
Set of all possible actions; can be discrete or continuous.
Continuous Action Spaces
Action spaces that are continuous and high-dimensional, often favored by policy-based methods.
Environment Dynamics (p(s'|s,a))
Probability of next state given current state s and action a.
Model-Based Planning
Using a learned model to predict future states/rewards for planning.
Model-Free Learning
Learning from trial-and-error without modeling environment dynamics.
Greedy Policy
Policy that always selects the currently best-known action.
Stability in Learning
Maintaining reliable progress during iterative policy/value updates.
Credit Assignment Problem
Challenge of identifying which action led to a reward, alleviated by immediate rewards.
Environment-Observer Interaction
The loop where the agent observes a state, takes an action, receives a reward, and moves to a new state.
Feature-Vector (State Representation)
Numeric vector describing relevant aspects of the current state.
Discounting (Note: not covered in provided notes)
Omitted to maintain alignment with notes; not included in this set.