1/43
Flashcards generated from lecture notes on Reinforcement Learning.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Reinforcement Learning (RL)
A paradigm of machine learning where an agent learns to take actions in an environment over time to maximize cumulative reward.
Policy
A mapping from states to actions learned by the agent through trial-and-error to optimize long-term rewards.
Markov Decision Process (MDP)
A mathematical framework for formalizing RL problems, characterized by states, actions, transition probabilities, reward functions, and a discount factor.
Transition Probability Function P(s'|s, a)
The probability of moving to state s' when action a is taken in state s in an MDP.
Reward Function r(s, a)
The immediate reward for taking action a in state s in an MDP.
Discount Factor γ
Weights future vs. immediate rewards; closer to 1 means long-term rewards are highly valued.
Exploration-Exploitation Dilemma
Balances exploiting known rewarding actions and exploring new actions to discover better policies.
How RL addresses large outcome space (uncertainty)
Monte Carlo sampling of outcomes and temporal-difference updates
How RL addresses large state space
Function approximation techniques to generalize value functions or policies across states.
How RL addresses large action space
Heuristics, policy search, or decomposition.
Policy Function Approximation (PFA)
Directly parameterize the policy and adjust parameters to improve performance.
Value Function Approximation (VFA)
Learn an approximate value function and derive a policy from it.
Cost Function Approximation (CFA)
Approximate the objective or cost-to-go directly.
Direct Lookahead (DLA)
Explicitly simulate or search future consequences of actions to select good actions.
Bellman Expectations
Expresses that the value of a state (or state-action) is the immediate reward plus the discounted value of the next state, averaging over the stochastic policy and environment.
Dynamic Programming (DP)
If the MDP model is known and state-action spaces are small, these algorithms can compute V* or Q*.
Value Iteration
Iteratively update an estimate V(s) using the Bellman optimality backup.
Policy Iteration
Iteratively improve a policy by alternating policy evaluation and policy improvement steps.
Monte Carlo (MC) Learning
Averages actual returns observed in sample episodes to estimate value.
Temporal-Difference (TD) Learning
Updates value estimates based on other learned estimates, allowing learning from partial progress through an episode.
Q-Learning
Updates Q toward the Bellman optimality target (off-policy).
SARSA
Updates Q toward the value of the action actually taken by the current policy (on-policy).
Value-based methods
Improves the policy indirectly via value estimates.
Policy Gradient Methods
Directly optimize the policy by gradient ascent on the expected reward objective.
Actor-Critic Algorithms
Combines policy gradient (actor) with value function learning (critic).
Natural Policy Gradient (NPG)
Scales the gradient by the inverse Fisher information matrix to account for the curvature of the policy space.
Trust Region Policy Optimization (TRPO)
Ensures each policy update does not deviate too much from the previous policy using a KL-divergence constraint.
Proximal Policy Optimization (PPO)
Uses a clipped surrogate objective to prevent the policy from changing too much.
Deep Reinforcement Learning (Deep RL)
Combines RL with deep learning to handle high-dimensional sensory inputs or large state spaces.
Deep Q-Network (DQN)
Uses a loss function derived from the Bellman equation and a target network for stable training.
Reward Clipping / Normalization
Keeps reward magnitudes in a reasonable range to prevent divergence in early training.
Entropy Regularization
Add a term to maximize entropy to encourage exploration and prevent premature convergence.
Batch Normalization or Layer Normalization
Helps with training deep networks on nonstationary data by normalizing inputs or layer activations.
Markov Assumptions
Financial time-series violate this and are also non-stationary
RL
This provides a flexible framework to encode complex objectives (like risk-adjusted returns with costs) and to learn from experience when classical optimization is intractable
Heuristics and Action Reduction
Incorporate domain knowledge to reduce the action space to a smaller subset of sensible actions
Factorization of Actions
An action can be thought of as a combination of sub-actions (e.g., decide each of n components independently
Mathematical Programming Oracles
Finding the best action is an optimization problem that can be solved with a mathematical programming approach (like linear or quadratic programming).
Optimal Learning
Concept often used in operations research asking how we should choose actions/states to observe in order to learn the optimal policy fastest given a limited budget of observations?
Upper Confidence Bounds (UCB)
This provides near-optimal exploration-exploitation trade-off and guarantees a low regret (it finds the best arm with minimal loss in cumulative reward) in bandits
Intrinsic motivation
Used to encourage exploration by giving exploration bonuses for novelty or uncertainty
Centralized training with decentralized execution (CTDE)
A central critic can see the global state and actions, but each agent has its own policy that only sees its local observations and will execute independently at runtime.
Counterfactual Multi-Agent (COMA) policy gradient
Uses a central critic to compute an advantage for each agent by comparing the outcome to a counterfactual where that agent’s action is replaced by a default action to help attribute credit to that agent’s action
Explainable RL (XRL)
Is an emerging field aiming to make RL policies more interpretable or to provide explanations for their decisions.