1/46
Flashcards generated from Reinforcement Learning lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Reinforcement Learning (RL)
A paradigm of machine learning concerned with how an agent should take actions in an environment over time in order to maximize some notion of cumulative reward.
Policy
Mapping from states to actions learned by trial-and-error to optimize long-term rewards.
States (in MDP)
A set of states S representing all possible situations.
Actions (in MDP)
A set of actions A(s) available in each state.
Transition Probability Function (in MDP)
A transition probability function P(s′|s, a) defining the environment dynamics.
Reward Function (in MDP)
A reward function r(s, a) giving the immediate reward for taking action a in state s.
Discount Factor (in MDP)
A discount factor γ ∈ [0, 1) weighting future vs. immediate rewards.
Policy (π)
Mapping states to actions that maximizes the expected discounted return.
State-Value Function Vπ(s)
The expected return starting from state s and following policy π thereafter.
Action-Value Function (Q-value) Qπ(s, a)
The expected return starting from state s, taking action a, and then following π.
Bellman Equation
Recusrively relates values of states (or state-action pairs) to the values of their successors.
Exploration-Exploitation Dilemma
Agent exploits known rewarding actions to accumulate reward, but also explore new actions to discover potentially better policies.
Policy Function Approximation (PFA)
Directly parameterize the policy πθ(a|s) and adjust parameters θ to improve performance.
Value Function Approximation (VFA)
Learn an approximate value function Vw(s) or Qw(s, a) and derive a policy from it.
Cost Function Approximation (CFA)
Approximate the objective or cost-to-go directly.
Direct Lookahead (DLA)
Explicitly simulate or search future consequences of actions to select good actions.
Bellman Expectations
Expresses that the value of a state (or state-action) is the immediate reward plus the discounted value of the next state, averaging over the stochastic policy and environment.
Value Iteration
Iteratively update an estimate V (s) using the Bellman optimality backup.
Policy Iteration
Iteratively improve a policy by alternating policy evaluation and policy improvement steps.
Monte Carlo (MC) Methods
Averages actual returns observed in sample episodes.
Gt
Total return observed from time t onward in an episode.
Temporal-Difference (TD) Learning
Updates estimates based on other estimates (bootstrapping like DP) but using sample experience instead of a full model.
TD Error
The term in brackets: δt = rt+1 + γV (st+1) − V (st).
Q-Learning
Off-policy algorithm that updates Q toward the Bellman optimality target.
SARSA
On-policy algorithm that updates Q toward the value of the action actually taken by the current policy.
Value-based Methods
Improves the policy indirectly via value estimates.
Policy Gradient Methods
Directly optimizes the policy by gradient ascent on the expected reward objective.
Natural Policy Gradient (NPG)
Scales the gradient by the inverse Fisher information matrix to take into account the curvature of the policy space.
Trust Region Policy Optimization (TRPO)
Ensures each policy update does not deviate too much from the previous policy.
Proximal Policy Optimization (PPO)
Uses a clipped surrogate objective to prevent the policy from changing too much.
Deep Reinforcement Learning (Deep RL)
Uses function approximators (especially deep neural networks) to represent value functions or policies to handle large state spaces.
Target Network
A stabilized version of the Q-network that is held fixed for a number of iterations before being updated to the current θ.
Experience Replay
Breaks the strong temporal correlations between successive samples by randomizing them, which stabilizes training.
Algorithmic Trading
Deciding when to buy or sell assets to maximize profit or utility.
Portfolio Management
Periodically reallocating a portfolio among assets to maximize returns or risk-adjusted returns.
Optimal Trade Execution
Executing a large order by splitting it into smaller pieces over time to minimize market impact and trading cost.
Market Making
Continuously quoting buy and sell prices for a security to earn the bid-ask spread while managing inventory risk.
Robo-Advising and Order Routing
Automated decision systems for advising retail investors or routing orders across exchanges.
Heuristics and Action Reduction
Incorporate domain knowledge to reduce the action space to a smaller subset of sensible actions.
Factorization of Actions
Break a big action decision into multiple smaller decisions.
Mathematical Programming Oracles
Given a state and a value function approximation, finding the best action is itself an optimization problem that can be solved with a mathematical programming approach.
Upper Confidence Bounds (UCB)
Select action a at time t as: at = arg max a [X¯ a(t) + c * sqrt(ln t / Na(t))].
Multi-Agent Reinforcement Learning (MARL)
Multiple agents learning and interacting, leading to complexities like non-stationarity and credit assignment.
Centralized Training with Decentralized Execution (CTDE)
All agents are trained together with a centralized critic that can see the global state and actions, but each agent has its own policy.
Explainable RL (XRL)
Making RL policies more interpretable or providing explanations for their decisions.
Inverse Reinforcement Learning (IRL)
Deduces reward functions from expert behavior.
Reinforcement Learning from Human Feedback (RLHF)
Agents learn from human feedback or demonstrations.