Reinforcement Learning Lecture Notes

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/43

Earn XP

Description and Tags

Flashcards generated from lecture notes on Reinforcement Learning.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

44 Terms

New cards

Reinforcement Learning (RL)

A paradigm of machine learning where an agent learns to take actions in an environment over time to maximize cumulative reward.

New cards

Policy

A mapping from states to actions learned by the agent through trial-and-error to optimize long-term rewards.

New cards

Markov Decision Process (MDP)

A mathematical framework for formalizing RL problems, characterized by states, actions, transition probabilities, reward functions, and a discount factor.

New cards

Transition Probability Function P(s'|s, a)

The probability of moving to state s' when action a is taken in state s in an MDP.

New cards

Reward Function r(s, a)

The immediate reward for taking action a in state s in an MDP.

New cards

Discount Factor γ

Weights future vs. immediate rewards; closer to 1 means long-term rewards are highly valued.

New cards

Exploration-Exploitation Dilemma

Balances exploiting known rewarding actions and exploring new actions to discover better policies.

New cards

How RL addresses large outcome space (uncertainty)

Monte Carlo sampling of outcomes and temporal-difference updates

New cards

How RL addresses large state space

Function approximation techniques to generalize value functions or policies across states.

New cards

How RL addresses large action space

Heuristics, policy search, or decomposition.

New cards

Policy Function Approximation (PFA)

Directly parameterize the policy and adjust parameters to improve performance.

New cards

Value Function Approximation (VFA)

Learn an approximate value function and derive a policy from it.

New cards

Cost Function Approximation (CFA)

Approximate the objective or cost-to-go directly.

New cards

Direct Lookahead (DLA)

Explicitly simulate or search future consequences of actions to select good actions.

New cards

Bellman Expectations

Expresses that the value of a state (or state-action) is the immediate reward plus the discounted value of the next state, averaging over the stochastic policy and environment.

New cards

Dynamic Programming (DP)

If the MDP model is known and state-action spaces are small, these algorithms can compute V* or Q*.

New cards

Value Iteration

Iteratively update an estimate V(s) using the Bellman optimality backup.

New cards

Policy Iteration

Iteratively improve a policy by alternating policy evaluation and policy improvement steps.

New cards

Monte Carlo (MC) Learning

Averages actual returns observed in sample episodes to estimate value.

New cards

Temporal-Difference (TD) Learning

Updates value estimates based on other learned estimates, allowing learning from partial progress through an episode.

New cards

Q-Learning

Updates Q toward the Bellman optimality target (off-policy).

New cards

SARSA

Updates Q toward the value of the action actually taken by the current policy (on-policy).

New cards

Value-based methods

Improves the policy indirectly via value estimates.

New cards

Policy Gradient Methods

Directly optimize the policy by gradient ascent on the expected reward objective.

New cards

Actor-Critic Algorithms

Combines policy gradient (actor) with value function learning (critic).

New cards

Natural Policy Gradient (NPG)

Scales the gradient by the inverse Fisher information matrix to account for the curvature of the policy space.

New cards

Trust Region Policy Optimization (TRPO)

Ensures each policy update does not deviate too much from the previous policy using a KL-divergence constraint.

New cards

Proximal Policy Optimization (PPO)

Uses a clipped surrogate objective to prevent the policy from changing too much.

New cards

Deep Reinforcement Learning (Deep RL)

Combines RL with deep learning to handle high-dimensional sensory inputs or large state spaces.

New cards

Deep Q-Network (DQN)

Uses a loss function derived from the Bellman equation and a target network for stable training.

New cards

Reward Clipping / Normalization

Keeps reward magnitudes in a reasonable range to prevent divergence in early training.

New cards

Entropy Regularization

Add a term to maximize entropy to encourage exploration and prevent premature convergence.

New cards

Batch Normalization or Layer Normalization

Helps with training deep networks on nonstationary data by normalizing inputs or layer activations.

New cards

Markov Assumptions

Financial time-series violate this and are also non-stationary

New cards

This provides a flexible framework to encode complex objectives (like risk-adjusted returns with costs) and to learn from experience when classical optimization is intractable

New cards

Heuristics and Action Reduction

Incorporate domain knowledge to reduce the action space to a smaller subset of sensible actions

New cards

Factorization of Actions

An action can be thought of as a combination of sub-actions (e.g., decide each of n components independently

New cards

Mathematical Programming Oracles

Finding the best action is an optimization problem that can be solved with a mathematical programming approach (like linear or quadratic programming).

New cards

Optimal Learning

Concept often used in operations research asking how we should choose actions/states to observe in order to learn the optimal policy fastest given a limited budget of observations?

New cards

Upper Confidence Bounds (UCB)

This provides near-optimal exploration-exploitation trade-off and guarantees a low regret (it finds the best arm with minimal loss in cumulative reward) in bandits

New cards

Intrinsic motivation

Used to encourage exploration by giving exploration bonuses for novelty or uncertainty

New cards

Centralized training with decentralized execution (CTDE)

A central critic can see the global state and actions, but each agent has its own policy that only sees its local observations and will execute independently at runtime.

New cards

Counterfactual Multi-Agent (COMA) policy gradient

Uses a central critic to compute an advantage for each agent by comparing the outcome to a counterfactual where that agent’s action is replaced by a default action to help attribute credit to that agent’s action

New cards

Explainable RL (XRL)

Is an emerging field aiming to make RL policies more interpretable or to provide explanations for their decisions.