Reinforcement Learning Lecture Notes

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/46

flashcard set

Earn XP

Description and Tags

Flashcards generated from Reinforcement Learning lecture notes.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

47 Terms

1
New cards

Reinforcement Learning (RL)

A paradigm of machine learning concerned with how an agent should take actions in an environment over time in order to maximize some notion of cumulative reward.

2
New cards

Policy

Mapping from states to actions learned by trial-and-error to optimize long-term rewards.

3
New cards

States (in MDP)

A set of states S representing all possible situations.

4
New cards

Actions (in MDP)

A set of actions A(s) available in each state.

5
New cards

Transition Probability Function (in MDP)

A transition probability function P(s′|s, a) defining the environment dynamics.

6
New cards

Reward Function (in MDP)

A reward function r(s, a) giving the immediate reward for taking action a in state s.

7
New cards

Discount Factor (in MDP)

A discount factor γ ∈ [0, 1) weighting future vs. immediate rewards.

8
New cards

Policy (π)

Mapping states to actions that maximizes the expected discounted return.

9
New cards

State-Value Function Vπ(s)

The expected return starting from state s and following policy π thereafter.

10
New cards

Action-Value Function (Q-value) Qπ(s, a)

The expected return starting from state s, taking action a, and then following π.

11
New cards

Bellman Equation

Recusrively relates values of states (or state-action pairs) to the values of their successors.

12
New cards

Exploration-Exploitation Dilemma

Agent exploits known rewarding actions to accumulate reward, but also explore new actions to discover potentially better policies.

13
New cards

Policy Function Approximation (PFA)

Directly parameterize the policy πθ(a|s) and adjust parameters θ to improve performance.

14
New cards

Value Function Approximation (VFA)

Learn an approximate value function Vw(s) or Qw(s, a) and derive a policy from it.

15
New cards

Cost Function Approximation (CFA)

Approximate the objective or cost-to-go directly.

16
New cards

Direct Lookahead (DLA)

Explicitly simulate or search future consequences of actions to select good actions.

17
New cards

Bellman Expectations

Expresses that the value of a state (or state-action) is the immediate reward plus the discounted value of the next state, averaging over the stochastic policy and environment.

18
New cards

Value Iteration

Iteratively update an estimate V (s) using the Bellman optimality backup.

19
New cards

Policy Iteration

Iteratively improve a policy by alternating policy evaluation and policy improvement steps.

20
New cards

Monte Carlo (MC) Methods

Averages actual returns observed in sample episodes.

21
New cards

Gt

Total return observed from time t onward in an episode.

22
New cards

Temporal-Difference (TD) Learning

Updates estimates based on other estimates (bootstrapping like DP) but using sample experience instead of a full model.

23
New cards

TD Error

The term in brackets: δt = rt+1 + γV (st+1) − V (st).

24
New cards

Q-Learning

Off-policy algorithm that updates Q toward the Bellman optimality target.

25
New cards

SARSA

On-policy algorithm that updates Q toward the value of the action actually taken by the current policy.

26
New cards

Value-based Methods

Improves the policy indirectly via value estimates.

27
New cards

Policy Gradient Methods

Directly optimizes the policy by gradient ascent on the expected reward objective.

28
New cards

Natural Policy Gradient (NPG)

Scales the gradient by the inverse Fisher information matrix to take into account the curvature of the policy space.

29
New cards

Trust Region Policy Optimization (TRPO)

Ensures each policy update does not deviate too much from the previous policy.

30
New cards

Proximal Policy Optimization (PPO)

Uses a clipped surrogate objective to prevent the policy from changing too much.

31
New cards

Deep Reinforcement Learning (Deep RL)

Uses function approximators (especially deep neural networks) to represent value functions or policies to handle large state spaces.

32
New cards

Target Network

A stabilized version of the Q-network that is held fixed for a number of iterations before being updated to the current θ.

33
New cards

Experience Replay

Breaks the strong temporal correlations between successive samples by randomizing them, which stabilizes training.

34
New cards

Algorithmic Trading

Deciding when to buy or sell assets to maximize profit or utility.

35
New cards

Portfolio Management

Periodically reallocating a portfolio among assets to maximize returns or risk-adjusted returns.

36
New cards

Optimal Trade Execution

Executing a large order by splitting it into smaller pieces over time to minimize market impact and trading cost.

37
New cards

Market Making

Continuously quoting buy and sell prices for a security to earn the bid-ask spread while managing inventory risk.

38
New cards

Robo-Advising and Order Routing

Automated decision systems for advising retail investors or routing orders across exchanges.

39
New cards

Heuristics and Action Reduction

Incorporate domain knowledge to reduce the action space to a smaller subset of sensible actions.

40
New cards

Factorization of Actions

Break a big action decision into multiple smaller decisions.

41
New cards

Mathematical Programming Oracles

Given a state and a value function approximation, finding the best action is itself an optimization problem that can be solved with a mathematical programming approach.

42
New cards

Upper Confidence Bounds (UCB)

Select action a at time t as: at = arg max a [X¯ a(t) + c * sqrt(ln t / Na(t))].

43
New cards

Multi-Agent Reinforcement Learning (MARL)

Multiple agents learning and interacting, leading to complexities like non-stationarity and credit assignment.

44
New cards

Centralized Training with Decentralized Execution (CTDE)

All agents are trained together with a centralized critic that can see the global state and actions, but each agent has its own policy.

45
New cards

Explainable RL (XRL)

Making RL policies more interpretable or providing explanations for their decisions.

46
New cards

Inverse Reinforcement Learning (IRL)

Deduces reward functions from expert behavior.

47
New cards

Reinforcement Learning from Human Feedback (RLHF)

Agents learn from human feedback or demonstrations.