(lec 5) Temporal Difference and Monte Carlo: RL1 Lecture Notes
Unknown models: Temporal Difference learning
This section introduces Temporal Difference (TD) learning as an approach for learning value functions when the model of the environment is unknown. The material draws on concepts from David Silver’s RL course and Katerina Fragkiadaki’s CMU course.
TD learning uses bootstrapping: it updates value estimates after each step using the estimated return from the next state/action, instead of waiting for a complete episode.
Central idea: leverage the Bellman equations to decompose value learning into local, incremental updates that can be performed online.
Recap: Monte Carlo on-policy learning
Two-step iterative algorithm:
Randomly initialize policy
Evaluate the policy with sampled episodes:
Action-value Q_cpi(s, a) is approximated with empirical means
Improve the policy by acting psilon-greedily with respect to Q(s, a):
cpi' = psilon-greedy Q(s, a)
Note: decaying psilon over time can help converge to an optimal policy.
Recap: Monte Carlo policy evaluation
Goal: obtain action-value empirical means (Q-values) from complete episodes.
When state-action (s, a) is visited in an episode, update:
Increment visitation counter: N(s, a) N(s, a) + 1
Increment total return: S(s, a) S(s, a) + G_t
Estimate value by mean return:
Important: to compute we must have complete episodes (the full return from t onward).
The problem with complete episodes
Monte Carlo learning requires complete episodes to estimate Q_ pi(s, a).
Problems:
1) Value estimates take a long time to make: if an episode can be extremely long (e.g., 1 million steps), no updates occur until the episode ends, which is inefficient.
2) Value estimates have high variance: randomness in the environment leads to large fluctuations in returns, requiring many samples for accuracy.
Can we learn from incomplete episodes? Bootstrapping
Yes. Instead of using the actual return, use an estimated return to update value estimates.
First: re-formulate empirical value estimate
Original online update formulation (for MC):
Whenever state-action (s, a) is visited:
1) Increment visitation counter: N(s, a) N(s, a) + 1
2) Increment total return: S(s, a) S(s, a) + G_tEstimate value by mean return:
This is described on slide 12.
First: re-formulate empirical value estimate (online variant)
Instead of waiting for the full return, incrementally compute the mean on each visit:
When (s, a) is visited:
1) Increment visitation counter: N(s, a) N(s, a) + 1
2) Update value estimate:
This is equivalent to maintaining the running mean of observed returns.
We can generalize this to: where (\alpha) is a step-size parameter.
Useful if actual values change over time and we want to forget old observations.
Exponential moving average interpretation
With online updating:
Smaller (\alpha) places higher priority on older values; larger (\alpha) places higher priority on more recent values; this is effectively a forgetting mechanism for older information.
Temporal Difference policy evaluation
In Monte Carlo learning, the target is the actual return:
In Temporal Difference (TD) learning, the target is the estimated return from the next step:
TD target:
TD update:
The quantity is called the TD target, and the difference is the TD error.
Recap: Recursive form of policy returns
The return can be written in two equivalent ways:
Expand the series of rewards:
Recursive form:
This relationship underpins the decomposition of value functions via the Bellman equations.
Recap: Bellman expectation equation
Two core decompositions:
State-value function: Vcpi(s) = \mathbb{E}[r{t+1} + \gamma Vcpi(s{t+1}) \mid s_t = s]
Action-value function: Qcpi(s, a) = \mathbb{E}[r{t+1} + \gamma Qcpi(s{t+1}, a{t+1}) \mid st = s, a_t = a]
These equations formalize how immediate reward and expected future value combine to produce the current value.
Temporal Difference policy evaluation (revisited)
TD update (as above):
The TD target is and the TD error is its difference from the current estimate.
N-step TD evaluation (if we want to)
Variants include:
1-step TD (TD(0))
2-step TD
3-step TD
n-step TD
-step TD and Monte Carlo (MC)
These trade off bias and variance and allow longer lookahead than 1-step TD while still bootstrapping.
The credit assignment problem
Central question in RL: Which actions and states in a sequence contributed to the eventual rewards?
TD handles credit assignment by propagating reward signals backwards across multiple updates, though not perfectly (inefficiently).
N-step returns and TD(\lambda) offer ways to mix bootstrapping with longer lookahead to improve credit assignment (see Sutton and Barto).
The credit assignment problem (Monte Carlo perspective)
How does Monte Carlo handle credit assignment?
MC attributes rewards to the entire observed sequence, which can be accurate but suffers from high variance and requires complete episodes.
Monte Carlo vs Temporal Difference
Monte Carlo:
Cannot learn until the final outcome is observed from the episode.
Cannot learn without an episode terminating.
Works only when episodes terminate.
Temporal Difference:
Can learn after every step, without waiting for episode end.
Can work with non-terminating episodes (lifelong learning).
Monte Carlo vs Temporal Difference (bias-variance and convergence)
Monte Carlo:
High variance in value estimates but low bias; good convergence but typically requires many samples.
Not sensitive to initial estimates.
Temporal Difference:
Low variance in value estimates but higher bias.
Less-good convergence but typically requires fewer samples; more sensitive to initial estimates.
SARSA: TD on-policy learning
Look familiar? Same two-step iterative structure as standard TD learning, but on-policy:
Initialize policy \u03c0, Q(s, a) arbitrarily; set Q(terminal, •) = 0
Repeat for each episode:
Initialize S
Choose A from S using a policy derived from Q (e.g., \u03b5-greedy)
Repeat for each step of the episode:
Take action A, observe R, S'
Choose A' from S' using policy derived from Q (e.g., \u03b5-greedy)
Update:
S \leftarrow S'; A \leftarrow A'
Terminate when S is terminal.
This formulation aligns with Sutton and Barto Chapter 6.4 (note: notation may differ slightly).
SARSA: naming note
Side note: SARSA stands for State, Action, Reward, State, Action, referring to the sequence used in the update.
Temporal Difference learning is important!
Quote: TD learning is central and novel to reinforcement learning according to Sutton and Barto (Chapter 6).
TD value estimates underlie nearly every major critic and actor-critic method in modern RL:
DQN, Rainbow, (MA)PPO, SAC, (MA)DDPG, TD3, Q-mix, COMA, VDN, A2C, A3C, …
The only major exceptions are pure policy search methods and MCTS-based methods.
With TD as a foundation, we can discuss modern RL research and algorithms.