(lec 5) Temporal Difference and Monte Carlo: RL1 Lecture Notes

This section introduces Temporal Difference (TD) learning as an approach for learning value functions when the model of the environment is unknown. The material draws on concepts from David Silver’s RL course and Katerina Fragkiadaki’s CMU course.
TD learning uses bootstrapping: it updates value estimates after each step using the estimated return from the next state/action, instead of waiting for a complete episode.
Central idea: leverage the Bellman equations to decompose value learning into local, incremental updates that can be performed online.

Goal: obtain action-value empirical means (Q-values) from complete episodes.
When state-action (s, a) is visited in an episode, update:
- Increment visitation counter: N(s, a) N(s, a) + 1
- Increment total return: S(s, a) S(s, a) + G_t
Estimate value by mean return: $Q(s, a) = \frac{S(s, a)}{N(s, a)}$
Important: to compute $G_t$ we must have complete episodes (the full return from t onward).

Monte Carlo learning requires complete episodes to estimate Q_ pi(s, a).
Problems:
1) Value estimates take a long time to make: if an episode can be extremely long (e.g., 1 million steps), no updates occur until the episode ends, which is inefficient.
2) Value estimates have high variance: randomness in the environment leads to large fluctuations in returns, requiring many samples for accuracy.

Yes. Instead of using the actual return, use an estimated return to update value estimates.

Original online update formulation (for MC):
- Whenever state-action (s, a) is visited:
  1) Increment visitation counter: N(s, a) N(s, a) + 1
  2) Increment total return: S(s, a) S(s, a) + G_t
- Estimate value by mean return: $Q(s, a) = \frac{S(s, a)}{N(s, a)}$
This is described on slide 12.

Instead of waiting for the full return, incrementally compute the mean on each visit:
- When (s, a) is visited:
  1) Increment visitation counter: N(s, a) N(s, a) + 1
  2) Update value estimate: $Q(s, a) = Q(s, a) + \frac{1}{N(s, a)} (G_t - Q(s, a))$
This is equivalent to maintaining the running mean of observed returns.
We can generalize this to: $Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))$ where (\alpha) is a step-size parameter.
- Useful if actual values change over time and we want to forget old observations.

With online updating: $Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))$
Smaller (\alpha) places higher priority on older values; larger (\alpha) places higher priority on more recent values; this is effectively a forgetting mechanism for older information.

In Monte Carlo learning, the target is the actual return: $Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))$
In Temporal Difference (TD) learning, the target is the estimated return from the next step:
- TD target: $r{t+1} + \gamma Q(s{t+1}, a_{t+1})$
- TD update: $Q(st, at) \leftarrow Q(st, at) + \alpha \big(r{t+1} + \gamma Q(s{t+1}, a{t+1}) - Q(st, a_t)\big)$
The quantity $r{t+1} + \gamma Q(s{t+1}, a{t+1})$ is called the TD target, and the difference $r{t+1} + \gamma Q(s{t+1}, a{t+1}) - Q(st, at)$ is the TD error.

The return can be written in two equivalent ways:
- Expand the series of rewards:
 $Gt = r{t+1} + \gamma r{t+2} + \gamma^2 r{t+3} + \gamma^3 r_{t+4} + \dots$
- Recursive form:
 $Gt = r{t+1} + \gamma G_{t+1}$
This relationship underpins the decomposition of value functions via the Bellman equations.

Two core decompositions:
- State-value function: Vcpi(s) = \mathbb{E}[r{t+1} + \gamma Vcpi(s{t+1}) \mid s_t = s]
- Action-value function: Qcpi(s, a) = \mathbb{E}[r{t+1} + \gamma Qcpi(s{t+1}, a{t+1}) \mid st = s, a_t = a]
These equations formalize how immediate reward and expected future value combine to produce the current value.

TD update (as above): $Q(s, a) \leftarrow Q(s, a) + \alpha \Big(r{t+1} + \gamma Q(s{t+1}, a_{t+1}) - Q(s, a)\Big)$
The TD target is $r{t+1} + \gamma Q(s{t+1}, a_{t+1})$ and the TD error is its difference from the current estimate.

Variants include:
- 1-step TD (TD(0))
- 2-step TD
- 3-step TD
- n-step TD
- -step TD and Monte Carlo (MC)
These trade off bias and variance and allow longer lookahead than 1-step TD while still bootstrapping.

Central question in RL: Which actions and states in a sequence contributed to the eventual rewards?
TD handles credit assignment by propagating reward signals backwards across multiple updates, though not perfectly (inefficiently).
N-step returns and TD(\lambda) offer ways to mix bootstrapping with longer lookahead to improve credit assignment (see Sutton and Barto).

How does Monte Carlo handle credit assignment?
MC attributes rewards to the entire observed sequence, which can be accurate but suffers from high variance and requires complete episodes.

Monte Carlo:
- Cannot learn until the final outcome is observed from the episode.
- Cannot learn without an episode terminating.
- Works only when episodes terminate.
Temporal Difference:
- Can learn after every step, without waiting for episode end.
- Can work with non-terminating episodes (lifelong learning).

Monte Carlo:
- High variance in value estimates but low bias; good convergence but typically requires many samples.
- Not sensitive to initial estimates.
Temporal Difference:
- Low variance in value estimates but higher bias.
- Less-good convergence but typically requires fewer samples; more sensitive to initial estimates.

Look familiar? Same two-step iterative structure as standard TD learning, but on-policy:
- Initialize policy \u03c0, Q(s, a) arbitrarily; set Q(terminal, •) = 0
- Repeat for each episode:
- Initialize S
- Choose A from S using a policy derived from Q (e.g., \u03b5-greedy)
- Repeat for each step of the episode:
  - Take action A, observe R, S'
  - Choose A' from S' using policy derived from Q (e.g., \u03b5-greedy)
  - Update: $Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma Q(S', A') - Q(S, A)]$
  - S \leftarrow S'; A \leftarrow A'
- Terminate when S is terminal.
This formulation aligns with Sutton and Barto Chapter 6.4 (note: notation may differ slightly).

Side note: SARSA stands for State, Action, Reward, State, Action, referring to the sequence used in the update.

Quote: TD learning is central and novel to reinforcement learning according to Sutton and Barto (Chapter 6).
TD value estimates underlie nearly every major critic and actor-critic method in modern RL:
- DQN, Rainbow, (MA)PPO, SAC, (MA)DDPG, TD3, Q-mix, COMA, VDN, A2C, A3C, …
The only major exceptions are pure policy search methods and MCTS-based methods.
With TD as a foundation, we can discuss modern RL research and algorithms.