Reinforcement Learning – Session 10 Vocabulary
Session Context
10th class of “Reinforcement Learning & Autonomous Systems” (MBD-Sept 24).
Purpose: wrap-up of practice assignments + deep dive into reward shaping & DQN.
Materials referenced: Gymnasium classic-control tasks, provided notebooks, prior lectures on Dynamic Programming, Model-Free methods, & Deep RL.
Practice Assignments Road-map
Assignment 1 – “Jump-Start”: quick intro to RL tooling & API usage.
Assignment 2 – Dynamic Programming (value iteration / policy iteration).
Assignment 3 – Model-Free tabular methods.
Assignment 4 – DQN / DDQN (function approximation, deep RL).
Group practice – collaborative task (details not on slide).
Capstone challenge – Mountain Car (focus of reward-shaping discussion).
Assignment 3: Tabular Model-Free Methods
Monte-Carlo (MC)
Episode-complete returns; unbiased estimates; slow in long-horizon tasks.
Temporal Difference (TD-(0))
Bootstraps after each transition; faster feedback; introduces bias.
SARSA (on-policy TD-control)
Updates toward Q(s{t+1},a{t+1}); behaviour policy & target policy identical.
Q-Learning (off-policy TD-control)
Updates toward \max{a'}Q(s{t+1},a'); learns greedy policy while exploring.
Key insight: each algorithm balances bias/variance, exploration, convergence speed differently.
Environments Reviewed
Frozen-Lake
2-D grid, single start/goal, stochastic & deterministic modes, 8×8 ⇒ 64 states.
Taxi-v3
5×5 grid. Encoded state-tuple: (taxi row, taxi col, passenger loc, destination). 5×5×5×4 = 500 discrete states.
Reward ≈ +8 when executed optimally; hard to visualise optimal policy.
CartPole-v1
Continuous obs: (x, ẋ, θ, θ̇). Requires discretisation (binning) for tabular Q-Learning.
Episode reward: +1 per step until failure (pole angle|cart pos threshold).
Comparative Cheat-Sheet – Q-Learning vs SARSA
Policy Type
Q-Learning: Off-policy (greedy target).
SARSA: On-policy (follows behaviour).
Next-state target
Q-Learning: \max_{a'}Q(s',a').
SARSA: Q(s',a'_{\text{taken}}).
Exploration / Exploitation
Q-Learning: more exploitative, risk-seeking.
SARSA: exploration-aware, risk-averse.
Convergence
Q-Learning: often faster; unstable under stochastic dynamics.
SARSA: slower; smoother in noisy domains.
Practical takeaway: choose SARSA when environment is highly stochastic or unsafe; choose Q-Learning for deterministic or cost-tolerant domains.
Reward Shaping – Concept & Motivations
Definition: add auxiliary, artificial rewards to guide agent toward long-term objective.
Benefits
Denser feedback ⇒ faster convergence.
Can encode domain heuristics (e.g., chess heuristics: king safety, material, pawn structure).
Risks
Potential to alter optimal policy if shaping signal is not potential-based.
Requires quantitative formulation of qualitative criteria.
Domain Examples
Lunar Lander
Native reward is sparse; agent may learn to hover indefinitely if no episode truncation.
Solutions: penalise fuel usage, increase gravity, intermediate touchdown bonuses.
Mountain Car (challenge task)
Native reward: –1 per timestep until goal (sparse negative).
Two shaping axes proposed:
Distance to goal: +k_1 |\text{position}|
Velocity: +k_2 |\text{velocity}|
Win bonus: if goal reached before 200 steps ⇒ +250.
Hyper-parameter sweep suggested to prioritise velocity vs distance.
Shaping Guidelines for CartPole & Mountain Car
Discretisation first: finer state bins => sharper Q-table.
Sample code:
python intervals = [(-2.4,2.4), (-3.0,3.0), (-0.5,0.5), (-2.0,2.0)] nbins = [12,12,24,24] bins = [np.linspace(lo,hi,n+1) for (lo,hi),n in zip(intervals,nbins)] discretize = lambda s: tuple(np.digitize(s[i], bins[i])-1 for i in range(4))
Reward spec:
Base step penalty: –1.
Add potential-based terms (distance, velocity, angle, etc.).
Keep code & algorithm unchanged; only modify reward function & bins.
Objective: maximise learning speed (highest rolling-avg reward in fewest episodes).
Deep Q-Network (DQN) – Theory Recap
Extends tabular Q-Learning by approximating Q(s,a;\theta) with a neural network.
Uses Experience Replay to break correlation & improve sample efficiency.
Employs Target Network to stabilise bootstrapping (weights \theta^- frozen for T steps).
Bellman loss: \mathcal{L}(\theta)=\big(rt+\gamma\max{a'}Q(s{t+1},a';\theta^-) - Q(st,a_t;\theta)\big)^2.
Training Workflow
Collect transition $(st,at,rt,s{t+1},\text{done})$; append to replay buffer.
Periodically sample random minibatch, vectorise to NumPy.
Compute target Q values using frozen network.
Fit online Q-network for 1 gradient step.
Every T steps, copy online weights \theta \rightarrow \theta^-.
Continue until rolling-avg reward ≥ solved threshold.
Hyper-parameters (default template)
\text{MAX
EPISODES}=300\text{MAX
STEPS}=500Buffer size: 2000
\gamma=0.99 (discount)
\epsilon schedule: start 1.0, decay 0.99, floor 0.01
Learning rate 0.001, batch 64
Solved threshold 195 (CartPole default)
Rolling window 20 episodes for moving average.
Network Skeleton
Input layer: \text{state
dim} (e.g., 4 for CartPole).Two hidden ReLU layers, suggested 16/32 or 24/24 units.
Linear output layer of size |A|.
Compile with MSE + Adam.
ε-Greedy Action Function
if np.random.rand()<=epsilon:
return random.randrange(action_size) # explore
q_vals = DQN.predict(state)
return np.argmax(q_vals[0]) # exploit
Experience Replay Code Notes
Sample indices without replacement for speed.
Convert batches to contiguous
np.ndarraybefore NN calls.Target update:
target_qs[range(batch), actions] = rewards + gamma * np.max(next_qs, axis=1)*(1-dones).
Main Training Loop Essentials
Reset env, reshape state to (1, state_size).
Step until done or MAX_STEPS.
Store transition, call
experience_replaywhen buffer ≥ batch.Decay \epsilon after each episode.
Monitor per-episode reward & 20-episode rolling average.
Save model when solved; print run-time stats.
Sample Performance Plot (CartPole)
Graph shows episodic rewards & rolling average crossing 195 (solve line).
Demonstrates typical learning curve: rapid rise after exploration → plateau.
Practical Implications & Tips
Reward shaping often yields bigger gains than hyper-parameter tuning in sparse-reward tasks.
Ensure shaping rewards are potential-based to preserve optimality (Ng et al., 1999).
For stochastic environments, prefer SARSA or add entropy regularisation.
DQN’s stability hinges on replay buffer size, target update frequency, learning rate.
Use vectorised operations for batch predictions; Python loops become bottlenecks.
Validate learning with multiple random seeds; single run can mislead.
Ethical & Safety Considerations
Over-shaping can create unintended incentives (reward hacking).
Off-policy methods may learn hazardous behaviours during exploration; shielding or conservative algorithms recommended in safety-critical domains.
Model persistence: saved networks must be version-controlled & reproducible for auditability.
Consolidated Formulae & Definitions
SARSA update: Q(st,at) \leftarrow Q(st,at) + \alpha\big[rt + \gamma Q(s{t+1},a{t+1}) - Q(st,a_t)\big].
Q-Learning update: Q(st,at) \leftarrow Q(st,at) + \alpha\big[rt + \gamma \max{a'}Q(s{t+1},a') - Q(st,a_t)\big].
DQN target: Q{\text{target}}(st,at) = rt + \gamma \max{a'}Q(s{t+1},a';\theta^-).
Potential-based shaping term: F(s,s') = \gamma \Phi(s') - \Phi(s) (guarantees optimal policy invariance).
Action Items Before Submission
Fine-tune number of bins & shaping coefficients for Mountain Car / CartPole.
Keep algorithmic core constant; modify only reward function & discretiser.
Test solution under different seeds; provide learning curve evidence.
Document chosen hyper-parameters & rationale (scientific method).
Session Context
10th class of “Reinforcement Learning & Autonomous Systems” (MBD-Sept 24), a Master's program in Big Data.
Purpose: wrap-up of practical assignments and a deep dive into advanced topics like reward shaping and Deep Q-Networks (DQN).
Materials referenced: official Gymnasium classic-control tasks, custom provided Jupyter notebooks, and prior lectures covering foundational concepts in Dynamic Programming, Model-Free methods, and Deep Reinforcement Learning.
Practice Assignments Road-map
Assignment 1 – “Jump-Start”: a quick introduction to fundamental RL tooling, environment interaction, and API usage (e.g.,
env.step(),env.reset()).Assignment 2 – Dynamic Programming: focused on value iteration and policy iteration algorithms, which are model-based methods for finding optimal policies when the environment dynamics are fully known.
Assignment 3 – Model-Free tabular methods: hands-on implementation of algorithms like Monte Carlo, Temporal Difference, SARSA, and Q-Learning, where the environment model is unknown.
Assignment 4 – DQN / DDQN (function approximation, deep RL): involves using neural networks to approximate the Q-function, moving into the realm of Deep Reinforcement Learning for more complex, high-dimensional state spaces.
Group practice – collaborative task: a team-based exercise, specific details of which were not explicitly covered on the slide.
Capstone challenge – Mountain Car: a culminating task designed to integrate learned concepts, with a particular focus on the application of reward-shaping techniques.
Assignment 3: Tabular Model-Free Methods
Monte-Carlo (MC)- Learns from complete episodes by averaging total discounted returns (sum of rewards from a given state until episode termination). Produces unbiased estimates of true value functions but can be slow in long-horizon tasks because it must wait until an episode completes to perform a single update.
Temporal Difference (TD-(0))- Learns after each transition, bootstrapping from the estimated value of the next state rather than waiting for an episode to finish. This provides faster feedback and allows for online learning, but introduces bias as it relies on estimated future values.
SARSA (on-policy TD-control)- An on-policy control algorithm that updates the Q-value for a state-action pair based on the next action taken by the same behavior policy. The update rule is Q(s{t+1},a{t+1}); implying that the policy used to generate the current behavior (exploration) is identical to the policy being evaluated and improved (target policy).
Q-Learning (off-policy TD-control)- An off-policy control algorithm that updates the Q-value for a state-action pair based on the maximum possible Q-value in the next state, regardless of the action actually taken. It updates toward \max{a'}Q(s{t+1},a'); this means it learns the optimal greedy policy while still allowing for exploration to be driven by a different, e.g., \epsilon-greedy, behavior policy.
Key insight: Each algorithm balances bias/variance trade-offs, exploration strategies, and convergence speed differently, making them suitable for various types of RL problems and environments.
Environments Reviewed
Frozen-Lake- A 2-D grid environment, featuring a single start (S) and goal (G) state. It can operate in stochastic mode (slippery ice, where intended actions may not be executed) or deterministic modes. Common sizes include 4x4 or 8x8, leading to 16 or 64 discrete states respectively.
Taxi-v3- A 5x5 grid environment where a taxi agent must pick up a passenger at one location and drop them off at a destination. The state is encoded as a tuple: (taxi row, taxi col, passenger location, destination). This results in 5 \times 5 \times 5 \times 4 = 500 discrete states. The base reward is typically +20 for a successful drop-off, -1 for each step, and -10 for an illegal pick-up/drop-off. Optimal reward for a successful trajectory is around +8, making it challenging to visualize the optimal policy due to the nuanced state space.
Reward \approx +8 when executed optimally; hard to visualise optimal policy.
CartPole-v1- A classic control task featuring a pole balanced on a cart. The observation space is continuous, comprising four values: (cart position x, cart velocity .{x}, pole angle \theta, pole angular velocity .{}\theta). For tabular Q-Learning, these continuous observations require discretisation (binning) to map them into a finite number of states.
Episode reward: +1 for each timestep the pole remains upright and within the cart's boundaries, until failure (pole angle or cart position exceeds a predefined threshold).
Comparative Cheat-Sheet – Q-Learning vs SARSA
Policy Type- Q-Learning: Off-policy. It learns the optimal Q-function directly by using the maximum Q-value of the next state, irrespective of how that next state was reached. This means it learns a greedy target policy while its behavior policy explores.
SARSA: On-policy. It learns the Q-function for the policy currently being followed, meaning the behavior policy (which is used for exploration) is the same as the policy being improved.
Next-state target- Q-Learning: The target value is based on the maximum possible future reward achievable from the next state: \max*{\text{a'}}Q(s',a'). This makes it more aggressive and exploitative.
SARSA: The target value is based on the Q-value of the actual action taken in the next state: Q(s',a'_{\text{taken}}). This makes it more cautious and exploration-aware.
Exploration / Exploitation- Q-Learning: More exploitative and can be considered risk-seeking. Since it always learns based on the optimal future action (the max Q), it might learn about dangerous paths if they lead to high rewards, even if exploration rarely visits them.
SARSA: More exploration-aware and thus risk-averse. Because it updates based on the actual next action taken, if its exploration leads it near a cliff or hazard, it will learn to avoid that path under its current policy.
Convergence- Q-Learning: Often converges faster to the optimal policy in deterministic environments. However, it can be unstable under stochastic dynamics as the \max operator doesn't account for state-transition probabilities and may over-estimate values.
SARSA: Slower to converge to the optimal policy but typically smoother and more stable in noisy or stochastic domains, as it takes into account the actual actions and transitions that occur.
Practical takeaway: Choose SARSA when the environment is highly stochastic, partially observable, or safety-critical (e.g., robotics, self-driving cars, where following the learned policy during exploration is important). Choose Q-Learning for deterministic or cost-tolerant domains where finding the absolute optimal policy is paramount and occasional exploration failures are acceptable.
Reward Shaping – Concept & Motivations
Definition: The technique of adding auxiliary, artificial rewards to the environment's native reward signal to provide denser feedback and guide the agent more efficiently toward its long-term objective without altering the optimal policy.
Benefits- Denser feedback \Rightarrow faster convergence: Providing rewards more frequently, even for intermediate steps, can significantly accelerate learning, especially in environments with sparse original rewards.
Can encode domain heuristics: Allows embedding expert knowledge or qualitative criteria (e.g., in chess, heuristics like king safety, material advantage, or pawn structure can be quantified and added as a shaping reward) to guide the agent.
Risks- Potential to alter optimal policy if shaping signal is not potential-based: If reward shaping is not done carefully (specifically, if it's not