Environment summary
Action space: Discrete(3)
⇒ {push left, no push, push right}
Observation space: Box([-1.2, -0.07],[0.6, 0.07], (2,), float32)
Default reward: r_t=-1 per time-step, episode terminates when car reaches the flag or after max steps.
Problems noted
Sparse/always-negative reward hampers learning; termination + truncation must be handled carefully.
Need to shape reward so that learning is faster without altering solution policy.
Suggested tasks
Experiment with number of discretisation bins; smaller bins ≈ coarse state space, larger bins ≈ finer (risk of state-explosion).
Goal: “BEST POSSIBLE LEARNING (FAST–SHORT)” by tuning bins + reward only (code & algorithm fixed).
Use ChatGPT / Internet for ideas; similar strategy to be applied later in CartPole assignment.
Helper code snippet
intervals = [(-2.4, 2.4), (-3.0, 3.0), (-0.5, 0.5), (-2.0, 2.0)]
nbins = [12, 12, 24, 24]
bins = [create_bins(intervals[i], nbins[i]) for i in range(4)]
def discretize_bins(x):
return tuple(np.clip(np.digitize(x[i], bins[i])-1, 0, nbins[i]-1)
for i in range(4))
Key take-aways
Increased bins (e.g.
\text{nbins} = [12,12,24,24]) ⇒ finer representation of pole angle & velocity.
Same discretisation logic can port to MountainCar, Acrobot, etc.
Model-based vs Model-free
Model-based: Policy / Value iteration, Dynamic Programming, Optimal Control.
Model-free
Gradient-free: Monte-Carlo, TD, SARSA, Q-learning (tabular).
Gradient-based: Actor–Critic, Policy Gradient, Deep MPC, Deep Policy Net.
Deep RL: DQN, DDQN, etc. sits in model-free, off-policy, gradient-based category (value-based).
1957 Bellman DP ⇒ foundation of value recursion.
1988 TD-learning; 1992 Q-learning (Watkins).
1995 TD-Gammon (Tesauro) – NN approximator + TD.
2013 DQN on Atari – raw pixels → Q-values.
2017 AlphaGo / AlphaZero; 2019 AlphaStar, AlphaFold.
2023–24 LLM-powered Web agents (WebAgent, WebLINX, WebVoyager, Anthropic MCP).
Tooling ecosystem: LangChain (2022) → AutoGPT, AutoGen, Crew.ai, Langraph, TapeAgents, LlamaIndex – focus on memory, chaining, multi-agent collaboration.
Atari DQN paper (Mnih et al. 2013)
Function approximation rationale.
Neural Networks as approximators.
Deep Q-Network (DQN) algorithm.
Double DQN (DDQN) improvement.
Motivation
State spaces gigantic: Backgammon 10^{20}; Go 10^{170}; continuous systems (helicopters, autonomous cars).
Tabular storage & per-state learning infeasible.
Solution: approximate V_\pi(s) or Q(s,a) with parameterised function \hat v(s,\mathbf w) or \hat q(s,a,\mathbf w).
Generalise from seen to unseen states.
Update parameters via MC or TD.
Any regressor works: linear basis, NN, decision trees, k-NN, Fourier / wavelets …
Simple MLP template for CartPole
inputs = Input(shape=(state_size,))
x = Dense(24, activation='relu')(inputs)
x = Dense(24, activation='relu')(x)
outputs = Dense(action_size, activation='linear')(x)
model = Model(inputs, outputs)
model.compile(optimizer=Adam(lr), loss='mse')
Training APIs
model.fit
– high-level loop (forward + backprop integrated).
tf.GradientTape
– manual loop for custom research logic (RL, adversarial, etc.).
Weight update rule: \mathbf w \leftarrow \mathbf w - \alpha \nabla_\mathbf w L.
Pros & cons
FIT: simplicity, callbacks, hardware-optimised; limited customisation.
GradientTape: full control, fine-grained debugging; more boiler-plate.
Goal: learn Q(s,a) end-to-end from pixel input.
Atari implementation specifics
State = stack of 4 greyscale 84\times84 frames ⇒ captures velocity.
Actions: 18 joystick positions.
Reward clipped to {-1,0,1} to normalise scale across games.
Network (for all games, no tuning):
Conv1: 16 filters 8\times8, stride 4, ReLU.
Conv2: 32 filters 4\times4, stride 2, ReLU.
FC: 256 units, ReLU → linear output |A|.
Key algorithmic ideas
Experience Replay buffer \mathcal D breaks sample correlation.
Fixed (target) network Q(\cdot,\cdot;\bar w) provides stable target.
Training loop (per step)
Act (\varepsilon)-greedy, observe transition ((st,at,r{t+1},s{t+1})) and push to \mathcal D.
Sample mini-batch from \mathcal D.
Compute target
y = r + \gamma \max_{a'} Q(s',a';\bar w).
Minimise MSE
Li(w) = \mathbb E{(s,a,r,s') \sim \mathcal D}[ (y - Q(s,a;w))^2 ].
Periodically copy (w \to \bar w).
Empirical results (paper)
Super-human (>100% of expert): Video Pinball 2539 %, Boxing 1707 %, Breakout 1327 % …
Difficult games: Montezuma’s Revenge ≈0 %, Private Eye 2 %.
Practical notes
Large replay buffer & RMSProp improve stability.
Action repeat = 4 reduces FPS ⇒ faster training.
Start (\varepsilon=1) → anneal to 0.1.
Issue in vanilla DQN: over-estimation bias because same network selects & evaluates \max_{a'}.
DDQN split
Online network (main) selects best action a^* = \arg\maxa Q{\text{main}}(s',a).
Target network evaluates value Q_{\text{target}}(s',a^*).
Update rule
y = r + \gamma \, Q{\text{target}}(s',\arg\maxa Q_{\text{main}}(s',a)).
Implementation snippet
next_q_main = main.predict(next_states)
best_actions = np.argmax(next_q_main, axis=1)
next_q_target = target.predict(next_states)
targets = rewards + gamma * next_q_target[np.arange(bs), best_actions]*(1-dones)
Weight-sync strategies
Hard update: copy every (T) steps.
Soft/Polyak: \theta{\text{target}} \leftarrow \tau\,\theta{\text{main}} + (1-\tau)\,\theta_{\text{target}} ((\tau\in(0,1))).
1⃣ Initialisation & Hyper-parameters
Environment: CartPole-v1
.
Typical values
\gamma=0.99, learning rate 5\times10^{-4}.
\varepsilon0=1.0, \varepsilon{\min}=0.01, \varepsilon_{\text{decay}}=0.99.
Replay buffer size 10^5, batch 64.
Target soft-update \tau=0.05 every 15 steps (or hard every 500).
Rolling window = 40 for “solved” check (threshold = 200 pt).
2⃣ Network definition (3 hidden layers: 16→64→16 ReLU → linear |A|).
3⃣ Support functions
store_experience
, sample_experiences
, soft_update
(Polyak).
Dual replay wrappers: experience_replay_DQN
, experience_replay_DDQN
.
4⃣ Learning loop
For each episode ≤1200 steps
Choose action ((\varepsilon)-greedy vs predicted Q).
Save transition; call experience replay.
Decay epsilon; track rolling avg; early-stop if solved.
5⃣ Visualisation
Plot episode rewards plus rolling mean; red dashed line at solved threshold.
6⃣ Testing & Video rendering
Run 10 episodes with greedy policy (no exploration); cap 500 steps.
RGB frames saved to GIF CARTPOLE_DDQN.gif
(works in Colab).
Reward shaping risk: may alter optimal policy unintentionally (be sure shaped reward is potential-based).
Hyper-parameter sensitivity: architecture, buffer size, target sync period all influence stability.
Memory footprint: replay buffers for pixel inputs may require GBs – must optimise sampling & storage (compress, lazy frames).
DDQN reduces over-estimation but does not eliminate distributional shift; further extensions (Dueling DQN, PER, Rainbow) exist.
DQN revolutionised RL by integrating ConvNets + replay + target networks.
Function approximation scales tabular RL to high-dimensional tasks; NNs are powerful universal approximators but add instability.
DDQN refines DQN with decoupled action selection/evaluation, yielding more accurate Q estimates.
Real-world pipeline requires modular program structure: hyper-params, network, replay buffer, learning loop, evaluation & visualisation.
Environment summary
Action space: Discrete(3)
( \Rightarrow ) {push left, no push, push right}. These actions control the car's horizontal force.
Observation space: Box([-1.2, -0.07],[0.6, 0.07], (2,), float32)
. This represents the car's position (between -1.2 and 0.6) and velocity (between -0.07 and 0.07).
Default reward: r_t=-1 per time-step. An agent receives a penalty for every step it takes, encouraging faster completion. The episode terminates when the car reaches the flag (position >= 0.5) or after a maximum number of steps (truncation, typically 200). This reward signal is problematic due to its sparsity.
Problems noted
Sparse/always-negative reward hampers learning: The agent only receives a positive signal upon reaching the goal, making it difficult to learn intermediate steps. The constant negative reward offers no gradient for improvement towards the goal, leading to slow and inefficient learning.
Termination + truncation must be handled carefully: The differing reasons for episode end (success vs. timeout) require distinct handling during experience buffering and target calculation to ensure accurate value estimation.
Need to shape reward so that learning is faster without altering solution policy: Reward shaping involves adding an auxiliary reward function \phi(s) - \phi(s') (for potential-based shaping) to guide the agent. This allows for intermediate positive feedback (e.g., getting closer to the goal) without changing the optimal policy, as the added terms cancel out over a full trajectory.
Suggested tasks
Experiment with number of discretisation bins: Discretisation transforms the continuous observation space into a finite, discrete state space. Smaller bins ( \approx ) coarse state space (less detail, faster learning but potentially suboptimal solutions), larger bins ( \approx ) finer (more detail, better solutions but risk of state-explosion and slower learning due to increased state space size).
Goal: “BEST POSSIBLE LEARNING (FAST–SHORT)” by tuning bins + reward only (code & algorithm fixed). The challenge is to find the optimal balance for efficient training.
Use ChatGPT / Internet for ideas: Research common strategies for MountainCar, such as potential-based reward shaping where reward is given for reaching higher positions, or for remaining within a certain velocity range. Similar strategy to be applied later in CartPole assignment.
Helper code snippet
intervals = [(-2.4, 2.4), (-3.0, 3.0), (-0.5, 0.5), (-2.0, 2.0)]
nbins = [12, 12, 24, 24]
bins = [create_bins(intervals[i], nbins[i]) for i in range(4)]
def discretize_bins(x):
return tuple(np.clip(np.digitize(x[i], bins[i])-1, 0, nbins[i]-1)
for i in range(4))
Key take-aways
intervals
define the range for each of the four observation dimensions (cart position, cart velocity, pole angle, pole angular velocity).
nbins
specifies the number of discrete bins for each dimension. For example, 12
bins for cart position will divide its range (-2.4 to 2.4) into 12 segments.
create_bins
likely generates the actual bin edges (boundaries) for each dimension.
np.digitize(x[i], bins[i])
assigns the observation value x[i]
to its corresponding bin index. -1
is used for 0-indexing, and np.clip
ensures the index stays within valid bounds [0, nbins[i]-1]
.
Increased bins (e.g., \text{nbins} = [12,12,24,24]) ( \Rightarrow ) finer representation of pole angle & velocity. This means the agent can distinguish between more subtle differences in the state, potentially leading to more precise control but larger state space.
Same discretisation logic can port to MountainCar, Acrobot, etc., making it a reusable technique for environments with continuous observation spaces.
Model-based vs Model-free
Model-based: The agent explicitly learns or knows the environment's dynamics (transition function P(s'|s,a) and reward function R(s,a)). It can then plan actions by simulating future states or solving for optimal policies/values using methods like Policy / Value iteration and Dynamic Programming. Optimal Control also falls into this category.
Model-free: The agent learns policies or value functions directly from interactions with the environment, without explicitly building a model of the environment's dynamics. This approach is often more scalable for complex or unknown environments.
Gradient-free: These methods typically use look-up tables or simple function approximators and do not rely on gradient descent for updates. Examples include Monte-Carlo (learning from complete episodes), TD (Temporal Difference, learning from bootstrapped estimates), SARSA (On-policy TD control), and Q-learning (Off-policy TD control, typically tabular for small state spaces). These are well-suited for discrete state-action spaces.
Gradient-based: These methods use gradient descent to update parameters of a function approximator (like a neural network) that represents the policy or value function. Examples include Actor–Critic (learns both a policy and a value function), Policy Gradient (directly optimizes policy parameters), Deep MPC (Model Predictive Control, often combined with deep learning for model learning), Deep Policy Net.
Deep RL: DQN, DDQN, etc. sits in model-free, off-policy, gradient-based category (value-based).
Model-free: Because it learns Q-values directly from experience, without an explicit model of the environment.
Off-policy: Because it learns about the optimal policy (derived from (\max_a Q(s,a) )) while following a different exploration policy (e.g., (\varepsilon)-greedy).
Gradient-based: Because it uses neural networks as function approximators, and updates their weights via gradient descent.
Value-based: Because its primary goal is to learn the optimal action-value function Q^*(s,a).
1957 Bellman DP: Richard Bellman's Dynamic Programming provided the mathematical foundation for optimal control and value recursion, essential for solving well-defined Markov Decision Processes (MDPs).
1988 TD-learning: Introduced Temporal Difference learning, which learns directly from raw experience without a model of the environment's dynamics, bootstrapping from estimated values of future states. 1992 Q-learning (Watkins): Developed by Chris Watkins, Q-learning is a model-free, off-policy algorithm for learning an optimal action-value function.
1995 TD-Gammon (Tesauro): A landmark achievement, it demonstrated that a neural network (as a value function approximator) combined with TD-learning could learn to play backgammon at a superhuman level, pioneering the use of function approximation in RL.
2013 DQN on Atari: DeepMind's Deep Q-Network was a major breakthrough, showing that a single neural network could learn to play a wide range of Atari 2600 games directly from raw pixel inputs, often surpassing human performance, by integrating convolutional neural networks with experience replay and a target network.
2017 AlphaGo / AlphaZero: DeepMind's AlphaGo defeated the world champion Go player, a game considered far more complex than chess. AlphaZero later generalized this, learning Go, chess, and shogi from scratch (tabula rasa) without human knowledge beyond game rules, using a form of self-play and Monte Carlo Tree Search with deep neural networks. 2019 AlphaStar, AlphaFold: AlphaStar applied similar techniques to complex real-time strategy games (StarCraft II), while AlphaFold demonstrated groundbreaking success in protein folding, showcasing RL's applicability beyond traditional games.
2023–24 LLM-powered Web agents (WebAgent, WebLINX, WebVoyager, Anthropic MCP): Recent advancements highlight the integration of Large Language Models (LLMs) with RL for complex tasks like web navigation and interaction, enabling agents to understand and execute human instructions on the internet.
Tooling ecosystem: LangChain (2022) ( \rightarrow ) AutoGPT, AutoGen, Crew.ai, Langraph, TapeAgents, LlamaIndex: These frameworks and libraries facilitate the development of sophisticated AI agents by providing tools for chaining together LLM calls, managing memory, enabling multi-agent collaboration, and integrating various tools and data sources.
Atari DQN paper (Mnih et al. 2013)
Function approximation rationale.
Neural Networks as approximators.
Deep Q-Network (DQN) algorithm.
Double DQN (DDQN) improvement.
Motivation
State spaces gigantic: Traditional tabular methods are impractical for high-dimensional or continuous state spaces. For example, Backgammon has an estimated 10^{20} states; Go has 10^{170} states; continuous systems like helicopters and autonomous cars have infinite states. Storing a Q-value for every possible state-action pair becomes computationally infeasible due to the