Q-Learning
Page 1: Introduction to Active Reinforcement Learning
Title: Artificial Intelligence and Intelligent Agents (F29AI)
Focus: Active Reinforcement Learning (ARL) and Q-Learning
Instructors:
Arash Eshghi
Slides based on works by:
Ioannis Konstas @HWU
Verena Rieser @HWU
Dan Klein @UC Berkeley
Page 2: Problems with TD Value Learning
TD Value Learning Overview:
Model-free policy evaluation method
Relies on Bellman updates and running sample averages
Challenges:
Difficulty in deriving new policies from values
Solution:
Learn Q-values directly instead of just values
Enables model-free action selection
Page 3: Institutional Branding
Heriot Watt University
Host for Active Reinforcement Learning program
Page 4: Overview of Active Reinforcement Learning
Full reinforcement learning: Aims to discover optimal policies similar to value iteration
Key Points:
Unknown transitions (T(s,a,s')) and rewards (R(s,a,s'))
The learner actively chooses actions
Goal: Learn optimal policy/values based on actions taken
Tradeoff: Exploration vs exploitation
Methodology: Not offline; act in the real world to learn outcomes
Page 5: Q-Value Iteration Details
Value Iteration:
A process to find depth-limited values
Starting from V0(s) = 0
Calculate depth k+1 values from Vk
Focus on Q-values:
Begin with Q0(s,a) = 0, use Qk to compute depth k+1 Q-values
Page 6: Q-Learning Fundamentals
Q-Learning Process:
A sample-based approach to Q-value iteration
Steps:
Receive sample (s,a,s',r)
Compare old estimate versus new sample estimate
Update running average with new estimates
Page 7: Q-Learning Demonstration
Context: Heriot Watt University's application of Q-Learning in Gridworld
Page 8: Properties of Q-Learning
Convergence:
Q-learning can converge to an optimal policy despite suboptimal actions
This is known as off-policy learning
Requirements:
Sufficient exploration
Gradually reduce the learning rate
Delay in reducing the learning rate is critical
Selection method for actions becomes irrelevant in the limit
Page 9: Exploration vs. Exploitation
Dynamics:
Essential contrast in reinforcement learning strategies
Visual Representation: Usual placement but unclear wording on "OPENING! GRAND ?"
Page 10: Epsilon-Greedy Policy
Policy Explanation:
Follows an !-greedy approach:
With probability ! conduct exploration
With probability (1-!) conduct exploitation
Definitions:
Exploration: Select a random action
Exploitation: Select the best known action according to current policy
On-Policy vs. Off-Policy:
On-Policy: Control policy aligns with the learning policy
Off-Policy: Different control policy (e.g., Epsilon-greedy) from the learned policy
Q-Learning is classified as an off-policy method.
Page 11: Summary of MDPs and RL
Known MDP (Offline Solution):
Goal: Compute optimal values/policies
Technique: Value/Policy Iteration
Unknown MDP:
Model-Based Approach:
Goal: Compute values/policies, evaluate fixed policy using approximate MDP techniques
Model-Free Approach:
Goal: Q-learning to compute optimal policies/values
Page 12: Suggested Readings
Recommended Reading: Russell & Norvig, Chapters 21.1-3
Page 13: Upcoming Lectures
Future Topics:
Lecture 14: Introduction to NLP
Lecture 15: Language Modelling
Lecture 16: Syntactic Parsing as a Search Problem
Lecture 17: (if time permits) Perceptrons & Deep Learning