Q-Learning

Page 1: Introduction to Active Reinforcement Learning

  • Title: Artificial Intelligence and Intelligent Agents (F29AI)

  • Focus: Active Reinforcement Learning (ARL) and Q-Learning

  • Instructors:

    • Arash Eshghi

    • Slides based on works by:

      • Ioannis Konstas @HWU

      • Verena Rieser @HWU

      • Dan Klein @UC Berkeley

Page 2: Problems with TD Value Learning

  • TD Value Learning Overview:

    • Model-free policy evaluation method

    • Relies on Bellman updates and running sample averages

  • Challenges:

    • Difficulty in deriving new policies from values

  • Solution:

    • Learn Q-values directly instead of just values

    • Enables model-free action selection

Page 3: Institutional Branding

  • Heriot Watt University

    • Host for Active Reinforcement Learning program

Page 4: Overview of Active Reinforcement Learning

  • Full reinforcement learning: Aims to discover optimal policies similar to value iteration

  • Key Points:

    • Unknown transitions (T(s,a,s')) and rewards (R(s,a,s'))

    • The learner actively chooses actions

    • Goal: Learn optimal policy/values based on actions taken

    • Tradeoff: Exploration vs exploitation

    • Methodology: Not offline; act in the real world to learn outcomes

Page 5: Q-Value Iteration Details

  • Value Iteration:

    • A process to find depth-limited values

    • Starting from V0(s) = 0

    • Calculate depth k+1 values from Vk

  • Focus on Q-values:

    • Begin with Q0(s,a) = 0, use Qk to compute depth k+1 Q-values

Page 6: Q-Learning Fundamentals

  • Q-Learning Process:

    • A sample-based approach to Q-value iteration

    • Steps:

      • Receive sample (s,a,s',r)

      • Compare old estimate versus new sample estimate

      • Update running average with new estimates

Page 7: Q-Learning Demonstration

  • Context: Heriot Watt University's application of Q-Learning in Gridworld

Page 8: Properties of Q-Learning

  • Convergence:

    • Q-learning can converge to an optimal policy despite suboptimal actions

    • This is known as off-policy learning

  • Requirements:

    • Sufficient exploration

    • Gradually reduce the learning rate

    • Delay in reducing the learning rate is critical

    • Selection method for actions becomes irrelevant in the limit

Page 9: Exploration vs. Exploitation

  • Dynamics:

    • Essential contrast in reinforcement learning strategies

  • Visual Representation: Usual placement but unclear wording on "OPENING! GRAND ?"

Page 10: Epsilon-Greedy Policy

  • Policy Explanation:

    • Follows an !-greedy approach:

      • With probability ! conduct exploration

      • With probability (1-!) conduct exploitation

  • Definitions:

    • Exploration: Select a random action

    • Exploitation: Select the best known action according to current policy

  • On-Policy vs. Off-Policy:

    • On-Policy: Control policy aligns with the learning policy

    • Off-Policy: Different control policy (e.g., Epsilon-greedy) from the learned policy

    • Q-Learning is classified as an off-policy method.

Page 11: Summary of MDPs and RL

  • Known MDP (Offline Solution):

    • Goal: Compute optimal values/policies

    • Technique: Value/Policy Iteration

  • Unknown MDP:

    • Model-Based Approach:

      • Goal: Compute values/policies, evaluate fixed policy using approximate MDP techniques

    • Model-Free Approach:

      • Goal: Q-learning to compute optimal policies/values

Page 12: Suggested Readings

  • Recommended Reading: Russell & Norvig, Chapters 21.1-3

Page 13: Upcoming Lectures

  • Future Topics:

    • Lecture 14: Introduction to NLP

    • Lecture 15: Language Modelling

    • Lecture 16: Syntactic Parsing as a Search Problem

    • Lecture 17: (if time permits) Perceptrons & Deep Learning