Q-Learning

Title: Artificial Intelligence and Intelligent Agents (F29AI)
Focus: Active Reinforcement Learning (ARL) and Q-Learning
Instructors:
- Arash Eshghi
- Slides based on works by:
  - Ioannis Konstas @HWU
  - Verena Rieser @HWU
  - Dan Klein @UC Berkeley

TD Value Learning Overview:
- Model-free policy evaluation method
- Relies on Bellman updates and running sample averages
Challenges:
- Difficulty in deriving new policies from values
Solution:
- Learn Q-values directly instead of just values
- Enables model-free action selection

Full reinforcement learning: Aims to discover optimal policies similar to value iteration
Key Points:
- Unknown transitions (T(s,a,s')) and rewards (R(s,a,s'))
- The learner actively chooses actions
- Goal: Learn optimal policy/values based on actions taken
- Tradeoff: Exploration vs exploitation
- Methodology: Not offline; act in the real world to learn outcomes

Value Iteration:
- A process to find depth-limited values
- Starting from V0(s) = 0
- Calculate depth k+1 values from Vk
Focus on Q-values:
- Begin with Q0(s,a) = 0, use Qk to compute depth k+1 Q-values

Q-Learning Process:
- A sample-based approach to Q-value iteration
- Steps:
  - Receive sample (s,a,s',r)
  - Compare old estimate versus new sample estimate
  - Update running average with new estimates

Convergence:
- Q-learning can converge to an optimal policy despite suboptimal actions
- This is known as off-policy learning
Requirements:
- Sufficient exploration
- Gradually reduce the learning rate
- Delay in reducing the learning rate is critical
- Selection method for actions becomes irrelevant in the limit

Dynamics:
- Essential contrast in reinforcement learning strategies
Visual Representation: Usual placement but unclear wording on "OPENING! GRAND ?"

Policy Explanation:
- Follows an !-greedy approach:
  - With probability ! conduct exploration
  - With probability (1-!) conduct exploitation
Definitions:
- Exploration: Select a random action
- Exploitation: Select the best known action according to current policy
On-Policy vs. Off-Policy:
- On-Policy: Control policy aligns with the learning policy
- Off-Policy: Different control policy (e.g., Epsilon-greedy) from the learned policy
- Q-Learning is classified as an off-policy method.

Known MDP (Offline Solution):
- Goal: Compute optimal values/policies
- Technique: Value/Policy Iteration
Unknown MDP:
- Model-Based Approach:
  - Goal: Compute values/policies, evaluate fixed policy using approximate MDP techniques
- Model-Free Approach:
  - Goal: Q-learning to compute optimal policies/values

Future Topics:
- Lecture 14: Introduction to NLP
- Lecture 15: Language Modelling
- Lecture 16: Syntactic Parsing as a Search Problem
- Lecture 17: (if time permits) Perceptrons & Deep Learning