Value-Iteration

Page 1: Introduction

Title: Artificial Intelligence and Intelligent Agents (F29AI)
Focus on Solving MDPs: Value Iteration
Presenters: Arash Eshghi, Ioiannis Konstas, Verena Rieser, Dan Klein

Page 2: Recap of MDPs

Definition: Markov Decision Process (MDP) consists of:
- A set of states: S
- A set of actions: A
- Transition function: T(s, a, s’) or P(s’ | s,a)
- Reward function: R(s, a, s’)
- Discount factor: γ
- Start state: S0
Key MDP quantities:
- Policy: mapping states to actions
- Utility: sum of discounted rewards
- Values: expected future utility from a state
- Q-value: expected future utility from a q-state

Page 3: Optimal Quantities

Value of a state:
- V*(s) = expected utility when acting optimally from state s
Value of a Q-state:
- Q*(s, a) = expected utility after taking action a in state s
Definition of optimal policy:
- p*(s) = optimal action corresponding to state s

Page 4: HERIOT Gridworld: V*

Visual representation of state values after 100 iterations:
- Values range: 0.28 to 1.00
- Noise factor: 0.2
- Discount: 0.9
- Living reward = -0.1

Page 5: HERIOT Gridworld: Q*

Display of Q-values after 100 iterations in gridworld:
- Values from 0.13 to 1.00
- Noise factor: 0.2
- Discount: 0.9
- Living reward = -0.1

Page 6: Today's Agenda

Topics covered:
- Solving Value Iteration
- Policy Evaluation
- Policy Extraction
- Policy Iteration
- Introduction to Reinforcement Learning

Page 7: Time-limited Values

Key idea of time-limited values:
- Vk(s): optimal value of state s if the game ends in k time steps
- Relates to depth-k expectimax implementation

Page 8: Computing HERIOT Time-Limited Values

Computation process for time-limited values: -vk values calculated iteratively based on previous values

Page 9: Value Iteration

Process overview:
- Initialize: V0 *(s) = 0 (no time steps remaining)
- Conduct one ply of expectimax from each state using Vk values
- Repeat until convergence
Computational complexity: O(S^2A)
Ensures convergence to unique optimal values
- The approximations refine toward optimal solutions
- Policies can converge before values do

Page 10: Example: Value Iteration

Initial values:
- All states V0 initialized to 0
Outputs:
- Analysis of slow vs. fast scenarios with living rewards and gamma values
Example calculations for values:
- Slow: 1.0 * (1 + V1(Cool)) = 3
- Fast: 0.5*(2 + V1(Cool)) + 0.5*(2 + V1(Warm))= 3.5

Page 11: Example: Convergence

Example progression of iterations:
- V0 = [0, 0, 0]
- V1 = [2, 1, 0]
- V2 = [3.5, 2.5, 0]
- V3 remains at [3.5, 2.5, 0]
Calculation annotations: slow and fast convergence highlighted with respective gamma adjustments

Page 12: Example: Gridworld Value Iteration

Focus on action impacts:
- Maximum value at action = right
- Additional actions not shown
- Maintains noise = 0.2 and gamma = 0.9 with living reward of 0

Page 13: Example: Gridworld Value Iteration Continued

Calculate based on optimal action:
- Formula: 0.8[0 + 0.90.72] + 0.1[0 + 0.90] + 0.1[0 + 0.9*0] ~ 0.52
- Visual representation of prior state values with focus on V2 and V3.

Page 14: HERIOT The Bellman Equations

Optimal behavior steps:
1. Take the correct first action
2. Continue making optimal decisions

Page 15: The Bellman Equations

Definition of "optimal utility":
- Represents relationship amongst optimal utility values through expectimax recurrence
- Forms the foundation of Bellman equations which characterize optimal values.