Value-Iteration
Page 1: Introduction
Title: Artificial Intelligence and Intelligent Agents (F29AI)
Focus on Solving MDPs: Value Iteration
Presenters: Arash Eshghi, Ioiannis Konstas, Verena Rieser, Dan Klein
Page 2: Recap of MDPs
Definition: Markov Decision Process (MDP) consists of:
A set of states: S
A set of actions: A
Transition function: T(s, a, s’) or P(s’ | s,a)
Reward function: R(s, a, s’)
Discount factor: γ
Start state: S0
Key MDP quantities:
Policy: mapping states to actions
Utility: sum of discounted rewards
Values: expected future utility from a state
Q-value: expected future utility from a q-state
Page 3: Optimal Quantities
Value of a state:
V*(s) = expected utility when acting optimally from state s
Value of a Q-state:
Q*(s, a) = expected utility after taking action a in state s
Definition of optimal policy:
p*(s) = optimal action corresponding to state s
Page 4: HERIOT Gridworld: V*
Visual representation of state values after 100 iterations:
Values range: 0.28 to 1.00
Noise factor: 0.2
Discount: 0.9
Living reward = -0.1
Page 5: HERIOT Gridworld: Q*
Display of Q-values after 100 iterations in gridworld:
Values from 0.13 to 1.00
Noise factor: 0.2
Discount: 0.9
Living reward = -0.1
Page 6: Today's Agenda
Topics covered:
Solving Value Iteration
Policy Evaluation
Policy Extraction
Policy Iteration
Introduction to Reinforcement Learning
Page 7: Time-limited Values
Key idea of time-limited values:
Vk(s): optimal value of state s if the game ends in k time steps
Relates to depth-k expectimax implementation
Page 8: Computing HERIOT Time-Limited Values
Computation process for time-limited values: -vk values calculated iteratively based on previous values
Page 9: Value Iteration
Process overview:
Initialize: V0 *(s) = 0 (no time steps remaining)
Conduct one ply of expectimax from each state using Vk values
Repeat until convergence
Computational complexity: O(S^2A)
Ensures convergence to unique optimal values
The approximations refine toward optimal solutions
Policies can converge before values do
Page 10: Example: Value Iteration
Initial values:
All states V0 initialized to 0
Outputs:
Analysis of slow vs. fast scenarios with living rewards and gamma values
Example calculations for values:
Slow: 1.0 * (1 + V1(Cool)) = 3
Fast: 0.5*(2 + V1(Cool)) + 0.5*(2 + V1(Warm))= 3.5
Page 11: Example: Convergence
Example progression of iterations:
V0 = [0, 0, 0]
V1 = [2, 1, 0]
V2 = [3.5, 2.5, 0]
V3 remains at [3.5, 2.5, 0]
Calculation annotations: slow and fast convergence highlighted with respective gamma adjustments
Page 12: Example: Gridworld Value Iteration
Focus on action impacts:
Maximum value at action = right
Additional actions not shown
Maintains noise = 0.2 and gamma = 0.9 with living reward of 0
Page 13: Example: Gridworld Value Iteration Continued
Calculate based on optimal action:
Formula: 0.8[0 + 0.90.72] + 0.1[0 + 0.90] + 0.1[0 + 0.9*0] ~ 0.52
Visual representation of prior state values with focus on V2 and V3.
Page 14: HERIOT The Bellman Equations
Optimal behavior steps:
Take the correct first action
Continue making optimal decisions
Page 15: The Bellman Equations
Definition of "optimal utility":
Represents relationship amongst optimal utility values through expectimax recurrence
Forms the foundation of Bellman equations which characterize optimal values.