Value-Iteration

Page 1: Introduction

  • Title: Artificial Intelligence and Intelligent Agents (F29AI)

  • Focus on Solving MDPs: Value Iteration

  • Presenters: Arash Eshghi, Ioiannis Konstas, Verena Rieser, Dan Klein

Page 2: Recap of MDPs

  • Definition: Markov Decision Process (MDP) consists of:

    • A set of states: S

    • A set of actions: A

    • Transition function: T(s, a, s’) or P(s’ | s,a)

    • Reward function: R(s, a, s’)

    • Discount factor: γ

    • Start state: S0

  • Key MDP quantities:

    • Policy: mapping states to actions

    • Utility: sum of discounted rewards

    • Values: expected future utility from a state

    • Q-value: expected future utility from a q-state

Page 3: Optimal Quantities

  • Value of a state:

    • V*(s) = expected utility when acting optimally from state s

  • Value of a Q-state:

    • Q*(s, a) = expected utility after taking action a in state s

  • Definition of optimal policy:

    • p*(s) = optimal action corresponding to state s

Page 4: HERIOT Gridworld: V*

  • Visual representation of state values after 100 iterations:

    • Values range: 0.28 to 1.00

    • Noise factor: 0.2

    • Discount: 0.9

    • Living reward = -0.1

Page 5: HERIOT Gridworld: Q*

  • Display of Q-values after 100 iterations in gridworld:

    • Values from 0.13 to 1.00

    • Noise factor: 0.2

    • Discount: 0.9

    • Living reward = -0.1

Page 6: Today's Agenda

  • Topics covered:

    • Solving Value Iteration

    • Policy Evaluation

    • Policy Extraction

    • Policy Iteration

    • Introduction to Reinforcement Learning

Page 7: Time-limited Values

  • Key idea of time-limited values:

    • Vk(s): optimal value of state s if the game ends in k time steps

    • Relates to depth-k expectimax implementation

Page 8: Computing HERIOT Time-Limited Values

  • Computation process for time-limited values: -vk values calculated iteratively based on previous values

Page 9: Value Iteration

  • Process overview:

    • Initialize: V0 *(s) = 0 (no time steps remaining)

    • Conduct one ply of expectimax from each state using Vk values

    • Repeat until convergence

  • Computational complexity: O(S^2A)

  • Ensures convergence to unique optimal values

    • The approximations refine toward optimal solutions

    • Policies can converge before values do

Page 10: Example: Value Iteration

  • Initial values:

    • All states V0 initialized to 0

  • Outputs:

    • Analysis of slow vs. fast scenarios with living rewards and gamma values

  • Example calculations for values:

    • Slow: 1.0 * (1 + V1(Cool)) = 3

    • Fast: 0.5*(2 + V1(Cool)) + 0.5*(2 + V1(Warm))= 3.5

Page 11: Example: Convergence

  • Example progression of iterations:

    • V0 = [0, 0, 0]

    • V1 = [2, 1, 0]

    • V2 = [3.5, 2.5, 0]

    • V3 remains at [3.5, 2.5, 0]

  • Calculation annotations: slow and fast convergence highlighted with respective gamma adjustments

Page 12: Example: Gridworld Value Iteration

  • Focus on action impacts:

    • Maximum value at action = right

    • Additional actions not shown

    • Maintains noise = 0.2 and gamma = 0.9 with living reward of 0

Page 13: Example: Gridworld Value Iteration Continued

  • Calculate based on optimal action:

    • Formula: 0.8[0 + 0.90.72] + 0.1[0 + 0.90] + 0.1[0 + 0.9*0] ~ 0.52

    • Visual representation of prior state values with focus on V2 and V3.

Page 14: HERIOT The Bellman Equations

  • Optimal behavior steps:

    1. Take the correct first action

    2. Continue making optimal decisions

Page 15: The Bellman Equations

  • Definition of "optimal utility":

    • Represents relationship amongst optimal utility values through expectimax recurrence

    • Forms the foundation of Bellman equations which characterize optimal values.