SS

In-Depth Notes on Markov Decision Processes

Markov Decision Processes (MDPs)

  • Definition:

    • MDPs are mathematical frameworks for modeling decision-making scenarios where outcomes are partly random and partly under the control of a decision maker.
  • Assumptions:

    • Fully-observable world: The agent has complete visibility of the state of the environment.
    • Non-deterministic actions: Actions can lead to different outcomes based on probabilities.
    • Non-deterministic search: The search for an action does not guarantee a specific outcome.
    • Applications:
    • Robotics: Determine movement considering actuator failures and potential unseen obstacles.
    • Resource Allocation: Decide on production strategies while facing uncertainty about customer demands.

Key Concepts in MDPs

  • Properties:

    • States (S): The possible situations the agent can be in.
    • Actions A(s): The possible actions available in state s.
    • Transition Model T(s,a,s'): The model that describes the probability of moving to state s' given action a is taken in state s: P(s'|s,a).
    • Reward Function R(s,a,s'): A function that indicates the immediate reward received after transitioning from state s to state s' through action a.
    • Start State: The initial state where the decision process begins.
    • Terminal State: The state where the process ends.
    • Goal: Maximize the total reward from the start state to terminal state.
  • Markov Property:

    • Only the current state impacts the outcomes of the actions, not the history of how that state was reached.
  • Policies:

    • A policy \pi(s) gives an action for each state.
    • The aim is to find the optimal policy 0\pi^* that maximizes expected utility when followed.

Utility and Reward

  • Utility of Sequences:

    • Preferences over sequences of rewards influence decision-making (e.g., [1, 2, 2] vs. [2, 3, 4]).
    • Considerations include immediate vs. future rewards (e.g., [0, 0, 1] vs. [1, 0, 0]).
  • Discounting:

    • Preference for immediate rewards over future ones represented by discount factor 0\gamma, where 0 \leq \gamma \leq 1.
    • Essential for ensuring convergence in MDP algorithms.
  • Values of States:

    • Expected utility for starting in state s and acting optimally denoted by U^*(s).
    • Action-Utility function or Q-function: Q^*(s,a) represents expected utility for taking action a in state s and then acting optimally.

Bellman Equation

  • Recursive equation used to define utilities of states:
    • Optimality condition states the maximum expected reward obtainable from each state.

Value Iteration

  • Process:
    • Initialize utilities: U_0(s) = 0 for all states.
    • Iteratively update utilities according to the Bellman equation until convergence to U^*.
    • Computational complexity: O(S^2A).

Example: Gridworld

  • An example scenario used to illustrate MDP concepts, involving states represented on a grid and rewards based on actions taken in those states.
    • Iterative updates of values shown to illustrate progression towards optimal values.
  • Convergence Criteria:
    • Value iteration converges when there are finite states and actions, bounded rewards, and a valid discount factor.

Policy Iteration

  • Steps:
    1. Policy Evaluation: Given a policy \pi_i, calculate the utility for each state.
    • For each state s:
      Ui(s) = \sum{s'} P(s'|s, \pii(s)) [R(s, \pii(s), s') + U^i(s')]
    1. Policy Improvement: Adjust the policy based on calculated utilities until convergence.

Summary

  • MDPs model non-deterministic search problems with objectives of finding effective policies through methods like:
    • Value Iteration
    • Policy Iteration including evaluation and improvement phases.

Preview: Reinforcement Learning

  • Similar to MDPs but lacks explicit transition models or reward functions.
  • Focus on agents learning from interactions with their environment and refining policies based on behavioral outcomes.