In-Depth Notes on Markov Decision Processes
Markov Decision Processes (MDPs)
Definition:
- MDPs are mathematical frameworks for modeling decision-making scenarios where outcomes are partly random and partly under the control of a decision maker.
Assumptions:
- Fully-observable world: The agent has complete visibility of the state of the environment.
- Non-deterministic actions: Actions can lead to different outcomes based on probabilities.
- Non-deterministic search: The search for an action does not guarantee a specific outcome.
- Applications:
- Robotics: Determine movement considering actuator failures and potential unseen obstacles.
- Resource Allocation: Decide on production strategies while facing uncertainty about customer demands.
Key Concepts in MDPs
Properties:
- States (S): The possible situations the agent can be in.
- Actions A(s): The possible actions available in state s.
- Transition Model T(s,a,s'): The model that describes the probability of moving to state s' given action a is taken in state s: P(s'|s,a).
- Reward Function R(s,a,s'): A function that indicates the immediate reward received after transitioning from state s to state s' through action a.
- Start State: The initial state where the decision process begins.
- Terminal State: The state where the process ends.
- Goal: Maximize the total reward from the start state to terminal state.
Markov Property:
- Only the current state impacts the outcomes of the actions, not the history of how that state was reached.
Policies:
- A policy \pi(s) gives an action for each state.
- The aim is to find the optimal policy 0\pi^* that maximizes expected utility when followed.
Utility and Reward
Utility of Sequences:
- Preferences over sequences of rewards influence decision-making (e.g., [1, 2, 2] vs. [2, 3, 4]).
- Considerations include immediate vs. future rewards (e.g., [0, 0, 1] vs. [1, 0, 0]).
Discounting:
- Preference for immediate rewards over future ones represented by discount factor 0\gamma, where 0 \leq \gamma \leq 1.
- Essential for ensuring convergence in MDP algorithms.
Values of States:
- Expected utility for starting in state s and acting optimally denoted by U^*(s).
- Action-Utility function or Q-function: Q^*(s,a) represents expected utility for taking action a in state s and then acting optimally.
Bellman Equation
- Recursive equation used to define utilities of states:
- Optimality condition states the maximum expected reward obtainable from each state.
Value Iteration
- Process:
- Initialize utilities: U_0(s) = 0 for all states.
- Iteratively update utilities according to the Bellman equation until convergence to U^*.
- Computational complexity: O(S^2A).
Example: Gridworld
- An example scenario used to illustrate MDP concepts, involving states represented on a grid and rewards based on actions taken in those states.
- Iterative updates of values shown to illustrate progression towards optimal values.
- Convergence Criteria:
- Value iteration converges when there are finite states and actions, bounded rewards, and a valid discount factor.
Policy Iteration
- Steps:
- Policy Evaluation: Given a policy \pi_i, calculate the utility for each state.
- For each state s:
Ui(s) = \sum{s'} P(s'|s, \pii(s)) [R(s, \pii(s), s') + U^i(s')]
- Policy Improvement: Adjust the policy based on calculated utilities until convergence.
Summary
- MDPs model non-deterministic search problems with objectives of finding effective policies through methods like:
- Value Iteration
- Policy Iteration including evaluation and improvement phases.
Preview: Reinforcement Learning
- Similar to MDPs but lacks explicit transition models or reward functions.
- Focus on agents learning from interactions with their environment and refining policies based on behavioral outcomes.