In-Depth Notes on Markov Decision Processes

Definition:
- MDPs are mathematical frameworks for modeling decision-making scenarios where outcomes are partly random and partly under the control of a decision maker.
Assumptions:
- Fully-observable world: The agent has complete visibility of the state of the environment.
- Non-deterministic actions: Actions can lead to different outcomes based on probabilities.
- Non-deterministic search: The search for an action does not guarantee a specific outcome.
- Applications:
- Robotics: Determine movement considering actuator failures and potential unseen obstacles.
- Resource Allocation: Decide on production strategies while facing uncertainty about customer demands.

Properties:
- States (S): The possible situations the agent can be in.
- Actions A(s): The possible actions available in state s.
- Transition Model T(s,a,s'): The model that describes the probability of moving to state s' given action a is taken in state s: P(s'|s,a).
- Reward Function R(s,a,s'): A function that indicates the immediate reward received after transitioning from state s to state s' through action a.
- Start State: The initial state where the decision process begins.
- Terminal State: The state where the process ends.
- Goal: Maximize the total reward from the start state to terminal state.
Markov Property:
- Only the current state impacts the outcomes of the actions, not the history of how that state was reached.
Policies:
- A policy \pi(s) gives an action for each state.
- The aim is to find the optimal policy 0\pi^* that maximizes expected utility when followed.

Utility of Sequences:
- Preferences over sequences of rewards influence decision-making (e.g., [1, 2, 2] vs. [2, 3, 4]).
- Considerations include immediate vs. future rewards (e.g., [0, 0, 1] vs. [1, 0, 0]).
Discounting:
- Preference for immediate rewards over future ones represented by discount factor 0\gamma, where 0 \leq \gamma \leq 1.
- Essential for ensuring convergence in MDP algorithms.
Values of States:
- Expected utility for starting in state s and acting optimally denoted by U^*(s).
- Action-Utility function or Q-function: Q^*(s,a) represents expected utility for taking action a in state s and then acting optimally.

Recursive equation used to define utilities of states:
- Optimality condition states the maximum expected reward obtainable from each state.

Process:
- Initialize utilities: U_0(s) = 0 for all states.
- Iteratively update utilities according to the Bellman equation until convergence to U^*.
- Computational complexity: O(S^2A).

An example scenario used to illustrate MDP concepts, involving states represented on a grid and rewards based on actions taken in those states.
- Iterative updates of values shown to illustrate progression towards optimal values.
Convergence Criteria:
- Value iteration converges when there are finite states and actions, bounded rewards, and a valid discount factor.

Steps:
1. Policy Evaluation: Given a policy \pi_i, calculate the utility for each state.
- For each state s:
  Ui(s) = \sum{s'} P(s'|s, \pii(s)) [R(s, \pii(s), s') + U^i(s')]
1. Policy Improvement: Adjust the policy based on calculated utilities until convergence.

MDPs model non-deterministic search problems with objectives of finding effective policies through methods like:
- Value Iteration
- Policy Iteration including evaluation and improvement phases.

Similar to MDPs but lacks explicit transition models or reward functions.
Focus on agents learning from interactions with their environment and refining policies based on behavioral outcomes.