Making Complex Decisions: Markov Decision Processes

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/14

Earn XP

Description and Tags

Flashcards covering the fundamentals of Markov Decision Processes (MDPs), including definitions, policy evaluation, Bellman equations, and value iteration algorithms based on the Chapter 17 lecture notes.

Last updated 9:03 PM on 4/28/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

15 Terms

1

New cards

What is the primary purpose of a Markov Decision Process (MDP)?

MDPs are used to model sequential decision-making processes in environments where the outcome of an action is uncertain and influenced by both the current state and the chosen action.

2

New cards

What are the six core components of a Markov Decision Process definition?

A set of States. 2. A starting state $s_{start}$ . 3. $Actions(s)$ : possible actions from state $s$ . 4. $T(s'|s, a)$ : transition probabilities. 5. $Reward(s, a, s')$ : rewards for transitions. 6. $IsEnd(s)$ : a test to determine if the game has ended.

3

New cards

What is the discount factor $\gamma$ and how does its value affect an agent's preference?

The discount factor $0 \leq \gamma \leq 1$ specifies how much an agent values future rewards compared to current ones. A $\gamma$ close to $0$ makes the agent favor immediate rewards, while a $\gamma$ of $1$ treats future rewards as equal to immediate ones (additive rewards).

4

New cards

How does the environment in a search problem differ from the environment in an MDP?

Search problems typically assume a deterministic environment where the outcome of an action is known (using a successor function $Succ(s, a)$ ), whereas MDPs involve stochastic/uncertain environments modeled with transition probabilities $T(s'|s, a)$ .

5

New cards

What is a policy $\pi$ in the context of an MDP?

A policy $\pi$ is a mapping that assigns an action $a \in Actions(s)$ to every state $s \in States$ .

6

New cards

Why is it necessary to define a policy for every state in an MDP rather than just a path?

Because of the randomness in transitions (e.g., dice rolls), an agent cannot predict exactly which state it will end up in; therefore, it needs a pre-defined action for every possible state it might encounter.

7

New cards

What is the Markov property?

The principle that the future depends only on the present state and not on the past sequence of events, meaning the current state contains all information necessary to make an optimal decision.

8

New cards

How is 'Value' ( $V^\pi(s)$ ) defined for a policy?

Value is the expected utility (the discounted sum of rewards) that an agent receives by starting in state $s$ and following policy $\pi$ .

9

New cards

What is a Q-value ( $Q^\pi(s, a)$ ) in policy evaluation?

The expected utility of taking a specific action $a$ from state $s$ and then following policy $\pi$ for all subsequent steps.

10

New cards

What is the recurrence relation for the Q-value $Q^\pi(s, a)$ used in policy evaluation?

$Q^\pi(s, a) = \sum_{s'} T(s'|s, a)[Reward(s, a, s') + \gamma V^\pi(s')]$

11

New cards

How does the iterative policy evaluation algorithm determine convergence?

It continues until the maximum change between state values in consecutive iterations is less than or equal to an error tolerance $\epsilon$ , or $\max_{s \in States} |V_\pi^{(t)}(s) - V_\pi^{(t-1)}(s)| \leq \epsilon$ .

12

New cards

What is the Bellman equation for the optimal value $V^*(s)$ of a non-terminal state?

$V^<em>(s) = \max_{a \in Actions(s)} Q^</em>(s, a)$ , where $Q^*(s, a)$ is the expected utility of taking action $a$ and acting optimally thereafter.

13

New cards

What are the two conditions under which the value iteration algorithm is guaranteed to converge?

Value iteration converges if the discount factor \gamma < 1 or if the MDP graph is acyclic.

14

New cards

In the 'Dice Game' example provided, what was the expected utility of the 'quit' policy versus the 'stay' policy?

The expected utility for 'quit' was $10$ , and the expected utility for 'stay' was $12$ .

15

New cards

How is the optimal policy $\pi^*(s)$ derived from the optimal Q-values?

$\pi^<em>(s) = \text{arg max}_{a \in Actions(s)} Q^</em>(s, a)$