Making Complex Decisions: Markov Decision Processes

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/14

flashcard set

Earn XP

Description and Tags

Flashcards covering the fundamentals of Markov Decision Processes (MDPs), including definitions, policy evaluation, Bellman equations, and value iteration algorithms based on the Chapter 17 lecture notes.

Last updated 9:03 PM on 4/28/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

15 Terms

1
New cards

What is the primary purpose of a Markov Decision Process (MDP)?

MDPs are used to model sequential decision-making processes in environments where the outcome of an action is uncertain and influenced by both the current state and the chosen action.

2
New cards

What are the six core components of a Markov Decision Process definition?

  1. A set of States. 2. A starting state sstarts_{start}. 3. Actions(s)Actions(s): possible actions from state ss. 4. T(ss,a)T(s'|s, a): transition probabilities. 5. Reward(s,a,s)Reward(s, a, s'): rewards for transitions. 6. IsEnd(s)IsEnd(s): a test to determine if the game has ended.
3
New cards

What is the discount factor γ\gamma and how does its value affect an agent's preference?

The discount factor 0γ10 \leq \gamma \leq 1 specifies how much an agent values future rewards compared to current ones. A γ\gamma close to 00 makes the agent favor immediate rewards, while a γ\gamma of 11 treats future rewards as equal to immediate ones (additive rewards).

4
New cards

How does the environment in a search problem differ from the environment in an MDP?

Search problems typically assume a deterministic environment where the outcome of an action is known (using a successor function Succ(s,a)Succ(s, a)), whereas MDPs involve stochastic/uncertain environments modeled with transition probabilities T(ss,a)T(s'|s, a).

5
New cards

What is a policy π\pi in the context of an MDP?

A policy π\pi is a mapping that assigns an action aActions(s)a \in Actions(s) to every state sStatess \in States.

6
New cards

Why is it necessary to define a policy for every state in an MDP rather than just a path?

Because of the randomness in transitions (e.g., dice rolls), an agent cannot predict exactly which state it will end up in; therefore, it needs a pre-defined action for every possible state it might encounter.

7
New cards

What is the Markov property?

The principle that the future depends only on the present state and not on the past sequence of events, meaning the current state contains all information necessary to make an optimal decision.

8
New cards

How is 'Value' (Vπ(s)V^\pi(s)) defined for a policy?

Value is the expected utility (the discounted sum of rewards) that an agent receives by starting in state ss and following policy π\pi.

9
New cards

What is a Q-value (Qπ(s,a)Q^\pi(s, a)) in policy evaluation?

The expected utility of taking a specific action aa from state ss and then following policy π\pi for all subsequent steps.

10
New cards

What is the recurrence relation for the Q-value Qπ(s,a)Q^\pi(s, a) used in policy evaluation?

Qπ(s,a)=sT(ss,a)[Reward(s,a,s)+γVπ(s)]Q^\pi(s, a) = \sum_{s'} T(s'|s, a)[Reward(s, a, s') + \gamma V^\pi(s')]

11
New cards

How does the iterative policy evaluation algorithm determine convergence?

It continues until the maximum change between state values in consecutive iterations is less than or equal to an error tolerance ϵ\epsilon, or maxsStatesVπ(t)(s)Vπ(t1)(s)ϵ\max_{s \in States} |V_\pi^{(t)}(s) - V_\pi^{(t-1)}(s)| \leq \epsilon.

12
New cards

What is the Bellman equation for the optimal value V(s)V^*(s) of a non-terminal state?

V<em>(s)=maxaActions(s)Q</em>(s,a)V^<em>(s) = \max_{a \in Actions(s)} Q^</em>(s, a), where Q(s,a)Q^*(s, a) is the expected utility of taking action aa and acting optimally thereafter.

13
New cards

What are the two conditions under which the value iteration algorithm is guaranteed to converge?

Value iteration converges if the discount factor \gamma < 1 or if the MDP graph is acyclic.

14
New cards

In the 'Dice Game' example provided, what was the expected utility of the 'quit' policy versus the 'stay' policy?

The expected utility for 'quit' was 1010, and the expected utility for 'stay' was 1212.

15
New cards

How is the optimal policy π(s)\pi^*(s) derived from the optimal Q-values?

π<em>(s)=arg maxaActions(s)Q</em>(s,a)\pi^<em>(s) = \text{arg max}_{a \in Actions(s)} Q^</em>(s, a)