1/14
Flashcards covering the fundamentals of Markov Decision Processes (MDPs), including definitions, policy evaluation, Bellman equations, and value iteration algorithms based on the Chapter 17 lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is the primary purpose of a Markov Decision Process (MDP)?
MDPs are used to model sequential decision-making processes in environments where the outcome of an action is uncertain and influenced by both the current state and the chosen action.
What are the six core components of a Markov Decision Process definition?
What is the discount factor γ and how does its value affect an agent's preference?
The discount factor 0≤γ≤1 specifies how much an agent values future rewards compared to current ones. A γ close to 0 makes the agent favor immediate rewards, while a γ of 1 treats future rewards as equal to immediate ones (additive rewards).
How does the environment in a search problem differ from the environment in an MDP?
Search problems typically assume a deterministic environment where the outcome of an action is known (using a successor function Succ(s,a)), whereas MDPs involve stochastic/uncertain environments modeled with transition probabilities T(s′∣s,a).
What is a policy π in the context of an MDP?
A policy π is a mapping that assigns an action a∈Actions(s) to every state s∈States.
Why is it necessary to define a policy for every state in an MDP rather than just a path?
Because of the randomness in transitions (e.g., dice rolls), an agent cannot predict exactly which state it will end up in; therefore, it needs a pre-defined action for every possible state it might encounter.
What is the Markov property?
The principle that the future depends only on the present state and not on the past sequence of events, meaning the current state contains all information necessary to make an optimal decision.
How is 'Value' (Vπ(s)) defined for a policy?
Value is the expected utility (the discounted sum of rewards) that an agent receives by starting in state s and following policy π.
What is a Q-value (Qπ(s,a)) in policy evaluation?
The expected utility of taking a specific action a from state s and then following policy π for all subsequent steps.
What is the recurrence relation for the Q-value Qπ(s,a) used in policy evaluation?
Qπ(s,a)=s′∑T(s′∣s,a)[Reward(s,a,s′)+γVπ(s′)]
How does the iterative policy evaluation algorithm determine convergence?
It continues until the maximum change between state values in consecutive iterations is less than or equal to an error tolerance ϵ, or maxs∈States∣Vπ(t)(s)−Vπ(t−1)(s)∣≤ϵ.
What is the Bellman equation for the optimal value V∗(s) of a non-terminal state?
V<em>(s)=maxa∈Actions(s)Q</em>(s,a), where Q∗(s,a) is the expected utility of taking action a and acting optimally thereafter.
What are the two conditions under which the value iteration algorithm is guaranteed to converge?
Value iteration converges if the discount factor \gamma < 1 or if the MDP graph is acyclic.
In the 'Dice Game' example provided, what was the expected utility of the 'quit' policy versus the 'stay' policy?
The expected utility for 'quit' was 10, and the expected utility for 'stay' was 12.
How is the optimal policy π∗(s) derived from the optimal Q-values?
π<em>(s)=arg maxa∈Actions(s)Q</em>(s,a)