1/23
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Bandit problems are simplified MDPs; Markov reward processes (MRPs) are simplified MDPs too. Bandit problems have simplified _____________ and MRPs have simplified _____________.
State spaces | action spaces
An agent’s return is a sum of rewards over time. How do we manage tasks where T = ∞?
B. By discounting future returns
An absorbing state in an MDP does all the following except:
Converts a continuing task into an episodic task
A policy is a mapping from _____________ to _____________.
States | actions
With an optimal value function or action-value function, one can extract an optimal policy by greedily selecting actions.
True
In an MDP, the transition dynamics are typically written as P(s' | s, a).
True
Which items are core components of an MDP? (Select ALL that apply)
States (S)
B. Actions (A)
C. Transition probabilities P(s'|s,a)
D. Reward function R(s,a,s') or R(s,a)
Which statements about discounting in returns are correct? (Select ALL that apply)
Discounting can make infinite-horizon returns finite when γ < 1
B. Smaller γ makes the agent more short-sighted
D. Discounting is one standard way to formalize continuing tasks
In reinforcement learning, the agent’s goal is to:
Maximize expected return (discounted sum of future rewards)
Which statement best describes the exploration vs. exploitation tradeoff?
Exploration means trying actions to learn more; exploitation means using what you know to get reward
A policy in reinforcement learning is:
A mapping from states (or observations) to actions
What is the difference between a value function V(s) and an actionvalue function Q(s,a)?
V(s) is expected return from a state under a policy; Q(s,a) is expected return from a state taking an action (then following a policy)
Which statement best describes temporal-difference (TD) learning?
It updates value estimates using a bootstrapped target based on the next state’s current estimate
Q-learning is an off-policy method that can learn the optimal policy even while following a different exploratory policy.
True
In general, Monte Carlo methods require episodes to terminate in order to compute returns.
True
(Multi-select) Which are common challenges in reinforcement learning?
A. Delayed rewards
B. Exploration vs. exploitation
C. Non-iid data
E. Stochastic transitions or rewards
A dominant strategy is a strategy that:
Gives a higher payoff than any other strategy, no matter what the opponent does
We can always find a pure Nash equilibrium in any finite game.
False
Every finite game has at least one Nash equilibrium (possibly mixed).
True
Which statements about mixed strategies are correct? (Select ALL that apply)
They are probability distributions over pure strategies
They help avoid being predictable and exploitable by an opponent
They are often necessary to achieve equilibrium in games with no pure Nash equilibrium
They can be used to maximize worst-case payoff in zero-sum settings
What does it mean for an outcome to be Pareto optimal?
You cannot make any player better off without making at least one other player worse off
Which statements about Nash equilibrium are correct? (Select ALL that apply)
At a Nash equilibrium, no player can improve their payoff by unilaterally changing strategy
A Nash equilibrium can be in mixed strategies (randomized action choices)
A game can have multiple Nash equilibria
Nash equilibrium is a stability concept, not necessarily the best overall outcome
In the Prisoner’s Dilemma, the classic lesson is that:
The Nash equilibrium can be worse for both players than mutual cooperation
In a 2-player zero-sum game, the minimax value represents:
The guaranteed payoff a player can secure assuming the opponent responds optimally