ML: Reinforcement Learning

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/23

There's no tags or description

Looks like no tags are added yet.

Last updated 1:29 AM on 5/4/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

24 Terms

New cards

Bandit problems are simplified MDPs; Markov reward processes (MRPs) are simplified MDPs too. Bandit problems have simplified _____________ and MRPs have simplified _____________.

State spaces | action spaces

New cards

An agent’s return is a sum of rewards over time. How do we manage tasks where T = ∞?

B. By discounting future returns

New cards

An absorbing state in an MDP does all the following except:

Converts a continuing task into an episodic task

New cards

A policy is a mapping from _____________ to _____________.

States | actions

New cards

With an optimal value function or action-value function, one can extract an optimal policy by greedily selecting actions.

True

New cards

In an MDP, the transition dynamics are typically written as P(s' | s, a).

True

New cards

Which items are core components of an MDP? (Select ALL that apply)

States (S)

B. Actions (A)

C. Transition probabilities P(s'|s,a)

D. Reward function R(s,a,s') or R(s,a)

New cards

Which statements about discounting in returns are correct? (Select ALL that apply)

Discounting can make infinite-horizon returns finite when γ < 1

B. Smaller γ makes the agent more short-sighted

D. Discounting is one standard way to formalize continuing tasks

New cards

In reinforcement learning, the agent’s goal is to:

Maximize expected return (discounted sum of future rewards)

New cards

Which statement best describes the exploration vs. exploitation tradeoff?

Exploration means trying actions to learn more; exploitation means using what you know to get reward

New cards

A policy in reinforcement learning is:

A mapping from states (or observations) to actions

New cards

What is the difference between a value function V(s) and an actionvalue function Q(s,a)?

V(s) is expected return from a state under a policy; Q(s,a) is expected return from a state taking an action (then following a policy)

New cards

Which statement best describes temporal-difference (TD) learning?

It updates value estimates using a bootstrapped target based on the next state’s current estimate

New cards

Q-learning is an off-policy method that can learn the optimal policy even while following a different exploratory policy.

True

New cards

In general, Monte Carlo methods require episodes to terminate in order to compute returns.

True

New cards

(Multi-select) Which are common challenges in reinforcement learning?

A. Delayed rewards

B. Exploration vs. exploitation

C. Non-iid data

E. Stochastic transitions or rewards

New cards

A dominant strategy is a strategy that:

Gives a higher payoff than any other strategy, no matter what the opponent does

New cards

We can always find a pure Nash equilibrium in any finite game.

False

New cards

Every finite game has at least one Nash equilibrium (possibly mixed).

True

New cards

Which statements about mixed strategies are correct? (Select ALL that apply)

They are probability distributions over pure strategies

They help avoid being predictable and exploitable by an opponent

They are often necessary to achieve equilibrium in games with no pure Nash equilibrium

They can be used to maximize worst-case payoff in zero-sum settings

New cards

What does it mean for an outcome to be Pareto optimal?

You cannot make any player better off without making at least one other player worse off

New cards

Which statements about Nash equilibrium are correct? (Select ALL that apply)

At a Nash equilibrium, no player can improve their payoff by unilaterally changing strategy

A Nash equilibrium can be in mixed strategies (randomized action choices)

A game can have multiple Nash equilibria

Nash equilibrium is a stability concept, not necessarily the best overall outcome

New cards

In the Prisoner’s Dilemma, the classic lesson is that:

The Nash equilibrium can be worse for both players than mutual cooperation

New cards

In a 2-player zero-sum game, the minimax value represents:

The guaranteed payoff a player can secure assuming the opponent responds optimally