Reinforcement Learning MCQ

0.0(0)
studied byStudied by 6 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/56

flashcard set

Earn XP

Description and Tags

If there any issues text me on discord @KwisJino

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

57 Terms

1
New cards

In Reinforcement Learning, the agent’s goal is to maximize the cumulative reward, not just the immediate reward.

True

2
New cards

The Markov property means that the next state and reward depend only on the current state and action, not on the entire history of past states.

True

3
New cards

Which of the following illustrates the exploration vs. exploitation dilemma?

Choosing between visiting a favorite restaurant or trying a new one

4
New cards

One major limitation of RL is that

It often needs large amounts of interaction data”

5
New cards

A discount factor γ close to 1 means the agent:

Strongly considers long-term rewards

6
New cards

Which of the following is a strength of Reinforcement Learning?

Adapts to non-stationary: RL is potentially good at adapting via feedback loops, though with caveats. This is a plausible strength.

7
New cards

Which of the following best describes Reinforcement Learning?

Learning by interacting with an environment to maximize long-term rewards

8
New cards

Which of the following terms is correctly matched with its definition?

Value Function – Estimates how good a state (or state–action pair) is in terms of expected return

9
New cards

Which of the following is NOT part of the agent–environment interaction loop?

The environment updates the policy

10
New cards

Which statement about the Markov Property is incorrect?

It’s not true that knowledge of all past states is required; Markov means the present is sufficient.

11
New cards

Which statement about the discount factor (γ) is incorrect?

Saying γ is always between 0 and 1 and therefore always makes future rewards less is too strong; γ can be 1 in episodic tasks (so future isn’t discounted).

12
New cards

Which statement about the Bellman Expectation Equation is incorrect?

Q(s, a) always considers only the immediate reward R(s).

13
New cards

What is the core meaning of the Bellman Optimality Equation?

The optimal policy π ∗ maximizes the expected return.

14
New cards

Which of the following statements correctly describes the goal of planning in reinforcement learning?

To use a known MDP model to find an optimal policy

15
New cards

Which of the following statements about Iterative Policy Evaluation (IPE) is correct?

It starts with arbitrary values and repeatedly applies the Bellman equation.

16
New cards

Which of the following statements correctly describes Policy Iteration or Value Iteration?

Policy Iteration alternates between policy evaluation and policy improvement.

17
New cards

Which of the following statements about Early Stopping in Policy Iteration is correct?

It can stop based on a maximum number of iterations or a convergence threshold.

18
New cards

Which of the following statements about the Action Value Function Q(s, a) is incorrect?

It includes both immediate reward and the next state’s Q-value.

19
New cards

Which of the following statements is incorrect regarding agent, environment, state, reward, and action in reinforcement learning?

The agent always selects the action that maximizes the immediate reward in every step

20
New cards

Which of the following statements is incorrect regarding Markov Process (MP), Markov Reward Process (MRP), and Markov Decision Process (MDP)?

A MRP is an MP extended with rewards but without discounting.

21
New cards

Which best describes the purpose of policy evaluation in reinforcement learning?

Estimate the value of states under a given policy so that the policy’s quality can be measured.

22
New cards

In an unknown MDP, which of the following is a correct description of model-free reinforcement learning?

The agent learns purely from interactions without prior knowledge of transitions or rewards.

23
New cards

Which of the following statements about value iteration is correct?

Value iteration updates each state value by applying the Bellman Optimality Equation and immediately taking the maximum over possible actions.

24
New cards

Which of the following best explains why Monte Carlo returns are computed backward from the end of an episode?

Because returns are recursively defined starting from the terminal state.

25
New cards

Which of the following is a disadvantage of Temporal Difference (TD) methods compared to Monte Carlo (MC) methods?

TD methods are biased because they update using estimates of future values rather than full returns.

26
New cards

Which of the following statements correctly distinguishes Temporal Difference (TD) from Monte Carlo (MC) methods?

TD can update after each step, while MC must wait until the episode ends

27
New cards

During policy improvement, what is the typical rule used to update the policy?

Select the action that maximizes the expected value according to the most recent estimates of the state-value function.

28
New cards

Which of the following is correct about n-step Temporal Difference (TD) methods)?

None of the above

29
New cards

Which of the following statements about MC vs TD is correct?

MC only works in episodic tasks, TD works in both episodic and continuing tasks.

30
New cards

What is the best reason Q Learning is classified as off policy?

It uses the target max over a prime of Q(s ′ , a′ ) regardless of which action a ′ was actually taken in the next state

31
New cards

Which pairing of method and policy type is correct?

SARSA is on policy and Q Learning is off policy.

32
New cards

Which explanation is correct regarding target and behavior policy in off-policy learning?

Off-policy learning allows the agent to learn the optimal target policy while following a different behavior policy to gather more diverse experiences.

33
New cards

Which of the following is NOT correct regarding exploration and exploitation?

Exploitation always guarantees the highest reward.

34
New cards

Which of the following is a correct explanation of the TD (Temporal Difference) target in reinforcement learning?

The TD target includes both the immediate reward and the estimated future reward, weighted by the discount factor.

35
New cards

Which of the following is correct regarding on-policy and off-policy learning?

In on-policy learning, the agent learns from the policy it is currently following, while off-policy learning can learn from a different behavior policy

36
New cards

In a policy-based agent, the decision-making process is:

Stochastic

37
New cards

What does a value-based agent primarily aim to maximize?

Expected Q(s, a) value

38
New cards

In a continuous action space, policy-based methods are preferred because:

They remove the need to find arg maxa Q(s, a)

39
New cards

The objective function J(θ) in policy gradient methods represents:

The expected return under policy πθ

40
New cards

The REINFORCE algorithm updates parameters using:

Monte Carlo returns

41
New cards

In Actor-Critic (AC) methods, the actor and critic are responsible for:

Actor = policy improvement, Critic = value estimation

42
New cards

In Advantage Actor-Critic (A2C/A3C), the advantage term A(s, a) is defined as:

A(s, a) = Q(s, a) − V (s)

43
New cards

Compared to Advantage AC, TD Actor-Critic uses:

Two networks (policy and value)

44
New cards

The policy gradient update in TD Actor-Critic uses:

Gradient of TD error times log probability

45
New cards

In the Actor-Critic training code, the critic loss is computed as:

Smooth L1 loss between V (s) and TD target

46
New cards

Which of the following equations correctly represents the policy update in TD ActorCritic?

θ ← θ + α∇θ log πθ(a|s) (r + γVϕ(s ′ ) − Vϕ(s))

47
New cards

Which of the following statements about DQN (Deep Q-Network) is incorrect?

DQN is a policy-based method that directly learns a stochastic policy.

48
New cards

Which of the following statements about the REINFORCE algorithm is incorrect?

REINFORCE uses Temporal Difference (TD) error to update every step.

49
New cards

Which of the following correctly represents PPO’s clipped objective?

To prevent excessively large policy updates

50
New cards

When At > 0, what does it mean?

The action was better than expected

51
New cards

If rt = πnew(a|s) πold(a|s) > 1, what does it indicate?

The new policy increases the action probability.

52
New cards

Which of the following is a correct statement about PPO?

A. PPO is model free
B. PPO can be applied to continuous action spaces
C. PPO avoids second order optimization by using a clipped surrogate objective
D. PPO often uses GAE to reduce variance of the advantage

53
New cards

Which statement about PPO is incorrect?

A. PPO is a value-based algorithm only
B. PPO is a pure model-based method
C. PPO completely ignores the critic network
D. PPO does not use policy gradient updates

54
New cards

Which of the following statements about GAE is correct?

GAE computes a weighted sum of multi-step TD residuals using λ ∈ [0,1].

55
New cards

When does the PPO clipping mechanism become active?

When rtr_trt​ moves outside [1−ϵ, 1+ϵ].

56
New cards

What does KL Divergence measure in PPO?

How much the new policy diverges from the old policy.

57
New cards

Why is the expectation Eπ[Rt ] used when defining the RL objective?

Because the return is a random variable due to stochasticity in environment and policy