1/56
If there any issues text me on discord @KwisJino
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
In Reinforcement Learning, the agent’s goal is to maximize the cumulative reward, not just the immediate reward.
True
The Markov property means that the next state and reward depend only on the current state and action, not on the entire history of past states.
True
Which of the following illustrates the exploration vs. exploitation dilemma?
Choosing between visiting a favorite restaurant or trying a new one
One major limitation of RL is that
It often needs large amounts of interaction data”
A discount factor γ close to 1 means the agent:
Strongly considers long-term rewards
Which of the following is a strength of Reinforcement Learning?
Adapts to non-stationary: RL is potentially good at adapting via feedback loops, though with caveats. This is a plausible strength.
Which of the following best describes Reinforcement Learning?
Learning by interacting with an environment to maximize long-term rewards
Which of the following terms is correctly matched with its definition?
Value Function – Estimates how good a state (or state–action pair) is in terms of expected return
Which of the following is NOT part of the agent–environment interaction loop?
The environment updates the policy
Which statement about the Markov Property is incorrect?
It’s not true that knowledge of all past states is required; Markov means the present is sufficient.
Which statement about the discount factor (γ) is incorrect?
Saying γ is always between 0 and 1 and therefore always makes future rewards less is too strong; γ can be 1 in episodic tasks (so future isn’t discounted).
Which statement about the Bellman Expectation Equation is incorrect?
Q(s, a) always considers only the immediate reward R(s).
What is the core meaning of the Bellman Optimality Equation?
The optimal policy π ∗ maximizes the expected return.
Which of the following statements correctly describes the goal of planning in reinforcement learning?
To use a known MDP model to find an optimal policy
Which of the following statements about Iterative Policy Evaluation (IPE) is correct?
It starts with arbitrary values and repeatedly applies the Bellman equation.
Which of the following statements correctly describes Policy Iteration or Value Iteration?
Policy Iteration alternates between policy evaluation and policy improvement.
Which of the following statements about Early Stopping in Policy Iteration is correct?
It can stop based on a maximum number of iterations or a convergence threshold.
Which of the following statements about the Action Value Function Q(s, a) is incorrect?
It includes both immediate reward and the next state’s Q-value.
Which of the following statements is incorrect regarding agent, environment, state, reward, and action in reinforcement learning?
The agent always selects the action that maximizes the immediate reward in every step
Which of the following statements is incorrect regarding Markov Process (MP), Markov Reward Process (MRP), and Markov Decision Process (MDP)?
A MRP is an MP extended with rewards but without discounting.
Which best describes the purpose of policy evaluation in reinforcement learning?
Estimate the value of states under a given policy so that the policy’s quality can be measured.
In an unknown MDP, which of the following is a correct description of model-free reinforcement learning?
The agent learns purely from interactions without prior knowledge of transitions or rewards.
Which of the following statements about value iteration is correct?
Value iteration updates each state value by applying the Bellman Optimality Equation and immediately taking the maximum over possible actions.
Which of the following best explains why Monte Carlo returns are computed backward from the end of an episode?
Because returns are recursively defined starting from the terminal state.
Which of the following is a disadvantage of Temporal Difference (TD) methods compared to Monte Carlo (MC) methods?
TD methods are biased because they update using estimates of future values rather than full returns.
Which of the following statements correctly distinguishes Temporal Difference (TD) from Monte Carlo (MC) methods?
TD can update after each step, while MC must wait until the episode ends
During policy improvement, what is the typical rule used to update the policy?
Select the action that maximizes the expected value according to the most recent estimates of the state-value function.
Which of the following is correct about n-step Temporal Difference (TD) methods)?
None of the above
Which of the following statements about MC vs TD is correct?
MC only works in episodic tasks, TD works in both episodic and continuing tasks.
What is the best reason Q Learning is classified as off policy?
It uses the target max over a prime of Q(s ′ , a′ ) regardless of which action a ′ was actually taken in the next state
Which pairing of method and policy type is correct?
SARSA is on policy and Q Learning is off policy.
Which explanation is correct regarding target and behavior policy in off-policy learning?
Off-policy learning allows the agent to learn the optimal target policy while following a different behavior policy to gather more diverse experiences.
Which of the following is NOT correct regarding exploration and exploitation?
Exploitation always guarantees the highest reward.
Which of the following is a correct explanation of the TD (Temporal Difference) target in reinforcement learning?
The TD target includes both the immediate reward and the estimated future reward, weighted by the discount factor.
Which of the following is correct regarding on-policy and off-policy learning?
In on-policy learning, the agent learns from the policy it is currently following, while off-policy learning can learn from a different behavior policy
In a policy-based agent, the decision-making process is:
Stochastic
What does a value-based agent primarily aim to maximize?
Expected Q(s, a) value
In a continuous action space, policy-based methods are preferred because:
They remove the need to find arg maxa Q(s, a)
The objective function J(θ) in policy gradient methods represents:
The expected return under policy πθ
The REINFORCE algorithm updates parameters using:
Monte Carlo returns
In Actor-Critic (AC) methods, the actor and critic are responsible for:
Actor = policy improvement, Critic = value estimation
In Advantage Actor-Critic (A2C/A3C), the advantage term A(s, a) is defined as:
A(s, a) = Q(s, a) − V (s)
Compared to Advantage AC, TD Actor-Critic uses:
Two networks (policy and value)
The policy gradient update in TD Actor-Critic uses:
Gradient of TD error times log probability
In the Actor-Critic training code, the critic loss is computed as:
Smooth L1 loss between V (s) and TD target
Which of the following equations correctly represents the policy update in TD ActorCritic?
θ ← θ + α∇θ log πθ(a|s) (r + γVϕ(s ′ ) − Vϕ(s))
Which of the following statements about DQN (Deep Q-Network) is incorrect?
DQN is a policy-based method that directly learns a stochastic policy.
Which of the following statements about the REINFORCE algorithm is incorrect?
REINFORCE uses Temporal Difference (TD) error to update every step.
Which of the following correctly represents PPO’s clipped objective?
To prevent excessively large policy updates
When At > 0, what does it mean?
The action was better than expected
If rt = πnew(a|s) πold(a|s) > 1, what does it indicate?
The new policy increases the action probability.
Which of the following is a correct statement about PPO?
A. PPO is model free
B. PPO can be applied to continuous action spaces
C. PPO avoids second order optimization by using a clipped surrogate objective
D. PPO often uses GAE to reduce variance of the advantage
Which statement about PPO is incorrect?
A. PPO is a value-based algorithm only
B. PPO is a pure model-based method
C. PPO completely ignores the critic network
D. PPO does not use policy gradient updates
Which of the following statements about GAE is correct?
GAE computes a weighted sum of multi-step TD residuals using λ ∈ [0,1].
When does the PPO clipping mechanism become active?
When rtr_trt moves outside [1−ϵ, 1+ϵ].
What does KL Divergence measure in PPO?
How much the new policy diverges from the old policy.
Why is the expectation Eπ[Rt ] used when defining the RL objective?
Because the return is a random variable due to stochasticity in environment and policy