Reinforcement Learning - Vocabulary Flashcards (Video Notes)

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/60

Earn XP

Description and Tags

Vocabulary-style flashcards covering key RL concepts from the notes.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

61 Terms

New cards

Reinforcement Learning (RL)

A framework for solving control/decision problems by an agent learning from interaction with an environment to maximize rewards.

New cards

Agent

An intelligent program that learns to make decisions by moving through an environment and taking actions.

New cards

Environment

The world in which the agent operates, providing states, rewards, and transitions.

New cards

State

A representation of the current situation returned by the environment; typically a feature-vector.

New cards

Action

A possible decision the agent can take in a given state.

New cards

Reward

Numeric feedback from the environment evaluating the agent's action; can be positive or negative.

New cards

Policy

A rule mapping states to actions; can be deterministic or stochastic and aims to maximize cumulative rewards.

New cards

Value Function

An estimate of the expected return (cumulative reward) from a state or state-action pair.

New cards

State-Value Function (V(s))

Expected return starting from state s and following a given policy.

New cards

Action-Value Function (Q(s,a))

Expected return starting from state s, taking action a, then following a policy.

New cards

Model of the Environment

A representation of environment dynamics used to predict next state and reward for planning.

New cards

Model-Based RL

RL that uses an explicit model of the environment for planning and optimization.

New cards

Model-Free RL

RL that learns from interactions without building an explicit model of the environment.

New cards

Immediate Reinforcement Learning (IRL)

RL where evaluation (reward) happens immediately after taking an action.

New cards

Bandit Problem

A simple RL scenario with one-step decisions and rewards but no state transitions.

New cards

Multi-Armed Bandit (MAB)

A bandit problem with multiple arms, each with an unknown reward distribution, to maximize cumulative reward.

New cards

Exploration

Trying new actions to gather information about rewards.

New cards

Exploitation

Choosing the best-known action to maximize reward.

New cards

ε-Greedy

Policy that mostly exploits the best-known action while exploring randomly with probability ε.

New cards

Upper Confidence Bound (UCB)

Balances exploration and exploitation by adding a confidence bound to estimated values.

New cards

Thompson Sampling

Bayesian method for bandits that samples actions from posterior reward distributions.

New cards

PAC (Probably Approximately Correct)

Framework that guarantees near-optimal learning with high probability within finite samples.

New cards

PAC-MDP

PAC framework applied to MDPs, guaranteeing near-optimal policy with high probability.

New cards

Regret

Difference between the reward of the optimal policy and the reward achieved by the algorithm.

New cards

Bandit Optimality

Policy that minimizes cumulative regret or maximizes cumulative reward in bandits.

New cards

Value-Based Methods

RL methods that learn value functions (V or Q) to guide action selection.

New cards

Q-Learning

Model-free, off-policy algorithm that learns the action-value function Q(s,a).

New cards

SARSA

On-policy algorithm updating Q-values based on the action actually taken.

New cards

Deep Q-Network (DQN)

Neural-network approximation of the Q-function for large or continuous state spaces.

New cards

Policy Gradient

Directly optimizing a parameterized policy by gradient ascent on expected return.

New cards

REINFORCE

Basic policy gradient algorithm updating policy parameters to increase expected reward.

New cards

TRPO (Trust Region Policy Optimization)

Policy optimization method enforcing updates within a trust region for stability.

New cards

PPO (Proximal Policy Optimization)

Practical policy optimization using a surrogate objective with clipping to limit updates.

New cards

Policy Representation

How the policy is represented (e.g., table or neural network) mapping states to actions.

New cards

Policy-Based Reinforcement Learning

Learning the policy directly without relying on a value function.

New cards

Deterministic Policy

Policy that selects a single action for each state.

New cards

Stochastic Policy

Policy that assigns probabilities to multiple actions.

New cards

On-Policy

Decision policy used to generate data and update the same policy (e.g., SARSA).

New cards

Off-Policy

Decision policy different from the policy used to generate data (e.g., Q-learning).

New cards

Policy Update Stability

Techniques (e.g., TRPO/PPO) to keep updates from destabilizing learning.

New cards

Policy Gradient vs Value-Based

Policy gradient directly optimizes the policy; value-based learns value functions to derive a policy.

New cards

Policy Optimization

Adjusting policy parameters to maximize expected return.

New cards

Immediate Reward Example

Instant feedback example such as ad clicks—reward observed right after action.

New cards

Credit Assignment

Determining which action caused a reward; easier with immediate rewards.

New cards

Reward Signal

The goal-defining numerical feedback from the environment.

New cards

State-Action Value

Q(s,a); the value of taking action a in state s and following a policy thereafter.

New cards

Planning

Deciding on actions by considering possible future states using a model.

New cards

Exploration-Exploitation Trade-off

Balancing learning new information vs. using known good actions.

New cards

Optimal Policy

Policy that maximizes expected cumulative reward.

New cards

Cumulative Reward

Total reward accumulated over time under a policy.

New cards

Action Space

Set of all possible actions; can be discrete or continuous.

New cards

Continuous Action Spaces

Action spaces that are continuous and high-dimensional, often favored by policy-based methods.

New cards

Environment Dynamics (p(s'|s,a))

Probability of next state given current state s and action a.

New cards

Model-Based Planning

Using a learned model to predict future states/rewards for planning.

New cards

Model-Free Learning

Learning from trial-and-error without modeling environment dynamics.

New cards

Greedy Policy

Policy that always selects the currently best-known action.

New cards

Stability in Learning

Maintaining reliable progress during iterative policy/value updates.

New cards

Credit Assignment Problem

Challenge of identifying which action led to a reward, alleviated by immediate rewards.

New cards

Environment-Observer Interaction

The loop where the agent observes a state, takes an action, receives a reward, and moves to a new state.

New cards

Feature-Vector (State Representation)

Numeric vector describing relevant aspects of the current state.

New cards

Discounting (Note: not covered in provided notes)

Omitted to maintain alignment with notes; not included in this set.