Introduction to Reinforcement Learning

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/47

There's no tags or description

Looks like no tags are added yet.

Last updated 12:35 PM on 6/25/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai	Chat

No analytics yet

Send a link to your students to track their progress

48 Terms

New cards

Supervised learning

Dataset is fully provided with explicit input-to-output mappings (full feedback).

New cards

Unsupervised Learning

Data is given without feedback labels (identifying hidden clusters or structures).

New cards

Reinforcement Learning

Agent lacks pre-existing dataset. It actively collects data through environmental interactions. Feedback is partial and scalar (agent learns what reward it achieves for a chosen action but never receives the ground-truth optimal action).

New cards

Multi-Armed Bandit

One-step decision-making problem. Isolates core challenge of data collection and exploration/exploitation dilemma without burden of temporal credit assignment.

New cards

Sequential Reinforcement Learning

Multi-step dependencies where action influences immediate reward and next environment state. Creates temporal credit assignment problem.

New cards

Action value

Q(a) = E[r|a]

New cards

Incremental Mean Update rule

Avoids massive memory storage overhead.
Q_n = Q_n-1+ (1/n) * [r_n - Q_n-1]

New cards

Learning Rate Update

Non-stationary environments (reward distributions shift over time).
Q(a) ← Q(a) + α * [r-Q(a)]

New cards

ε-Greedy (random perturbation)

Acts greedily with probability 1-ε and selects random action with probability ε. Maximize performance by decaying ε over time.

New cards

Optimistic Initialization

Initializes action-value estimates to unrealistic high valuation (ψ). Selects actions greedily and forces immediate exploration (unchosen actions will always have higher value placeholders than recent).

New cards

Upper Confidence Bound (UCB)

Optimism in the face of uncertainty by standard deviation/error bounds.
A_t = argmax_a [Q_t(a) + c * sqr(ln(t)/N_t(a))]

<p>Optimism in the face of uncertainty by standard deviation/error bounds.<br>A<sub>t</sub> = argmax<sub>a</sub> [Q<sub>t</sub>(a) + c * sqr(ln(t)/N<sub>t</sub>(a))] </p>

New cards

Markov Decision Processes (MDPs)

Sequential decision-making.
State Space (S)
Action Space (A)
Transition Probability Function (p(s’|s,a))
Reward Function (r(s,a,s’))
Discount Factor (γ = [0,1])

New cards

State Space (S)

All valid environmental situations. States can be atomic (distinct elements with no crossover) or factorized (represented as vector/matrix allowing for generalization).

New cards

Action Space (A)

Set of choices available to an agent.

New cards

Transition Probability Function (p(s’|s,a))

Dynamics defining the environment’s response.

New cards

Reward Function (r(s,a,s’))

Scalar feedback loop optimizing specific behaviors.

New cards

Discount Factor (γ = [0,1])

Determines the current valuation of feature rewards.

New cards

Curse of dimensionality

Cardinality (number of elements in a set) of a state space grows exponentially with dimensionality.

New cards

Fundamental Bellman Relations

State values: v^π(s)
State-action values: q^π (s,a)
Policy π(a|s)

New cards

Complete recursive equation for state values

v^π (s) = Σ_a π(a|s) Σ_s’ p(s’|s,a) [r(s,a,s’) + γ v^π(s’)]

<p>v<sup>π</sup> (s) = Σ<sub>a</sub> π(a|s) Σ<sub>s’</sub> p(s’|s,a) [r(s,a,s’) + γ v<sup>π</sup>(s’)]</p>

New cards

Policy iteration

Couples explicit evaluation and improvement steps. Iterates Policy Evaluation (repeated sweeps using formula until convergence) and Policy Improvement (π’(s) ← argmax_a q^π(s,a)) within the Generalized Policy Iteration (cycle).

New cards

Bellman Optimality Equation

v_k+1 (s) = max_a Σ_s’p (s’|s,a) [r(s,a,s’) + γ v_k(s’)]

New cards

Monte Carlo Methods

Wait until absolute termination of an episode to compute exact realized empirical return G_t and update values. Updates are unbiased but have high variance.
V(s_t) ← V(s_t) + α[G_t - V(s_t)]
Requires unfeasible assumption of Exploring Starts.

<p>Wait until absolute termination of an episode to compute exact realized empirical return G<sub>t</sub> and update values. Updates are unbiased but have high variance.<br>V(s<sub>t</sub>) ← V(s<sub>t</sub>) + α[G<sub>t</sub> - V(s<sub>t</sub>)] <br>Requires unfeasible assumption of Exploring Starts.</p>

New cards

Exploring Starts

Every episode starts at a randomized state-action pair (ensure every action gets performed).

New cards

Temporal-Difference Learning: TD(0)

Eliminates the endpoint restriction via bootstrapping (updating current estimation using single-step future estimate target, reducing variance at the cost of introduction of bias).
V(s_t) ← V(s_t) + α[R_t+1 + γ V(s_t+1) - V(s_t)]

<p>Eliminates the endpoint restriction via bootstrapping (updating current estimation using single-step future estimate target, reducing variance at the cost of introduction of bias).<br>V(s<sub>t</sub>) ← V(s<sub>t</sub>) + α[R<sub>t+1</sub> + γ V(s<sub>t+1</sub>) - V(s<sub>t</sub>)] </p>

New cards

SARSA

On-policy. Learns values of the current exploratory policy (safer in online systems).

Q(s, a) ← Q(s, a) + α [ R + γ Q(s', a') - Q(s, a) ]

New cards

Q-Learning

Off-policy. Learns values of the absolute optimal policy directly (regardless of behavior choices).
Q(s, a) ← Q(s, a) + α [ R + γ max_a’Q(s', a') - Q(s, a) ]

<p>Off-policy. Learns values of the absolute optimal policy directly (regardless of behavior choices).<br>Q(s, a) ← Q(s, a) + α [ R + γ max<sub>a’ </sub>Q(s', a') - Q(s, a) ]</p>

New cards

Expected Sarsa

Flexible. Computes exact policy expectation over the target state (reduces sampling variance).
Q(s, a) ← Q(s, a) + α [ R + γ Σ_a’ π(a'|s')Q(s', a') - Q(s, a) ]

<p>Flexible. Computes exact policy expectation over the target state (reduces sampling variance). <br>Q(s, a) ← Q(s, a) + α [ R + γ Σ<sub>a’</sub> π(a'|s')Q(s', a') - Q(s, a) ]</p>

New cards

On-policy

The agent learns the value of the exact policy it is currently using to interact with the environment. It improves upon its own current strategy (chef learning only by tasting their own cooking)

New cards

Off-policy

The agent can learn optimal behavior from data collected by entirely different policies, past experiences or even random actions (chef learns by watching other chefs cook, reading books, and analyzing other people's recipes).

New cards

Maximization Bias

Taking the maximum of noisy values systematically overestimates true expectations.

New cards

Double Q-Learning

Solves maximization bias by decoupling action selection from action evaluation using two independent value arrays.
Q_A(s,a) ← Q_A(s,a) + α [ R + γ Q_B(s', argmax_a’ Q_A(s', a')) - Q_A(s,a) ]

New cards

Model-Based RL & Sample-Based Planning

Environmental interactions in the real world are irreversible. MBRL allows agents to construct a learned tabular forward model from experiences real-world data tuple logs (s, a, r, s’).

p̂(s' | s, a) = n(s, a, s') / Σ_s’’ n(s, a, s'')
r̂(s, a, s') = R_sum(s, a, s') / n(s, a, s')

New cards

Dyna Framework

Unifies real-world model-free updates with simulated background planning updates. After real transition and updating the table it samples randomized previously seen states/actions from internal model. Running simulated updates maximizes data efficiency.

New cards

Prioritized Sweeping

Learns a backward model and updates a priority queue of states whose values are most significantly altered by the latest reward update (optimizing computational allocation).

New cards

Back-up Architectures

Dynamic Programming (full-width, shallow)
TD Learning (sample-width, shallow)
Monte Carlo (sample-width, deep)
Exhaustive Search (full-width, deep)

New cards

Parametric function networks

Eliminate tracking bottlenecks of continuous or extremely large state spaces.
v_π(s) ≈ v(s, w)
q_π(s, a) ≈ q(s, a, w)

New cards

Mean Squared Value Error

Using Gradient Descent.
VE(w) = Σ_s μ(s) [ v_π(s) - v(s, w) ]²

New cards

Policy Gradient Methods

Parameterize a probabilistic policy mapping π(a|s,θ) instead of optimizing values and picking actions indirectly via ε-greedy.

New cards

REINFORCE algorithm

Policy Gradient Theorem to update parameters directly from sampled trajectories.
θ_t+1 = θ_t + α G_t ∇ ln π(A_t | S_t, θ_t)

We stabilize learning speed by shifting updates to reward advantages.
θ_t+1 = θ_t + α [ G_t - v(S_t, w) ] ∇ ln π(A_t | S_t, θ_t)

<p>Policy Gradient Theorem to update parameters directly from sampled trajectories.<br>θ<sub>t+1</sub> = θ<sub>t</sub> + α G<sub>t</sub> ∇ ln π(A<sub>t</sub> | S<sub>t</sub>, θ<sub>t</sub>)<br><br>We stabilize learning speed by shifting updates to reward advantages.<br>θ<sub>t+1</sub> = θ<sub>t</sub> + α [ G<sub>t</sub> - v(S<sub>t</sub>, w) ] ∇ ln π(A<sub>t</sub> | S<sub>t</sub>, θ<sub>t</sub>)</p>

New cards

Actor-Critic Methods

Combine paradigms: actor parameterizes and improves the explicit policy π(a|s,θ) while critic tracks parameters w to learn a single-step bootstrapped TD baseline value function used to judge actor’s selections.

<p>Combine paradigms: actor parameterizes and improves the explicit policy π(a|s,θ) while critic tracks parameters w to learn a single-step bootstrapped TD baseline value function used to judge actor’s selections.</p>

New cards

Psychology & Neuroscience Connections

Classical Conditioning (Pavlovian)
Instrumental/Operant Conditioning (Skinnerian)
Habitual vs. Goal-Directed Behavior

New cards

Classical Conditioning (Pavlovian)

Matches prediction problem (policy evaluation). Burst of neurotransmitter dopamine inside the brain mathematically mirror the exact functional operation of Temporal-Difference Error signals (δ). Learning saturates when rewards are fully expected and extinction when expected rewards are removes. This tracks behavior mechanism of the Rescorla-Wagner learning rule.

New cards

Instrumental/Operation Conditioning (Skinnerian)

Matches control problem (policy improvement) shifting active behaviors based on reinforcement feedback.

New cards

Habitual vs. Goal-Directed Behavior

Distinction between raw reactive instincts (Model-Free RL) and forward planning cognitive maps (Model-Based RL).

New cards

AlphaGO

GO has massive search branching factor (b=250) and game depth (d=150) making traditional chess-style Minmax lookahead search impossible.

Integrations of MCTS:
Narrowing Search Width: human move logs were cloned via supervised learning to train policy network (p_σ) and guids MCTS selection formula to prioritize high-potential actions.
Pruning Search Depth: self-play policy gradient architectures (REINFORCE) optimized a deep value network (v_θ) to evaluate intermediate game state tables directly (bypassing the computational need for full deep tree search evaluations).

New cards

AlphaGo Zero

Simplified complex structure. Bypassing all human imitation data entirely. Utilized raw observations and treats MCTS simultaneously as a dual-engine iteration step (MCTS acts as continuous policy improvement step and policy neural network directly learns to clone the tree-search distribution patterns via structural reinforcement updates).

New cards

Rescorla-Wagner

Change in strength of stimuli = salience x learning rate( max associative strength - summed associative strength)