Introduction to Reinforcement Learning

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/46

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 4:04 PM on 6/2/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

47 Terms

1
New cards

Supervised learning

Dataset is fully provided with explicit input-to-output mappings (full feedback).

2
New cards

Unsupervised Learning

Data is given without feedback labels (identifying hidden clusters or structures).

3
New cards

Reinforcement Learning

Agent lacks pre-existing dataset. It actively collects data through environmental interactions. Feedback is partial and scalar (agent learns what reward it achieves for a chosen action but never receives the ground-truth optimal action).

4
New cards

Multi-Armed Bandit

One-step decision-making problem. Isolates core challenge of data collection and exploration/exploitation dilemma without burden of temporal credit assignment.

5
New cards

Sequential Reinforcement Learning

Multi-step dependencies where action influences immediate reward and next environment state. Creates temporal credit assignment problem.

6
New cards

Action value

Q(a) = E[r|a]

7
New cards

Incremental Mean Update rule

Avoids massive memory storage overhead.
Qn = Qn-1 + (1/n) * [rn - Qn-1]

8
New cards

Learning Rate Update

Non-stationary environments (reward distributions shift over time).
Q(a) ← Q(a) + α * [r-Q(a)]

9
New cards

ε-Greedy (random perturbation)

Acts greedily with probability 1-ε and selects random action with probability ε. Maximize performance by decaying ε over time.

10
New cards

Optimistic Initialization

Initializes action-value estimates to unrealistic high valuation (ψ). Selects actions greedily and forces immediate exploration (unchosen actions will always have higher value placeholders than recent).

11
New cards

Upper Confidence Bound (UCB)

Optimism in the face of uncertainty by standard deviation/error bounds.
At = argmaxa [Qt(a) + c * sqr(ln(t)/Nt(a))]

12
New cards

Markov Decision Processes (MDPs)

Sequential decision-making.
State Space (S)
Action Space (A)
Transition Probability Function (p(s’|s,a))
Reward Function (r(s,a,s’))
Discount Factor (γ = [0,1])

13
New cards

State Space (S)

All valid environmental situations. States can be atomic (distinct elements with no crossover) or factorized (represented as vector/matrix allowing for generalization).

14
New cards

Action Space (A)

Set of choices available to an agent.

15
New cards

Transition Probability Function (p(s’|s,a))

Dynamics defining the environment’s response.

16
New cards

Reward Function (r(s,a,s’))

Scalar feedback loop optimizing specific behaviors.

17
New cards

Discount Factor (γ = [0,1])

Determines the current valuation of feature rewards.

18
New cards

Curse of dimensionality

Cardinality (number of elements in a set) of a state space grows exponentially with dimensionality.

19
New cards

Fundamental Bellman Relations

State values: vπ(s)
State-action values: qπ (s,a)
Policy π(a|s)

20
New cards

Complete recursive equation for state values

vπ (s) = Σa π(a|s) Σs’ p(s’|s,a) [r(s,a,s’) + γ vπ(s’)]

21
New cards

Policy iteration

Couples explicit evaluation and improvement steps. Iterates Policy Evaluation (repeated sweeps using formula until convergence) and Policy Improvement (π’(s) ← argmaxa qπ (s,a)) within the Generalized Policy Iteration (cycle).

22
New cards

Bellman Optimality Equation

vk+1 (s) = maxa Σs’ p (s’|s,a) [r(s,a,s’) + γ vk(s’)]

23
New cards

Monte Carlo Methods

Wait until absolute termination of an episode to compute exact realized empirical return Gt and update values. Updates are unbiased but have high variance.
V(st) ← V(st) + α[Gt - V(st)]
Requires unfeasible assumption of Exploring Starts.

24
New cards

Exploring Starts

Every episode starts at a randomized state-action pair.

25
New cards

Temporal-Difference Learning: TD(0)

Eliminates the endpoint restriction via bootstrapping (updating current estimation using single-step future estimate target, reducing variance at the cost of introduction of bias).
V(st) ← V(st) + α[Rt+1 + γ V(st+1) - V(st)]

26
New cards

SARSA

On-policy. Learns values of the current exploratory policy (safer in online systems).

Q(s, a) ← Q(s, a) + α [ R + γ Q(s', a') - Q(s, a) ]

27
New cards

Q-Learning

Off-policy. Learns values of the absolute optimal policy directly (regardless of behavior choices).
Q(s, a) ← Q(s, a) + α [ R + γ maxa’ Q(s', a') - Q(s, a) ]

28
New cards

Expected Sarsa

Flexible. Computes exact policy expectation over the target state (reduces sampling variance).
Q(s, a) ← Q(s, a) + α [ R + γ Σa’ π(a'|s')Q(s', a') - Q(s, a) ]

29
New cards

On-policy

The agent learns the value of the exact policy it is currently using to interact with the environment. It improves upon its own current strategy (chef learning only by tasting their own cooking)

30
New cards

Off-policy

The agent can learn optimal behavior from data collected by entirely different policies, past experiences or even random actions (chef learns by watching other chefs cook, reading books, and analyzing other people's recipes).

31
New cards

Maximization Bias

Taking the maximum of noisy values systematically overestimates true expectations.

32
New cards

Double Q-Learning

Solves maximization bias by decoupling action selection from action evaluation using two independent value arrays.
QA(s,a) ← QA(s,a) + α [ R + γ QB(s', argmaxa’ QA(s', a')) - QA(s,a) ]

33
New cards

Model-Based RL & Sample-Based Planning

Environmental interactions in the real world are irreversible. MBRL allows agents to construct a learned tabular forward model from experiences real-world data tuple logs (s, a, r, s’).

p̂(s' | s, a) = n(s, a, s') / Σs’’ n(s, a, s'')
r̂(s, a, s') = Rsum(s, a, s') / n(s, a, s')

34
New cards

Dyna Framework

Unifies real-world model-free updates with simulated background planning updates. After real transition and updating the table it samples randomized previously seen states/actions from internal model. Running simulated updates maximizes data efficiency.

35
New cards

Prioritized Sweeping

Learns a backward model and updates a priority queue of states whose values are most significantly altered by the latest reward update (optimizing computational allocation).

36
New cards

Back-up Architectures

Dynamic Programming (full-width, shallow)
TD Learning (sample-width, shallow)
Monte Carlo (sample-width, deep)
Exhaustive Search (full-width, deep)

37
New cards

Parametric function networks

Eliminate tracking bottlenecks of continuous or extremely large state spaces.
vπ(s) ≈ v(s, w)
qπ(s, a) ≈ q(s, a, w)

38
New cards

Mean Squared Value Error

Using Gradient Descent.
VE(w) = Σs μ(s) [ vπ(s) - v(s, w) ]2

39
New cards

Policy Gradient Methods

Parameterize a probabilistic policy mapping π(a|s,θ) instead of optimizing values and picking actions indirectly via ε-greedy.

40
New cards

REINFORCE algorithm

Policy Gradient Theorem to update parameters directly from sampled trajectories.
θt+1 = θt + α Gt ∇ ln π(At | St, θt)

We stabilize learning speed by shifting updates to reward advantages.
θt+1 = θt + α [ Gt - v(St, w) ] ∇ ln π(At | St, θt)

41
New cards

Actor-Critic Methods

Combine paradigms: actor parameterizes and improves the explicit policy π(a|s,θ) while critic tracks parameters w to learn a single-step bootstrapped TD baseline value function used to judge actor’s selections.

42
New cards

Psychology & Neuroscience Connections

Classical Conditioning (Pavlovian)
Instrumental/Operant Conditioning (Skinnerian)
Habitual vs. Goal-Directed Behavior

43
New cards

Classical Conditioning (Pavlovian)

Matches prediction problem (policy evaluation). Burst of neurotransmitter dopamine inside the brain mathematically mirror the exact functional operation of Temporal-Difference Error signals (δ). Learning saturates when rewards are fully expected and extinction when expected rewards are removes. This tracks behavior mechanism of the Rescorla-Wagner learning rule.

44
New cards

Instrumental/Operation Conditioning (Skinnerian)

Matches control problem (policy improvement) shifting active behaviors based on reinforcement feedback.

45
New cards

Habitual vs. Goal-Directed Behavior

Distinction between raw reactive instincts (Model-Free RL) and forward planning cognitive maps (Model-Based RL).

46
New cards

AlphaGO

GO has massive search branching factor (b=250) and game depth (d=150) making traditional chess-style Minmax lookahead search impossible.

Integrations of MCTS:
Narrowing Search Width: human move logs were cloned via supervised learning to train policy network (pσ) and guids MCTS selection formula to prioritize high-potential actions.
Pruning Search Depth: self-play policy gradient architectures (REINFORCE) optimized a deep value network (vθ) to evaluate intermediate game state tables directly (bypassing the computational need for full deep tree search evaluations).

47
New cards

AlphaGo Zero

Simplified complex structure. Bypassing all human imitation data entirely. Utilized raw observations and treats MCTS simultaneously as a dual-engine iteration step (MCTS acts as continuous policy improvement step and policy neural network directly learns to clone the tree-search distribution patterns via structural reinforcement updates).