1/89
Essential notation and definitions drawn from the lecture’s summary tables, covering probability symbols, bandit parameters, MDP components, value functions, TD learning, policy-gradient parameters, and linear function-approximation matrices.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Capital Letters
Used for random variables
Lower-case Letters (e.g., x)
Denote specific values of random variables or scalar functions.
Bold Lower-case (e.g., x)
Real-valued column vectors.
Bold Capitals (e.g., A)
Matrices.
.= (Definitional Equality)
Expresses that two quantities are equal by definition.
≈ (Approximately Equal)
Indicates an approximation between two quantities.
∝ (Proportional To)
Shows that one quantity is proportional to another.
Pr{X = x}
Probability that random variable X takes value x.
X ~ p(x)
Random variable X is drawn from distribution p(x).
E[X]
Expectation (mean) of random variable X.
argmax_a f(a)
Value(s) of a that maximize the function f(a).
ln x
Natural logarithm of x.
exp(x) or e^x
The exponential function; inverse of ln x.
ℝ
Set of real numbers.
f : X → Y
Function f mapping elements of set X to elements of set Y.
← (Assignment)
Assigns the value on the right to the variable on the left.
(a, b]
Half-open real interval: greater than a and up to and including b.
ε (Epsilon-greedy)
Probability of selecting a random action in an ε-greedy policy.
α (Step-size Parameter)
Learning-rate parameter for incremental updates.
γ (Discount-rate Parameter)
Factor that discounts future rewards (0 ≤ γ ≤ 1).
λ (Lambda)
Decay-rate parameter for eligibility traces.
𝟙(predicate)
Indicator function: 1 if predicate is true, else 0.
k
Number of actions (arms) in a multi-armed bandit.
t
Discrete time step or play number.
q₀(a)
True (expected) reward of action a in a bandit problem.
Q_t(a)
Estimate at time t of q₀(a).
N_t(a)
Number of times action a has been selected up to time t.
H_t(a)
Learned preference for selecting action a at time t (preference-based methods).
π_t(a)
Probability of selecting action a at time t.
R̄_t
Running estimate of expected reward at time t.
s
State in a Markov Decision Process (MDP).
s′ (or s*)
Next state after a transition.
a
Action taken by the agent.
r
Reward received after a transition.
S
Set of all non-terminal states.
S+
Set of all states including the terminal state.
A(s)
Set of actions available in state s.
R
Set of all possible rewards (finite subset of ℝ).
|S|
Number of states in set S (cardinality).
T
Final time step of an episode.
A_t
Action taken at time step t.
S_t
State occupied at time step t.
R_t
Reward received at time step t.
π(s) (Deterministic Policy)
Action chosen in state s under a deterministic policy.
π(a | s) (Stochastic Policy)
Probability of taking action a in state s under policy π.
G_t
Return (cumulative, possibly discounted reward) following time t.
h (Horizon)
Look-ahead time step used in forward‐view methods.
G_{t:t+n}
n-step return from t+1 through t+n (discounted and possibly corrected).
p(s′, r | s, a)
Probability of moving to state s′ and receiving reward r after (s, a).
p(s′ | s, a)
Transition probability from state s to s′ under action a (reward ignored).
r(s, a)
Expected immediate reward after taking action a in state s.
r(s, a, s′)
Expected reward on transition (s, a) → s′.
v_π(s)
State-value: expected return from state s following policy π.
v*(s)
Optimal state-value: maximum expected return from state s.
q_π(s, a)
Action-value: expected return from (s, a) following policy π.
q*(s, a)
Optimal action-value: maximum expected return from (s, a).
V_t
Array (vector) of current estimates of v_π(s) or v*(s).
Q_t
Array of current estimates of q_π(s, a) or q*(s, a).
V̄_t(s)
Expected approximate value at s: Σa π(a|s) Qt(s,a).
U_t
Target used for updating an estimate at time t.
δ_t (TD Error)
Temporal-difference error at time step t: δt = R{t+1} + γV(S{t+1}) – V(St).
n (n-step Methods)
Number of steps of bootstrapping before using a value estimate.
d
Dimensionality of weight vector w in function approximation.
w
Weight vector parameterizing an approximate value function.
v̂(s, w)
Approximate value of state s given weights w.
q̂(s, a, w)
Approximate action value for (s, a) given weights w.
∇v̂(s, w)
Gradient of v̂(s, w) with respect to w (column vector).
x(s)
Feature vector observed in state s.
x(s, a)
Feature vector observed for pair (s, a).
wᵀx
Inner (dot) product between vectors w and x.
z_t
Eligibility-trace vector at time t.
θ
Parameter vector defining a (possibly stochastic) target policy.
π(a | s, θ)
Probability of action a in state s under parameters θ.
J(θ)
Performance objective of policy parameter θ (e.g., expected return).
b(a | s)
Behavior policy used to generate experience while learning.
ρ_{t:h}
Importance-sampling ratio from time t through h.
μ(s)
On-policy distribution over states under policy π.
A (TD Matrix)
Expected matrix E[xt (xt − γx_{t+1})ᵀ] used in linear TD theory.
b (TD Vector)
Expected vector E[R{t+1} xt] in linear TD.
w_TD
Fixed-point weight vector solving Aw = b (TD solution).
I
Identity matrix.
P
State-transition probability matrix under policy π.
D
Diagonal matrix with μ(s) on its diagonal.
X
Matrix whose rows are feature vectors x(s).
δ̄_w(s) (Bellman Error)
Expected TD error at state s under weights w.
VE(w) (Value Error)
Mean-square difference between v̂(s,w) and true value v_π(s).
BE(w) (Bellman Error, MSE)
Mean-square Bellman error: E[δ̄_w(s)²].
PBE(w)
Mean-square projected Bellman error (after projection onto feature space).
TDE(w)
Mean-square temporal-difference error: E[δ_t²].
RE(w)
Mean-square return error: expected squared error between n-step returns and v̂.