Reinforcement-Learning Notation – Vocabulary Flashcards

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/89

flashcard set

Earn XP

Description and Tags

Essential notation and definitions drawn from the lecture’s summary tables, covering probability symbols, bandit parameters, MDP components, value functions, TD learning, policy-gradient parameters, and linear function-approximation matrices.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

90 Terms

1
New cards

Capital Letters

Used for random variables

2
New cards

Lower-case Letters (e.g., x)

Denote specific values of random variables or scalar functions.

3
New cards

Bold Lower-case (e.g., x)

Real-valued column vectors.

4
New cards

Bold Capitals (e.g., A)

Matrices.

5
New cards

.= (Definitional Equality)

Expresses that two quantities are equal by definition.

6
New cards

≈ (Approximately Equal)

Indicates an approximation between two quantities.

7
New cards

∝ (Proportional To)

Shows that one quantity is proportional to another.

8
New cards

Pr{X = x}

Probability that random variable X takes value x.

9
New cards

X ~ p(x)

Random variable X is drawn from distribution p(x).

10
New cards

E[X]

Expectation (mean) of random variable X.

11
New cards

argmax_a f(a)

Value(s) of a that maximize the function f(a).

12
New cards

ln x

Natural logarithm of x.

13
New cards

exp(x) or e^x

The exponential function; inverse of ln x.

14
New cards

Set of real numbers.

15
New cards

f : X → Y

Function f mapping elements of set X to elements of set Y.

16
New cards

← (Assignment)

Assigns the value on the right to the variable on the left.

17
New cards

(a, b]

Half-open real interval: greater than a and up to and including b.

18
New cards

ε (Epsilon-greedy)

Probability of selecting a random action in an ε-greedy policy.

19
New cards

α (Step-size Parameter)

Learning-rate parameter for incremental updates.

20
New cards

γ (Discount-rate Parameter)

Factor that discounts future rewards (0 ≤ γ ≤ 1).

21
New cards

λ (Lambda)

Decay-rate parameter for eligibility traces.

22
New cards

𝟙(predicate)

Indicator function: 1 if predicate is true, else 0.

23
New cards

k

Number of actions (arms) in a multi-armed bandit.

24
New cards

t

Discrete time step or play number.

25
New cards

q₀(a)

True (expected) reward of action a in a bandit problem.

26
New cards

Q_t(a)

Estimate at time t of q₀(a).

27
New cards

N_t(a)

Number of times action a has been selected up to time t.

28
New cards

H_t(a)

Learned preference for selecting action a at time t (preference-based methods).

29
New cards

π_t(a)

Probability of selecting action a at time t.

30
New cards

R̄_t

Running estimate of expected reward at time t.

31
New cards

s

State in a Markov Decision Process (MDP).

32
New cards

s′ (or s*)

Next state after a transition.

33
New cards

a

Action taken by the agent.

34
New cards

r

Reward received after a transition.

35
New cards

S

Set of all non-terminal states.

36
New cards

S+

Set of all states including the terminal state.

37
New cards

A(s)

Set of actions available in state s.

38
New cards

R

Set of all possible rewards (finite subset of ℝ).

39
New cards

|S|

Number of states in set S (cardinality).

40
New cards

T

Final time step of an episode.

41
New cards

A_t

Action taken at time step t.

42
New cards

S_t

State occupied at time step t.

43
New cards

R_t

Reward received at time step t.

44
New cards

π(s) (Deterministic Policy)

Action chosen in state s under a deterministic policy.

45
New cards

π(a | s) (Stochastic Policy)

Probability of taking action a in state s under policy π.

46
New cards

G_t

Return (cumulative, possibly discounted reward) following time t.

47
New cards

h (Horizon)

Look-ahead time step used in forward‐view methods.

48
New cards

G_{t:t+n}

n-step return from t+1 through t+n (discounted and possibly corrected).

49
New cards

p(s′, r | s, a)

Probability of moving to state s′ and receiving reward r after (s, a).

50
New cards

p(s′ | s, a)

Transition probability from state s to s′ under action a (reward ignored).

51
New cards

r(s, a)

Expected immediate reward after taking action a in state s.

52
New cards

r(s, a, s′)

Expected reward on transition (s, a) → s′.

53
New cards

v_π(s)

State-value: expected return from state s following policy π.

54
New cards

v*(s)

Optimal state-value: maximum expected return from state s.

55
New cards

q_π(s, a)

Action-value: expected return from (s, a) following policy π.

56
New cards

q*(s, a)

Optimal action-value: maximum expected return from (s, a).

57
New cards

V_t

Array (vector) of current estimates of v_π(s) or v*(s).

58
New cards

Q_t

Array of current estimates of q_π(s, a) or q*(s, a).

59
New cards

V̄_t(s)

Expected approximate value at s: Σa π(a|s) Qt(s,a).

60
New cards

U_t

Target used for updating an estimate at time t.

61
New cards

δ_t (TD Error)

Temporal-difference error at time step t: δt = R{t+1} + γV(S{t+1}) – V(St).

62
New cards

n (n-step Methods)

Number of steps of bootstrapping before using a value estimate.

63
New cards

d

Dimensionality of weight vector w in function approximation.

64
New cards

w

Weight vector parameterizing an approximate value function.

65
New cards

v̂(s, w)

Approximate value of state s given weights w.

66
New cards

q̂(s, a, w)

Approximate action value for (s, a) given weights w.

67
New cards

∇v̂(s, w)

Gradient of v̂(s, w) with respect to w (column vector).

68
New cards

x(s)

Feature vector observed in state s.

69
New cards

x(s, a)

Feature vector observed for pair (s, a).

70
New cards

wᵀx

Inner (dot) product between vectors w and x.

71
New cards

z_t

Eligibility-trace vector at time t.

72
New cards

θ

Parameter vector defining a (possibly stochastic) target policy.

73
New cards

π(a | s, θ)

Probability of action a in state s under parameters θ.

74
New cards

J(θ)

Performance objective of policy parameter θ (e.g., expected return).

75
New cards

b(a | s)

Behavior policy used to generate experience while learning.

76
New cards

ρ_{t:h}

Importance-sampling ratio from time t through h.

77
New cards

μ(s)

On-policy distribution over states under policy π.

78
New cards

A (TD Matrix)

Expected matrix E[xt (xt − γx_{t+1})ᵀ] used in linear TD theory.

79
New cards

b (TD Vector)

Expected vector E[R{t+1} xt] in linear TD.

80
New cards

w_TD

Fixed-point weight vector solving Aw = b (TD solution).

81
New cards

I

Identity matrix.

82
New cards

P

State-transition probability matrix under policy π.

83
New cards

D

Diagonal matrix with μ(s) on its diagonal.

84
New cards

X

Matrix whose rows are feature vectors x(s).

85
New cards

δ̄_w(s) (Bellman Error)

Expected TD error at state s under weights w.

86
New cards

VE(w) (Value Error)

Mean-square difference between v̂(s,w) and true value v_π(s).

87
New cards

BE(w) (Bellman Error, MSE)

Mean-square Bellman error: E[δ̄_w(s)²].

88
New cards

PBE(w)

Mean-square projected Bellman error (after projection onto feature space).

89
New cards

TDE(w)

Mean-square temporal-difference error: E[δ_t²].

90
New cards

RE(w)

Mean-square return error: expected squared error between n-step returns and v̂.