Basic Notation

0.0(0)
studied byStudied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/10

flashcard set

Earn XP

Description and Tags

Basic RL notation

Last updated 10:29 AM on 3/24/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

11 Terms

1
New cards

What is a State?

State = s in S

  • the state S represents a “snapshot” of the environment it includes the info a agent needs to take an action within this state

  • s could for example represent a position in a maze

<p>State = s in S</p><ul><li><p>the state S represents a “snapshot” of the environment it includes the info a agent needs to take an action within this state</p></li><li><p>s could for example represent a position in a maze</p></li></ul><p></p>
2
New cards

What is a action?

action = a in A(s)

  • the action a is a choice the agent can make within the environment at a certain state A(s) represents all actions within state s

<p>action = a in A(s)</p><ul><li><p>the action a is a choice the agent can make within the environment at a certain state A(s) represents all actions within state s</p></li></ul><p></p>
3
New cards

What is a policy

Policy a = π(s)

  • This indicates that the agent takes action a based on the policy π at state s in a deterministic policy this will always be action a, in stochastic policies action a is chosen based on probabilities

<p>Policy a = <span><em>π(s)</em></span></p><ul><li><p>This indicates that the agent takes action a based on the policy <span><em>π</em></span> at state s in a deterministic policy this will always be action a, in stochastic policies action a is chosen based on probabilities</p></li></ul><p></p>
4
New cards

What is the reward?

Reward (r) = r(s,a) or rt

  • this is a feedback value that is returned to the agent after taking action a in state s. This indicates how good the taken action was for the goal

    • This of something like +1 to move towards the goal and -1 to move away

<p>Reward (r) = r(s,a) or r<sub>t</sub></p><ul><li><p>this is a feedback value that is returned to the agent after taking action a in state s. This indicates how good the taken action was for the goal</p><ul><li><p>This of something like +1 to move towards the goal and -1 to move away</p></li></ul></li></ul><p></p>
5
New cards

What is a value function?

Value function = V(s)

  • The value function V(s) represents the cumulative reward the agent can get from starting from state s and following a certain policy π

    • V(s) could indicate how good it is to be in state s for future rewards (reaching the goal)

<p>Value function = V(s)</p><ul><li><p>The value function V(s) represents the cumulative reward the agent can get from starting from state s and following a certain policy <span><em>π</em></span></p><ul><li><p>V(s) could indicate how good it is to be in state s for future rewards (reaching the goal)</p></li></ul></li></ul><p></p>
6
New cards

What is a action-value function?

Action-value function (Q value) = Q(s,a)

  • The action-value function Q(s,a) represents the cumulative reward when taking action a in state s and then following policy π

    • In Q-learning Q(s, a) gives a value for each state-action pair guiding the agents decisions

<p>Action-value function (Q value) = Q(s,a)</p><ul><li><p>The action-value function Q(s,a) represents the cumulative reward when taking action a in state s and then following policy <span><em>π</em></span></p><ul><li><p>In Q-learning Q(s, a) gives a value for each state-action pair guiding the agents decisions</p></li></ul></li></ul><p></p>
7
New cards

What is the discount factor?

Discount factor γ

  • The discount factor is a number between 0 and 1 that determines the importance of future rewards relative to immediate rewards. a γ close to 1 means the agent values future rewards nearly as much as immediate rewards

<p>Discount factor <span>γ</span></p><ul><li><p>The discount factor is a number between 0 and 1 that determines the importance of future rewards relative to immediate rewards. a <span>γ</span> close to 1 means the agent values future rewards nearly as much as immediate rewards</p></li></ul><p></p>
8
New cards

What is the return?

Return (G) = Gt

  • This is the total accumulated reward from timestep t onwards, also typically discounted

    • Gt represents the sum of rewards the agent expects to receive from time step t onward

<p>Return (G) = G<sub>t</sub></p><ul><li><p>This is the total accumulated reward from timestep t onwards, also typically discounted</p><ul><li><p>G<sub>t</sub> represents the sum of rewards the agent expects to receive from time step t onward</p></li></ul></li></ul><p></p>
9
New cards

What is the bellman equation?

Bellman equation = V(s) = E[r(s,a) + γ V(s')]

  • This is a recursive equation that helps compute the value of a state. It expresses the state value as the expected reward plus the discounted value of the next state

<p>Bellman equation = V(s) = E[r(s,a) + <span>γ</span> V(s')]</p><ul><li><p>This is a recursive equation that helps compute the value of a state. It expresses the state value as the expected reward plus the discounted value of the next state</p></li></ul><p></p>
10
New cards

When is a method shallow?

When it only uses information from a single transition

11
New cards

When is a model wide?

when it considers all actions that can be taken in a state