Basic Notation

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/10

flashcard set

Earn XP

Description and Tags

Basic RL notation

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

11 Terms

1
New cards

What is a State?

State = s in S

  • the state S represents a “snapshot” of the environment it includes the info a agent needs to take an action within this state

  • s could for example represent a position in a maze

<p>State = s in S</p><ul><li><p>the state S represents a “snapshot” of the environment it includes the info a agent needs to take an action within this state</p></li><li><p>s could for example represent a position in a maze</p></li></ul><p></p>
2
New cards

What is a action?

action = a in A(s)

  • the action a is a choice the agent can make within the environment at a certain state A(s) represents all actions within state s

<p>action = a in A(s)</p><ul><li><p>the action a is a choice the agent can make within the environment at a certain state A(s) represents all actions within state s</p></li></ul><p></p>
3
New cards

What is a policy

Policy a = π(s)

  • This indicates that the agent takes action a based on the policy Ď€ at state s in a deterministic policy this will always be action a, in stochastic policies action a is chosen based on probabilities

<p>Policy a = <span><em>Ď€(s)</em></span></p><ul><li><p>This indicates that the agent takes action a based on the policy <span><em>Ď€</em></span> at state s in a deterministic policy this will always be action a, in stochastic policies action a is chosen based on probabilities</p></li></ul><p></p>
4
New cards

What is the reward?

Reward (r) = r(s,a) or rt

  • this is a feedback value that is returned to the agent after taking action a in state s. This indicates how good the taken action was for the goal

    • This of something like +1 to move towards the goal and -1 to move away

<p>Reward (r) = r(s,a) or r<sub>t</sub></p><ul><li><p>this is a feedback value that is returned to the agent after taking action a in state s. This indicates how good the taken action was for the goal</p><ul><li><p>This of something like +1 to move towards the goal and -1 to move away</p></li></ul></li></ul><p></p>
5
New cards

What is a value function?

Value function = V(s)

  • The value function V(s) represents the cumulative reward the agent can get from starting from state s and following a certain policy Ď€

    • V(s) could indicate how good it is to be in state s for future rewards (reaching the goal)

<p>Value function = V(s)</p><ul><li><p>The value function V(s) represents the cumulative reward the agent can get from starting from state s and following a certain policy <span><em>Ď€</em></span></p><ul><li><p>V(s) could indicate how good it is to be in state s for future rewards (reaching the goal)</p></li></ul></li></ul><p></p>
6
New cards

What is a action-value function?

Action-value function (Q value) = Q(s,a)

  • The action-value function Q(s,a) represents the cumulative reward when taking action a in state s and then following policy Ď€

    • In Q-learning Q(s, a) gives a value for each state-action pair guiding the agents decisions

<p>Action-value function (Q value) = Q(s,a)</p><ul><li><p>The action-value function Q(s,a) represents the cumulative reward when taking action a in state s and then following policy <span><em>Ď€</em></span></p><ul><li><p>In Q-learning Q(s, a) gives a value for each state-action pair guiding the agents decisions</p></li></ul></li></ul><p></p>
7
New cards

What is the discount factor?

Discount factor Îł

  • The discount factor is a number between 0 and 1 that determines the importance of future rewards relative to immediate rewards. a Îł close to 1 means the agent values future rewards nearly as much as immediate rewards

<p>Discount factor <span>Îł</span></p><ul><li><p>The discount factor is a number between 0 and 1 that determines the importance of future rewards relative to immediate rewards. a <span>Îł</span> close to 1 means the agent values future rewards nearly as much as immediate rewards</p></li></ul><p></p>
8
New cards

What is the return?

Return (G) = Gt

  • This is the total accumulated reward from timestep t onwards, also typically discounted

    • Gt represents the sum of rewards the agent expects to receive from time step t onward

<p>Return (G) = G<sub>t</sub></p><ul><li><p>This is the total accumulated reward from timestep t onwards, also typically discounted</p><ul><li><p>G<sub>t</sub> represents the sum of rewards the agent expects to receive from time step t onward</p></li></ul></li></ul><p></p>
9
New cards

What is the bellman equation?

Bellman equation = V(s) = E[r(s,a) + Îł V(s')]

  • This is a recursive equation that helps compute the value of a state. It expresses the state value as the expected reward plus the discounted value of the next state

<p>Bellman equation = V(s) = E[r(s,a) + <span>Îł</span> V(s')]</p><ul><li><p>This is a recursive equation that helps compute the value of a state. It expresses the state value as the expected reward plus the discounted value of the next state</p></li></ul><p></p>
10
New cards

When is a method shallow?

When it only uses information from a single transition

11
New cards

When is a model wide?

when it considers all actions that can be taken in a state