Lecture 17 - Dopamine + Reinforcement Learning I

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/34

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

35 Terms

1
New cards

Recap: What fall under representation, valuation, action selection, outcome evaluation?

representation: sensory representations, uncertainty estimation, Bayesian integration

valuation: value representations in PFC, availability/desirability

action selection: action selection in striatum

outcome evaluation: confidence/metacognition

2
New cards

Recap: What are the 3 types of learning?

supervised: learning from labelled data - accuracy is the feedback

unsupervised: learning structure in the data - distance/structure metric is the feedback

reinforcement: learning from trial and error - sparse reward or punishment as feedback

  • idea of action in reinforcement 

3
New cards

What is the learning signal in supervised learning? Describe process in supervised learning. Is it useful conceptually or for biological systems?

  1. you have one data point and one label

  2. you get an error (target - prediction)

  3. error gets back propagated into network

  4. use error to figure out how to change network to give you correct network

  5. the learning signal for a given neuron is dependent on other downstream neurons in the network

NOT USEFUL FOR BIOLOGICAL SYSTEMS SINCE YOU NEED TO KNOW ALL OTHER NEURONS DOWNSTREAM TO UPDATE A WEIGHT OF A NEURON

  • useful conceptually 

4
New cards

What is the agent and environment in reinforcement learning?

the agent has to take a number of actions and obtains sparse rewards

the agent has to learn the value of states and action that lead to the reward

agent → actions → environment

environment → rewards/observations → agent

5
New cards

What is intracranial self stimulation? What experiment with rats?

stimulation of the medial forebrain bundle (MFB)

when the rats press on the lever, they receive an electrical stimulation

  • they were not food or water deprived

the rat self-stimulates → they will press the lever repeatedly it is paired with a stimulation

  • they work for this stimulation reward

6
New cards

What neurons were targeted in the self-stimulation experiment?

the dopaminergic system

  • ventral tegental area

  • substantia nigra

    • axons are activated and release of dopamine

7
New cards

What diseases is dopamine implicated in? What is a treatment option?

Parkinson’s disease arises from degeneration in dopamine neurons (substantial nigra)

  • motor disorder

treatments include L-Dopa which is a metabolic precursor of dopamine

forcing dopamine neuron activation can lead to recovery in dopaminergic system

8
New cards

How do psychopathic drugs operate with dopamine?

many addictive drugs act by increasing dopamine levels through different mechanisms

9
New cards
<p>Describe experiment (part 1). What type of neuron is being recorded?</p>

Describe experiment (part 1). What type of neuron is being recorded?

dopamine neuron from ventral tegmental area

response of the neuron to an unpredicted liquid reward

strong response after reward delivery

THERE IS REWARD CODING IN THESE NEURONS

10
New cards

What type of learning is it where we give a cue (CS) paired with a reward (US)?

classical conditioning

  • Pavlovian

11
New cards
<p>What is shown from the experiment? When does the neuron fire?</p>

What is shown from the experiment? When does the neuron fire?

after association, neuron responds to cue that predicts the reward

  • increase in firing after cue

  • no response at the time of the reward delivery

12
New cards
<p>What occurs in this trial?</p>

What occurs in this trial?

  • increase in response after cue

  • decrease in firing at the time of the reward when the reward isn’t given when it is expected

    • if the reward is omitted there is a dip in the dopamine activity at the time the reward should have been delivered

13
New cards

What is the Reward Prediction Error (RPE)?

actual value/reward - expected value/reward

comparison between what you expect and what you get

NOT AN ABSOLUTE VALUE SIGNAL

14
New cards

What is the model we use for reward prediction error?

use prediction → receive outcome —reward = prediction→ keep prediction unchanged

OR
use prediction → receive outcome —reward not = prediction→ error → update prediction

<p>use prediction → receive outcome —reward = prediction→ keep prediction unchanged</p><p>OR<br>use prediction → receive outcome —reward not = prediction→ error → update prediction </p>
15
New cards

You are expecting a $5 reward. You receive instead a $2 reward. How would you expect one of your dopamine neurons to respond?

decrease in activity as you receive a worse reward than expected reward

  • positive value reward but negative prediction error

RPE: 2 - 5 = -3

16
New cards

You are expecting a $5 loss. You instead loose $2. How would you expect one of your dopamine neurons to respond?

RPE: -2 - -5 = 3

increase in activity as you receive a smaller loss

17
New cards
<p>How do you optimize your actions to get to goal? What 3 estimations are required?</p>

How do you optimize your actions to get to goal? What 3 estimations are required?

  1. learning signal

  • what is the feedback signal we should use

  • need to learn while propagating value across space

  1. Value of each state

  • what is the definition of value for each state

  1. choose action that maximizes future value

  • which action should we choose

18
New cards

How do we determine the value of each state?

state is usually location in space/time but can be defined in a more abstract space

we have a value estimation (value function)

the policy is given the value of the state, which action should we take

19
New cards

What is the TD error? What is it similar to?

TD error measures the different between what an agent expected to happen and what actually happened at the next time step

  • used to update the value estimate so tat future predictions become more accurate

the reward prediction error in dopamine neurons is similar to the Temporal Difference (TD) error in reinforcement learning algorithms

20
New cards

What is the ‘critic’ (value function)?

  1. the critic uses state and reward information to compute the TD error

  2. the TD error is used to update the value of the states

21
New cards

What is value of a state?

*value is a state is the sum of the value that can be reached from this state discounted by how far they are into the future

e.g. get reward now → full value

get reward in future → lower value amount

further in future = lower value (like graph)

<p>*value is a state is the sum of the value that can be reached from this state discounted by how far they are into the future</p><p>e.g. get reward now → full value</p><p>get reward in future → lower value amount </p><p>further in future = lower value (like graph)</p>
22
New cards

Why should we discount the value of future states?

uncertainty

  • you not sure that you’re getting reward

  • model of world in future is not predictable

23
New cards

What is the TD error formula?

TD error = Actual value (reward + expected future value) - expected value

reward (rt) + expected future value (yV(st+1)

y = discounting into the future

compare to current expected value of the state

<p>TD error = Actual value (reward + expected future value) - expected value</p><p>reward (rt) + expected future value (yV(st+1)</p><p>y = discounting into the future </p><p>compare to current expected value of the state </p>
24
New cards

What is the policy (actor)?

actor: choose action that maximizes future values

25
New cards

What are different policies given the same value function?

greedy policy: always pick the most valuable future state

explore: sometimes pick another state to ensure high value states don’t exist elsewhere

26
New cards
<p>Explain this action and update.</p>

Explain this action and update.

the TD error is used as the learning signal

previous value is updated given the learning rate (alpha) and the prediction error

it is a local update that can be done before receiving the final outcome

  • you only need to know value of next state NOT the final state

  1. you start on t

  2. you want to move towards action where you have reward

  3. you move to t + 1 where you’re going to have reward in future 

  4. use TD error to get updated value function

  • t + 1 has more reward than t (t has 0 reward)

  1. t has a value now since it leads to a value state

CREATES A PATH TO A REWARD

27
New cards

What is TD learning? What is bootstrapping? Is this compatible with biological systems?

TD learning = learning a guess from a guess

bootstrapping = update the estimate based on the estimated value of other states 

TD errors are computed at each timestep (moment by moment)

MORE REALISTIC FOR BIOLOGICAL MODELS

28
New cards

What is TD error equation and what does it mean?

  • δt temporal-difference prediction error

    • Reflected in dopamine firing

      • Positive = better than expected

      • Zero = as expected

      • Negative = worse than expected

  • rtreward received at time t

    • Actual outcome

  • V^(st)predicted future value of the current state

    • What the brain expects before seeing the outcome

  • ydiscount factor

    • How much the agent values future rewards

  • V^(st+1) — predicted future value of the next state

    • How valuable the next moment is expected to be

<ul><li><p><strong><span>δt&nbsp;</span></strong>— <em>temporal-difference prediction error</em></p><ul><li><p>Reflected in <strong>dopamine firing</strong></p><ul><li><p>Positive = better than expected</p></li><li><p>Zero = as expected</p></li><li><p>Negative = worse than expected</p></li></ul></li></ul></li><li><p><strong><span>rt</span></strong> — <em>reward received at time t</em></p><ul><li><p>Actual outcome</p></li></ul></li><li><p><strong><span>V^(st)</span></strong> — <em>predicted future value</em> of the current state</p><ul><li><p>What the brain <strong>expects</strong> before seeing the outcome</p></li></ul></li><li><p><strong><span>y</span></strong> — <em>discount factor</em></p><ul><li><p>How much the agent values future rewards</p></li></ul></li><li><p><strong><span>V^(st+1)</span></strong> — predicted future value of the next state</p><ul><li><p>How valuable the next moment is expected to be</p></li></ul></li></ul><p></p>
29
New cards

What happens at the CS (before learning)? What happens at reward (before learning)?

δ(CS) = yV^(st+1) = 0

  • CS does not predict reward 

  • predicted value = 0

  • no dopamine response

δ(R) = rt​−V^(st​) = rt​−0 = rt​

  • reward is unexpected

  • predicted value is 0

  • large dopamine burst at reward

30
New cards

What happens at the CS (after learning)? What happens at the Reward (after learning)?

δ(CS)=γV^(st+1) − V^(st) ≈ positive jump

  • CS now predicts the reward

  • large dopamine burst at CS

δ(R) = rt − V^(st) ≈ rt−rt = 0

  • Reward is fully expected

  • No dopamine burst at reward

31
New cards

What happens at the CS (after learning)? What happens at the Reward (after learning)? NO REWARD THIS TIME

δ(CS)=γV^(st+1) − V^(st) ≈ positive jump

  • CS now predicts the reward

  • large dopamine burst at CS

δ(R) = 0−V^(st) = −V^(st)

  • The animal expects a reward because of the CS

  • But reward does not come

  • Dip in dopamine firing (below baseline)

32
New cards

How do dopamine neurons respond to rewards of different sizes?

graded response

more water = greater RPE

smaller amount of water = negative RPE

33
New cards

Why is there a negative prediction error to small rewards?

smaller reward than expected

  • compared to average size of reward, you’re getting something smaller

34
New cards

Is prediction error an absolute value?

no! it is not an absolute measure of reward vs. punishment

depends on expected value

  • relative to the expectation

  • can change with baseline

35
New cards

Recap of reinforcement learning to optimize rewards:

  1. assign value to states in the world

  2. assigning value is the value and the discounted future values

  3. using learning signal to compare reward and expected future value of future place to expected value/current value of place

  4. increase or decrease values of spaces