Lecture 17 - Dopamine + Reinforcement Learning I

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/34

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

35 Terms

New cards

Recap: What fall under representation, valuation, action selection, outcome evaluation?

representation: sensory representations, uncertainty estimation, Bayesian integration

valuation: value representations in PFC, availability/desirability

action selection: action selection in striatum

outcome evaluation: confidence/metacognition

New cards

Recap: What are the 3 types of learning?

supervised: learning from labelled data - accuracy is the feedback

unsupervised: learning structure in the data - distance/structure metric is the feedback

reinforcement: learning from trial and error - sparse reward or punishment as feedback

idea of action in reinforcement

New cards

What is the learning signal in supervised learning? Describe process in supervised learning. Is it useful conceptually or for biological systems?

you have one data point and one label
you get an error (target - prediction)
error gets back propagated into network
use error to figure out how to change network to give you correct network
the learning signal for a given neuron is dependent on other downstream neurons in the network

NOT USEFUL FOR BIOLOGICAL SYSTEMS SINCE YOU NEED TO KNOW ALL OTHER NEURONS DOWNSTREAM TO UPDATE A WEIGHT OF A NEURON

useful conceptually

New cards

What is the agent and environment in reinforcement learning?

the agent has to take a number of actions and obtains sparse rewards

the agent has to learn the value of states and action that lead to the reward

agent → actions → environment

environment → rewards/observations → agent

New cards

What is intracranial self stimulation? What experiment with rats?

stimulation of the medial forebrain bundle (MFB)

when the rats press on the lever, they receive an electrical stimulation

they were not food or water deprived

the rat self-stimulates → they will press the lever repeatedly it is paired with a stimulation

they work for this stimulation reward

New cards

What neurons were targeted in the self-stimulation experiment?

the dopaminergic system

ventral tegental area
substantia nigra
- axons are activated and release of dopamine

New cards

What diseases is dopamine implicated in? What is a treatment option?

Parkinson’s disease arises from degeneration in dopamine neurons (substantial nigra)

motor disorder

treatments include L-Dopa which is a metabolic precursor of dopamine

forcing dopamine neuron activation can lead to recovery in dopaminergic system

New cards

How do psychopathic drugs operate with dopamine?

many addictive drugs act by increasing dopamine levels through different mechanisms

New cards

Describe experiment (part 1). What type of neuron is being recorded?

dopamine neuron from ventral tegmental area

response of the neuron to an unpredicted liquid reward

strong response after reward delivery

THERE IS REWARD CODING IN THESE NEURONS

New cards

What type of learning is it where we give a cue (CS) paired with a reward (US)?

classical conditioning

Pavlovian

New cards

What is shown from the experiment? When does the neuron fire?

after association, neuron responds to cue that predicts the reward

increase in firing after cue
no response at the time of the reward delivery

New cards

What occurs in this trial?

increase in response after cue
decrease in firing at the time of the reward when the reward isn’t given when it is expected
- if the reward is omitted there is a dip in the dopamine activity at the time the reward should have been delivered

New cards

What is the Reward Prediction Error (RPE)?

actual value/reward - expected value/reward

comparison between what you expect and what you get

NOT AN ABSOLUTE VALUE SIGNAL

New cards

What is the model we use for reward prediction error?

use prediction → receive outcome —reward = prediction→ keep prediction unchanged

OR
use prediction → receive outcome —reward not = prediction→ error → update prediction

New cards

You are expecting a $5 reward. You receive instead a $2 reward. How would you expect one of your dopamine neurons to respond?

decrease in activity as you receive a worse reward than expected reward

positive value reward but negative prediction error

RPE: 2 - 5 = -3

New cards

You are expecting a $5 loss. You instead loose $2. How would you expect one of your dopamine neurons to respond?

RPE: -2 - -5 = 3

increase in activity as you receive a smaller loss

New cards

How do you optimize your actions to get to goal? What 3 estimations are required?

learning signal

what is the feedback signal we should use
need to learn while propagating value across space

Value of each state

what is the definition of value for each state

choose action that maximizes future value

which action should we choose

New cards

How do we determine the value of each state?

state is usually location in space/time but can be defined in a more abstract space

we have a value estimation (value function)

the policy is given the value of the state, which action should we take

New cards

What is the TD error? What is it similar to?

TD error measures the different between what an agent expected to happen and what actually happened at the next time step

used to update the value estimate so tat future predictions become more accurate

the reward prediction error in dopamine neurons is similar to the Temporal Difference (TD) error in reinforcement learning algorithms

New cards

What is the ‘critic’ (value function)?

the critic uses state and reward information to compute the TD error
the TD error is used to update the value of the states

New cards

What is value of a state?

*value is a state is the sum of the value that can be reached from this state discounted by how far they are into the future

e.g. get reward now → full value

get reward in future → lower value amount

further in future = lower value (like graph)

<p>*value is a state is the sum of the value that can be reached from this state discounted by how far they are into the future</p><p>e.g. get reward now → full value</p><p>get reward in future → lower value amount </p><p>further in future = lower value (like graph)</p>

New cards

Why should we discount the value of future states?

uncertainty

you not sure that you’re getting reward
model of world in future is not predictable

New cards

What is the TD error formula?

TD error = Actual value (reward + expected future value) - expected value

reward (rt) + expected future value (yV(st+1)

y = discounting into the future

compare to current expected value of the state

New cards

What is the policy (actor)?

actor: choose action that maximizes future values

New cards

What are different policies given the same value function?

greedy policy: always pick the most valuable future state

explore: sometimes pick another state to ensure high value states don’t exist elsewhere

New cards

Explain this action and update.

the TD error is used as the learning signal

previous value is updated given the learning rate (alpha) and the prediction error

it is a local update that can be done before receiving the final outcome

you only need to know value of next state NOT the final state

you start on t
you want to move towards action where you have reward
you move to t + 1 where you’re going to have reward in future
use TD error to get updated value function

t + 1 has more reward than t (t has 0 reward)

t has a value now since it leads to a value state

CREATES A PATH TO A REWARD

New cards

What is TD learning? What is bootstrapping? Is this compatible with biological systems?

TD learning = learning a guess from a guess

bootstrapping = update the estimate based on the estimated value of other states

TD errors are computed at each timestep (moment by moment)

MORE REALISTIC FOR BIOLOGICAL MODELS

New cards

What is TD error equation and what does it mean?

δt — temporal-difference prediction error
- Reflected in dopamine firing
  - Positive = better than expected
  - Zero = as expected
  - Negative = worse than expected
rt — reward received at time t
- Actual outcome
V^(st) — predicted future value of the current state
- What the brain expects before seeing the outcome
y — discount factor
- How much the agent values future rewards
V^(st+1) — predicted future value of the next state
- How valuable the next moment is expected to be

<ul><li><p><strong><span>δt </span></strong>— <em>temporal-difference prediction error</em></p><ul><li><p>Reflected in <strong>dopamine firing</strong></p><ul><li><p>Positive = better than expected</p></li><li><p>Zero = as expected</p></li><li><p>Negative = worse than expected</p></li></ul></li></ul></li><li><p><strong><span>rt</span></strong> — <em>reward received at time t</em></p><ul><li><p>Actual outcome</p></li></ul></li><li><p><strong><span>V^(st)</span></strong> — <em>predicted future value</em> of the current state</p><ul><li><p>What the brain <strong>expects</strong> before seeing the outcome</p></li></ul></li><li><p><strong><span>y</span></strong> — <em>discount factor</em></p><ul><li><p>How much the agent values future rewards</p></li></ul></li><li><p><strong><span>V^(st+1)</span></strong> — predicted future value of the next state</p><ul><li><p>How valuable the next moment is expected to be</p></li></ul></li></ul><p></p>

New cards

What happens at the CS (before learning)? What happens at reward (before learning)?

δ(CS) = yV^(st+1) = 0

CS does not predict reward
predicted value = 0
no dopamine response

δ(R) = rt−V^(st) = rt−0 = rt

reward is unexpected
predicted value is 0
large dopamine burst at reward

New cards

What happens at the CS (after learning)? What happens at the Reward (after learning)?

δ(CS)=γV^(st+1) − V^(st) ≈ positive jump

CS now predicts the reward
large dopamine burst at CS

δ(R) = rt − V^(st) ≈ rt−rt = 0

Reward is fully expected
No dopamine burst at reward

New cards

What happens at the CS (after learning)? What happens at the Reward (after learning)? NO REWARD THIS TIME

δ(CS)=γV^(st+1) − V^(st) ≈ positive jump

CS now predicts the reward
large dopamine burst at CS

δ(R) = 0−V^(st) = −V^(st)

The animal expects a reward because of the CS
But reward does not come
Dip in dopamine firing (below baseline)

New cards

How do dopamine neurons respond to rewards of different sizes?

graded response

more water = greater RPE

smaller amount of water = negative RPE

New cards

Why is there a negative prediction error to small rewards?

smaller reward than expected

compared to average size of reward, you’re getting something smaller

New cards

Is prediction error an absolute value?

no! it is not an absolute measure of reward vs. punishment

depends on expected value

relative to the expectation
can change with baseline

New cards

Recap of reinforcement learning to optimize rewards:

assign value to states in the world
assigning value is the value and the discounted future values
using learning signal to compare reward and expected future value of future place to expected value/current value of place
increase or decrease values of spaces