Reinforcement Learning and Perceptual Learning

Introduction

Chi-Tat Law & Joshua I Gold's study explores how reinforcement learning accounts for associative and perceptual learning in a visual-decision task.
Improved perceptual performance corresponds to changes in how sensory information is interpreted for decision-making.
A reinforcement-learning rule shapes the functional connectivity between sensory and decision neurons.
The model is based on the readout of simulated responses from direction-selective sensory neurons in the middle temporal area (MT) of monkey cortex.
A reward prediction error guides changes in connections between sensory neurons and the decision process. This establishes associations between motion direction and response direction and improves perceptual sensitivity by strengthening connections from the most sensitive neurons.
The study suggests a common, feedback-driven mechanism for some forms of associative and perceptual learning.

Background

Perceptual sensitivity to simple sensory stimuli can improve with training, a phenomenon known as perceptual learning.
The neural basis of perceptual learning is not completely understood, especially for vision.
Improvements can occur at various stages of visual processing, from the primary visual cortex to higher-order sensory-motor areas.
The study explores the hypothesis that changes can be driven by a reinforcement signal that generates a selective readout of the most informative sensory neurons.
Reinforcement learning involves learning by trial and error to maximize reward and minimize punishment.
A reward-prediction error compares predicted and actual rewards, helping to form associations between sensory input and rewarded actions.
Dopamine neurons in the midbrain reflect this error signal.
A similar dopamine-based signal can drive changes in the auditory cortex, and some forms of visual perceptual learning require a reinforcement signal.

Methods and Model

The study examined whether a reinforcement-learning rule based on a reward-prediction error could account for both associative and perceptual learning on a direction-discrimination task.
Monkeys were trained to decide the direction of random-dot motion and respond with a saccadic eye movement.
Monkeys learned the association between motion direction and saccadic response and then learned to make accurate direction decisions using weaker motion stimuli.
Behavioral improvements corresponded to changes in the motion-driven responses of neurons in the lateral intraparietal area (LIP).
No apparent change was observed in the motion-driven responses of neurons in MT.
Reinforcement signals establish functional connections from MT-like sensory neurons to LIP-like decision neurons.
These connections are refined to more strongly weight inputs from the most informative sensory neurons, improving perceptual sensitivity.
The model explains the time course and asymptotic behavior of both associative and perceptual learning, changes related to the readout of the sensory representation, and the establishment and progression of motion-sensitive responses in decision-making neurons.
Reinforcement learning might be involved in establishing and shaping patterns of connectivity critical for forming perceptual decisions.
The model is based on a pooling scheme that relates the activity of MT-like neurons to a perceptual decision about motion direction and a reinforcement-learning rule that evaluates and adjusts this pooling process based on the reward outcome.
The pooling model has three stages: sensory representation in MT, pooling of the MT responses, and formation of the direction decision in LIP.
MT was modeled as a population of 7,200 neurons with 36 different direction tuning functions distributed uniformly around 360 degrees, with trial-by-trial responses to the motion stimulus and interneuronal correlations.
The cumulative, motion-driven responses from each MT neuron ( $xi$ ) were pooled by LIP as a weighted sum: $y = \sum wi xi$ (1) where $wi$ is the pooling weight assigned to the ith MT neuron.
The pooled response, $y$ , is corrupted with Gaussian noise.
The direction decision was made based on the arithmetic sign of the pooled response: rightward (between -90 and 90 degrees) for >= 0 and leftward otherwise.
The pooling weights from MT to LIP were initially random and then adjusted according to a reinforcement-learning rule after each trial.
A simple delta rule adjusts the pooling weights on trial $n+1$ ( $w{n+1}$ ) based on the weights on trial $n$ ( $wn$ ) and an update term ( $\Delta w$ ): $w{n+1} = wn + \Delta w$ (2) $\Delta w = aC(r - mE[r])(x - nE[x])$ (3) where:
- $a$ is the learning rate.
- $C$ describes the choice (-1 for leftward and 1 for rightward).
- $r$ is the reward outcome (1 if a reward was given and 0 otherwise).
- $E[r]$ is the predicted probability of making a correct (rewarded) choice given the pooled response ( $y$ ).
- $x$ is the vector of MT responses.
- $E[x]$ is the vector of baseline MT responses.
- $m$ and $n$ are binary variables that determine the exact form of the rule used (here, $m = 1$ and $n = 0$ ).
The difference between the actual and predicted reward is the reward prediction error, $r – E[r]$ .
Weight adjustments are based on the correlation between the reward prediction error and $x$ .
C determines the sign of the adjustments. Neurons that respond more strongly to rightward motion will tend to have more positive weights, and vice versa.
This formulation assumes a single LIP neuron (or pool) forms the decision variable.
Alternatively, two pools of decision neurons corresponding to each of the two choices can be considered. The decision variable is then computed as the difference between the two pooled responses.
$E[r]$ estimates the subject’s confidence that the decision is correct and a reward will be obtained.
$y$ is modeled after LIP and is proportional to the log-odds ratio that rightward is the correct choice, given the sensory evidence $x$ and equal priors.
The estimated probability of a correct decision was computed as $1/(1+e^{-b|y|})$ , where $b$ is a proportionality constant.
After each update, $w$ is normalized by $\sqrt{\sum wi^2} = w{amp}$ to keep the vector length of $w$ constant. This enhances the stability of the model and prevents the model from learning by indiscriminately increasing the magnitude of the weights.
The value of $w_{amp}$ was chosen to give a pooled response similar to the responses of LIP neurons in trained monkeys.
The trial-by-trial performance of the model was simulated using the exact sequences of stimulus conditions used for training monkeys in a previous study.
The model was then compared to the real behavioral, MT, and LIP data.

Results

Monkeys learned to associate a given direction of motion with a particular eye-movement response. This associative learning was quantified using the lapse rate (errors for high-coherence stimuli).
The lapse rate declined rapidly over the first week of training, reaching an asymptotic value of near zero.
Monkeys also became increasingly sensitive to weak motion signals over many months of training. These improvements in perceptual sensitivity were quantified using the discrimination threshold (the motion coherence corresponding to ~81% correct responses).
Thresholds decreased gradually, starting when low-coherence stimuli were introduced and continuing well after the monkeys had acquired the visuomotor association.
The model can account for both the associative and perceptual changes. The same sequences of trials experienced by monkeys generated a simulated sequence of choices during learning.
The model can account for both associative and perceptual changes.
Lapse rates declined rapidly to near zero (time constant, $\tau_{la}$ , mean value (and 68% confidence intervals) in units of the number of trials was 1,082 [1,080 1,175] for real data and 2,532 [1,839 3,658] for simulated data for monkey C; 9,205 [9,139 9,564] for real data and 6,132 [6,000 6,539] for simulated data for monkey Z).
The discrimination threshold improved more gradually, eventually reaching lower asymptotes comparable to those reached by the monkeys (time constant, $\tau_{th}$ , mean value (and 68% confidence intervals) in units of the number of trials was 28,317 [26,945 29,839] for real data and 26,841 [26,170 27,326] for simulated data for monkey C; 18,422 [17,339 20,500] for real data and 22,467 [22,016 22,620] for simulated data for monkey Z).
The lower asymptotes were also similar (mean value (and 68% confidence intervals) in units of percent coherence was 13.2 [12.7 13.7] for real data and 11.8 [11.7 12.0] for simulated data for monkey C; 21.5 [19.7 22.5] for real data and 21.0 [20.3 21.5] for simulated data for monkey Z).
The time course of learning for the model depended critically on the learning rate ( $a$ in equation (2)). As the learning rate increased, both parameters decreased, indicating that simulated performance improved more rapidly.
The match was good in monkey C (linear regression, H0: [ $\tau{la}, \tau{th}$ ]monkey ¼ [ $\tau{la}, \tau{th}$ ]model, P ¼ 0.56).
The match was not as good for monkey Z (P < 0.05), whose lapse rate declined more slowly, possibly reflecting factors other than knowledge of the sensory-motor association (distractibility, etc.).
These results were robust to a variety of pooling schemes and reinforcement rules. Briefly, similar results were found using one or two pools of decision neurons, additive, multiplicative, or both kinds of pooling noise, multiplicative, subtractive, or no normalization of the linear weights, and linear or nonlinear pooling.
Likewise, similar results were found using different learning rules, as long as they were based on a correlation between sensory input and a reward prediction error.
A qualitatively different scheme, in which pooling weights remained constant, but a decision bound on the pooled signal varied with training, was unable to reproduce the pattern of behavioral results.

Weight Optimization

The improvements in simulated discrimination performance resulted from changes in the pooling weights.
The association between motion direction and decision direction was established early in training, but performance to weaker motion signals was still near chance.
At this early stage, pooling weights tended to be strongest but were of opposite signs near the two directions of motion used in the discrimination task.
As training progressed and simulated performance improved, the pooling weights continued to evolve such that weights to more sensitive neurons tuned to ~0 degrees became more positive, and weights to more sensitive neurons tuned to ~180 degrees became more negative.
Thus, the improvements in sensitivity to weak motion appeared to result from an increasingly selective readout of the more sensitive sensory neurons with training.
The learning rule (equation (2)) guided the pooling weights to a form of optimal linear readout at the end of training.
Optimal pooling weights were computed using Fisher’s linear discriminant analysis.

Choice Probability

Choice probability is a measure of the relationship between trial-by-trial fluctuations in the activity of individual neurons and choice behavior.
For neurons in area MT, choice probability tends to be near chance early in training and then progresses steadily to values that are slightly, but reliably, above chance after training.
The model shows a similar, steady increase of choice probability for neurons tuned to the two directions of motion with training, as the pooling weights of these sensory neurons are adjusted to drive the decision process more effectively.
However, the changes in pooling weights alone were not sufficient to account for another key feature of the real MT choice probability data: a selective increase with training for the most sensitive neurons.
Interneuronal correlations are also important for choice probability.
When correlation strength depended on the similarity of direction tuning but not on the sensitivity between pairs of neurons, simulated choice probability was insensitive to neurometric sensitivity throughout training.
When the strength of correlations between pairs of neurons decreased as their direction tuning and sensitivity became less similar, simulated choice probability matched the MT measurements and increased selectively for the most-sensitive neurons.
Dynamic pooling weights and static interneuronal correlations that both depended on neuronal sensitivity could together account for changes in the relationship between MT activity and choice behavior throughout training.

LIP Activity

The selective changes in the pooling weights caused improvements in the pooled response (y in equation (1)), which were somewhat similar to the changes in motion-driven responses of individual LIP neurons that were measured during training.
As in LIP, the simulated pooled response reflected the direction decision throughout training, increasing from zero to more positive values with increasing viewing time on trials in which a rightward decision was made and decreasing from zero to more negative values with increasing viewing time on trials in which a leftward decision was made.
With training, the pooled response also became increasingly dependent on motion strength, increasing more steeply as a function of viewing time for rightward decisions and decreasing more steeply as a function of viewing time for leftward decisions.
Two important differences existed between the pooled response in the model and LIP activity:
- First, the pooled response in the model grew roughly linearly as a function of viewing duration. In contrast, LIP activity tended to increase early in motion viewing but then saturate at a threshold level.
- Second, the model tended to overestimate the signal-to-noise ratio (SNR, the difference in mean responses to the two directions of motion divided by their common s.d.) of individual LIP neurons throughout training.
The SNR of both the simulated and real LIP responses grew consistently in a similar manner with training.

Fine Discrimination Task

The model was tested to determine whether it can also account for improved perceptual sensitivity on a fine discrimination task, in which the two alternative motion directions are separated by a much smaller amount.
In this case, the most informative neurons are not those that respond most strongly to the presented stimuli but rather are those tuned ~40 degrees away from those directions.
Unlike the weight profiles for the coarse task, changes in the pooling weights for this task were not centered on neurons tuned to the direction of motion of the stimulus.
Instead, the strongest weights developed around neurons tuned to values offset from the stimuli by ~40 degrees.
By the end of training, the weights were similar to the optimal readout.

Specificity of Learning

A key feature of many forms of perceptual learning is that the improvements tend to be specific to the stimulus configuration used during training.
The model, trained on the same sequence of motion axes, showed similar specificity.
After training the model using a single pair of simulated motion directions, both lapse rate and discrimination threshold were measured using different directions.
For the coarse task, lapse rates were mostly absent except at 90 degrees from the trained direction.
For the fine task, both lapse rates and discrimination threshold showed a higher degree of stimulus specificity.

Predictions

These forms of perceptual learning are driven by a reward prediction error.
Interneuronal correlations in MT should depend on neurometric sensitivity.
Specificity of learning on coarse and fine discrimination tasks should depend on the direction axis.
Learning should be fastest if strong motion stimuli are used early in training, as the error signal depends on the value of the pooled MT response and is therefore noisier for weaker motion, particularly early in training.

Discussion

The computational model uses a reinforcement-learning rule to adjust pooling weights between MT-like sensory neurons and LIP-like decision neurons can account for both the behavioral and neural changes observed during training.
Changes measured in LIP during training reflect an increasingly selective readout of the most informative MT neurons.
The model establishes principles governing how functional connectivity between areas such as MT and LIP is modified by experience.
This relationship, called choice probability, appears to arise from both an appropriate readout scheme and a particular form of interneuronal correlations.
The reinforcement-learning model used a simple delta rule to adjust the pooling weights on the basis of a reward prediction error from the current decision.
Early in training, the feedback reinforcement signal establishes functional connectivity in stimulus-response association.
This sensory-motor connectivity is further refined by the same learning mechanism to provide a more selective readout of the most sensitive sensory signals associated with that response.

Chi-Tat Law & Joshua I Gold's study examines reinforcement learning's role in associative and perceptual learning through a visual decision-making task, revealing that improved perceptual performance relates to sensory information interpretation changes. A reinforcement-learning rule shapes the connectivity between sensory and decision neurons, based on responses from direction-selective sensory neurons in the monkey's middle temporal area (MT). A reward prediction error strengthens connections between motion direction and response direction, enhancing perceptual sensitivity. Perceptual learning improves with training but remains incompletely understood, particularly in vision, highlighting the significance of reinforcement signals in refining sensory neuron readouts. Monkeys trained in a direction-discrimination task showed a decline in lapse rates and an improvement in sensitivity to weak motion signals over time, with the model simulating their learning and performance using specific reinforcement-learning rules. The results demonstrate increasingly selective readout from the most informative neurons and provide insights into how experience modifies functional connectivity in the neural circuitry involved in perception and decision-making, establishing principles for both associative and perceptual changes.