Lecture 7 - Instrumental or Operant Conditioning

Midterm Feedback

“Except for” MC questions instilled doubt, more time-consuming
Many MC questions involved similar answers, hard to differentiate
Too many written answer questions for the time constraint
Many choices of written answer prompts was really good
Maybe having a review session before the exam would be beneficial — we didn’t even get to finish the sixth chapter, yet we were examined on it

Lesson Outline

Instrumental Conditioning (IC) Background
Procedures
Influencing Factors
Consequences
Associations in Instrumental Conditioning

Instrumental Conditioning Background

Classical Conditioning vs. Instrumental Conditioning

Classical Conditioning
- Stimulus + Stimulus = Conditioned Reflexive Response
- Ex: Footsteps + Food = Salivation to Footsteps
Instrumental Conditioning
- Voluntary Behaviour + Consequence = Increase/Decrease in Voluntary Behaviour
- Ex: Biting one’s nails + Punishment = No More Nail-biting
Commonalities and Differences:
- Resulting behaviour is reflexive in CC and voluntary in IC

Basic IC Procedure

In instrumental conditioning, voluntary responses are modified through the following steps:
1. The organism ‘reacts or behaves’
2. A behaviour modification technique is applied
3. Consequence: The reaction or behaviour either occurs more frequently or is reduced/stopped.
- Note: IC can produce complex behaviours.

Definitions

Instrumental behavior: Behavior that occurs due to its previous role in producing consequences (e.g., Study hard to get an A+).
Instrumental conditioning: The procedures developed to study instrumental behavior through reinforcement and punishment. Examples of instrumental behavior:
- Turning the key to start a car.
- Pulling the handle on a slot machine to win.
- Driving too fast leads to a speeding ticket.
- Touching an electric fence results in a shock.

Background

A type of learning in which the consequences of behaviour tend to modify that behaviour in the future.
Rationale:
- Behaviours that are rewarded or reinforced tend to be repeated.
- Behaviors that are ignored or punished are less likely to be repeated.

Early Studies on Instrumental Conditioning

Thorndike’s Early Studies

Edward L. Thorndike (1874-1949):
- The first serious theoretical analysis of instrumental conditioning
Initially, a lot of behaviours were tried out
Animal tracks outcomes of behaviours
- S → R → O
- In context (S), response (R) produces outcome (O)
This knowledge guides future behaviours:
- Behaviours with positive outcomes increase
- Behaviours with negative outcomes decrease
Thorndike’s puzzle boxes:
- Cool, but there are some methodological problems:
  - Have to repeat trials over and over, resetting animal and device
  - Cutoff? What is the worst performance?
  - Decreases with learning
  - Hard to compare across animals, trials
  - How do you generate a prediction from latencies?

Instrumental Conditioning Procedures

Discrete-trial Procedures

Puzzle Boxes
Maze Learning

Runway Maze (aka Straight-Alley Maze)
- Stick rat in S-box (start box) for them to get to G-box (goal box)
- Reward for rates is usually fruit loops — they love em’
T-Maze
- Used for memory studies and other aspects
- One of the two ends there is a reward… when put back into the box, they demonstrate learning (e.g., make the same turn to the correct G-Box)

8-Arm Radial Maze
- Receive different rewards at different times
- Typically four feet off the ground (rats don’t like heights, but do go to ends of arms if there is a good reason for it — like a reward)

Free Operant Procedures

The operant response is defined in terms of its effect on the environment
- E.g., your actions alter the environment in some way
Different types of operant responses:
- 1. Lever-press
  - Rats learning lever-pressing
- 2. Chain pull
- 3. Nose-poke
  - Rats poking something with nose
- 4. Peck
  - Pigeons pecking something
What is the dependent variable?
- 1. Response-rate
- 2. Total number of responses
- 3. Latency to respond

B.F. Skinner and the Skinner Box

Skinner was considered the leading authority of IC
- Was influenced by Thorndike
Skinner invented the “Skinner Box” to test IC through shaping
- Ex: One type of chamber trains rats to bar-press for rewards

The Initial Learning

The IC involves learning familiar responses in new situations or in new ways
- Example: Learning where and what to run for
  - Rats do not need to learn HOW to run
  - Rats need to learn WHERE to run, WHERE to turn, and WHAT they will find at the end
Constructing new responses from familiar components
- Example: To press a lever, rats hav e to combine various familiar behaviours
  - Raising their paws, standing on hind legs, etc.

Shaping

Reinforces any movement in the direction of the desired response
- Rewards gradual successive approximations
- Quicker than waiting for the response to occur and then reinforcing it
- Used effectively to condition humans and many types of animals
  - Ex: parents and children, teachers and students, coaches and athletes

What is shaping?

It involves taking what is known by organism and modifying it in different ways — it’s not about teaching brand new behaviours, but modifying existing behaviours.

What is successive approximation?

In a lever pressing rat learning experiment,, you could wait for dumb luck for it to press lever for first time. Or you reward successive approximations to SPEED UP THE LEARNING PROCESS (for example give it rewards for simply looking in direction of lever — increased probability of facing lever — after it does that over and over again, we up the ante — we withold reward, until the rat APPRAOCHES the lever — then we see probability increase of apporach in lever direction — then eventually it gets to the lever, and nothing happens, until the next step, maybe rewarding TOUCHING the lever.
Kid spelling example — kid wants to say a word, reward it when it says letter, increases likeliness of getting the kid to actually say the word.

Shaping and Chaining

Shaping:

Shaping through successive approximation builds a complex R incrementally
Initially, the contingency is introduced for simple behaviour (R)
As the rate of R improves, the contingency is moved to a more complex version of R
Gradually, it builds a complex R animal that would never spontaneously produce

Chaining:

Chaining builds complex R sequences by linking together S→R→O (if S, then R, leads to O) conditions
Initially, train the animal to pick up an object
Next, reward it for picking it up and then throwing it
It allows a series of behaviours (as opposed to shaping, which simply elaborates ona simple response)

Shaping and chaining can be used together to train animals to complete incredibly complex behaviours. Both techniques require skill and patience from the trainer.

Keep an animal motivated and interested
Select proper training sequence
- Cannot move too fast

How to Get a Rat to Lever Press

IC in the Skinner Box

Outcomes (O):
- ± food delivery
- ± shock through wires in the floor (punishment)
Behaviour (R): rate of lever pressing
Context (S): light that signals box is “on”
Note than animal is “free” in the chamber, no experimenter intervention
- Free-operant learning
Also, many possible contingencies can be introduced

Positive Reinforcement:

Press lever (R) → GET FOOD

Positive Punishment

Press lever (R) → GET SHOCK

Negative Reinforcement

Press lever (R) → STOP SHOCK

Negative Punishment

Press lever (R) → STOPS FOOD

Structure of the IC Skinner Box Experiment

Initially, tries many things; eventually, accidentally presses the lever, produces a positive effect
Now starts hanging around the lever, accidentally presses it again
Rat has learned a contingency: if light on (S), pressing lever (R) → food (O); spends much of tis day pressing and eating

Basic Pattern of IC

Generalizing & Discrimination

Influencing Factors

Quality of the Outcome

Appetitive stimulus: ‘pleasant’ event or outcome in the context of IC
Aversive stimulus: ‘unpleasant’ event or outcome in the context of IC

Relationship Between the Instrumental Behaviour and the Outcome

Positive contingency: The instrumental response causes an outcome/stimulus to appear
Negative contingency: The instrumental response causes a stimulus to dissapear or be eliminated

Magnitude of Reinforcement

As magnitude increases:
- Acquisition of a response is faster
- Rate of responding is higher
- Resistance to extinction is greater
Ex: people work harder for $30/hr than $10/hr

Immediacy of Reinforcement

If reinforcement is immediate, responses are conditioning more effectively
- Ex: Addiction to drugs can happen quickly because the euphoric effects are felt almost instantly
As a rule, the longer the delay in reinforcement, the more slowly the response will be acquired
- Ex: Eating habits are hard to change because of the long delay between better health and weight loss
- Note: For rats, the association between their behaviour and the reward or punishment should be maximum one minute

Level of motivation

Higher motivation leads to faster learning
Skinner found maximum motivation occurred when rats were food deprived for 24hrs — makes the rats more interested and motivated to obtain food

Consequences

Changes in instrumental behaviour are determined by the nature of the outcome, and whether or not the outcome is presented or eliminated
- Reinforcement: Where the relationship between the response (R) and the outcome (O) increases the probability of the response occurring
- Punishment: Where the relationship between the response (R) and outcome (O) decreases the probability of a response occurring

Reinforcement

Anything that strengthens a response (or increases the probability that the response will occur)

Primary and Secondary Reinforcers

Primary reinforcers fulfill basic physical needs for survival
- Do not depend on learning
- Ex: food, water, termination of pain
Secondary reinforcers are acquired or learned by association with other reinforcers
- Ex: money, praise, awards, good grades

Punishment

Anything that suppresses a response (or decreases the probability that the response will occur)

Types of Consequences and Their Procedures

Four main scenarios outline the consequences of behaviour:
1. Positive Reinforcement:
- Behaviour produces an appetitive stimulus
- Probability of behaviour increases
- Contingency is positive
- Human Example: Slot machines
- Experimental Procedure: Lever-pressing for food
1. Positive Punishment (Punishment)
- Behaviour produces an aversive stimulus
- Probability of behaviour decreases
- Contingency is negative
- Human example: ticket after speeding
- Experimental example: rats not pressing lever to avoid shock

etc…

Instrumental vs. Classical Conditioning

Associations in IC

Associative Structure of Instrumental Conditioning

Originated with Thorndike
Role of Pavlovian mechanisms in instrumental conditioning
Focus on individual responses and their stimulus antecedents and outcomes (molecular approach)
- REMINDER: If S, Then R, Produces O (if you find yourself in a situation with particular stimuli, then a specific response will lead to a particular outcome)

Thorndike’s Law of Effect: S-R Learning

If a response in the presence of a stimulus results in a satisfying event then the S-R association is strengthened. If the response is followed by an annoying event then the S-R association is weakened.

The reinforcer (O) serves to ‘stamp in’ the S-R association
Motivation for instrumental behaviour:
- Activation of the S-R association upon exposure to contextual stimuli (S), in the presence of which the response was previously reinforced
No learning about ‘O’ or ‘S-O’ or ‘R-O’
- The O was not learned about; rather, it was a means of learning
- THORNDIKE WAS NOT CORRECT — but he wasn’t entirely wrong either. S-R is learned about, it can be produced independently of what the O is. It’s really applied when overlearning (or habitual learning) occurs.
  - Back in the day Dean would smoke. He would light up in particular smoking contexts — it became habitual.
So O (outcome) MATTERS!
- Reward expectancy:
  - Can the expectancy of a particular outcome (S-O_ modulate instrumental behaviour?
  - Does the expectancy of a certain outcome drive the response?
Earliest theory — Hull (1930), Spence (1956):
- Two factors motivate the instrumental response (TWO PROCESS THEORY):
  - S-R Association
    - The stimulus comes to evoke the response directly
  - S-O Association
    - Response is motivated by expectancy of reward (classical conditioning occurs here — context reflexively makes organism expect a reward)

The Modern Two-Process Theory

Rescorla & Black (1967)
- S-O Association (Pavlovian Learning) → Conditioned, central emotional state (positive or negative based on the reinforcer) → Response

Pavolvian-Instrumental Transfer Test

Phase 1: Lever-press → food learning (instrumental conditioning)
Phase 2: Tone → food learning (classical conditioning)
Phase 3 (Test Phase): the organism is given the opportunity to produce instrumental responding (with a lever present in the environment). Sometimes the tone will play with when the lever is pressed, sometimes it won’t. Instrumental responding increases when the CS is present vs. when it’s not.
- Conclusion — CS produces a reflexive conditioned expectancy in the organism, which makes the organism work harder (press the lever harder)

PIT in Humans…

Hnadrgrip stronger when purple stimuli was present
They didn’t report
they didn’t know they were squeezing any differently when CS was present in the third phase vs when it wasnt

Conditioned Emotional State or Reward-Specific Expectancy

Does classical conditioning influence instrumental behaviour via a positive or negative emotional state (based on reinforcer valence) or do subjects acquire specific expectations of the reinforcer?
is it any reward your predicting that will increase responding, or is it a particular reward? in the transfer test, the reward was the same. could it be any work towards a reward in phase 1 and prediciton of reward in phase 2?

The Experiment:

Phase 1 — Classical Conditioning
- Lever-pressing leads to chocolate reward
- Chain-pulling leads to cheese reward
Phase 2 — Instrumental Conditioning
- Yellow light predicts when chocolate reward is given
- Red light predicts when cheese reward is given
Phase 3 — Test Phase
- When lever-pressing, sometimes yellow light would be on, sometimes red light would be on, and sometimes neither would be on
- When chain-pulling, sometimes yellow light would be on, sometimes red light would be on, and sometimes neither would be on
Explanation:
- Lever pressing only increased when the yellow light was on (e.g., specific to the reward it was working for)
  - When the red light was on, it did not increase
- When the expectancy of reward and working towards reward are the same, effort in instrumental responding increases
  - If you predict cheese, you’re not gonna work harder to get chocolate (they are not generalizing across rewards)

R-O Associations in Positive Reinforcement

The instrumental response

Stereotypy vs. behavioural variability
Relevance (belongingness and instinctive drift)

The outcome of the response

Quality and quantity
Positive and negative contrast

The relation or contingency between the response and outcome

Temporal contiguity
Contingency

The Instrumental Response

What is this R that is learned?
Initially, thought to be a rote motor program
However, if the normal motor program is blocked, the animal will use other methods to achieve the same ends
- Ex: wading/swimming; pressing lever with nose
R is a “behavioural unit”
- Not a single behaviour but a class of behaviours producing an effects
- Some cognitive psychologists would call it a goal or intention
Stereotypy vs. Response variability
- It is possible to maintain variability of responses using reinforcement
- However, unless variability is explicitly reinforced, responding will become more stereotypical
Thorndike’s belongingness
- Easier to train responses that ‘belong’ with the reinforcer
  - e.g., cannot train yawns or scratching as an escape response
Breland & Breland’s instinctive drift
- Extra responses that are performed instinctively because they are related to the reinforcer
- They compete with the response required by the training procedure
  - e.g., cannot teach raccoons to drop coins in a box

The Instrumental Reinforcer

Quantity of the reinforcer (canadian dollars vs. rupees)
Quality of the reinforcer (brussel sprouts vs. ice cream)
- quantity and quality of reinforcer example
  - food-deprived rats changed amounts of food an dhow much insturmental responses would be dislayed to receive different quantities of different food
    - More acidic food does not increase instrumental output
    - For neutral food, you prefer medium over small, but not larger amount
    - For sweet food, more ouput as amount increases

RECAP:

S-O association is reward expectancy (make prediction of outcome — e.g., prediction of reward)
- Example": Dean’s nephew — Kid recognized hockey arena and predicted that popcorn was coming, so he immediately started thinking ‘reward’ after being exposed to particular environment. S-O association, or reward expectancy (element of classical conditioning: reflexive association between stimulus and outcome)
- Note that influence of reward expectancy is that it increases instrumental response to receive the reward — unless the reward expected and the reward gotten by instrumental response doesn’t align (see Pavlovian instrumental transfer test)
R-O association is when we learn association between making a response and the outcome it leads to… we learned this with positive reinforcement (response ledas tot he presence of something good, which goes back to fuel our response more).
- Response itself is being learned
- Reinforcer or outcome is being learned
- The link between them is also learned

The Instrumental Reinforcer

Quantity of the reinforcer (canadian dollars vs. rupees)
Quality of the reinforcer (brussel sprouts vs. ice cream)
- quantity and quality of reinforcer example
  - food-deprived rats changed amounts of food an dhow much insturmental responses would be dislayed to receive different quantities of different food
    - More acidic food does not increase instrumental output
    - For neutral food, you prefer medium over small, but not larger amount
    - For sweet food, more output as amount increases
Does prior experience with a reinforcer influence IC?
Shifts in reinforcer quality and quantity
- Example: food deprived rats performed instrumental response for food
  - Phase 1: Groups 1 and 2 received small food reward after (2 pellets) — run same pace as groups 3 and 4
  - Phase 1: Groups 3 and 4 received large food reward after (22 pellets) — run same pace as groups 1 and 2
  - Phase 2:
    - Group 1 continues to receive 2 food pellets
      - Same pace ran in phase 2 as in phase 1 (line of best fit accounts for variability)
    - Group 3 continues to receive 22 food pellets
      - Same pace ran in phase 2 as in phase 1 (line of best fit accounts for variability)
    - (SL group, small-large group) Group 2 now receives 22 food pellets
      - Ran faster in phase 2 than in phase 1 — they are happy that they’re getting more; hell yeah I am running faster
    - Group 4 now receives 2 food pellets
      - Ran slower in phase 2 than in phase 1 — they are used to putting in effort for large food amount, so they reduce their instrumental response (it’s still a reward, but it’s a decrease in instrumental behaviour because prior experience matters — 2 is better than 0, but worse than 22)
- Conclusion: prior experience DOES matter — positive contrast (experienced in group 2) and negative contrast (experienced in group 4)
  - In other words, you shouldn’t ONLY look at the magnitude of the reward — you should also look at prior experiences
The Response-Reinforcer Relation
- Understanding the relationship between a response and its consequence is critical for efficient instrumental behaviour
  - First, temporal relation: The time between the response and the appearance of the reinforcement
    - E.G., The more the delay between the response and reinforcer, the harder the learning
    - EXAMPLE:
      - Rats lever press for food, and food pellets are delivered after different fixed delays
      - conclusion: Immediate reinforcement is most effective
      - Is it possible to overcome the delay effect? YES — Marking the target instrumental response. Where your instrumental response produces a marking that sort of reminds you that your response has been taken into account (e.g., “
        Example of Marking;
        Rats lever press for food, food delivered after 30-sec delay
        Group 1 - no signal
        Group 2 (MARKING GROUP)- 5-sec light right after lever press (the light acts as a CS, or predictive stimulus, for the reward)
        Learns very quickly
        Group 3 - 5-sec light right before food delivery (doesn’t learn at all — TOO MUCH OF A DELAY, REALLY AN EXAMPLE OF FLOCKING… THE LIGHT BEING A PERFECT PREDICTOR OT HE REWARD 5 SECS BEFORE ITS RECEIVED BLOCKS THEM FROM UNDERSTANDING THAT THEY CAUSED THE LIGHT IN THE FIRST PLACE)

Second, response-reinforcer contingency: The extent to which the response is necessary any sufficient for occurence of the reinforcer (causal effect)
- E.G., you have to learn that you caused the reinforcer by producing the instrumental response (cause and effect)
- Skinner’s superstitious behaviour — idea that we create contingencies in our mind that don’t necessarily exist. The pigeons in skinner’s experiment would be doing weird things like pecking or jumping around (imitating the thing they did the first time they received randomly-timed reward)… accidental/adventitious reinforcement.
  - Contiguity was all that mattered according to Skinner — they created their own contingency, and as long as the temporal contiguity is there, then reinforcement can occur
- However, contrasting evidence suggests that contiguity is not the only explanation (some dude re-did skinner’s pigeon experiment):
  - Similar behavioural responses developed in many different pigeons
  - Food delivery increased strength only of terminal responses
  - Periodic presentation of reinforcer produces behavioural regularities based on the interval
  - conclusion — specific types of behaviours were reproduced that had food-getting qualities
  - Why are high terminal and low interim responses observed?
  - Periodic deliveres of food activate feeding sysems and corresponding pre-organizrd spec ies typical foraging and feeding repsonses
    - Just after food:
    - d
    - d

Do R-O Associations Exist?

Instrumental devaluation
- Phase 1 — if you push a rod to the left, you get chocolate, and to the right, you get cheese
- Phase 2 — you devalue the reinforcer of one of them (e.g., overload them with cheese)
- Phase 3 — the rat will only produce the instrumental response that leads to the OTHER reinforcer (in this case, chocolate), that can only be explained by R-O associations
Colwill & Recorla (1986):

Associative Structure of Instrumental Conditioning

Hierarchical S(R-O) conditioning — about knowing WHEN the R-O association exists
- S activates R (Thorndike, habitual behaviour)
- S also activates R-O association
- EXAMPLE: Response (not being a brat) leads to outcome (popcorn) only when exposed to a certain stimulus (the hockey arena)
- Context lets you know that R-O association is now active