Instrumental Conditioning: Comprehensive Study Notes (Lecture 2)

Instrumental Conditioning

Instrumental conditioning (also called operant conditioning) is learning the associations between a voluntary response and its consequences (outcomes).
The learner produces a response (emitted, not elicited) and the response is reinforced or punished, altering future behavior.
You operate on the environment to obtain rewards or avoid punishments; your action is instrumental in obtaining the outcome.
Key idea: learning is about how actions lead to consequences, shaping future propensity to respond.

Thorndike’s Law of Effect

Core idea: actions followed by favourable consequences tend to be repeated; actions followed by punishment tend to be avoided.
Everyday illustration:
- Example of repetition: I study for tests because I tend to get better marks.
- Example of avoidance: I do not skip tutorials after having previously received a Technical Fail.
Implication: the environment contingently strengthens or weakens actions based on their outcomes.

Operant Conditioning: Core Concepts

Also called instrumental conditioning.
Distinction from classical (Pavlovian) conditioning:
- Operant conditioning involves a voluntary, emitted response.
- The response is followed by consequences that reinforce or punish (outcomes).
Learning involves the organism operating on the environment to obtain rewards and to avoid punishments.
Important terminology:
- Reinforcement: an outcome that strengthens the tendency to emit the response.
- Punishment: an outcome that weakens the tendency to emit the response.
- Positive/Negative are not about good/bad; they refer to whether something is added or removed.

Positive and Negative (Not Valence)

Positive means something is added to the environment following a response.
Negative means something is removed from the environment following a response.

Reinforcement and Punishment (Definitions)

Reinforcement: a reinforcer is an event following a response that strengthens the tendency to make that response.
- Skinner (1953): “The only defining characteristic of a reinforcer is that it reinforces.”
Punishment: an event following a response that weakens the tendency to make that response.

Putting it all together: The Response Matrix

Reinforcement tends to increase the instrumental response; Punishment tends to decrease it.
When reinforcing:
- Positive reinforcement: add a reward to increase behaviour.
- Negative reinforcement: remove a negative consequence to increase behaviour.
When punishing:
- Positive punishment: add a negative consequence to decrease behaviour.
- Negative punishment: remove a positive consequence to decrease behaviour.

Positive Reinforcement (example)

Example: A child brushes their teeth and receives a sticker.
- Sticker added → POSITIVE.
- The child is more likely to brush teeth in the future.
- Outcome: the behaviour of tooth brushing increases → REINFORCEMENT.

Positive Punishment (example)

Example: A student asks a question and is heavily criticized by the teacher.
- Criticism added → POSITIVE.
- The student is less likely to ask questions in the future.
- Outcome: asking questions decreases → PUNISHMENT.

Negative Reinforcement (example)

Example: You use a spot cream or patch; painful acne goes away.
- Acne reduced/removed → NEGATIVE.
- You’re more likely to use spot cream/patch next time you have painful acne.
- Outcome: use of spot treatment increases → REINFORCEMENT.

Negative Punishment (example)

Example: A child swears at his parents; phone privileges are taken away.
- Phone privileges removed → NEGATIVE.
- He’s less likely to swear at his parents in the future.
- Outcome: swearing decreases → PUNISHMENT.

Shaping

When an animal does not perform the desired behaviour yet, you reinforce progressively closer approximations to the target behaviour.
This technique builds complex behaviours by rewarding successive approximations.

B.F. Skinner Shaping: A Pigeon to Turn Around

Classic demonstration of shaping: conditioning a pigeon to perform a new, complex action by reinforcing incremental steps toward the goal.

Problems with Punishment

Learners may simply avoid punishment rather than learning the desired behaviour (avoidance).
Punishment can inhibit all behaviour, not just the undesired one.
It is important to reinforce an alternative (adopt a competing, desirable behaviour).
Punishment can create dislike or fear of the punisher (Pavlovian conditioning).
Observational learning: learners may copy the punisher.
If punishment works, the punisher may be rewarded indirectly (e.g., violence rewarded by reduced undesired behavior).

Reinforcement Schedules: The Rules of Timing and Frequency

A rule determining the timing and frequency of reinforcements for a behaviour.
Core dimensions:
- Fixed vs Variable: Fixed is predictable/constant; Variable is unpredictable/average.
- Ratio vs Interval: Ratio is based on number of responses; Interval is based on the passage of time.
- Continuous vs Intermittent: Continuous reinforcement/punishment every response; Intermittent reinforcement/punishment only for a subset of responses.

Putting it all together: Schedule Matrix (conceptual)

The interaction of Ratio/Interval with Fixed/Variable determines the pattern of responding and persistence under reinforcement/punishment.
Example labels you may see:
- Fixed Ratio (FR)
- Fixed Interval (FI)
- Variable Ratio (VR)
- Variable Interval (VI)
- Each leads to characteristic response rates and pausing patterns.

Rates of Responding with Different Reinforcement Schedules

The contingency between response and outcome greatly affects rate and persistence of responding.
Key generalizations:
- Ratio schedules tend to produce faster responding than interval schedules.
- Organisms learn that the reinforcer depends on the number of responses (ratio) rather than the passage of time (interval).
- Fixed schedules tend to produce pauses in responding; variable schedules tend to produce steadier responding.
- The predictability of fixed schedules allows organisms to anticipate and occasionally pause responding.
- The unpredictable nature of variable schedules promotes consistent responding.

Fixed-Ratio (FR) – Rates of Responding

Reinforcement after a set number of responses.
Example: Reinforcement after every 5th response → $FR-5$ .
Result: high response rate with a post-reinforcement pause after each reinforcement.

Fixed-Interval (FI) – Rates of Responding

The first response after a fixed time interval is reinforced.
Example: First response after 30 seconds is reinforced → $FI-30s$ .
Result: scalloped response pattern: pauses, then increasing rate as time to reinforcement approaches.

Variable-Ratio (VR) – Rates of Responding

Reinforcement after an average number of responses, varying unpredictably.
Example: Reinforcement roughly every 5 responses, but varies → $VR-5$ .
Result: very high, steady rate of responding.

Variable-Interval (VI) – Rates of Responding

First response after an unpredictable time interval is reinforced.
Example: First response after an average of 30 seconds is reinforced, but interval length varies → $VI-30s$ .
Result: moderate, steady response rate.

Skinner Box: Experimental Setup for Operant Conditioning

A device used to study operant conditioning in animals (rats, pigeons).
Reinforcement examples:
- Positive Reinforcement: Rat learns to press a lever to receive food.
- Negative Reinforcement: Rat learns to press a lever to stop receiving electric foot shocks.
- Positive Punishment: Rat learns not to press a second lever because it receives a foot shock.
- Negative Punishment: Rat learns to stop pressing a lever to stop receiving foot shocks.

Drive Reduction Theory (Hull)

Motivation arises from biological needs that create drives.
Behaviour is aimed at reducing these drives, thereby restoring homeostasis.
Key ideas:
- Needs: biological requirements (e.g., food, water, warmth).
- Drives: internal states of tension or arousal triggered by unmet needs (e.g., hunger, thirst).
- Drive reduction reinforces behaviour because it reduces the drive.

Motivation and Learning: Connecting Conditioning Types

Pavlovian (classical) conditioning: learning about contingencies between biologically relevant stimuli (e.g., food, pain) and neutral stimuli (e.g., a bell).
- The learner learns to predict good or bad events.
- Quote from lecture: “I can predict when something good or bad will happen to me.”
Instrumental (operant) conditioning: learning the contingency between enacting a behaviour (e.g., saying “please”) and a motivationally relevant outcome (e.g., getting a cookie or not getting one).
- The learner learns to control outcomes through actions.
Question raised: do contingencies always have to exist to learn? Examples: everyday navigation like riding a bus and learning city layouts may involve implicit learning without explicit reward/punishment contingencies.

Tolman (1948): Cognitive Maps and Latent Learning

Experimental design: 3 groups of rats conducted maze trials over 10 days.
- One group received food at the maze end (reinforced).
- Two groups received no reinforcement throughout training.
Findings across training days:
- Measured by number of errors; more errors = poorer learning.
- The group reinforced with food showed faster reduction in errors over time.
On day 11:
- The previously non-reinforced group was given a food reward for the first time.
- This group quickly learned to reach the maze end with few errors once reinforcement was introduced.
- The consistently unrewarded group continued to erratically navigate.
Cognitive maps and latent learning:
- Tolman proposed that rats formed cognitive maps of the maze even without reinforcement.
- Latent learning: learning that has occurred but is not yet observable until a motivation to demonstrate it is provided.

Instrumental Conditioning Summary

Instrumental (operant) conditioning involves making a voluntary response that leads to an outcome.
Classifications are based on two dimensions:
- Whether something is added or subtracted (Positive/Negative).
- Whether the behaviour is increased or decreased (Reinforcement/Punishment).
Shaping can be used when animals need to learn complex behaviours.
Reinforcement schedules are rules for delivering rewards/punishments and are categorized by:
- Fixed vs Variable (consistency over time or responses).
- Ratio vs Interval (dependence on number of responses vs time).
Latent learning refers to learning that occurs passively, without an obvious motivator at the time of learning.

Connections to Prior Knowledge and Real-World Relevance

Instrumental conditioning builds on Thorndike’s Law of Effect and Skinner’s operant theory, complemented by Hull’s drive reduction and Tolman’s cognitive maps.
Real-world applications include education (reinforcement strategies), animal training (shaping and schedules), and behavioural modification techniques.
Ethical implications of punishment emphasize avoiding avoidance behaviours, fear, or imitation of punitive models; prefer reinforcement-based strategies and shaping to promote desirable behaviours.
The interaction between Pavlovian cues and instrumental actions highlights how predictions and control over outcomes influence motivation and learning in everyday life.