Reinforcement and Operant Conditioning Notes

Operant Learning and Reinforcement

Operant: a class of behaviour that operates on the environment to produce a common environmental consequence.
Learning: a change in behaviour due to experience.
Operant Learning: a change in a class of behaviour as a function of the consequences that followed it.
Learning = Conditioning.

Effect of Consequences

Reinforcement increases behaviour when the consequence follows the behaviour.
Punishment decreases behaviour when the consequence follows the behaviour.
Descriptions:
- Increase: Reinforce
- Decrease: Punish

Effects of Reinforcing Consequences

Reinforcement effects on behaviour include:
- Increase in frequency
- Increase in duration
- Increase in intensity
- Increase in quickness (i.e., decrease in latency)
- Increase in variability
Reinforcement leads to an increase in whatever the reinforcer is contingent on.

Two Ways of Reinforcing (Positive vs Negative)

Positive Reinforcement: add a stimulus to increase a behaviour.
Negative Reinforcement: remove a stimulus to increase a behaviour.
These are the two fundamental routes to reinforcement.

Rewards vs. Reinforcers

Distinction:
- Reinforcer: a consequence that increases the probability of a behaviour.
- Reward: outcomes that may or may not function as reinforcers; not a technical term in the reinforcement framework by itself.

Reinforcers Maintain Behaviour

Graphical idea: reinforcement can maintain responding over time; increases in response probability can rise toward an asymptote depending on the reinforcer and conditions.
Key concept: conditioning trials shape performance toward a learned asymptote based on reinforcement history.

Notes about Reinforcement (Functional Description)

Reinforcement is a functional description, not a theory.
It is not circular: we say the consequence functioned as a reinforcer for the response, not that the consequence simply increased the probability because it was reinforcing.
Correct Usage examples:
- “The consequence (e.g., food) functioned as a reinforcer for the response (e.g., lever pressing).”
- “The consequence (e.g., food) reinforced the response (e.g., lever pressing).”
Incorrect usage would imply the consequence inherently increased probability without specifying its function as a reinforcer.

Types of Reinforcers

Two main types: 1) Unconditional (Primary) Reinforcer:
- Properties derived from species’ evolutionary history (phylogenetic importance).
- Examples: food, sex, water, sleep, social interaction, certain sensory stimulation, escape from harm (e.g., extreme heat).
- Usually depends on some deprivation; often species-specific.
  2) Conditional (Secondary) Reinforcer:
- Neutral stimuli/events that acquire reinforcing power via a contingent relationship with primary reinforcers or other conditioned reinforcers.
Key idea: secondary reinforcers gain value because they signal or enable access to primary rewards.

Empirical Examples of Reinforcement Concepts

Liberman et al. (1973): Differential reinforcement of incompatible behaviours
- Incompatible behaviours: e.g., “rational talk” vs “irrational talk.”
- Reinforced rational talk; extinguished irrational talk when it ceased receiving reinforcement.
Sullivan & Leon (1987): conditioned reinforcement in a perceptual/olfactory paradigm
- Pups trained with peppermint odor conditioning showed differences in time spent near peppermint and in olfactory bulb activity (2-DG uptake) depending on conditioning groups.
- Illustrates how neutral odors can acquire reinforcing properties through conditioning.

Delay Reduction Theory (Conditional Reinforcement)

Translation: The effectiveness of a stimulus as a conditional reinforcer is determined by the degree to which it is correlated with a reduction in the delay to terminal reinforcement.
Note: You do not need to memorize every detail on every slide, but understand the core idea that conditional reinforcers gain strength by predicting shorter delays to obtaining the primary reinforcement.
Formula (hyperbolic decay form shown in slides): V = rac{A}{1 + K D}
- where:
- $V$ = value of the reinforcer (reinforcing effectiveness)
- $A$ = maximum reinforcing value (asymptote)
- $K$ = rate of discounting (how quickly value declines with delay)
- $D$ = delay to terminal reinforcement

Conditional Reinforcement in a Choice Task

Example (pigeons):
- Key A provided food 20% of the time plus a conditioned reinforcer (CR).
- Key B provided food 50% of the time but no CR.
- Result: pigeons preferred Key A despite less actual food, due to the presence of the conditioned reinforcer associated with Key A.
Interpretation: The conditioned reinforcer can drive choice when its predictive value for the primary reinforcement is high enough, illustrating the power of conditional reinforcement.

Variables Affecting Reinforcement

Contingency: degree of correlation between a behaviour and its consequence.
Contiguity: nearness in time (temporal contiguity) or space (spatial contiguity) between the operant response and the reinforcer.
- High contiguity strengthens the effectiveness of the reinforcer; longer delays reduce effectiveness.
Hyperbolic Decay Function (conceptual):
- The value of a reinforcer declines with delay according to a hyperbolic rule, captured by the formula above.

Reinforcer Magnitude and Characteristics

Magnitude (size) generally increases reinforcing effectiveness, but the relationship is not linear.
Increases in magnitude yield diminishing returns: bigger reinforcers do not always scale proportionally in effectiveness.
Unconditional reinforcers often show diminishing effects with greater magnitude due to satiation and other factors.
Reinforcer Magnitude examples: not a simple linear function; need to consider context and organism state.

Reinforcer Characteristics, Task Demands, and Motivating Operations

Specific reinforcer used matters (e.g., chocolate vs sunflower seeds for a given animal).
Task characteristics matter (e.g., what behaviour is being reinforced—pecking for food vs a different target for a different species).
Motivating Operations (MOs): influence the effectiveness of reinforcers
- Establishing Operations: increase the effectiveness of a reinforcer (e.g., deprivation increases value).
- Abolishing Operations: decrease the effectiveness of a reinforcer (e.g., satiation reduces value).
Other considerations: competing contingencies of reinforcement (e.g., choosing between alternatives like study vs watching YouTube).

Premack Principle

Core idea: the more probable (high-probability) behaviour can reinforce the less probable (low-probability) behaviour.
Formal statement (Premack, 1965): Of any two responses, the more probable one will reinforce the less probable one.
Example: If a child prefers playing pinball to eating candy, allowing pinball access after each candy consumption reinforces candy eating.
Problems/Limitations:
- Does not fully account for conditional reinforcement effects.
- Low-probability behaviour can reinforce high-probability behaviour when deprivation is present (e.g., after deprivation of the low-probability behaviour).

Schedules of Reinforcement

Schedule of Reinforcement: a rule describing how reinforcement is delivered.
Schedule effects: different schedules produce unique patterns and rates of behaviour over time; long-term effects are predictable and observed across species.

Cumulative Records and Data Representation

Cumulative record: a plot of cumulative responses (y-axis) over time (x-axis).
Slope of the cumulative line indicates the rate of responding; steeper slope corresponds to higher response rates.
Comparing frequency vs cumulative frequency:
- Frequency: number of responses in a given interval.
- Cumulative frequency: total number of responses up to each time point; useful for visualizing rate changes over time.

Types of Schedules (CRF and Intermittent)

Continuous Reinforcement (CRF): reinforcement is delivered after every instance of the target behaviour.
- Pros: rapid acquisition and clear learning signal.
- Cons: rare in natural environments; leads to faster extinction if reinforcement stops.
Intermittent (Partial) Reinforcement: reinforcement is delivered on some occasions only.
- Four main types:
- Fixed-Ratio (FR): reinforcement after a fixed number of responses (e.g., FR-120).
  - Generates a Post-Reinforcement Pause (PRP).
  - As ratio increases, PRP tends to increase; following PRP, run rates become steady.
- Variable-Ratio (VR): reinforcement after a varying number of responses around an average (e.g., VR-360).
  - Ratios can be a list such as 1, 10, 20, 30, 60, 100, 180, 240, 300, 360, 420, 480, 540, 600, 660, 690, 690, 720, 739.
  - Mean ratio ≈ 360.
  - Characteristics: high and constant response rates; strong resistance to extinction; common in natural environments and gambling games.
- Fixed-Interval (FI): reinforcement after a fixed amount of time has passed since the last reinforcement.
  - Produces a scalloped response pattern; responding increases gradually as the interval ends.
  - Not common in natural environments.
- Variable-Interval (VI): intervals vary around an average (e.g., VI-3 min).
  - PRPs are rare and short; produces steady, moderate response rates.
  - Not as high as VR; common in natural environments.
- Other schedules: Duration schedules (Fixed/Variable Duration) reinforce contingent on continuous performance for a period (e.g., practicing guitar for 30 minutes); often used in practice settings but may not provide a reinforcer in every case.

Observing Schedules: Bed Making Example

Bed Making data show frequency and slope changes across days under different conditions.
Slopes (e.g., 0.35 vs 0.75) indicate different rates of responding under different reinforcement contexts.

Operant Extinction

Extinction: withholding reinforcers that maintain a behaviour.
Data example (cumulative responses) show a period of no responding followed by responses and occasional resets.
Spontaneous Recovery: extinguished behaviour can reappear in similar situations after a period without reinforcement; multiple extinction sessions across settings reduce this effect.
Reinstatement: recovery of an extinguished behaviour when the reinforcer is presented alone.
Extinction Burst: a temporary increase in the reinforced behaviour when reinforcement is withdrawn.
Extinction and Variability: increase in operant variability during extinction; organism may contact other sources of reinforcement.
Resistance to Extinction (PRE): partial reinforcement effect
- Schedules that reinforce more intermittently tend to take longer to extinguish than those that reinforce less intermittently.
- E.g., FR-100 vs. continuous reinforcement (CRF): with extinction, more responses must occur under FR-100 before extinction is contacted than under CRF.

Final Note on Extinction

Extinction is new learning, not forgetting.
Spontaneous recovery and reinstatement show that original learning persists in some form.
The core idea: learning is a change in behaviour due to experience; when reinforcement stops, the probability of the response decreases, reflecting new learning rather than simply forgetting.

Connections to Foundational Principles and Real-World Relevance

Reinforcement principles underpin many educational, clinical, and organizational practices:
- Shaping behaviours through successive approximations using CRF and gradually changing schedules.
- Using Premack and motivating operations to increase adherence to desired behaviours.
- Understanding and leveraging contingencies and contiguity to design effective behavioural interventions.
- Recognizing extinction dynamics in behavior change programs and how to minimize relapse (extinction bursts, spontaneous recovery).
Ethical and practical implications:
- Use of reinforcement should respect welfare and autonomy; avoid coercive or harmful reinforcement strategies.
- Consider individual differences in primary vs secondary reinforcement and motivational states (deprivation, satiation).
- Be mindful of unintended reinforcement of undesired behaviours when designing schedules (e.g., reinforcing avoidance or avoidance learning).