Correlation Coefficient

Overview

  • Focus of lecture: understanding correlation, specifically Pearson’s correlation coefficient (r)

  • Central slogan: “Correlation does not imply causation” – reminder from statistics class

Fundamental Terms

  • Correlation / Association: Statistical relationship between two quantitative variables

    • Measured with Pearson’s r when the relationship is linear and variables are on interval/ratio scales

  • Variables in examples

    • Age vs. coordination in children

    • Product price vs. quality perception

    • Hours slept last night vs. next-day mood

  • Scatter Plot: Graphical display representing each participant’s paired scores as a single dot (X-axis = variable 1, Y-axis = variable 2)

Scatter Plots & Visualising Association

  • Visual inspection reveals form, direction, and strength of a relationship before statistical computation

  • Hours-slept vs. happy-mood fictional data (Table 11-1)

    • Axes labelled 0–12 hours (X) and 0–8 mood points (Y)

    • Dots form an upward band, suggesting positive linear trend

Types / Patterns of Correlation

  • Linear correlations

    • Positive: as X increases, Y increases – dots slope (/

    • Negative: as X increases, Y decreases – dots slope \/

  • Curvilinear (non-linear) correlations

    • Example: Yerkes–Dodson–like curve of performance vs. anxiety (low & high anxiety impair performance; moderate is optimal)

    • Pearson r inappropriate because it only captures linear relations and can return r≈0 even when a clear curved pattern exists

  • No correlation

    • Example: income vs. shoe size – dots scattered randomly, r≈0

Pearson Correlation Coefficient (r)

  • Numerical index summarising direction & strength of a linear relationship

    • +1.00 → perfect positive linear association (all dots exactly on an upward-sloping line)

    • -1.00 → perfect negative linear association (all dots on a downward-sloping line)

    • 0.00 → no linear association

  • “Closer to perfect” = r’s absolute value |r| approaches 1; “closer to no correlation” = |r| approaches 0

Computing r: Z-Scores and Cross-Products

  1. Convert raw scores to Z-scores within each variable

    • Z = \frac{X - \mu}{\sigma}

    • Guarantees comparable, unit-free metrics where

      • High raw score → positive Z ; low raw score → negative Z

      • Magnitude reflects how far (in SD units) a score sits from its mean

  2. Form cross-products for each participant

    • ZX ZY

    • Sign logic (Table)

      • (+)(+) → + : contributes to positive r

      • (-)(-) → + : contributes to positive r (both scores low)

      • (+)(-) or (-)(+) → – : contributes to negative r (one high, one low)

      • Zero Z from mid-range scores → null contribution, pushing r toward 0

  3. Sum the cross-products: \sum ZX ZY

  4. Divide by sample size (N) to obtain mean cross-product → the correlation coefficient

    • r = \frac{\sum ZX ZY}{N}

Example Applications

  • Sexual satisfaction vs. relationship satisfaction

    • Scatter plot shows upward trend; computed r = .69 indicating a strong positive association

  • Hours slept vs. happy mood revisited later with and without outlier (see Outliers section)

Correlation vs. Causation

  • Three logical possibilities whenever r ≠ 0:

    1. X causes Y (\rightarrow)

    2. Y causes X (\leftarrow)

    3. Spurious / third-variable explanation: some unmeasured factor influences both X & Y (\uparrow)

  • Example causal web for couples:

    • Relationship satisfaction ←→ Sexual satisfaction

    • Both potentially influenced by communication quality, stress levels, health/fitness, etc.

  • Proper causal inference requires experimental manipulation, longitudinal models, or advanced statistical controls—not simple r

Outliers

  • Outlier: observation far from the pattern of the rest of the data

  • Why problematic? A single extreme pair can substantially inflate or deflate r

  • Demonstration with mood vs. sleep

    • Without outlier: r = .85 (strong positive)

    • After inserting one extreme case: r = -.11 (slightly negative, essentially destroys apparent relationship)

  • Possible sources / interpretations

    1. Data entry or measurement error

    2. Genuine but rare phenomenon worth investigating

    3. Legitimate part of population distribution that must be retained

Dealing with Outliers
  • Winsorising: truncate extreme values to a specified percentile (e.g., ±3 SD) to reduce undue influence

    • Example APA-style write-up: “Scores were Winsorised not to exceed three SD (Osborne & Overbay, 2004)… results unchanged.”

  • Other strategies: robust statistics, transformation, exclusion with justification, or separate analysis

Restriction in Range

  • Occurs when sample spans only a narrow band of possible values on one or both variables

  • Consequence: observed |r| biased toward 0 because variability necessary for correlation is artificially limited

    • Real-world example: studying SAT vs. GPA only at an elite university where SAT scores are uniformly high

  • Always examine sampling frame & consider whether restricted range masks a stronger population relationship

Practical, Ethical & Methodological Implications

  • Report scatter plots alongside r to reveal form & outliers

  • Inspect for curvilinear patterns; avoid Pearson r if relationship is non-linear

  • Disclaim causal language when presenting correlations; suggest alternate hypotheses & confounders

  • Transparently document outlier handling and potential restriction-in-range limitations

  • Encourage replication and complementary designs (experimental or longitudinal) for causal claims