Correlation Coefficient
Overview
Focus of lecture: understanding correlation, specifically Pearson’s correlation coefficient (r)
Central slogan: “Correlation does not imply causation” – reminder from statistics class
Fundamental Terms
Correlation / Association: Statistical relationship between two quantitative variables
Measured with Pearson’s r when the relationship is linear and variables are on interval/ratio scales
Variables in examples
Age vs. coordination in children
Product price vs. quality perception
Hours slept last night vs. next-day mood
Scatter Plot: Graphical display representing each participant’s paired scores as a single dot (X-axis = variable 1, Y-axis = variable 2)
Scatter Plots & Visualising Association
Visual inspection reveals form, direction, and strength of a relationship before statistical computation
Hours-slept vs. happy-mood fictional data (Table 11-1)
Axes labelled 0–12 hours (X) and 0–8 mood points (Y)
Dots form an upward band, suggesting positive linear trend
Types / Patterns of Correlation
Linear correlations
Positive: as X increases, Y increases – dots slope (/
Negative: as X increases, Y decreases – dots slope \/
Curvilinear (non-linear) correlations
Example: Yerkes–Dodson–like curve of performance vs. anxiety (low & high anxiety impair performance; moderate is optimal)
Pearson r inappropriate because it only captures linear relations and can return r≈0 even when a clear curved pattern exists
No correlation
Example: income vs. shoe size – dots scattered randomly, r≈0
Pearson Correlation Coefficient (r)
Numerical index summarising direction & strength of a linear relationship
+1.00 → perfect positive linear association (all dots exactly on an upward-sloping line)
-1.00 → perfect negative linear association (all dots on a downward-sloping line)
0.00 → no linear association
“Closer to perfect” = r’s absolute value |r| approaches 1; “closer to no correlation” = |r| approaches 0
Computing r: Z-Scores and Cross-Products
Convert raw scores to Z-scores within each variable
Z = \frac{X - \mu}{\sigma}
Guarantees comparable, unit-free metrics where
High raw score → positive Z ; low raw score → negative Z
Magnitude reflects how far (in SD units) a score sits from its mean
Form cross-products for each participant
ZX ZY
Sign logic (Table)
(+)(+) → + : contributes to positive r
(-)(-) → + : contributes to positive r (both scores low)
(+)(-) or (-)(+) → – : contributes to negative r (one high, one low)
Zero Z from mid-range scores → null contribution, pushing r toward 0
Sum the cross-products: \sum ZX ZY
Divide by sample size (N) to obtain mean cross-product → the correlation coefficient
r = \frac{\sum ZX ZY}{N}
Example Applications
Sexual satisfaction vs. relationship satisfaction
Scatter plot shows upward trend; computed r = .69 indicating a strong positive association
Hours slept vs. happy mood revisited later with and without outlier (see Outliers section)
Correlation vs. Causation
Three logical possibilities whenever r ≠ 0:
X causes Y (\rightarrow)
Y causes X (\leftarrow)
Spurious / third-variable explanation: some unmeasured factor influences both X & Y (\uparrow)
Example causal web for couples:
Relationship satisfaction ←→ Sexual satisfaction
Both potentially influenced by communication quality, stress levels, health/fitness, etc.
Proper causal inference requires experimental manipulation, longitudinal models, or advanced statistical controls—not simple r
Outliers
Outlier: observation far from the pattern of the rest of the data
Why problematic? A single extreme pair can substantially inflate or deflate r
Demonstration with mood vs. sleep
Without outlier: r = .85 (strong positive)
After inserting one extreme case: r = -.11 (slightly negative, essentially destroys apparent relationship)
Possible sources / interpretations
Data entry or measurement error
Genuine but rare phenomenon worth investigating
Legitimate part of population distribution that must be retained
Dealing with Outliers
Winsorising: truncate extreme values to a specified percentile (e.g., ±3 SD) to reduce undue influence
Example APA-style write-up: “Scores were Winsorised not to exceed three SD (Osborne & Overbay, 2004)… results unchanged.”
Other strategies: robust statistics, transformation, exclusion with justification, or separate analysis
Restriction in Range
Occurs when sample spans only a narrow band of possible values on one or both variables
Consequence: observed |r| biased toward 0 because variability necessary for correlation is artificially limited
Real-world example: studying SAT vs. GPA only at an elite university where SAT scores are uniformly high
Always examine sampling frame & consider whether restricted range masks a stronger population relationship
Practical, Ethical & Methodological Implications
Report scatter plots alongside r to reveal form & outliers
Inspect for curvilinear patterns; avoid Pearson r if relationship is non-linear
Disclaim causal language when presenting correlations; suggest alternate hypotheses & confounders
Transparently document outlier handling and potential restriction-in-range limitations
Encourage replication and complementary designs (experimental or longitudinal) for causal claims