Correlation Coefficient

Focus of lecture: understanding correlation, specifically Pearson’s correlation coefficient (r)
Central slogan: “Correlation does not imply causation” – reminder from statistics class

Correlation / Association: Statistical relationship between two quantitative variables
- Measured with Pearson’s r when the relationship is linear and variables are on interval/ratio scales
Variables in examples
- Age vs. coordination in children
- Product price vs. quality perception
- Hours slept last night vs. next-day mood
Scatter Plot: Graphical display representing each participant’s paired scores as a single dot (X-axis = variable 1, Y-axis = variable 2)

Visual inspection reveals form, direction, and strength of a relationship before statistical computation
Hours-slept vs. happy-mood fictional data (Table 11-1)
- Axes labelled 0–12 hours (X) and 0–8 mood points (Y)
- Dots form an upward band, suggesting positive linear trend

Linear correlations
- Positive: as X increases, Y increases – dots slope (/
- Negative: as X increases, Y decreases – dots slope \/
Curvilinear (non-linear) correlations
- Example: Yerkes–Dodson–like curve of performance vs. anxiety (low & high anxiety impair performance; moderate is optimal)
- Pearson r inappropriate because it only captures linear relations and can return r≈0 even when a clear curved pattern exists
No correlation
- Example: income vs. shoe size – dots scattered randomly, r≈0

Numerical index summarising direction & strength of a linear relationship
- $+1.00$ → perfect positive linear association (all dots exactly on an upward-sloping line)
- $-1.00$ → perfect negative linear association (all dots on a downward-sloping line)
- $0.00$ → no linear association
“Closer to perfect” = r’s absolute value |r| approaches 1; “closer to no correlation” = |r| approaches 0

Convert raw scores to Z-scores within each variable
- $Z = \frac{X - \mu}{\sigma}$
- Guarantees comparable, unit-free metrics where
  - High raw score → positive Z ; low raw score → negative Z
  - Magnitude reflects how far (in SD units) a score sits from its mean
Form cross-products for each participant
- $ZX ZY$
- Sign logic (Table)
 - (+)(+) → + : contributes to positive r
 - (-)(-) → + : contributes to positive r (both scores low)
 - (+)(-) or (-)(+) → – : contributes to negative r (one high, one low)
 - Zero Z from mid-range scores → null contribution, pushing r toward 0
Sum the cross-products: $\sum ZX ZY$
Divide by sample size (N) to obtain mean cross-product → the correlation coefficient
- $r = \frac{\sum ZX ZY}{N}$

Sexual satisfaction vs. relationship satisfaction
- Scatter plot shows upward trend; computed $r = .69$ indicating a strong positive association
Hours slept vs. happy mood revisited later with and without outlier (see Outliers section)

Three logical possibilities whenever r ≠ 0:
1. X causes Y (\rightarrow)
2. Y causes X (\leftarrow)
3. Spurious / third-variable explanation: some unmeasured factor influences both X & Y (\uparrow)
Example causal web for couples:
- Relationship satisfaction ←→ Sexual satisfaction
- Both potentially influenced by communication quality, stress levels, health/fitness, etc.
Proper causal inference requires experimental manipulation, longitudinal models, or advanced statistical controls—not simple r

Outlier: observation far from the pattern of the rest of the data
Why problematic? A single extreme pair can substantially inflate or deflate r
Demonstration with mood vs. sleep
- Without outlier: $r = .85$ (strong positive)
- After inserting one extreme case: $r = -.11$ (slightly negative, essentially destroys apparent relationship)
Possible sources / interpretations
1. Data entry or measurement error
2. Genuine but rare phenomenon worth investigating
3. Legitimate part of population distribution that must be retained

Winsorising: truncate extreme values to a specified percentile (e.g., ±3 SD) to reduce undue influence
- Example APA-style write-up: “Scores were Winsorised not to exceed three SD (Osborne & Overbay, 2004)… results unchanged.”
Other strategies: robust statistics, transformation, exclusion with justification, or separate analysis

Occurs when sample spans only a narrow band of possible values on one or both variables
Consequence: observed |r| biased toward 0 because variability necessary for correlation is artificially limited
- Real-world example: studying SAT vs. GPA only at an elite university where SAT scores are uniformly high
Always examine sampling frame & consider whether restricted range masks a stronger population relationship

Report scatter plots alongside r to reveal form & outliers
Inspect for curvilinear patterns; avoid Pearson r if relationship is non-linear
Disclaim causal language when presenting correlations; suggest alternate hypotheses & confounders
Transparently document outlier handling and potential restriction-in-range limitations
Encourage replication and complementary designs (experimental or longitudinal) for causal claims