Notes on the Correlation Coefficient and Its Properties

Pearson correlation coefficient (r)

  • Measures the strength and direction of a linear relationship between two variables (X and Y).

  • Scales to [-1, +1]; values near ±1 indicate strong linear relationships, near 0 indicate weak or no linear relationship.

  • If all points lie on a straight line with positive/negative slope, r = +1.0 or -1.0; if no linear relationship, r ≈ 0.0.

  • The statistic is unitless and invariant to linear transformations of the data (X* = a + bX, Y* = c + dY with b ≠ 0, d ≠ 0).

  • Symmetric: r{XY} = r{YX}.

  • Can be undefined if the denominator is zero (e.g., constant X or Y, or perfect horizontal/vertical alignment).

  • Causation caveat: correlation does not imply causation; r captures association, not causal direction (especially in time-dependent or feedback processes).


Formulas

  • Conceptual (covariation) formula:
    r<em>XY=(X</em>iXˉ)(Y<em>iYˉ)(X</em>iXˉ)2  (YiYˉ)2\boxed{ r<em>{XY} = \frac{\sum (X</em>i - \bar X)(Y<em>i - \bar Y)}{\sqrt{\sum (X</em>i - \bar X)^2 \; \sum (Y_i - \bar Y)^2}} }

  • Computational (covariance-based) formula:
    r<em>XY=X</em>iY<em>i(X</em>i)(Y<em>i)/n(X</em>i2(X<em>i)2/n)  (Y</em>i2(Yi)2/n)\boxed{ r<em>{XY} = \frac{\sum X</em>iY<em>i - (\sum X</em>i)(\sum Y<em>i)/n}{\sqrt{\left(\sum X</em>i^2 - (\sum X<em>i)^2/n\right) \; \left(\sum Y</em>i^2 - (\sum Y_i)^2/n\right)} } }

  • Note: these two forms are algebraically equivalent.


Key properties of r

  • Linearity vs nonlinearity: r measures linear association; non-linear patterns can yield low |r| even if a strong relationship exists.

  • Invariance under linear scale changes: r is unchanged by linear re-scaling of X and/or Y.

  • Outliers: magnitude of r is sensitive to outliers; can inflate or deflate the observed relationship.

  • Range restriction: narrowing the range of X or Y often reduces the observed r.

  • Levels of analysis: r depends on the unit of analysis (individuals vs groups) and can change with aggregation (ecological validity).

  • Interpreting magnitude is context-dependent: small r can be meaningful in some contexts; very large r may still be insufficient for strong predictive utility depending on reliability and costs.

  • r^2 interpretation: proportion of variance in Y explained by X in the linear model.

    • In population terms: r^2 is the fraction of variance in Y accounted for by the linear relationship with X.

    • In sample terms: reflects fit of the sample regression line; use with caution.


Examples and context

  • Example (air-traffic controller data): r ≈ 0.75 indicates a strong positive linear relation between initial test score (X) and post-training performance (Y).

  • Outliers can drastically change r (e.g., from 0.14 to 0.45 with one extreme point); consider analyses with and without outliers.

  • Range restriction example: selecting on X can reduce observed r between X and Y because of reduced X variance.

  • Levels of analysis example: correlations can be high between branch averages but near zero within branches; aggregation changes the observed r (ecological validity).


Interpreting the size of r

  • Context matters. A tiny r (e.g., r = 0.01) can be meaningful in some scenarios (e.g., survival or high-stakes decisions) and trivial in others.

  • Even large correlations (e.g., r ≈ 0.90) may be small relative to reliability or practical utility in some contexts (e.g., test-retest reliability or predictive validity with costs/benefits).

  • Practical interpretation often involves r^2 or utility considerations rather than r alone.


r^2 and alternative interpretations

  • r^2: proportion of variance in Y associated with variance in X; commonly reported in regression contexts.

  • Alternative interpretation (direct, not squared): in utility analysis, utility is a function of r (not r^2) and other factors; small r can still yield meaningful utility depending on costs, base rates, and other parameters.


Other correlation coefficients (overview)

  • There are many related indices; Pearson r is the default, with several important special cases:

Spearman's rho (rank correlation)
  • Used when data are ordinal or when outliers distort Pearson r.

  • Formula (based on ranks):
    ρ=16di2n(n21)\boxed{ \rho = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)} }

  • d_i = difference between ranks of X and Y for each pair.

Phi coefficient (2x2 contingency data)
  • For dichotomous variables arranged in a 2x2 table, phi is the Pearson correlation on dichotomous data.

  • Formula:
    ϕ=adbc(a+b)(c+d)(a+c)(b+d)\boxed{ \phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} }

  • Maximal phi depends on marginals; cannot always reach 1.0 even with strong association.

Point-biserial correlation (continuous X, dichotomous Y)
  • When Y is 0/1 and X is continuous:
    r<em>pb=Xˉ</em>1Xˉ<em>0s</em>Xpq\boxed{ r<em>{pb} = \frac{\bar X</em>1 - \bar X<em>0}{s</em>X} \sqrt{p q} }

  • p = n1/n, q = n0/n, where n1 and n0 are group sizes.

  • Equivalent to a t-test in significance testing of group means.

Biserial correlation (dichotomized continuous X)
  • When a continuous variable X has been artificially dichotomized, a biserial estimate can recover the underlying r under a normality assumption.

  • General idea: r{bis} ≈ (\bar X2 - \bar X1) / sX × λ, where λ is the height of the standard normal curve at the dichotomy threshold; depends on group proportions and normality.

  • Note: biserial r tends to be larger than the corresponding point-biserial r under the normality assumption; sensitive to normality and has larger standard error with unequal group sizes.

Other notes
  • Dichotomization can reduce the observed correlation; when possible, analyze with the original continuous variables or consider corrections (e.g., tetrachoric for paired dichotomies).

  • Tetrachoric correlation (not detailed here) estimates the correlation between two underlying continuous variables from a 2x2 table but relies on normality and is sensitive to sample size and nonnormality.


Practical cautions and concepts

  • Causation vs correlation: r cannot imply causation; observed relationships may be time-lagged, bidirectional, or due to a third variable.

  • Dynamic/feedback models: simple r may miss causal loops or time-dependent effects; advanced methods (e.g., two-stage least squares, LISREL) may be required for such structures.

  • Nonlinearity: r may understate the strength of a nonlinear relationship; consider scatterplots and nonlinear models when r is small but a clear pattern exists.

  • Scale transformations: r is robust to linear scaling; standardizing to z-scores does not change r.

  • Outliers and robustness: consider robust alternatives or with/without-outliers analyses to assess stability of the relationship.


Worked example (brief)

  • Data example: X = test score, Y = performance; a large positive r (e.g., r ≈ 0.75) indicates good predictive potential for screening.

  • Calculation details (summary):

    • Compute sums: (\sum Xi), (\sum Yi), (\sum Xi^2), (\sum Yi^2), (\sum Xi Yi).

    • Use the computational formula to obtain r from these sums.


Quick references from the chapter

  • r is the Pearson product-moment correlation coefficient.

  • Other coefficients discussed include Spearman's rho, phi, point-biserial, and biserial correlations, with notes on when to use each.

  • Key cautions include: interpretation depends on context, level of analysis matters, and outliers/range restriction can substantially affect r.