Notes on the Correlation Coefficient and Its Properties

Measures the strength and direction of a linear relationship between two variables (X and Y).
Scales to [-1, +1]; values near ±1 indicate strong linear relationships, near 0 indicate weak or no linear relationship.
If all points lie on a straight line with positive/negative slope, r = +1.0 or -1.0; if no linear relationship, r ≈ 0.0.
The statistic is unitless and invariant to linear transformations of the data (X* = a + bX, Y* = c + dY with b ≠ 0, d ≠ 0).
Symmetric: r{XY} = r{YX}.
Can be undefined if the denominator is zero (e.g., constant X or Y, or perfect horizontal/vertical alignment).
Causation caveat: correlation does not imply causation; r captures association, not causal direction (especially in time-dependent or feedback processes).

Conceptual (covariation) formula:
$\boxed{ r{XY} = \frac{\sum (Xi - \bar X)(Yi - \bar Y)}{\sqrt{\sum (Xi - \bar X)^2 \; \sum (Y_i - \bar Y)^2}} }$
Computational (covariance-based) formula:
$\boxed{ r{XY} = \frac{\sum XiYi - (\sum Xi)(\sum Yi)/n}{\sqrt{\left(\sum Xi^2 - (\sum Xi)^2/n\right) \; \left(\sum Yi^2 - (\sum Y_i)^2/n\right)} } }$
Note: these two forms are algebraically equivalent.

Linearity vs nonlinearity: r measures linear association; non-linear patterns can yield low |r| even if a strong relationship exists.
Invariance under linear scale changes: r is unchanged by linear re-scaling of X and/or Y.
Outliers: magnitude of r is sensitive to outliers; can inflate or deflate the observed relationship.
Range restriction: narrowing the range of X or Y often reduces the observed r.
Levels of analysis: r depends on the unit of analysis (individuals vs groups) and can change with aggregation (ecological validity).
Interpreting magnitude is context-dependent: small r can be meaningful in some contexts; very large r may still be insufficient for strong predictive utility depending on reliability and costs.
r^2 interpretation: proportion of variance in Y explained by X in the linear model.
- In population terms: r^2 is the fraction of variance in Y accounted for by the linear relationship with X.
- In sample terms: reflects fit of the sample regression line; use with caution.

Example (air-traffic controller data): r ≈ 0.75 indicates a strong positive linear relation between initial test score (X) and post-training performance (Y).
Outliers can drastically change r (e.g., from 0.14 to 0.45 with one extreme point); consider analyses with and without outliers.
Range restriction example: selecting on X can reduce observed r between X and Y because of reduced X variance.
Levels of analysis example: correlations can be high between branch averages but near zero within branches; aggregation changes the observed r (ecological validity).

Context matters. A tiny r (e.g., r = 0.01) can be meaningful in some scenarios (e.g., survival or high-stakes decisions) and trivial in others.
Even large correlations (e.g., r ≈ 0.90) may be small relative to reliability or practical utility in some contexts (e.g., test-retest reliability or predictive validity with costs/benefits).
Practical interpretation often involves r^2 or utility considerations rather than r alone.

r^2: proportion of variance in Y associated with variance in X; commonly reported in regression contexts.
Alternative interpretation (direct, not squared): in utility analysis, utility is a function of r (not r^2) and other factors; small r can still yield meaningful utility depending on costs, base rates, and other parameters.

There are many related indices; Pearson r is the default, with several important special cases:

For dichotomous variables arranged in a 2x2 table, phi is the Pearson correlation on dichotomous data.
Formula:
$\boxed{ \phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} }$
Maximal phi depends on marginals; cannot always reach 1.0 even with strong association.

When Y is 0/1 and X is continuous:
$\boxed{ r{pb} = \frac{\bar X1 - \bar X0}{sX} \sqrt{p q} }$
p = n1/n, q = n0/n, where n1 and n0 are group sizes.
Equivalent to a t-test in significance testing of group means.

When a continuous variable X has been artificially dichotomized, a biserial estimate can recover the underlying r under a normality assumption.
General idea: r{bis} ≈ (\bar X2 - \bar X1) / sX × λ, where λ is the height of the standard normal curve at the dichotomy threshold; depends on group proportions and normality.
Note: biserial r tends to be larger than the corresponding point-biserial r under the normality assumption; sensitive to normality and has larger standard error with unequal group sizes.

Dichotomization can reduce the observed correlation; when possible, analyze with the original continuous variables or consider corrections (e.g., tetrachoric for paired dichotomies).
Tetrachoric correlation (not detailed here) estimates the correlation between two underlying continuous variables from a 2x2 table but relies on normality and is sensitive to sample size and nonnormality.

Causation vs correlation: r cannot imply causation; observed relationships may be time-lagged, bidirectional, or due to a third variable.
Dynamic/feedback models: simple r may miss causal loops or time-dependent effects; advanced methods (e.g., two-stage least squares, LISREL) may be required for such structures.
Nonlinearity: r may understate the strength of a nonlinear relationship; consider scatterplots and nonlinear models when r is small but a clear pattern exists.
Scale transformations: r is robust to linear scaling; standardizing to z-scores does not change r.
Outliers and robustness: consider robust alternatives or with/without-outliers analyses to assess stability of the relationship.

Data example: X = test score, Y = performance; a large positive r (e.g., r ≈ 0.75) indicates good predictive potential for screening.
Calculation details (summary):
- Compute sums: (\sum Xi), (\sum Yi), (\sum Xi^2), (\sum Yi^2), (\sum Xi Yi).
- Use the computational formula to obtain r from these sums.

r is the Pearson product-moment correlation coefficient.
Other coefficients discussed include Spearman's rho, phi, point-biserial, and biserial correlations, with notes on when to use each.
Key cautions include: interpretation depends on context, level of analysis matters, and outliers/range restriction can substantially affect r.