Spearman's correlation for ordinal data (rho) notes

Pearson vs Spearman: key ideas

  • Pearson's correlation coefficient, r, is used for interval or ratio data under specific assumptions:

    • Data are interval or ratio scales.

    • Each variable is normally distributed.

    • There is a linear relationship between x and y (bivariate normal distribution).

  • If these assumptions are not met, or if data are ordinal, a different correlation method is appropriate: Spearman's correlation coefficient (Spearman's rho, often denoted as ρ or r_s).

When to use Spearman's correlation (rho)

  • Use Spearman's rho when data are ordinal (rank data).

    • Example: ranking participants in a race (1st, 2nd, 3rd, …).

    • The exact time differences between ranks do not matter; only the order matters.

  • Use Spearman's rho when there are problems with interval/ratio data such as outliers that affect Pearson's r.

    • Converting to ordinal data can reduce the influence of outliers.

  • Spearman's rho applies the Pearson correlation formula to the ranked data, i.e., it is the Pearson correlation computed on ranks.

  • It is especially useful when the data naturally fall into ranks, when there are extreme scores, or when the relationship is monotonic (see definition below).

How Spearman's rho is computed

  • Core idea: transform the data to ranks, then compute the Pearson correlation on those ranks.

  • Formal definitions:

    • Let X and Y be two variables with ranks $R(X)$ and $R(Y)$ for each observation.

    • Spearman's rho is the correlation between the ranked variables:

      rs = ext{Corr}ig(R(X), R(Y)ig) = rac{ ext{Cov}(R(X), R(Y))}{\sigma{R(X)}\sigma_{R(Y)}}

    • Equivalent numerically to the standard Pearson correlation formula applied to the rank-transformed data:

      r_s = rac{


    where RX and $RY$ are the ranks of X and Y, and the bars denote means of the ranks.

  • A commonly cited closed-form for data without ties:
    r_s = 1 - rac{6

    $$
    where $di = R(xi) - R(y_i)$ and $n$ is the number of paired observations. This form assumes there are no ties.

  • In practice, most software (e.g., SPSS) handles ties appropriately when computing rho.

  • Important concepts related to Spearman: monotonicity

    • A monotonic relationship means as x increases, y tends to increase or decrease, but not necessarily at a constant rate.

    • Example: Conforming behaviors in crowds — as crowd size increases, the percent who conform increases, but the rate of increase may level off as the crowd gets very large.

    • When data are ranked, Spearman captures the direction of the association without requiring a constant rate of change.

Example interpretation: ranking vs percentage data

  • Suppose you have conformity data expressed as a percent (continuous), and you convert this to ranks.

    • The rank-based analysis still reveals the direction of the relationship.

    • The magnitude is interpreted in terms of monotonic association, not a strictly linear slope.

    • Ranking can flatten differences when relationships are non-linear, but preserves the sign of the association.

Output and interpretation in practice

  • In SPSS, to run Spearman correlation:

    • Use: Analyze → Correlate → Bivariate

    • Select Spearman from the dialog box to run the correlation coefficient.

  • Example interpretation from a study on hazardous drinking and stress (as discussed in the lecture):

    • Pearson r (top panel) shows a positive association between stress and hazardous drinking, of moderate strength ≈ $0.3$.

    • Spearman's rho (lower panel) typically shows a similar direction and a similar magnitude, sometimes slightly larger, around $0.3$ as well.

  • Practical takeaway: The interpretation of the output is the same regardless of whether you use Pearson's r or Spearman's rho—the sign indicates direction (positive/negative), and the magnitude indicates strength of association. The choice depends on whether assumptions for Pearson are met or whether the data are ordinal or monotonic but not strictly linear.

Practical implications and considerations

  • Consequences of converting to ordinal (rank) data:

    • May reduce the ability to capture the full strength of the relationship due to loss of information.

    • For example, ranking first vs second place in a race loses information about how much faster the first finisher was.

    • Ranking is robust to outliers and non-normal distributions, but may understate non-linear but monotonic relationships.

  • When to prefer Spearman over Pearson:

    • Data are ordinal or not normally distributed.

    • The relationship is monotonic but not linear.

    • Outliers strongly affect Pearson's r, and ranking mitigates their influence.

  • Relationship to previous lecture on Pearson's r:

    • Pearson's r relies on interval/ratio measurement, normality, and linearity.

    • Spearman's rho relaxes these requirements by applying a rank transformation before computing the correlation.

  • Real-world relevance:

    • Useful in psychology, sociology, and other fields where data are often ordinal (e.g., Likert scales) or where outliers distort linear relationships.

  • Ethical/practical implications:

    • Choice of measurement scale and statistical method affects conclusions; misapplying Pearson to ordinal data can misrepresent relationships.

    • When reporting results, clearly state whether you used Pearson or Spearman and why, including any implications for interpretation and generalizability.