Spearman Correlation Analysis

Spearman Correlation Analysis: Nonparametric Approach to Association

This mini-lecture introduces the Spearman correlation analysis, a non-parametric statistical test. The primary goal is to empower learners to describe this test conceptually and identify appropriate research questions for its application, without requiring them to perform the underlying mathematical calculations.

Conceptual Overview of Spearman Correlation

The Spearman correlation analysis serves as the nonparametric equivalent of the Pearson correlation analysis. Unlike the Pearson correlation, which assumes data are normally distributed, the Spearman correlation does not require data to be normally distributed. It is therefore an appropriate test when a research question concerns a correlation, and when one or both of the variables being assessed are non-normally distributed.

Re-evaluating Pearson Correlation Assumptions

To understand why Spearman correlation is necessary, it is helpful to revisit the assumptions of the Pearson correlation analysis:

Normally Distributed Data: Both variables (e.g., x and y) are assumed to be approximately normally distributed.
Scale of Measurement: Both variables are assumed to be measured on an interval or ratio scale.
No Outliers: The data should not contain extreme data values, known as outliers. An outlier could represent a coding or data entry error and significantly distort the correlation estimate. Examples include an IQ of 6 or 300 when observations are centered around 100 with a standard deviation of 15.
Linear Association: The association between x and y is assumed to be linear, meaning the best explanation of the relationship between the variables is a straight line.

The Spearman correlation analysis relaxes the assumptions of normal distribution, linearity, and sensitivity to outliers. However, it still requires data to be on at least an ordinal scale and assumes a monotonic relationship.

Visualizing Assumption Violations

The lecture provides graphical examples of situations where Pearson correlation assumptions are not met:

Top-Left Graph: Assumptions are met (standard case for Pearson).
Top-Right Graph: An association between x and y exists, but it is nonlinear (e.g., quadratic). As x increases, y doesn't consistently increase or decrease in a linear fashion.
Bottom-Left Graph: An outlier in y is present, affecting the estimate of the correlation, even though x values are within a typical range.
Bottom-Right Graph: An outlier in both x and y is shown. Without this outlier, there might be no clear association between x and y. Including it can erroneously suggest a positive association because the outlier pulls the data in a particular direction.

The Core Mechanism: Rank Transformation

When Pearson correlation assumptions are not met, the Spearman correlation analysis (also known as the Spearman rank correlation coefficient) offers a solution. Its core principle involves transforming the raw measured data into ranks.

How Rank Transformation Works

For a given variable, observations are ordered from lowest to highest:

The observation with the lowest value receives a rank of 1.
The second lowest receives a rank of 2.
This continues until the highest observation receives a rank of n, where n is the total number of observations in the sample.
Handling Ties: If two or more observations have exactly the same value, they receive the same average rank. For example, if the 3^{rd} and 4^{th} observations are tied, they both receive a rank of 3.5

Calculating Spearman Correlation

After rank-transforming both variables (x and y), the Spearman correlation is simply calculated by performing a Pearson correlation analysis on these rank-transformed data.

Illustrative Example: IQ and TV Hours

Consider an example from Wikipedia investigating the theoretical correlation between a person's IQ and the number of hours of TV watched per week. Let's say we have 10 individuals:

Person	IQ (raw x)	TV Hours/Week (raw y)
1	86	2
2	95	6
3	90	7
4	93	8
5	101	50
…	…	…

Let's assume a Shapiro-Wilk test or QQ plot reveals that 'hours of TV per week' (y) is non-normally distributed, violating a key Pearson assumption. Therefore, a Spearman correlation analysis is appropriate.

Rank Transformation in Action

To conduct the Spearman analysis, both raw IQ and TV hours are rank-transformed:

Person	IQ (raw x)	Rank of x	TV Hours/Week (raw y)	Rank of y
1	86	1	2	1
2	95	5	6	3
3	90	3	7	4
4	93	4	8	5
5	101	7	50	10
…	…	…	…	…

The lowest IQ (86) gets rank 1, the second lowest (90) gets rank 2 etc. Similarly, the lowest TV hours (2) gets rank 1, the second lowest (6) gets rank 2 (assuming no other values are between 2 and 6), and so on. After obtaining the ranks for both variables, a Pearson correlation is calculated between these two sets of ranks.

Interpreting the Spearman Correlation Coefficient

The resulting correlation coefficient from a Spearman analysis is often denoted by the Greek letter rho (
ho), which visually resembles a 'p', to distinguish it from Pearson's 'r'.

Similar to Pearson's 'r', Spearman's
ho has the following interpretations:

A perfect positive correlation:
ho = +1
A perfect negative correlation:
ho = -1
No association:
ho = 0

If two variables are perfectly Spearman correlated, their ranks would increase together (Rank x1 with Rank y1, Rank x2 with Rank y2 etc.), leading to
ho = 1.

Why Rank Transformation Works: Overcoming Assumptions

Rank transformation effectively addresses several issues that plague Pearson correlation when its parametric assumptions are violated:

Nonlinear Relationships: If the relationship between x and y is nonlinear (e.g., exponential, quadratic), rank transformation can linearize this relationship. For instance, an exponential curve between raw x and y will become a perfectly positive linear relationship between the rank of x and the rank of y, as the lowest x will correspond to the lowest y, and so forth.
Non-normally Distributed Data: When x and/or y are not normally distributed (e.g., y being bimodal, taking only low or high values), rank transformation standardizes the distribution, mitigating the impact of non-normality on the correlation estimate.
Outliers: Outliers, which can disproportionately influence Pearson's 'r', have a reduced impact on Spearman's
ho. This is because an outlier, no matter how extreme its raw value, will only receive the highest or lowest rank (n or 1). It won't drastically alter the relative ordering of other data points, thus preventing it from exerting undue influence on the overall association between ranks.

For example, with an outlier data point that might misleadingly suggest a relationship (x = 10, y = 30 when most data are around x = 5, y = 6), after rank transformation, this point will simply be the highest rank for x and y. It will not artificially inflate the correlation, potentially leading to a non-significant result if no true underlying association exists among the majority of the data points.

Deciding When to Use Spearman Correlation

To determine if a Spearman correlation is necessary, several methods can be employed:

Visual Inspection: Scatter plots are invaluable for visualizing the relationship between variables and identifying potential nonlinearities, non-normal distributions (e.g., clustering of points), or the presence of outliers.
Normality Checks: Quantitative tests like QQ plots or the Shapiro-Wilk test can formally assess whether each variable is normally distributed. Violation of normality for one or both variables strongly suggests using Spearman correlation instead of Pearson.

Assumptions of Spearman Correlation

While Spearman correlation makes fewer assumptions than Pearson correlation, it is not entirely assumption-free:

Scale of Measurement: Both variables must be measured on an ordinal, interval, or ratio scale. This is a broader requirement than Pearson's, allowing for ordinal data. However, nominal data (unordered categories) are still not suitable for Spearman correlation.
Monotonic Relationship: The relationship between the two variables must be monotonic. This means that as x increases, y either consistently tends to increase (monotonic increasing) or consistently tends to decrease (monotonic decreasing). Spearman correlation is not appropriate for complex non-monotonic relationships, such as sinusoidal patterns where y goes up and down with x.

Hypothesis Testing with Spearman Correlation

When conducting a Spearman correlation analysis, you are typically testing the following hypotheses:

Null Hypothesis (H_0): The ranks of the two variables are not associated with each other.
Alternative Hypothesis (H_1): The ranks of the two variables are associated with each other.

Rejecting the null hypothesis would lead to the conclusion that there is a significant association between the ranks of the two variables, which is the non-parametric equivalent of concluding an association between the variables themselves. This discussion sets the stage for future mini-lectures on nonparametric equivalents of t-tests.