All statistical tests have assumptions, especially distributional assumptions about numeric data like the shape and variance of distribution.
Regression and ANOVA analyses are considered "robust" to mild to moderate violations of distributional assumptions.
Larger sample sizes and equal group, cell, or condition sizes increase robustness.
However, robustness shouldn't be relied upon with highly non-normal data, small sample sizes, and very uneven group sizes.
Traditional parametric analyses are generally preferable due to their sensitivity and statistical power, provided assumptions aren't severely violated.
Non-parametric analyses are also called distribution-free analyses: they don’t require (expect) certain kinds of distributions. They apply to categorical & numeric data.
Categorical data:
Chi-square goodness of fit test: a single categorical variable
Chi-square test of independence: two categorical variables, independent design
McNemar’s test: two categorical variables, related/paired design
Non-normal numeric or ordinal data:
Spearman’s correlation: equivalent of Pearson’s correlation
Wilcoxon-Mann-Whitney rank sum test: equivalent of the independent samples t-test
Wilcoxon signed-rank test: equivalent of paired t-test
Kruskal-Wallis test: equivalent of one-way between-subjects ANOVA
Friedman Test: equivalent of one-way repeated-measures ANOVA
Numeric and Ordinal Data
Non-parametric tests rank numerical data and perform analyses on the ranks, addressing distributional problems and outliers.
Ranking involves ordering data from lowest to highest.
Downside of ranking:
Loss of information about the degree of difference since the difference between 18 and 19 (1 point) is treated the same as the difference between 21 and 35 (14 points)
However, this is beneficial for skewed distributions and those with outliers
Tied ranks: When data contains the same numbers, assign equal ranks by averaging their potential ranks:
If the raw data had some people of the same age (19).
Spearman's Correlation (Counterpart of Pearson's Correlation)
Use Spearman’s correlation in situations:
One or both variables are ordinal
Variable(s) are severely skewed
The relationship is non-linear but monotonic (consistent direction between the variables)
Spearman’s Correlation in Stata
Syntax: spearman var1 var2
Example: There is a very strong, rs (23) = .83, and statistically significant correlation, p < .001. The more someone’s heart flutters with excitement at “Design and Stats”, the more time they spend thinking about stats.
Very similar in implementation and interpretation to Pearson’s correlation!
Write up is the same as Pearson’s correlation (except the “s” subscript).
Similar to Pearson’s correlation, Spearman’s correlation, denoted as rs or ρ (rho), also ranges from 0 to 1 or -1 to 0, with 1 or -1 indicating a perfect relationship
Same rules of thumb as Pearson’s correlation (Cohen, 1988):
Between 0.1 and 0.3 = small effect
Between 0.3 and 0.5 = medium effect
> 0.5 = large effect
Difference: Spearman’s correlation works by ranking the scores (lowest to highest) for each variable, and computing the correlation on the ranked scores
Mann-Whitney-Wilcoxon Rank Sum Test (Two Independent Groups; Counterpart of the Independent-Samples T-Test)
Also known as:
Mann-Whitney U
Wilcoxon rank-sum Ws
Example: Drug Use and Depression
Research question: Is there a difference in the experiences of depression in people who use different recreational drugs when clubbing?
Procedure: Drugs were taken on a Saturday night while clubbing. Depression scores were measured twice, on Sunday night and then again on Wednesday.
DV: depression (BDI) measured 1 day after drug use (Sunday)
IV: Drug (alcohol vs. ecstasy), between-subjects categorical variable
Design: Single-IV between-groups design with two levels
Because:
Small sample size
Outliers
Non-normal data
Not-so-equal variances
So, we use the Rank Sum test instead
Stata syntax: ranksum DV, by (IV)
egen SunBDIRank = rank(SundayBDI)
H0: ranks (median) between the two groups didn’t differ
If p < .05 reject H0, indicating stat. sig.; Two p-values here; report the exact p.
Effect size:
r=Nz
Sunday Depression scores: r=201.105=0.25
Same rules of thumb as correlation (Cohen, 1988):
Between 0.1 and 0.3 = small effect
Between 0.3 and 0.5 = medium effect
> 0.5 = large effect
Conclude: The day after a night of clubbing, depression scores do not significantly differ between those who drank alcohol (mean rank = 9.1) and those who took ecstasy (mean rank = 12.0), z = -1.11, p = .288, with a weak effect size, r = .25.
Wilcoxon Sign Rank Test (Two Related Groups; Counterpart of the Paired-Samples T-Test)
Not to be confused with the Wilcoxon rank-sum test
Example: Drug Use and Depression
Research question: Have people’s experiences with depression changed after one day of using drugs while clubbing, compared to four days later? But here, let’s focus on Alcohol.
Procedure: Drugs were taken on a Saturday night while clubbing. Depression scores were measured twice, on Sunday night and then again on Wednesday.
DV: depression (BDI)
IV: Day (1 day after alcohol use, Sunday; 4 days after alcohol use), within-subjects categorical variable
Design: A within-group design with a single factor comprising two levels
Normality of the differences (between the conditions/levels)
Not normal difference scores,Small N (10 pairs of scores)
Stata syntax: signrank var1 = var2
Sort diff_rank Ignoring the signs; Compare the ranks of the diff. associated with + vs. - sign. H0: No changes in the ranks of differences. Rank the absolute diff.
Report the exact p.
Effect size:
r=Nz
Pairs of obs.
r=101.99=0.63 (large effect)
When reporting, can include the median scores from each variable, or the sum of the positive and negative ranks
Conclude: A sign rank test demonstrated that, after consuming alcohol and clubbing on Saturday night, depression scores significantly decreased from Sunday (Median = 16) to Wednesday (Median = 7.5), z = -1.99, p = .045, with a large effect size, r = .63
Kruskal-Wallis Test (Multiple Independent Groups; Counterpart of the Between-Subjects One-Way ANOVA)
Used for violations of distributional assumptions, especially if coupled with small sample sizes and/or uneven group sizes
Extension of Previous Study:
Let’s say someone wanted to extend the previous study, rather than just looking at ecstasy vs alcohol, they wanted to explore ecstasy, alcohol, and water-only drinkers.
100 people were recruited: 10 were ecstasy users, 60 were alcohol drinkers, and 30 were water-only drinkers
Dataset: “drug3groups.dta”, with a categorical variable “Drug” and a numerical variable “SundayBDI”
kwallis DV, by(IV)
Like the Rank Sum test, it ranks the DV and then totals the ranks for different groups
test statistic: χ2
p < .05 reject H0, indicating there’s some difference in the mean ranks among the groups
Need to follow up with group comparisons!
H0: Ranks (median) among groups are the same
We use Dunn’s test of multiple comparisons to follow up on a statistically significant Kruskal-Wallis test:
Syntax: dunntest DV, by(IV) ma(bonferroni)
It conducts pair-wise comparison, just like running a Rank-sum test on each pair of groups
Need to adjust family-wise error rate using different adjustments to prevent inflation of Type I error (like ANOVA)
Effect Size
Calculating the r effect size
r=Nz
N for the two groups being compared
For example, between Ecstasy and alcohol: r=701.09=0.13
There’s no significant difference in depression scores between alcohol and ecstasy, z = 1.09, p = .413, with a small effect size, r = .13.
There is a significant difference between ecstasy and water drinkers, z = 4.70, p < .001, r = .74, and alcohol and water drinkers, z = 6.00, p < .001, r = .63, in that the mean ranks of depression were lower in water-only drinkers than both other groups, both with large sized differences.
Final Remarks
There is also non-parametric equivalent to one-way repeated measures ANOVA: Friedman’s Test. However, the test itself has yet to be elegantly implemented in Stata
Ranking loses information on the nuances of the scale and differences
If the assumptions are met, yes, parametric analyses are more powerful (more likely to detect an effect)
But if assumptions are particularly badly violated, non-parametric tests are often more powerful, and provide us more reliable p-value.
Often, you’ll get the same results/conclusion from equivalent parametric and non-parametric tests (which doesn’t mean to try both and pick the preferred outcome!).
These non-parametric analyses are pretty simple, with no equivalent for more complex designs
Ways of adjusting estimations within the parametric analyses (e.g. bootstrapping)
Ways to transform the data
More complex computational estimations (e.g. simulations)
Again, if and when we can (which is the majority of the time), parametric tests (t-tests, regression, ANOVAs) are better choices for analysing our data; however, it is useful to know when to use non-parametric tests and why, sometimes, people use them
Conclusions
Non-parametric analyses don’t have expectations about shapes of distributions, because they:
Apply to categorical data, and therefore, distributional shapes are irrelevant! or
Apply to non-normal or ordinal data, and rank the data before analysing
When it comes to numeric data that doesn’t meet distributional assumptions (or violations to distributional assumptions are coupled with other issues, e.g. small sample sizes or unequal groups), non-parametric analyses are an option
Despite being different families of analyses and different methods of analysing, similar practice:
Understand the variables and the research question
Numerically and graphically describe the data
Conduct the analysis, and if necessary, make adjustments for multiple comparisons.
Interpret the results.
After this week’s lecture, you know:
What defines non-parametric analyses, and when they are appropriate to use
Spearman’s correlation, rank sum test, sign rank test, Kruskal-Wallis test
How to interpret the results of these tests
The strengths and limitations of non-parametric tests
In Stata, you should be able to:
Conduct the tests covered
Graph data in an appropriate way for the type of data (and its distribution)