Statistical Significance Testing

Many people, including researchers, misunderstand the function of statistical significance testing.
The probability value, or p-value, does not prove causation or practical importance.
It does not indicate the size or meaningfulness of the effect of independent variables on dependent variables.
The p-value provides the probability of the observed data if the null hypothesis is true.

The aim is to understand what statistical significance testing truly does and to evaluate statistics critically.
This lecture is only the first of two focusing on interpreting data concerning statistical significance testing.
The next lecture will cover effect size and confidence intervals.
The three metrics are referred to collectively as "the big three."

Statistical significance testing was historically regarded as a definitive indicator of meaningful effects.
In reality, effect size and confidence intervals are more influential in interpreting data.

Null Hypothesis (H0): A statement that assumes no effect or no difference exists in the population.
Example: People who go swimming never get wet. (Hypothetical)
Evidence against the null hypothesis builds confidence in rejecting it when a significant number of samples show contrary evidence.

The null hypothesis can be compared to a jury's presumption of innocence until proven guilty.
- Evidence Required: Significant evidence, like DNA or video, is needed to reject the null hypothesis.
Statistical Significance Testing: Looks for evidence to reject the null hypothesis.

Statistical significance testing starts with the sampling process.
Random sampling error occurs due to inevitable deviations in sample means compared to the population mean.
- Most samples will deviate from the population mean to varying degrees.
Types of Error:
- Random Sampling Error: Inherent and measurable.
- Bias: Caused by poor sampling methods; not measurable.

Random samples lead to a Sampling Distribution:
- Definition: Distribution of sample statistics (e.g., mean) from all possible random samples of a specified size from the population.
The Central Limit Theorem states:
1. The mean of the sampling distribution (x̄) equals the population mean (µ).
2. The shape of the sampling distribution becomes normally distributed (bell-shaped) with large enough sample sizes (≥ 30).
3. Standard deviation of the sampling distribution (standard error) decreases as sample size increases.

Purpose: Provides a foundation for statistical significance testing by establishing expected sample distributions under the null hypothesis:
- Null Sampling Distribution: Represents all possible outcomes if there is no treatment effect.
- Alternative Sampling Distribution: Represents outcomes if there is a treatment effect.

Statistical significance testing helps answer: “Did something happen?”
- It does not convey effect size or confidence in the result's reliability.
Type I Error (alpha error): Rejecting the null hypothesis when it is true; typically set at 0.05 (5%).
Type II Error (beta error): Failing to reject the null when the alternative hypothesis is true; often set at 0.20 (20%).
Statistical Power: Probability of correctly rejecting a false null hypothesis.

Type I Error: Believing in significance when in fact, there is none (far right part of the null distribution).
Type II Error: Missing a significant effect due to a failure to reject the null hypothesis (at the critical value).
Power: Correct decision made by rejecting the null hypothesis when the alternative hypothesis is true.

The result from statistical tests (e.g., t-test, ANOVA) produces a p-value.
To determine significance, compare the p-value to the alpha level (0.05).
- Interpretation:
- If p ≤ alpha, reject the null hypothesis.
- If p > alpha, do not reject the null hypothesis.

Example Comparing Cholesterol Drug Efficacy:
- A p-value of 0.60 indicates insufficient evidence against the null hypothesis that the drug works.
- A p-value of 0.01 suggests strong evidence against the null hypothesis, indicating effective treatment.

Statistics are susceptible to manipulation based on sample size.
Larger sample sizes usually result in smaller standard error, making even minor differences statistically significant yet clinically irrelevant.
Power Analysis: Used to determine appropriate sample sizes for achieving statistically meaningful results.

The common alpha level of 0.05 has historical precedent but is not a stringent standard.
Decision-making can vary; for higher stakes, a lower alpha (like 0.01) might be used.

Interpretations must consider context as statistics can suggest certainty improperly.
Just because a result is statistically significant does not imply it is clinically relevant or useful.
Informed analysis remains a skilled area requiring expertise, recognized in high-quality journals and practiced by biostatisticians and epidemiologists.