p-values explained (StatQuest)

P-values: what they are and how to interpret

P-values are numbers between 0 and 1 that quantify how confident we should be that there is a difference between two groups (e.g., drug A vs. drug B) based on the observed data.
The closer the p-value is to 0, the more confident we are that the groups are different.
A commonly used threshold (alpha) is \alpha = 0.05. If the p-value is less than this threshold, we conclude that there is a difference; if not, we fail to conclude a difference.
The idea behind a 0.05 threshold: if there is actually no difference (the null hypothesis is true) and we repeated the exact same experiment many times, only about 5% of those experiments would yield a p-value below 0.05 (a false positive).
Examples from the transcript:
- In a summarizing scenario, if the p-value is less than 0.05, we say drug A is different from drug B. If the p-value is 0.24, we are not confident that there is a difference.
False positives (Type I errors): a small p-value can occur even when there is no true difference, due solely to random variation. The transcript calls this the “dreaded terminology alert.”
Thresholds can be adjusted depending on context:
- A stricter threshold (e.g., \alpha = 0.00001) would reduce false positives to about 1 in 100,000 experiments.
- A looser threshold (e.g., \alpha = 0.2) would tolerate more false positives (about 2 out of 10 experiments).
Example outcomes discussed:
- A p-value of 0.9 suggests no evidence that drug A and drug B differ (very high p-value).
- A p-value of 0.01 could arise in a situation where a difference is present or could occur due to random sampling under the null; in the transcript, this is shown as evidence of a difference in a particular run (potential false positive if there is no true difference).
Practical takeaway: a small p-value does not measure the size of the difference; it only indicates how unlikely the observed data would be if there were no true difference.

Hypothesis testing and the null hypothesis

The process is framed as hypothesis testing.
The null hypothesis is the assumption that the drugs are the same (no real difference).
The p-value helps decide whether to reject the null hypothesis or not.
A small p-value leads to rejecting the null hypothesis; a large p-value leads to failing to reject it.
Important nuance: rejecting the null does not tell you how big the effect is; a tiny effect can have a small p-value with a large enough sample, and a large effect can have a non-significant p-value with a small sample.
The transcript provides examples to illustrate this distinction:
- An experiment with a relatively large p-value (e.g., 0.24) can still show a 6-point difference between groups but may not be statistically significant with the given sample size.
- A larger study with a 1-point difference can yield a smaller p-value (e.g., 0.04) due to more information (larger sample).
Takeaway: p-values assess evidence against the null hypothesis, not the magnitude of the difference.

Experimental design: from single subject to larger samples

A single comparison (one person on drug A and one on drug B) is insufficient to draw conclusions because random factors can strongly influence outcomes.
Random and rare events can confound results: interactions with other medications, allergies, improper dosing, placebo effects, or mislabeling of the drug.
Therefore, experiments are repeated with more participants to average out random quirks and obtain more reliable evidence.
Example progression:
- Start with one person per drug: inconclusive because random variation can dominate.
- Add more participants: each drug tested on two people shows mixed outcomes, still inconclusive.
- Test on many people: large differences in cure rates suggest real differences (e.g., Drug A cures many more people than Drug B).
Data from the transcript (illustrative numbers):
- Drug A: cured 10,443 out of 10,446 treated; not cured 3. Cure rate ≈ \frac{10443}{10446} \approx 0.997 or 99.7\%.
- Drug B: cured 2 out of 1,434 treated; not cured 1,432. Cure rate ≈ \frac{2}{1434} \approx 0.00140 or 0.14\%.
In such a dramatic contrast (A much higher cure rate than B), it would seem obvious that A is better, beyond what random chance would explain.

Thresholds, false positives, and decision rules (detailed)

Thresholds and their implications:
- Common threshold: \alpha = 0.05. If the computed p-value < 0.05, declare a difference.
- Stricter thresholds reduce false positives but may increase false negatives (missing real differences).
- Very lax thresholds (like 0.2) increase the chance of false positives but make decisions more permissive.
False positive concept illustration:
- When there is no true difference, about 5% of experiments will yield p < 0.05 (a false positive) if the threshold is 0.05.
Alternate thresholds shown:
- A stricter threshold of 10^{-5} would reduce false positives to about once in 100,000 experiments.
- A looser threshold like 0.2 would accept false positives about 2 times in 10 experiments.
Real-world interpretation:
- If the p-value is below the chosen threshold, we say the result is statistically significant and that there is evidence for a difference.
- If the p-value is above the threshold, we do not conclude a difference.
Specific example from the transcript:
- If a p-value is calculated as 0.24, we would not conclude a difference between Drug A and Drug B in that study.

Interpreting p-values: effect size vs significance

Important nuance: a small p-value does not tell you how large the difference is (the effect size).
Two scenarios illustrate this distinction:
- Scenario 1: a relatively large p-value (e.g., 0.24) can occur even when there is a substantial difference (e.g., 6-point difference) if the sample size is small.
- Scenario 2: a smaller p-value (e.g., 0.04) can occur with a much smaller observed difference (e.g., 1-point difference) if the sample size is large.
Takeaway: Statistical significance (p-value) is about the evidence against the null hypothesis given the sample size, not the practical importance or magnitude of the effect.

Practical implications, caveats, and terminology

Terminology:
- Hypothesis testing: the framework used to decide if observed data provide enough evidence to conclude a difference.
- Null hypothesis: the assumption that there is no difference between the groups (e.g., drugs are the same).
- P-value: the metric used to decide whether to reject the null hypothesis.
Important caveat:
- A small p-value indicates statistical significance but does not measure the size of the effect.
- A large study can detect small differences as statistically significant; a small study may fail to detect meaningful differences.
Real-world relevance:
- Balancing false positives and false negatives is important in decision-making, depending on the stakes (e.g., medical treatments vs. mundane decisions like predicting ice cream truck arrival times).
Final takeaway from the transcript:
- The p-value is a tool for assessing evidence against the null hypothesis, not a direct measure of how large the difference is or how important it is.
- Always consider sample size, effect size, and scientific or practical importance in addition to the p-value when drawing conclusions.

Quick mathematical recap and formulas (as reference)

Decision rule (null hypothesis testing):
- If p < \alpha, reject the null hypothesis; otherwise, fail to reject it.
Common threshold:
- \alpha = 0.05 (most commonly used, but not universal).
Example numerical values to recall:
- Threshold example: \alpha = 0.05, a false positive rate of about 5% under the null.
- Stricter example: \alpha = 0.00001 reduces false positives to about 1 in 100{,}000.
- Looser example: \alpha = 0.2 tolerates more false positives (about 2 in 10).
Observed data example (p-values):
- p = 0.9 (no evidence of difference)
- p = 0.01 (evidence of difference in a particular run, but be wary of possible false positives depending on context)
- p = 0.24 (not significant)
- p = 0.04 (significant in a larger or more precise study)
Effect size vs p-value (conceptual):
- P-value tests the null of no difference, not the magnitude of the difference; larger studies can yield significant p-values for small differences, and small studies can yield non-significant p-values for larger differences.
Data notes from the transcript (for reference):
- Drug A example: cure rate ≈ \frac{10443}{10446} \approx 0.997 ⇒ 99.7\% cured.
- Drug B example: cure rate ≈ \frac{2}{1434} \approx 0.00140 ⇒ 0.14\% cured.
Real-world caveat:
- Some observed differences may be due to random variation, placebo effects, allergies, drug interactions, mislabeling, or adherence issues; increasing sample size helps mitigate these concerns.