p-values explained (StatQuest)
P-values: what they are and how to interpret
- P-values are numbers between 0 and 1 that quantify how confident we should be that there is a difference between two groups (e.g., drug A vs. drug B) based on the observed data.
- The closer the p-value is to 0, the more confident we are that the groups are different.
- A commonly used threshold (alpha) is \alpha = 0.05. If the p-value is less than this threshold, we conclude that there is a difference; if not, we fail to conclude a difference.
- The idea behind a 0.05 threshold: if there is actually no difference (the null hypothesis is true) and we repeated the exact same experiment many times, only about 5% of those experiments would yield a p-value below 0.05 (a false positive).
- Examples from the transcript:
- In a summarizing scenario, if the p-value is less than 0.05, we say drug A is different from drug B. If the p-value is 0.24, we are not confident that there is a difference.
- False positives (Type I errors): a small p-value can occur even when there is no true difference, due solely to random variation. The transcript calls this the “dreaded terminology alert.”
- Thresholds can be adjusted depending on context:
- A stricter threshold (e.g., \alpha = 0.00001) would reduce false positives to about 1 in 100,000 experiments.
- A looser threshold (e.g., \alpha = 0.2) would tolerate more false positives (about 2 out of 10 experiments).
- Example outcomes discussed:
- A p-value of 0.9 suggests no evidence that drug A and drug B differ (very high p-value).
- A p-value of 0.01 could arise in a situation where a difference is present or could occur due to random sampling under the null; in the transcript, this is shown as evidence of a difference in a particular run (potential false positive if there is no true difference).
- Practical takeaway: a small p-value does not measure the size of the difference; it only indicates how unlikely the observed data would be if there were no true difference.
Hypothesis testing and the null hypothesis
- The process is framed as hypothesis testing.
- The null hypothesis is the assumption that the drugs are the same (no real difference).
- The p-value helps decide whether to reject the null hypothesis or not.
- A small p-value leads to rejecting the null hypothesis; a large p-value leads to failing to reject it.
- Important nuance: rejecting the null does not tell you how big the effect is; a tiny effect can have a small p-value with a large enough sample, and a large effect can have a non-significant p-value with a small sample.
- The transcript provides examples to illustrate this distinction:
- An experiment with a relatively large p-value (e.g., 0.24) can still show a 6-point difference between groups but may not be statistically significant with the given sample size.
- A larger study with a 1-point difference can yield a smaller p-value (e.g., 0.04) due to more information (larger sample).
- Takeaway: p-values assess evidence against the null hypothesis, not the magnitude of the difference.
Experimental design: from single subject to larger samples
- A single comparison (one person on drug A and one on drug B) is insufficient to draw conclusions because random factors can strongly influence outcomes.
- Random and rare events can confound results: interactions with other medications, allergies, improper dosing, placebo effects, or mislabeling of the drug.
- Therefore, experiments are repeated with more participants to average out random quirks and obtain more reliable evidence.
- Example progression:
- Start with one person per drug: inconclusive because random variation can dominate.
- Add more participants: each drug tested on two people shows mixed outcomes, still inconclusive.
- Test on many people: large differences in cure rates suggest real differences (e.g., Drug A cures many more people than Drug B).
- Data from the transcript (illustrative numbers):
- Drug A: cured 10,443 out of 10,446 treated; not cured 3. Cure rate ≈ \frac{10443}{10446} \approx 0.997 or 99.7\%.
- Drug B: cured 2 out of 1,434 treated; not cured 1,432. Cure rate ≈ \frac{2}{1434} \approx 0.00140 or 0.14\%.
- In such a dramatic contrast (A much higher cure rate than B), it would seem obvious that A is better, beyond what random chance would explain.
Thresholds, false positives, and decision rules (detailed)
- Thresholds and their implications:
- Common threshold: \alpha = 0.05. If the computed p-value < 0.05, declare a difference.
- Stricter thresholds reduce false positives but may increase false negatives (missing real differences).
- Very lax thresholds (like 0.2) increase the chance of false positives but make decisions more permissive.
- False positive concept illustration:
- When there is no true difference, about 5% of experiments will yield p < 0.05 (a false positive) if the threshold is 0.05.
- Alternate thresholds shown:
- A stricter threshold of 10^{-5} would reduce false positives to about once in 100,000 experiments.
- A looser threshold like 0.2 would accept false positives about 2 times in 10 experiments.
- Real-world interpretation:
- If the p-value is below the chosen threshold, we say the result is statistically significant and that there is evidence for a difference.
- If the p-value is above the threshold, we do not conclude a difference.
- Specific example from the transcript:
- If a p-value is calculated as 0.24, we would not conclude a difference between Drug A and Drug B in that study.
Interpreting p-values: effect size vs significance
- Important nuance: a small p-value does not tell you how large the difference is (the effect size).
- Two scenarios illustrate this distinction:
- Scenario 1: a relatively large p-value (e.g., 0.24) can occur even when there is a substantial difference (e.g., 6-point difference) if the sample size is small.
- Scenario 2: a smaller p-value (e.g., 0.04) can occur with a much smaller observed difference (e.g., 1-point difference) if the sample size is large.
- Takeaway: Statistical significance (p-value) is about the evidence against the null hypothesis given the sample size, not the practical importance or magnitude of the effect.
Practical implications, caveats, and terminology
- Terminology:
- Hypothesis testing: the framework used to decide if observed data provide enough evidence to conclude a difference.
- Null hypothesis: the assumption that there is no difference between the groups (e.g., drugs are the same).
- P-value: the metric used to decide whether to reject the null hypothesis.
- Important caveat:
- A small p-value indicates statistical significance but does not measure the size of the effect.
- A large study can detect small differences as statistically significant; a small study may fail to detect meaningful differences.
- Real-world relevance:
- Balancing false positives and false negatives is important in decision-making, depending on the stakes (e.g., medical treatments vs. mundane decisions like predicting ice cream truck arrival times).
- Final takeaway from the transcript:
- The p-value is a tool for assessing evidence against the null hypothesis, not a direct measure of how large the difference is or how important it is.
- Always consider sample size, effect size, and scientific or practical importance in addition to the p-value when drawing conclusions.
- Decision rule (null hypothesis testing):
- If p < \alpha, reject the null hypothesis; otherwise, fail to reject it.
- Common threshold:
- \alpha = 0.05 (most commonly used, but not universal).
- Example numerical values to recall:
- Threshold example: \alpha = 0.05, a false positive rate of about 5% under the null.
- Stricter example: \alpha = 0.00001 reduces false positives to about 1 in 100{,}000.
- Looser example: \alpha = 0.2 tolerates more false positives (about 2 in 10).
- Observed data example (p-values):
- p = 0.9 (no evidence of difference)
- p = 0.01 (evidence of difference in a particular run, but be wary of possible false positives depending on context)
- p = 0.24 (not significant)
- p = 0.04 (significant in a larger or more precise study)
- Effect size vs p-value (conceptual):
- P-value tests the null of no difference, not the magnitude of the difference; larger studies can yield significant p-values for small differences, and small studies can yield non-significant p-values for larger differences.
- Data notes from the transcript (for reference):
- Drug A example: cure rate ≈ \frac{10443}{10446} \approx 0.997 ⇒ 99.7\% cured.
- Drug B example: cure rate ≈ \frac{2}{1434} \approx 0.00140 ⇒ 0.14\% cured.
- Real-world caveat:
- Some observed differences may be due to random variation, placebo effects, allergies, drug interactions, mislabeling, or adherence issues; increasing sample size helps mitigate these concerns.