DC

p-values explained (StatQuest)

P-values: what they are and how to interpret

  • P-values are numbers between 0 and 1 that quantify how confident we should be that there is a difference between two groups (e.g., drug A vs. drug B) based on the observed data.
  • The closer the p-value is to 0, the more confident we are that the groups are different.
  • A commonly used threshold (alpha) is \alpha = 0.05. If the p-value is less than this threshold, we conclude that there is a difference; if not, we fail to conclude a difference.
  • The idea behind a 0.05 threshold: if there is actually no difference (the null hypothesis is true) and we repeated the exact same experiment many times, only about 5% of those experiments would yield a p-value below 0.05 (a false positive).
  • Examples from the transcript:
    • In a summarizing scenario, if the p-value is less than 0.05, we say drug A is different from drug B. If the p-value is 0.24, we are not confident that there is a difference.
  • False positives (Type I errors): a small p-value can occur even when there is no true difference, due solely to random variation. The transcript calls this the “dreaded terminology alert.”
  • Thresholds can be adjusted depending on context:
    • A stricter threshold (e.g., \alpha = 0.00001) would reduce false positives to about 1 in 100,000 experiments.
    • A looser threshold (e.g., \alpha = 0.2) would tolerate more false positives (about 2 out of 10 experiments).
  • Example outcomes discussed:
    • A p-value of 0.9 suggests no evidence that drug A and drug B differ (very high p-value).
    • A p-value of 0.01 could arise in a situation where a difference is present or could occur due to random sampling under the null; in the transcript, this is shown as evidence of a difference in a particular run (potential false positive if there is no true difference).
  • Practical takeaway: a small p-value does not measure the size of the difference; it only indicates how unlikely the observed data would be if there were no true difference.

Hypothesis testing and the null hypothesis

  • The process is framed as hypothesis testing.
  • The null hypothesis is the assumption that the drugs are the same (no real difference).
  • The p-value helps decide whether to reject the null hypothesis or not.
  • A small p-value leads to rejecting the null hypothesis; a large p-value leads to failing to reject it.
  • Important nuance: rejecting the null does not tell you how big the effect is; a tiny effect can have a small p-value with a large enough sample, and a large effect can have a non-significant p-value with a small sample.
  • The transcript provides examples to illustrate this distinction:
    • An experiment with a relatively large p-value (e.g., 0.24) can still show a 6-point difference between groups but may not be statistically significant with the given sample size.
    • A larger study with a 1-point difference can yield a smaller p-value (e.g., 0.04) due to more information (larger sample).
  • Takeaway: p-values assess evidence against the null hypothesis, not the magnitude of the difference.

Experimental design: from single subject to larger samples

  • A single comparison (one person on drug A and one on drug B) is insufficient to draw conclusions because random factors can strongly influence outcomes.
  • Random and rare events can confound results: interactions with other medications, allergies, improper dosing, placebo effects, or mislabeling of the drug.
  • Therefore, experiments are repeated with more participants to average out random quirks and obtain more reliable evidence.
  • Example progression:
    • Start with one person per drug: inconclusive because random variation can dominate.
    • Add more participants: each drug tested on two people shows mixed outcomes, still inconclusive.
    • Test on many people: large differences in cure rates suggest real differences (e.g., Drug A cures many more people than Drug B).
  • Data from the transcript (illustrative numbers):
    • Drug A: cured 10,443 out of 10,446 treated; not cured 3. Cure rate ≈ \frac{10443}{10446} \approx 0.997 or 99.7\%.
    • Drug B: cured 2 out of 1,434 treated; not cured 1,432. Cure rate ≈ \frac{2}{1434} \approx 0.00140 or 0.14\%.
  • In such a dramatic contrast (A much higher cure rate than B), it would seem obvious that A is better, beyond what random chance would explain.

Thresholds, false positives, and decision rules (detailed)

  • Thresholds and their implications:
    • Common threshold: \alpha = 0.05. If the computed p-value < 0.05, declare a difference.
    • Stricter thresholds reduce false positives but may increase false negatives (missing real differences).
    • Very lax thresholds (like 0.2) increase the chance of false positives but make decisions more permissive.
  • False positive concept illustration:
    • When there is no true difference, about 5% of experiments will yield p < 0.05 (a false positive) if the threshold is 0.05.
  • Alternate thresholds shown:
    • A stricter threshold of 10^{-5} would reduce false positives to about once in 100,000 experiments.
    • A looser threshold like 0.2 would accept false positives about 2 times in 10 experiments.
  • Real-world interpretation:
    • If the p-value is below the chosen threshold, we say the result is statistically significant and that there is evidence for a difference.
    • If the p-value is above the threshold, we do not conclude a difference.
  • Specific example from the transcript:
    • If a p-value is calculated as 0.24, we would not conclude a difference between Drug A and Drug B in that study.

Interpreting p-values: effect size vs significance

  • Important nuance: a small p-value does not tell you how large the difference is (the effect size).
  • Two scenarios illustrate this distinction:
    • Scenario 1: a relatively large p-value (e.g., 0.24) can occur even when there is a substantial difference (e.g., 6-point difference) if the sample size is small.
    • Scenario 2: a smaller p-value (e.g., 0.04) can occur with a much smaller observed difference (e.g., 1-point difference) if the sample size is large.
  • Takeaway: Statistical significance (p-value) is about the evidence against the null hypothesis given the sample size, not the practical importance or magnitude of the effect.

Practical implications, caveats, and terminology

  • Terminology:
    • Hypothesis testing: the framework used to decide if observed data provide enough evidence to conclude a difference.
    • Null hypothesis: the assumption that there is no difference between the groups (e.g., drugs are the same).
    • P-value: the metric used to decide whether to reject the null hypothesis.
  • Important caveat:
    • A small p-value indicates statistical significance but does not measure the size of the effect.
    • A large study can detect small differences as statistically significant; a small study may fail to detect meaningful differences.
  • Real-world relevance:
    • Balancing false positives and false negatives is important in decision-making, depending on the stakes (e.g., medical treatments vs. mundane decisions like predicting ice cream truck arrival times).
  • Final takeaway from the transcript:
    • The p-value is a tool for assessing evidence against the null hypothesis, not a direct measure of how large the difference is or how important it is.
    • Always consider sample size, effect size, and scientific or practical importance in addition to the p-value when drawing conclusions.

Quick mathematical recap and formulas (as reference)

  • Decision rule (null hypothesis testing):
    • If p < \alpha, reject the null hypothesis; otherwise, fail to reject it.
  • Common threshold:
    • \alpha = 0.05 (most commonly used, but not universal).
  • Example numerical values to recall:
    • Threshold example: \alpha = 0.05, a false positive rate of about 5% under the null.
    • Stricter example: \alpha = 0.00001 reduces false positives to about 1 in 100{,}000.
    • Looser example: \alpha = 0.2 tolerates more false positives (about 2 in 10).
  • Observed data example (p-values):
    • p = 0.9 (no evidence of difference)
    • p = 0.01 (evidence of difference in a particular run, but be wary of possible false positives depending on context)
    • p = 0.24 (not significant)
    • p = 0.04 (significant in a larger or more precise study)
  • Effect size vs p-value (conceptual):
    • P-value tests the null of no difference, not the magnitude of the difference; larger studies can yield significant p-values for small differences, and small studies can yield non-significant p-values for larger differences.
  • Data notes from the transcript (for reference):
    • Drug A example: cure rate ≈ \frac{10443}{10446} \approx 0.997 ⇒ 99.7\% cured.
    • Drug B example: cure rate ≈ \frac{2}{1434} \approx 0.00140 ⇒ 0.14\% cured.
  • Real-world caveat:
    • Some observed differences may be due to random variation, placebo effects, allergies, drug interactions, mislabeling, or adherence issues; increasing sample size helps mitigate these concerns.