DataC8 - 11.4. Error Probabilities

Error Probabilities

Contents

  • 11.4.1. Wrong Conclusions

  • 11.4.2. The Chance of an Error

  • 11.4.3. The Cutoff for the p-value is an Error Probability

  • 11.4.4. Data Snooping and p-Hacking

  • 11.4.5. Technical Note: The Other Kind of Error

11.4.1. Wrong Conclusions

  • In hypothesis testing, we compare two hypotheses: the null hypothesis ($H0$) and the alternative hypothesis ($Ha$).

  • There are four possible outcomes when testing these hypotheses:

    • Test Favors the Null Hypothesis:

    • Null is True: Correct result (Accepting $H_0$)

    • Alternative is True: Error (Type II Error, also known as a False Negative, incorrectly accepting $H_0$)

    • Test Favors the Alternative Hypothesis:

    • Null is True: Error (Type I Error, also known as a False Positive, incorrectly rejecting $H_0$)

    • Alternative is True: Correct result (Rejecting $H_0$)

11.4.2. The Chance of an Error

  • When testing whether a coin is fair, the hypotheses are defined as follows:

    • Null Hypothesis ($H_0$): The coin is fair (i.e., the outcomes resemble random draws from Heads and Tails).

    • Alternative Hypothesis ($H_a$): The coin is not fair.

  • Testing is based on 2000 coin tosses:

    • Expected heads if fair: $1000$ (i.e., $2000 / 2$)

    • Test statistic defined as:
      extteststatistic=extnumberofheads1000ext{test statistic} = | ext{number of heads} - 1000 |

    • Empirical distribution of the test statistic under the null hypothesis shows a distribution pattern with an area (probability) of just under 5% for values over 45 favoring the alternative hypothesis.

  • Therefore, using a 5% cutoff for the p-value leads to:

    • Conclusion: If the coin is fair, there is about a 5% chance that the test will incorrectly conclude the coin is unfair (Type I Error).

11.4.3. The Cutoff for the p-value is an Error Probability

  • The general principle states:

    • If using a p-value cutoff of $eta$%, there is about a $eta$% chance of incorrectly rejecting $H0$ if in fact $H0$ is true.

  • Error Probability Table:

    • This table outlines the four possible outcomes in hypothesis testing, where the probabilities in the top row are calculated under the condition that $H_0$ is true. The p-value represents the probability of making an error (in red):

    • Test Favors the Null: Correct result (Accept $H0$) / Type I Error (Reject $H0$)

    • Test Favors the Alternative: Type II Error (Fail to reject $H0$) / Correct result (Reject $H0$)

  • The table is a fundamental representation of conclusions based on statistical tests.

11.4.3.1. Controlling for the Error
  • Implementing a 1% cutoff is more stringent than a 5% cutoff, reducing the likelihood of rejecting $H_0$ if it is true.

  • Context in medical trials:

    • Null Hypothesis ($H_0$): The treatment has no effect; differences in outcomes are due to random variation.

    • Alternative Hypothesis ($H_a$): The treatment has an effect.

  • While a 1% cutoff reduces Type I Error probability, it does not entirely eliminate it:

    • Even at a 1% cutoff, there remains a 1% chance of falsely concluding that the treatment has an effect (due to chance variation).

  • Random sampling seeks to identify this chance variation.

11.4.4. Data Snooping and p-Hacking

  • Scenario involving multiple research groups:

    • If 100 different groups run randomized controlled trials (RCTs) on a treatment that has no actual effect (using a 1% cutoff), it is statistically expected that at least one will incorrectly find a significant effect due to chance variation.

  • Importance of replication:

    • Other researchers should validate findings by replicating experiments to confirm or refute initial conclusions regarding the treatment's effects.

  • Issues with testing multiple hypotheses:

    • In trials assessing various effects of a drug, it is possible that some tests may show a treatment effect by randomness alone, even if the treatment is ineffective.

  • Recommendations when studying research:

    • Consider how many different hypotheses were tested before the one that was published was reported as statistically significant.

    • Caution is advised if multiple tests were conducted before arriving at a significant result, indicating possible data snooping or p-hacking, where data is manipulated or misused to produce significant results.

    • Validating reported results through replication is essential to confirm that the treatment effect exists.

11.4.5. Technical Note: The Other Kind of Error

  • Be aware that there is another error type:

    • Type II Error: Concluding the treatment has no effect when it truly does have an effect.

  • Acknowledgment of the dilemma in hypothesis testing:

    • Efforts to minimize Type I errors tends to increase Type II errors and vice versa. This trade-off is critical in the design and interpretation of statistical testing.