DataC8 - 11.4. Error Probabilities

In hypothesis testing, we compare two hypotheses: the null hypothesis ($H0$) and the alternative hypothesis ($Ha$).
There are four possible outcomes when testing these hypotheses:
- Test Favors the Null Hypothesis:
- Null is True: Correct result (Accepting $H_0$)
- Alternative is True: Error (Type II Error, also known as a False Negative, incorrectly accepting $H_0$)
- Test Favors the Alternative Hypothesis:
- Null is True: Error (Type I Error, also known as a False Positive, incorrectly rejecting $H_0$)
- Alternative is True: Correct result (Rejecting $H_0$)

When testing whether a coin is fair, the hypotheses are defined as follows:
- Null Hypothesis ($H_0$): The coin is fair (i.e., the outcomes resemble random draws from Heads and Tails).
- Alternative Hypothesis ($H_a$): The coin is not fair.
Testing is based on 2000 coin tosses:
- Expected heads if fair: $1000$ (i.e., $2000 / 2$)
- Test statistic defined as:
  $ext{test statistic} = | ext{number of heads} - 1000 |$
- Empirical distribution of the test statistic under the null hypothesis shows a distribution pattern with an area (probability) of just under 5% for values over 45 favoring the alternative hypothesis.
Therefore, using a 5% cutoff for the p-value leads to:
- Conclusion: If the coin is fair, there is about a 5% chance that the test will incorrectly conclude the coin is unfair (Type I Error).

The general principle states:
- If using a p-value cutoff of $eta$%, there is about a $eta$% chance of incorrectly rejecting $H0$ if in fact $H0$ is true.
Error Probability Table:
- This table outlines the four possible outcomes in hypothesis testing, where the probabilities in the top row are calculated under the condition that $H_0$ is true. The p-value represents the probability of making an error (in red):
- Test Favors the Null: Correct result (Accept $H0$) / Type I Error (Reject $H0$)
- Test Favors the Alternative: Type II Error (Fail to reject $H0$) / Correct result (Reject $H0$)
The table is a fundamental representation of conclusions based on statistical tests.

Implementing a 1% cutoff is more stringent than a 5% cutoff, reducing the likelihood of rejecting $H_0$ if it is true.
Context in medical trials:
- Null Hypothesis ($H_0$): The treatment has no effect; differences in outcomes are due to random variation.
- Alternative Hypothesis ($H_a$): The treatment has an effect.
While a 1% cutoff reduces Type I Error probability, it does not entirely eliminate it:
- Even at a 1% cutoff, there remains a 1% chance of falsely concluding that the treatment has an effect (due to chance variation).
Random sampling seeks to identify this chance variation.

Scenario involving multiple research groups:
- If 100 different groups run randomized controlled trials (RCTs) on a treatment that has no actual effect (using a 1% cutoff), it is statistically expected that at least one will incorrectly find a significant effect due to chance variation.
Importance of replication:
- Other researchers should validate findings by replicating experiments to confirm or refute initial conclusions regarding the treatment's effects.
Issues with testing multiple hypotheses:
- In trials assessing various effects of a drug, it is possible that some tests may show a treatment effect by randomness alone, even if the treatment is ineffective.
Recommendations when studying research:
- Consider how many different hypotheses were tested before the one that was published was reported as statistically significant.
- Caution is advised if multiple tests were conducted before arriving at a significant result, indicating possible data snooping or p-hacking, where data is manipulated or misused to produce significant results.
- Validating reported results through replication is essential to confirm that the treatment effect exists.

Be aware that there is another error type:
- Type II Error: Concluding the treatment has no effect when it truly does have an effect.
Acknowledgment of the dilemma in hypothesis testing:
- Efforts to minimize Type I errors tends to increase Type II errors and vice versa. This trade-off is critical in the design and interpretation of statistical testing.