Exhaustive Guide to Statistical Hypothesis Testing and Significance Levels

Definition and Purpose of Hypothesis Testing

Methodological Goal: A statistical hypothesis test is a method used to determine if statistical data contains sufficient evidence to confirm an assumption or to reject it. It is widely applied in interpreting social surveys or testing new medical remedies.
Necessity of Rules: Specific rules and conventions must be applied during statistical analysis; otherwise, the interpretation of data would be unreliable.
The Scientific Approach: In statistics, we do not speak of "right" or "wrong" since nothing can be proven with absolute certainty. Instead, we use the term significant to describe the support for a statement based on calculated probabilities.

The Significance Level and P-Value

P-Value Definition: The probability that a result occurred by chance alone, assuming the null hypothesis is true.
The Significance Thresholds: * Significant: The assumption is considered significant if the significance level (P-value) is at most $5\%$ . * Highly Significant: The assumption is considered highly significant if the significance level is at most $1\%$ .
Stating Hypotheses: * Null Hypothesis ( $H_0$ ): The general statement or default position, often asserting that the result occurred by pure chance. Formally expressed as $p = p_0$ . * Assumption A (Alternative Hypothesis): The initial research hypothesis or assumption (e.g., p > p_0 or p < p_0).
Interpretation of Results: If the P-value is not low enough to reject $H_0$ , it does not necessarily mean $H_0$ is correct; often, it indicates the number of trials was not high enough to reject it statistically.

Detailed Example 1: Newborn Chick Preferences

The Assumption: A researcher believes a newly hatched chick prefers round grains (Assumption A: p > 1/2) over triangular ones, based on evolutionary theory.
The Null Hypothesis: The chick picks round or triangular grains with equal likelihood ( $H_0$ : $p = 1/2$ ).
Experiment Parameters: * Trials ( $n$ ): $10$ chicks. * Outcome: $8$ chicks chose round grains ( $\bigcirc$ ), and $2$ chose triangular (represented as \bigcirc\bigcirc\bigcirc\Delta\bigcirc\bigcirc\bigcirc\Delta\n\bigcirc\bigcirc).
Mathematical Calculation: * To calculate the significance level, we sum the probability of the actual result and every event even more unlikely ( $k = 8, 9, 10$ ). * $P = \sum_{k=8}^{10} \binom{10}{k} \times \left(\frac{1}{2}\right)^k \times \left(\frac{1}{2}\right)^{10-k} = \sum_{k=8}^{10} \binom{10}{k} \times \left(\frac{1}{2}\right)^{10}$ * $P = 0.055 = 5.5\%$
Conclusion: Since 5.5\% > 5\%, the result is not statistically significant enough to reject $H_0$ . The researcher cannot confirm the preference for round grains.

Detailed Example 2: Firework Reliability

Background: A producer claims $90\%$ of rockets ignite ( $p = 0.9$ ). Customers suspect the real percentage is lower (p < 0.9).
Hypotheses: * $H_0$ : $p = 0.9$ * Assumption A: p < 0.9
Test Data: * $n = 100$ rockets tested. * $k = 85$ set off (observed value). * $\mu = n \times p = 100 \times 0.9 = 90$ (expected number).
Significance Calculation: * We calculate the range for $k = 0$ to $85$ . * $P = \sum_{k=0}^{85} \binom{100}{k} \times 0.9^k \times 0.1^{100-k} = 0.073 = 7.3\%$
Conclusion: 7.3\% > 5\%. There is not enough statistical evidence to reject the producer's claim of $90\%$ reliability.

Rationale for Testing Ranges of k-values

Convention: Statistics requires testing the observed number and all values less likely (the tail of the distribution).
Mathematical Reason: Calculating only the single point probability (e.g., $k = 85$ ) leads to contradiction and nonsense results.
Contradiction Example: Suppose $n = 1000$ rockets and exactly $k = 900$ fire ( $90\%$ matching the claim). * If we calculate only the probability for $k = 900$ : * $P = \binom{1000}{900} \times 0.9^{900} \times 0.1^{100} = 4.20\%$ * Since 4.20\% < 5\%, we would mistakenly reject the null hypothesis even though the result perfectly matches it. This is why the cumulative probability (the range) is essential.

One-sided vs. Two-sided Tests

One-sided Tests: Used when there is prior information or a clear direction for the assumption. * Criteria: * Reliable information exists before the test. * Only one side is of interest. * Assumptions are based on founded theories. * Previous tests show a clear tendency. * The other side makes no sense (e.g., a gustation test where being worse than guessing is impossible).
Two-sided Tests: Used when there is no prior information or info is contradictory. * Criteria: * No information beforehand, even if the result shows a strong tendency. * Contradictory information exists. * Doubts remain and the other side is physically possible. * Calculation Logic: If testing two-sided, calculate the probability for one side and multiply by the factor of $2$ . * Example 3 (Colours): Chicks tested for blue vs. red preference. $n = 20$ , observed $15$ red, $5$ blue. $H_0: p = 0.5$ , $A: p \neq 0.5$ . * $P = 2 \times \sum_{k=0}^{5} \binom{20}{k} \times 0.5^{20} = 2 \times 0.0207 = 0.041 = 4.1\%$ * Since 4.1\% < 5\%, it is significant that one colour is preferred (red).

Conceptual Types of Errors

Type 1 Error ($\alpha$): The incorrect rejection of a true null hypothesis $H_0$ . This leads to the conclusion that a relationship exists when it does not. The risk of this error is equal to the significance level ( $\alpha = P$ ).
Type 2 Error ($\beta$): The failure to reject a false null hypothesis $H_0$ . The risk is harder to estimate because the true value of $p$ is unknown.
Example Cases: * Murder Trial: $H_0$ : Defendant is not guilty. Type 1 error: Sentencing an innocent person. Type 2 error: Acquitting a guilty person. * Company Prototype: $H_0$ : Old model is fine ( $p = p_0$ ). Type 1: Producing a new model that isn't actually better (losing money). Type 2: Not producing a better model (losing customers to competition).

The Critical Region (CR)

Definition: The set of values for which the null hypothesis can be rejected. This is the "light" variant of the significance level for non-mathematicians.
Example 7: New Chocolate Croissant (Schoggipfeli): * Master baker has $p = 0.9$ satisfaction. Tests new recipe with $n = 100$ . $\alpha = 5\%$ , two-sided. * Divide $\alpha$ : $2.5\%$ for the left tail (worse) and $2.5\%$ for the right tail (better). * Expected value $\mu = 90$ . * Upper Limit: Find $x$ such that P(k \ge x) < 0.025. For $x = 96$ , $P = 2.4\%$ . So $96, 97, ..., 100$ is the right limit. * Lower Limit: Find $y$ such that P(k \le y) < 0.025. For $y = 83$ , $P = 2.1\%$ . So $0, 1, ..., 83$ is the left limit. * CR Result: $CR = \{0, 1, ..., 83\} \cup \{96, 97, ..., 100\}$ . * Values in the middle $\text{CR} = \{84, 85, ..., 95\}$ show no significant difference.

Critical Region with Standard Deviation (Approximation)

Formulae: * Expected Value: $\mu = n \times p$ * Standard Deviation: $\sigma = \sqrt{n \times p \times q}$
Estimation Table (Last 4 rows for significance): * $5\%$ significance, one-sided: $\mu \pm 1.64\sigma$ * $5\%$ significance, two-sided: \mu \n± 1.96\sigma * $1\%$ significance, one-sided: \mu \n± 2.33\sigma * $1\%$ significance, two-sided: \mu \n± 2.58\sigma
High n Examples: * Example 9 (Euro Coin): $n = 1000$ , $p = 0.5$ . $5\%$ level, two-sided. * Limits: $500 \pm 1.96 \times \sqrt{1000 \times 0.5 \times 0.5} = 500 \pm 31$ . * $CR = \{0, 1, ..., 469\} \cup \{531, 532, ..., 1000\}$ . * Example 10 (London Births 1664-1757): * $n = 1,436,587$ . Compare boys vs. girls ( $p = 0.5$ ). * Looking for $CR_{1\%}$ (one-sided for more boys): $\mu + 2.33\sigma = 718,293.5 + 2.33 \times \sigma = 719,690$ . * Observed boys were $737,629$ , which is highly significant.

Discussion of Particle Physics (CERN)

Higgs Particle: Confirmed with a range of $5.9\sigma$ .
Probability of Certainty: This corresponds to a probability of $1$ to $550$ million, or $99.999999982\%$ .
Experiment Example: physicists forecast incident once in $1547$ collisions. In $150,000$ collisions registered: * $\mu = \frac{150,000}{1547} \approx 96.962$ * $\sigma = \sqrt{150,000 \times \frac{1}{1547} \times \frac{1546}{1547}} \approx 9.84$ * $\text{CR for } \alpha = 1\% \text{ (one-sided)}: 96.962 \pm 2.58 \times 9.84 = \{0, ..., 71\} \cup \{123, ..., 150,000\}$ .

Summary of Selected Exercise Solutions

Dice Manipulation (Exercise 1): * $100$ trials, $25$ sixes observed. Expected value $16.7$ . P = 2.2\% < 5\%. Alice can reject $H_0$ and state manipulation is significant.
Sugar Packaging (Exercise 2): * $n = 75$ , $8$ weigh less than $1\text{ kg}$ . $H_0: p = 0.05$ . P \approx 3.36\% < 5\%. Significant evidence that too many packages weigh too little.
Coke Bottle Lid (Exercise 4): * $7$ throws, $6$ Down ( $D$ ), $1$ Up ( $U$ ). Two-sided test: $P = 2 \times \binom{7}{0} \times 0.5^7 = 12.5\%$ . Not significant. * Second test: $20$ throws, $15$ $D$ , $5$ $U$ . One-sided test (using previous tendency): P = 2.07\% < 5\%. Now significant.
Wine Connoisseur (Exercise 6): * One-sided triangular test ( $p = 1/3$ ). Person gets $7$ out of $10$ correct. P = 1.97\% < 5\%. This person is significantly a connoisseur.
Vaccine Serum (Exercise 8): * New serum $10$ out of $12$ survived. Old serum $40\%$ survival ( $p = 0.4$ ). $P = 2 \times 3.06\%$ (two-sided) or $P = 1.7\% \dots$ Result is significant (3.06\% < 5\% ).