Exhaustive Guide to Statistical Hypothesis Testing and Significance Levels

Definition and Purpose of Hypothesis Testing

  • Methodological Goal: A statistical hypothesis test is a method used to determine if statistical data contains sufficient evidence to confirm an assumption or to reject it. It is widely applied in interpreting social surveys or testing new medical remedies.

  • Necessity of Rules: Specific rules and conventions must be applied during statistical analysis; otherwise, the interpretation of data would be unreliable.

  • The Scientific Approach: In statistics, we do not speak of "right" or "wrong" since nothing can be proven with absolute certainty. Instead, we use the term significant to describe the support for a statement based on calculated probabilities.

The Significance Level and P-Value

  • P-Value Definition: The probability that a result occurred by chance alone, assuming the null hypothesis is true.

  • The Significance Thresholds:     * Significant: The assumption is considered significant if the significance level (P-value) is at most 5%5\%.     * Highly Significant: The assumption is considered highly significant if the significance level is at most 1%1\%.

  • Stating Hypotheses:     * Null Hypothesis (H0H_0): The general statement or default position, often asserting that the result occurred by pure chance. Formally expressed as p=p0p = p_0.     * Assumption A (Alternative Hypothesis): The initial research hypothesis or assumption (e.g., p > p_0 or p < p_0).

  • Interpretation of Results: If the P-value is not low enough to reject H0H_0, it does not necessarily mean H0H_0 is correct; often, it indicates the number of trials was not high enough to reject it statistically.

Detailed Example 1: Newborn Chick Preferences

  • The Assumption: A researcher believes a newly hatched chick prefers round grains (Assumption A: p > 1/2) over triangular ones, based on evolutionary theory.

  • The Null Hypothesis: The chick picks round or triangular grains with equal likelihood (H0H_0: p=1/2p = 1/2).

  • Experiment Parameters:     * Trials (nn): 1010 chicks.     * Outcome: 88 chicks chose round grains (\bigcirc), and 22 chose triangular (represented as \bigcirc\bigcirc\bigcirc\Delta\bigcirc\bigcirc\bigcirc\Delta\n\bigcirc\bigcirc).

  • Mathematical Calculation:     * To calculate the significance level, we sum the probability of the actual result and every event even more unlikely (k=8,9,10k = 8, 9, 10).     * P=k=810(10k)×(12)k×(12)10k=k=810(10k)×(12)10P = \sum_{k=8}^{10} \binom{10}{k} \times \left(\frac{1}{2}\right)^k \times \left(\frac{1}{2}\right)^{10-k} = \sum_{k=8}^{10} \binom{10}{k} \times \left(\frac{1}{2}\right)^{10}     * P=0.055=5.5%P = 0.055 = 5.5\%

  • Conclusion: Since 5.5\% > 5\%, the result is not statistically significant enough to reject H0H_0. The researcher cannot confirm the preference for round grains.

Detailed Example 2: Firework Reliability

  • Background: A producer claims 90%90\% of rockets ignite (p=0.9p = 0.9). Customers suspect the real percentage is lower (p < 0.9).

  • Hypotheses:     * H0H_0: p=0.9p = 0.9     * Assumption A: p < 0.9

  • Test Data:     * n=100n = 100 rockets tested.     * k=85k = 85 set off (observed value).     * μ=n×p=100×0.9=90\mu = n \times p = 100 \times 0.9 = 90 (expected number).

  • Significance Calculation:     * We calculate the range for k=0k = 0 to 8585.     * P=k=085(100k)×0.9k×0.1100k=0.073=7.3%P = \sum_{k=0}^{85} \binom{100}{k} \times 0.9^k \times 0.1^{100-k} = 0.073 = 7.3\%

  • Conclusion: 7.3\% > 5\%. There is not enough statistical evidence to reject the producer's claim of 90%90\% reliability.

Rationale for Testing Ranges of k-values

  • Convention: Statistics requires testing the observed number and all values less likely (the tail of the distribution).

  • Mathematical Reason: Calculating only the single point probability (e.g., k=85k = 85) leads to contradiction and nonsense results.

  • Contradiction Example: Suppose n=1000n = 1000 rockets and exactly k=900k = 900 fire (90%90\% matching the claim).     * If we calculate only the probability for k=900k = 900:     * P=(1000900)×0.9900×0.1100=4.20%P = \binom{1000}{900} \times 0.9^{900} \times 0.1^{100} = 4.20\%     * Since 4.20\% < 5\%, we would mistakenly reject the null hypothesis even though the result perfectly matches it. This is why the cumulative probability (the range) is essential.

One-sided vs. Two-sided Tests

  • One-sided Tests: Used when there is prior information or a clear direction for the assumption.     * Criteria:         * Reliable information exists before the test.         * Only one side is of interest.         * Assumptions are based on founded theories.         * Previous tests show a clear tendency.         * The other side makes no sense (e.g., a gustation test where being worse than guessing is impossible).

  • Two-sided Tests: Used when there is no prior information or info is contradictory.     * Criteria:         * No information beforehand, even if the result shows a strong tendency.         * Contradictory information exists.         * Doubts remain and the other side is physically possible.     * Calculation Logic: If testing two-sided, calculate the probability for one side and multiply by the factor of 22.     * Example 3 (Colours): Chicks tested for blue vs. red preference. n=20n = 20, observed 1515 red, 55 blue. H0:p=0.5H_0: p = 0.5, A:p0.5A: p \neq 0.5.     * P=2×k=05(20k)×0.520=2×0.0207=0.041=4.1%P = 2 \times \sum_{k=0}^{5} \binom{20}{k} \times 0.5^{20} = 2 \times 0.0207 = 0.041 = 4.1\%     * Since 4.1\% < 5\%, it is significant that one colour is preferred (red).

Conceptual Types of Errors

  • Type 1 Error ($\alpha$): The incorrect rejection of a true null hypothesis H0H_0. This leads to the conclusion that a relationship exists when it does not. The risk of this error is equal to the significance level (α=P\alpha = P).

  • Type 2 Error ($\beta$): The failure to reject a false null hypothesis H0H_0. The risk is harder to estimate because the true value of pp is unknown.

  • Example Cases:     * Murder Trial: H0H_0: Defendant is not guilty. Type 1 error: Sentencing an innocent person. Type 2 error: Acquitting a guilty person.     * Company Prototype: H0H_0: Old model is fine (p=p0p = p_0). Type 1: Producing a new model that isn't actually better (losing money). Type 2: Not producing a better model (losing customers to competition).

The Critical Region (CR)

  • Definition: The set of values for which the null hypothesis can be rejected. This is the "light" variant of the significance level for non-mathematicians.

  • Example 7: New Chocolate Croissant (Schoggipfeli):     * Master baker has p=0.9p = 0.9 satisfaction. Tests new recipe with n=100n = 100. α=5%\alpha = 5\%, two-sided.     * Divide α\alpha: 2.5%2.5\% for the left tail (worse) and 2.5%2.5\% for the right tail (better).     * Expected value μ=90\mu = 90.     * Upper Limit: Find xx such that P(k \ge x) < 0.025. For x=96x = 96, P=2.4%P = 2.4\%. So 96,97,...,10096, 97, ..., 100 is the right limit.     * Lower Limit: Find yy such that P(k \le y) < 0.025. For y=83y = 83, P=2.1%P = 2.1\%. So 0,1,...,830, 1, ..., 83 is the left limit.     * CR Result: CR={0,1,...,83}{96,97,...,100}CR = \{0, 1, ..., 83\} \cup \{96, 97, ..., 100\}.     * Values in the middle CR={84,85,...,95}\text{CR} = \{84, 85, ..., 95\} show no significant difference.

Critical Region with Standard Deviation (Approximation)

  • Formulae:     * Expected Value: μ=n×p\mu = n \times p     * Standard Deviation: σ=n×p×q\sigma = \sqrt{n \times p \times q}

  • Estimation Table (Last 4 rows for significance):     * 5%5\% significance, one-sided: μ±1.64σ\mu \pm 1.64\sigma     * 5%5\% significance, two-sided: \mu \n± 1.96\sigma     * 1%1\% significance, one-sided: \mu \n± 2.33\sigma     * 1%1\% significance, two-sided: \mu \n± 2.58\sigma

  • High n Examples:     * Example 9 (Euro Coin): n=1000n = 1000, p=0.5p = 0.5. 5%5\% level, two-sided.         * Limits: 500±1.96×1000×0.5×0.5=500±31500 \pm 1.96 \times \sqrt{1000 \times 0.5 \times 0.5} = 500 \pm 31.         * CR={0,1,...,469}{531,532,...,1000}CR = \{0, 1, ..., 469\} \cup \{531, 532, ..., 1000\}.     * Example 10 (London Births 1664-1757):         * n=1,436,587n = 1,436,587. Compare boys vs. girls (p=0.5p = 0.5).         * Looking for CR1%CR_{1\%} (one-sided for more boys): μ+2.33σ=718,293.5+2.33×σ=719,690\mu + 2.33\sigma = 718,293.5 + 2.33 \times \sigma = 719,690.         * Observed boys were 737,629737,629, which is highly significant.

Discussion of Particle Physics (CERN)

  • Higgs Particle: Confirmed with a range of 5.9σ5.9\sigma.

  • Probability of Certainty: This corresponds to a probability of 11 to 550550 million, or 99.999999982%99.999999982\%.

  • Experiment Example: physicists forecast incident once in 15471547 collisions. In 150,000150,000 collisions registered:     * μ=150,000154796.962\mu = \frac{150,000}{1547} \approx 96.962     * σ=150,000×11547×154615479.84\sigma = \sqrt{150,000 \times \frac{1}{1547} \times \frac{1546}{1547}} \approx 9.84     * CR for α=1% (one-sided):96.962±2.58×9.84={0,...,71}{123,...,150,000}\text{CR for } \alpha = 1\% \text{ (one-sided)}: 96.962 \pm 2.58 \times 9.84 = \{0, ..., 71\} \cup \{123, ..., 150,000\}.

Summary of Selected Exercise Solutions

  • Dice Manipulation (Exercise 1):     * 100100 trials, 2525 sixes observed. Expected value 16.716.7. P = 2.2\% < 5\%. Alice can reject H0H_0 and state manipulation is significant.

  • Sugar Packaging (Exercise 2):     * n=75n = 75, 88 weigh less than 1 kg1\text{ kg}. H0:p=0.05H_0: p = 0.05. P \approx 3.36\% < 5\%. Significant evidence that too many packages weigh too little.

  • Coke Bottle Lid (Exercise 4):     * 77 throws, 66 Down (DD), 11 Up (UU). Two-sided test: P=2×(70)×0.57=12.5%P = 2 \times \binom{7}{0} \times 0.5^7 = 12.5\%. Not significant.     * Second test: 2020 throws, 1515 DD, 55 UU. One-sided test (using previous tendency): P = 2.07\% < 5\%. Now significant.

  • Wine Connoisseur (Exercise 6):     * One-sided triangular test (p=1/3p = 1/3). Person gets 77 out of 1010 correct. P = 1.97\% < 5\%. This person is significantly a connoisseur.

  • Vaccine Serum (Exercise 8):     * New serum 1010 out of 1212 survived. Old serum 40%40\% survival (p=0.4p = 0.4). P=2×3.06%P = 2 \times 3.06\% (two-sided) or P=1.7%P = 1.7\% \dots Result is significant (3.06\% < 5\% ).