Lecture 3: Exhaustive Notes on Hypothesis Testing and Statistical Power

The Research Cycle and Statistical Modeling

The research process follows a cyclical trajectory, moving from vague curiosity to formal data analysis and conclusion. While this cycle is often idealised, it serves as the foundational structure for scientific inquiry.

  • Initial Stage: Developing a research question. This can range from a vague idea to a well-formulated query.

  • Hypothesis Generation: Translating research questions into clear, testable hypotheses as a prerequisite for data collection.

  • Data Collection and Descriptive Statistics: Identifying whether data will come from a laboratory experiment, fieldwork, or a computational model. Initial summaries typically involve calculating the mean (μ\mu) and variance (σ2\sigma^2).

  • Analysis Hierarchy:

    • T-test: Used for comparing two groups involving a categorical variable.

    • ANOVA (Analysis of Variance): The extension of a T-test for three or more groups.

    • Linear Regression: Used for identifying relationships between continuous variables (straight lines).

    • ANCOVA (Analysis of Covariance): A combination of regression (continuous variables) and ANOVA (categorical variables).

    • Generalised Linear Models (GLMs): Used when the assumption of normality is relaxed, accounting for different error structures such as Poisson for counts or Binomial for proportions.

    • Generalised Additive Models (GAMs): Used when the assumption of linearity is relaxed.

  • Final Stage: Rejecting or failing to reject the null hypothesis (H0H_0) based on the information gathered.

Defining and Constructing Hypotheses

A hypothesis links a scientific question to a statistical model (e.g., T-tests, Chi-squared, or linear models). It aims to determine if observed differences are real or merely due to chance.

  • Falsifiability: Based on the philosophy of Karl Popper, a hypothesis must be falsifiable to be scientifically valid. If a hypothesis cannot be tested with data, it is irrelevant.

    • Example (Invisible Martian Heaters): If an individual claims global warming is caused by "invisible heaters" placed by "invisible Martians," the claim is not falsifiable because no data can prove or disprove the existence of invisible entities. Conversely, testing greenhouse gases like CO2CO_2 is scientific because we can monitor concentrations and perform experiments.

  • Null Hypothesis (H0H_0): The starting assumption that there is no effect, no difference, or no relationship (e.g., the means of two groups are equal).

  • Alternative Hypothesis (H1H_1): Assumes there is an effect, difference, or relationship. This is typically what the researcher hopes to find evidence for and usually contains an inequality (>, <, or \neq).

  • Mathematical Mirroring: If the alternative hypothesis is "greater than" (>), the null hypothesis must account for the opposite and equal possibilities ("less than or equal to," \le).

Practical Examples of Hypothesis Formulation

  • Question 1: Do different forests have different bird species richness?

    • Hypothesis: Bird species richness differs between forest types (native vs. plantation).

    • H0H_0: μnative=μplantation\mu_{\text{native}} = \mu_{\text{plantation}}

    • H1H_1: μnativeμplantation\mu_{\text{native}} \neq \mu_{\text{plantation}}

    • Note: This is non-directional (a two-tailed consideration).

  • Question 2: Does native forest support higher bird species richness than plantations?

    • Hypothesis: Bird species richness is greater in native forests than in plantations.

    • H0H_0: μnativeμplantation\mu_{\text{native}} \le \mu_{\text{plantation}}

    • H1H_1: \mu_{\text{native}} > \mu_{\text{plantation}}

    • Note: This is directional (one-tailed).

  • Question 3: Does water pollution reduce tadpole survival?

    • Hypothesis: Polluted water leads to lower tadpole survival rates.

    • H0H_0: Survival RatePollutedSurvival RateClean\text{Survival Rate}_{\text{Polluted}} \ge \text{Survival Rate}_{\text{Clean}}

    • H1H_1: \text{Survival Rate}_{\text{Polluted}} < \text{Survival Rate}_{\text{Clean}}

  • Question 4: Does wind speed affect bird foraging?

    • Hypothesis: Bees spend less time foraging at higher wind speeds.

    • Note: Since wind speed is a continuous variable, this might be tested using a linear regression rather than a T-test.

Statistical Error Framework: Type I and Type II Errors

Statistical decisions are made in a context where the true state of nature (H0H_0 is true or false) is hidden from the researcher.

  • Decision Matrix:

    • True State: H0H_0 is True / Decision: Reject H0H_0: This is a Type I Error (α\alpha). It is considered a serious error (falsely claiming an effect). The threshold is usually set at 5%5\%.

    • True State: H0H_0 is True / Decision: Fail to Reject H0H_0: This is a correct decision (1α1 - \alpha).

    • True State: H0H_0 is False / Decision: Fail to Reject H0H_0: This is a Type II Error (β\beta). This occurs when an effect exists, but the test fails to detect it.

    • True State: H0H_0 is False / Decision: Reject H0H_0: This is a correct decision, known as Power (1β1 - \beta).

Statistical Power and Detectability

Power is the ability to reject a false null hypothesis. It is fundamentally a ratio of signal to noise: Detectability=Signal (Effect Size)Noise (Variability)\text{Detectability} = \frac{\text{Signal (Effect Size)}}{\text{Noise (Variability)}}

  • Effect Size: The magnitude of the difference between means. A larger effect size is easier to detect. For example, a 20cm20\,\text{cm} height difference between populations is easier to find than a 1cm1\,\text{cm} difference.

  • Variability: Often represented by the standard error (SE\text{SE}). If the standard deviation of individuals is high, detection is harder.

  • Sample Size (nn): Power increases monotonically as the number of samples increases because the standard error decreases (SE=sn\text{SE} = \frac{s}{\sqrt{n}}).

  • Influence of Alpha (α\alpha): There is a trade-off between error types. As α\alpha (Type I error rate) is set lower (e.g., from 0.050.05 to 0.010.01), the probability of a Type II error (β\beta) increases, and power decreases.

  • Ethical Implications: In studies like environmental impact assessments (e.g., mining impacts on mammalian population), a "poor" study with very few replicates may fail to find a significant impact simply because it lacks the power to do so. This can be used unethically to accept the null hypothesis of "no impact."

The Problem of Multiple Testing

When conducting multiple independent tests, the cumulative probability of making at least one Type I error increases significantly.

  • Probability of no Type I errors in one test: (1α)(1 - \alpha)

  • Probability of no Type I errors in nn tests: (1α)n(1 - \alpha)^n

  • Probability of making at least one Type I error: 1(1α)n1 - (1 - \alpha)^n

  • Scenario: If you perform 5050 tests with α=0.05\alpha = 0.05:     1(10.05)50=1(0.95)500.921 - (1 - 0.05)^{50} = 1 - (0.95)^{50} \approx 0.92

  • Conclusion: With 5050 tests, there is a 92%92\% chance of at least one false positive. This necessitates moving toward complex linear models rather than performing dozens of individual T-tests.

The T-Test and the Test Statistic

The T-test utilizes a specific test statistic compared against a T-distribution.

  • Formula Logic:     t=EstimateNull Under H0Standard Errort = \frac{\text{Estimate} - \text{Null Under } H_0}{\text{Standard Error}}

  • For two means:     t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

  • Critical Values: Traditionally, researchers looked up a "critical value" in a T-table based on the Degrees of Freedom (dfdf). For a two-group test: df=(n1+n2)2df = (n_1 + n_2) - 2.

  • Decision Rule: If the calculated tt-value is greater than the critical value (falling into the "red area" of the distribution tail), the result is statistically significant.

  • Two-tailed vs. One-tailed:

    • Two-tailed: Splits the 5%α5\% \alpha into two tails of 2.5%2.5\% each. Used when the direction of difference is unknown.

    • One-tailed: Puts all 5%5\% into one tail. Used when there is a clear prior expectation of directionality.

Questions & Discussion

Question: If statistical power is 1β1 - \beta, how do you calculate β\beta or determine it?

Response: Usually, power is not calculated after an experiment, but before it. To calculate β\beta precisely, one must know the true effect size and the true variation (standard deviation) of the population. Since these values are typically unknown, scientists estimate them based on previous work. With an estimated effect size, a chosen α\alpha level (usually 0.050.05), and a set sample size nn, one can calculate β\beta. Functions like power.t.test in R allow researchers to fix three variables and solve for the fourth (e.g., determine the sample size needed to achieve 80%80\% power for a given effect size).