Lecture 3: Exhaustive Notes on Hypothesis Testing and Statistical Power

The Research Cycle and Statistical Modeling

The research process follows a cyclical trajectory, moving from vague curiosity to formal data analysis and conclusion. While this cycle is often idealised, it serves as the foundational structure for scientific inquiry.

Initial Stage: Developing a research question. This can range from a vague idea to a well-formulated query.
Hypothesis Generation: Translating research questions into clear, testable hypotheses as a prerequisite for data collection.
Data Collection and Descriptive Statistics: Identifying whether data will come from a laboratory experiment, fieldwork, or a computational model. Initial summaries typically involve calculating the mean ( $\mu$ ) and variance ( $\sigma^2$ ).
Analysis Hierarchy:
- T-test: Used for comparing two groups involving a categorical variable.
- ANOVA (Analysis of Variance): The extension of a T-test for three or more groups.
- Linear Regression: Used for identifying relationships between continuous variables (straight lines).
- ANCOVA (Analysis of Covariance): A combination of regression (continuous variables) and ANOVA (categorical variables).
- Generalised Linear Models (GLMs): Used when the assumption of normality is relaxed, accounting for different error structures such as Poisson for counts or Binomial for proportions.
- Generalised Additive Models (GAMs): Used when the assumption of linearity is relaxed.
Final Stage: Rejecting or failing to reject the null hypothesis ( $H_0$ ) based on the information gathered.

Defining and Constructing Hypotheses

A hypothesis links a scientific question to a statistical model (e.g., T-tests, Chi-squared, or linear models). It aims to determine if observed differences are real or merely due to chance.

Falsifiability: Based on the philosophy of Karl Popper, a hypothesis must be falsifiable to be scientifically valid. If a hypothesis cannot be tested with data, it is irrelevant.
- Example (Invisible Martian Heaters): If an individual claims global warming is caused by "invisible heaters" placed by "invisible Martians," the claim is not falsifiable because no data can prove or disprove the existence of invisible entities. Conversely, testing greenhouse gases like $CO_2$ is scientific because we can monitor concentrations and perform experiments.
Null Hypothesis ( $H_0$ ): The starting assumption that there is no effect, no difference, or no relationship (e.g., the means of two groups are equal).
Alternative Hypothesis ( $H_1$ ): Assumes there is an effect, difference, or relationship. This is typically what the researcher hopes to find evidence for and usually contains an inequality (>, <, or $\neq$ ).
Mathematical Mirroring: If the alternative hypothesis is "greater than" (>), the null hypothesis must account for the opposite and equal possibilities ("less than or equal to," $\le$ ).

Practical Examples of Hypothesis Formulation

Question 1: Do different forests have different bird species richness?
- Hypothesis: Bird species richness differs between forest types (native vs. plantation).
- $H_0$ : $\mu_{\text{native}} = \mu_{\text{plantation}}$
- $H_1$ : $\mu_{\text{native}} \neq \mu_{\text{plantation}}$
- Note: This is non-directional (a two-tailed consideration).
Question 2: Does native forest support higher bird species richness than plantations?
- Hypothesis: Bird species richness is greater in native forests than in plantations.
- $H_0$ : $\mu_{\text{native}} \le \mu_{\text{plantation}}$
- $H_1$ : \mu_{\text{native}} > \mu_{\text{plantation}}
- Note: This is directional (one-tailed).
Question 3: Does water pollution reduce tadpole survival?
- Hypothesis: Polluted water leads to lower tadpole survival rates.
- $H_0$ : $\text{Survival Rate}_{\text{Polluted}} \ge \text{Survival Rate}_{\text{Clean}}$
- $H_1$ : \text{Survival Rate}_{\text{Polluted}} < \text{Survival Rate}_{\text{Clean}}
Question 4: Does wind speed affect bird foraging?
- Hypothesis: Bees spend less time foraging at higher wind speeds.
- Note: Since wind speed is a continuous variable, this might be tested using a linear regression rather than a T-test.

Statistical Error Framework: Type I and Type II Errors

Statistical decisions are made in a context where the true state of nature ( $H_0$ is true or false) is hidden from the researcher.

Decision Matrix:
- True State: $H_0$ is True / Decision: Reject $H_0$ : This is a Type I Error ( $\alpha$ ). It is considered a serious error (falsely claiming an effect). The threshold is usually set at $5\%$ .
- True State: $H_0$ is True / Decision: Fail to Reject $H_0$ : This is a correct decision ( $1 - \alpha$ ).
- True State: $H_0$ is False / Decision: Fail to Reject $H_0$ : This is a Type II Error ( $\beta$ ). This occurs when an effect exists, but the test fails to detect it.
- True State: $H_0$ is False / Decision: Reject $H_0$ : This is a correct decision, known as Power ( $1 - \beta$ ).

Statistical Power and Detectability

Power is the ability to reject a false null hypothesis. It is fundamentally a ratio of signal to noise: $\text{Detectability} = \frac{\text{Signal (Effect Size)}}{\text{Noise (Variability)}}$

Effect Size: The magnitude of the difference between means. A larger effect size is easier to detect. For example, a $20\,\text{cm}$ height difference between populations is easier to find than a $1\,\text{cm}$ difference.
Variability: Often represented by the standard error ( $\text{SE}$ ). If the standard deviation of individuals is high, detection is harder.
Sample Size ( $n$ ): Power increases monotonically as the number of samples increases because the standard error decreases ( $\text{SE} = \frac{s}{\sqrt{n}}$ ).
Influence of Alpha ( $\alpha$ ): There is a trade-off between error types. As $\alpha$ (Type I error rate) is set lower (e.g., from $0.05$ to $0.01$ ), the probability of a Type II error ( $\beta$ ) increases, and power decreases.
Ethical Implications: In studies like environmental impact assessments (e.g., mining impacts on mammalian population), a "poor" study with very few replicates may fail to find a significant impact simply because it lacks the power to do so. This can be used unethically to accept the null hypothesis of "no impact."

The Problem of Multiple Testing

When conducting multiple independent tests, the cumulative probability of making at least one Type I error increases significantly.

Probability of no Type I errors in one test: $(1 - \alpha)$
Probability of no Type I errors in $n$ tests: $(1 - \alpha)^n$
Probability of making at least one Type I error: $1 - (1 - \alpha)^n$
Scenario: If you perform $50$ tests with $\alpha = 0.05$ : $1 - (1 - 0.05)^{50} = 1 - (0.95)^{50} \approx 0.92$
Conclusion: With $50$ tests, there is a $92\%$ chance of at least one false positive. This necessitates moving toward complex linear models rather than performing dozens of individual T-tests.

The T-Test and the Test Statistic

The T-test utilizes a specific test statistic compared against a T-distribution.

Formula Logic: $t = \frac{\text{Estimate} - \text{Null Under } H_0}{\text{Standard Error}}$
For two means: $t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$
Critical Values: Traditionally, researchers looked up a "critical value" in a T-table based on the Degrees of Freedom ( $df$ ). For a two-group test: $df = (n_1 + n_2) - 2$ .
Decision Rule: If the calculated $t$ -value is greater than the critical value (falling into the "red area" of the distribution tail), the result is statistically significant.
Two-tailed vs. One-tailed:
- Two-tailed: Splits the $5\% \alpha$ into two tails of $2.5\%$ each. Used when the direction of difference is unknown.
- One-tailed: Puts all $5\%$ into one tail. Used when there is a clear prior expectation of directionality.

Questions & Discussion

Question: If statistical power is $1 - \beta$ , how do you calculate $\beta$ or determine it?

Response: Usually, power is not calculated after an experiment, but before it. To calculate $\beta$ precisely, one must know the true effect size and the true variation (standard deviation) of the population. Since these values are typically unknown, scientists estimate them based on previous work. With an estimated effect size, a chosen $\alpha$ level (usually $0.05$ ), and a set sample size $n$ , one can calculate $\beta$ . Functions like power.t.test in R allow researchers to fix three variables and solve for the fourth (e.g., determine the sample size needed to achieve $80\%$ power for a given effect size).