W3 L3 - Null Hypothesis Testing: Key Concepts and Notes

Introduction to Null Hypothesis Testing

The video explains why we use statistical models (inferential statistics) and introduces null hypothesis testing, along with key concepts: p value, statistical significance, effect size, and power.
Null hypothesis testing uses mathematical models to evaluate predictions about data; the process relies on inference from a sample to a population.
Key terms covered: p value, statistical significance, effect size, and power.

Six-Step Process of Null Hypothesis Testing

Step 1: Formulate a hypothesis that embodies our prediction before seeing the data.
- Example hypothesis: Those who sleep seven hours or more per night will show better memory performance than those who sleep less than seven hours.
- The hypothesis is formed after reviewing the literature and before data collection.
Step 2: Specify the null and the alternative (experimental) hypotheses.
- Alternative hypothesis (H1): those who sleep seven hours or more per night will show better memory performance than those who sleep less than seven hours.
- Null hypothesis (H0): there is no difference in memory performance as a function of sleep duration.
- In modern reporting style, we typically state the experimental hypothesis and infer the null hypothesis.
Step 3: Collect data relevant to the hypothesis.
- Two groups: <7 hours and ≥7 hours; n = 20 in each group.
- Descriptive statistics given: mean memory performance for <7 hours = 5.1; ≥7 hours = 6.7.
- Memory performance is scored on a scale that varies between 0 and 10, with descriptive stats including minimums, maximums, and standard deviations.
Step 4: Fit a model to the data that represents the alternative hypothesis and compute a statistical test.
- In the example, an independent samples t test is used to test the difference between the two sleep-duration groups.
- The statement notes that the four main steps are: fit the model to test the hypothesis and compute the test statistic.
Step 5: Compute the probability of the observed value of that statistic, assuming the null hypothesis is true.
- This probability is the p value. The slide notes that in this example, the p value is related as: "this probability is one minus the p value" (as stated in the video).
Step 6: Assess the statistical significance.
- The p value is used to determine significance; in the example, p = 0.007.
- In typical practice, a p value less than 0.05 is considered statistically significant.
Additional notes on practice:
- Modern reporting practice emphasizes stating the experimental hypothesis; the null is inferred.
- The course will go into p values in more depth later.
- The example uses an independent samples t test; the exact statistics are not the focus at this stage, just the process.

The P-Value: Definition and Common Misconceptions

Definition (conceptual): Under the assumption that the null hypothesis is true, the p value is the probability of obtaining a sample as or more extreme than the one observed. This means that a low p value indicates strong evidence against the null hypothesis, suggesting that we should consider rejecting it in favor of the alternative hypothesis.
The p value is a probability about the sample, not about the truth of the null hypothesis or about the world:
- It is not the probability that the null hypothesis is true.
- It is not the probability that you are making the wrong decision.
- It is not the probability that if you ran the study again you would obtain the same result (replicability).
- It does not reflect the size of the effect.
- It is an abstract concept used as a decision tool for statistical significance.
The p value is tied to statistical significance:
- In psychology, dichotomous decisions are common: significant vs not significant.
- This dichotomy is relatively arbitrary; avoid interpreting p values as more than a signal of whether the observed data are unlikely under H0.
Reporting and interpretation guidelines from the video:
- Always report the actual p value (to three decimals, e.g., p = 0.123).
- Do not report p values as "p < 0.05" without the exact value.
- Never report a p value of exactly 0; if a software outputs 0.000, report it as p < 0.001.
- The alpha level (e.g., 0.05) is predefined and compared to the p value to determine significance.
- Do not use terms like "nearly significant" or "trending"; these terms are discouraged.
- The alpha value and the p value are related but not the same: {
- Alpha is the threshold for deciding significance, set before analysis: $\alpha = 0.05.$
- If p < \alpha, the result is deemed statistically significant; otherwise, not significant.
 }

Alpha, Significance Threshold, and Error Types

Alpha (significance level): the probability of a false positive if the null is true. In formula: $\alpha = P(\text{Reject } H0 \mid H0 \text{ true})$
Common practice: $\alpha = 0.05$ , though contexts may use different values.
Type I error (false positive): rejecting the null when it is true.
- Example: concluding sleep seven hours or more leads to better memory when there is no true difference.
Type II error (false negative): failing to reject the null when it is false.
- Example: concluding there is no difference when there actually is a difference.
Power: the probability of correctly rejecting the null when the alternative is true.
- Definition: $\text{Power} = 1 - \beta$ where $\beta = P(\text{Fail to reject } H0 \mid H1 \text{ true}).$
Relationship between alpha and power:
- Decreasing $\alpha$ lowers the probability of a Type I error but also lowers power (more false negatives).
- Increasing $\alpha$ increases power but raises the risk of Type I errors.
Power analyses (prospective): important ethical consideration to ensure a study is capable of detecting expected effects.
Practical note: setting alpha, considering effect sizes, and planning power are part of the "chef-like" recipe for statistical analysis.

Effect Size: Magnitude of the Effect

Effect size measures the strength of the relationship or difference, and it is often as important as, or more important than, the p value.
Why it matters:
- Large samples can yield statistically significant but practically tiny effects; thus reporting effect size helps assess practical significance.
- Policy and intervention decisions depend on the magnitude of the effect, not just statistical significance.
Common measures:
- For comparing two groups (mean difference): Cohen's d
- For regression models: R-squared ( $R^2 = \frac{SS{\text{reg}}}{SS{\text{tot}}}$ ) or other effect size metrics depending on the model.
Example scale-based differences:
- Large effect: sleep more yields an average increase of about 2 points on a 0–10 memory scale (out of 10).
- Small effect: sleep more yields an average increase of about 0.5 points.
In practice: report both p values and effect sizes (e.g., $d, R^2$ ) to inform policy or practical implications.
Note: Effect size and p value do not always move together; large samples can produce significant p values for small effects.
The course will introduce several effect-size measures (e.g., Cohen's d, $R^2$ , etc.) depending on the model used.

Statistical Power and Experimental Design

Power is the probability of detecting an effect when one exists: $\text{Power} = 1 - \beta$ .
Three primary factors affect power:
- Sample size: Larger samples provide greater statistical power.
- Effect size: Larger true effects are easier to detect; designs have more power for large effects than for small ones.
- Type I error rate (alpha): There is a relationship such that decreasing $\alpha$ (e.g., from 0.05 to 0.01) tends to decrease power if other factors remain constant; increasing $\alpha$ increases power but raises Type I error risk.
Power analyses are conducted prospectively to ensure there is sufficient power to detect the intended effects.
Ethical considerations: researchers should plan for adequate power to avoid wasted resources and false conclusions.

Practical Considerations and Takeaways

The process can be thought of as following key ingredients in a recipe:
- Use null hypothesis testing methods with a pre-set alpha value.
- Consider effect size alongside the p value.
- Be mindful of Type I and Type II errors and plan the study's power to detect meaningful effects.
- Report p values precisely (to three decimals) and provide effect-size measures to inform interpretation and policy.
- Avoid misleading phrasing like "nearly significant" or "trends"; report exact p values and their context.
The overarching goal is to balance statistical decision making with practical significance, research ethics, and transparent reporting.

Example Recap from the Video

Hypothesis: Those who sleep seven hours or more per night will show better memory performance than those who sleep less than seven hours.
Null hypothesis: There is no difference in memory performance as a function of sleep duration.
Data: Two groups, n = 20 each; Mean memory scores: <7 hours = 5.1, ≥7 hours = 6.7; Memory scale 0–10; Descriptive stats shown (min, max, SD).
Statistical test: Independent samples t-test (as an example in the course).
Result: p-value = 0.007; Interpreted as statistically significant at the conventional threshold $\alpha = 0.05$ .
Emphasis on p-value interpretation: the p value is not the probability the null is true, nor the size of the effect; it is a tool for assessing significance under the null.
Emphasis on reporting practice: provide the exact p value, avoid phrases like "significant at p < 0.05" without the exact value, and report effect sizes as well.

Notation and Formulas to Remember

P-value concept (general): $p = P\left( \text{Test statistic as extreme as observed} \mid H_0 \right)$ This expression illustrates that the p-value quantifies the probability of observing a test statistic that is at least as extreme as the one calculated from the sample, assuming that the null hypothesis (H_0) is true.
Alpha level (predefined): $\alpha = 0.05\,\text{(typical)}$
Decision rule:
Type I error (false positive): This occurs when we reject the null hypothesis when it is actually true, leading us to incorrectly conclude that there is an effect or difference when in fact there is none.
Type II error (false negative): This occurs when we fail to reject the null hypothesis when it is actually false, resulting in the incorrect conclusion that there is no effect or difference when there truly is one. To minimize these errors, researchers must carefully design their studies and select appropriate sample sizes, ensuring that their findings are both reliable and valid.
Power of the test: (1 - \beta): The probability of correctly rejecting the null hypothesis when it is false.
Power: $\text{Power} = 1 - \beta$
Important reminder: p values do not convey the probability that the null is true or the magnitude of the effect; they indicate how incompatible the data are with the null under repeated sampling.