Two-Group Designs: Model Comparison Approach (Lecture Notes)

Overview of Two-Group Designs

Purpose: compare two independent groups to see if membership in a group causes differences in a depenent variable (DV).
Between-subjects design: each subject belongs to one group (mutually exclusive groups), not both.
Representing populations: each group represents a population; samples drawn from two populations.
Benefit over single-group designs: allows estimating and comparing two means, enables broader research questions, and, when feasible, random assignment improves internal validity.
When random assignment is not possible: still useful, but causality inference is limited.
Common path after two groups: extend to three or more groups via ANOVA.

Hypotheses and Models in a Two-Group Design

Research question: does group membership cause differences in the DV?
Null hypothesis (H0): the two population means are equal (no effect of group membership).
Alternative hypothesis (H1): the two population means are different (two-tailed by default in social sciences).
Directional hypotheses may lead to one-tailed tests in some contexts, but many fields default to two-tailed tests.
Two competing models for analysis:
- Restricted model (null model): assumes one common mean for both groups (no group effect).
- Full model (alternative model): allows two distinct group means.
Model comparison approach (ANOVA framework): test whether allowing two distinct means significantly improves model fit over a single common mean.

Key Notation and Concepts

Grand mean (under the restricted model): $\bar{y} = rac{1}{n} sum{i=1}^{n} yi$ or denoted as Mu_0 in the lecture notes.
Group means under the full model: bc1 and bc2 (often denoted as bc_j with j = 1, 2).
Predicted values (y-hats):
- Restricted model: \u0304y^{(R)}i = bc0 ext{ for all } i where bc_0 = ar{y} is the grand mean.
- Full model: bhy^{(F)}i = bcj ext{ if observation } i ext{ is in group } j ext{ (j=1,2)}.
Sums of squared residuals:
- Restricted: ER = \sum{i=1}^{n} (yi - bc0)^2
- Full: EF = \sum{i=1}^{n} (yi - bhy^{(F)}i)^2
Degrees of freedom:
- Restricted model: $df_R = n - 1$
- Full model: $df_F = n - 2$
F statistic for model comparison:
$F = \frac{(ER - EF) / (dfR - dfF)}{EF / dfF}$
Note: here $dfR - dfF = 1$ for the two-group case, so equivalently $F = \frac{(ER - EF)}{EF / dfF}$ .
Relation to t tests:
- The t statistic from a two-group independent-means test is related to F by $F = t^2$ when comparing two groups with the same model framework.

How to interpret the F-test in this design

If H0 is true, restricting to one mean yields similar residuals to the unrestricted full model; the SSEs (sum of squared errors) are similar, and the restricted model is preferred for parsimony.
If H0 is false, the full model substantially reduces residuals (better fit) compared to the restricted model; the F statistic will be large, leading to rejection of H0.
p-value interpretation: the probability of observing an F as large as the computed one under the assumption that H0 is true.

Practical example: ADHD IQ study (illustrative numbers)

Setup: n = 12 total (6 ADHD, 6 control);
- Restricted model: assume one common mean for all 12 observations.
- Full model: allow two group means, one for ADHD and one for control.
Given means: ADHD group mean bc1 = 104; control group mean bc2 = 98.
Grand mean (restricted model): bc_0 = ar{y} = 101.
Residual sums of squares (example values):
- Restricted SSE: $E_R = 116$
- Full SSE: $E_F = 8$
Degrees of freedom:
- Restricted: $df_R = n - 1 = 11$
- Full: $df_F = n - 2 = 10$
F statistic:
- Numerator: $ER - EF = 108$
- Divide by df difference (1): 108
- Denominator: $EF / dfF = 8 / 10 = 0.8$
- $F = \frac{108}{0.8} = 135$
Result: $F(1,10) = 135.0$ , p < 0.001 (extremely significant).
Interpretation: strong evidence that ADHD and control groups have different means on IQ; the two-group model provides a substantially better fit than the single common mean model.

Practical example: Sleep study (sex differences)

Setup: six individuals (n = 6); DV = hours of sleep; groups defined by sex: male (n=3) and female (n=3).
Group means: males bc1 = 8 hours; females bc2 = 6 hours.
Grand mean (restricted model): bc_0 = ar{y} = 7 hours.
SSE values:
- Restricted SSE: $E_R = 10$
- Full SSE: $E_F = 4$
Degrees of freedom:
- Restricted: $df_R = n - 1 = 5$
- Full: $df_F = n - 2 = 4$
F statistic:
- Numerator: $ER - EF = 6$
- Denominator: $EF / dfF = 4 / 4 = 1$
- $F = 6 / 1 = 6.0$
Result: $F(1,4) = 6.00$ ; p-value is greater than 0.05 (not statistically significant at 5% level).
Interpretation: despite a 2-hour mean difference (8 vs 6), the sample size and within-group variability yield insufficient evidence to declare a significant sex difference in sleep hours in this sample.
Practical note: small sample size and variability dampen the power to detect between-group differences, even when means differ by a nontrivial amount.

How to conduct these analyses (conceptual workflow)

Step 1: Define two independent groups representing two populations.
Step 2: State H0: the two populations share the same mean; H1: the means differ.
Step 3: Compute grand mean under the restricted model: bc_0 = \bar{y}.
Step 4: Compute group-specific means under the full model: bc1, \u0003bc2.
Step 5: Compute residuals and SSEs under both models:
- Restricted residuals: $yi - \hat{y}^{(R)}i$ with $\hat{y}^{(R)}i = \u00bc0$ .
- Full residuals: $yi - \hat{y}^{(F)}i$ with $\hat{y}^{(F)}i = \u00bcj$ for group j.
Step 6: Compute SSEs: $ER, EF$ as sums of squared residuals.
Step 7: Compute degrees of freedom: $dfR = n - 1,\ dfF = n - 2$ .
Step 8: Compute the F statistic: $F = \dfrac{(ER - EF)/(dfR - dfF)}{EF/dfF}$ and determine p-value from F distribution with $df1 = dfR - dfF$ and $df2 = dfF$ (here $df1=1$ ).
Step 9: Decision rule: reject H0 if p-value < alpha (commonly 0.05) or if F exceeds the critical value from the F distribution.
Step 10: Report results in text and, if helpful, in a table with group means and SDs; include the exact F statistic and p-value.

One-tailed vs two-tailed tests and practical considerations

Default in social sciences: two-tailed tests for means and correlation tests.
If a strong, a priori directional hypothesis exists (e.g., one group is expected to have higher means), a one-tailed test can be used, halving the p-value for the two-tailed scenario but changing the alpha interpretation to 0.10 if you switch to one-tailed without adjusting the claimed direction.
The shift toward automatic one-tailed p-values in some software (e.g., newer SPSS outputs) may reflect field or software trends; interpret carefully and stick with your preregistered plan.
Small samples:
- Greater risk of Type II error; more variability reduces power to detect real differences.
- In such cases, one-tailed tests might be tempting but should only be used if there is a strong, justified directional hypothesis a priori.
Reporting practice:
- Always report p-values, especially for meta-analysis purposes.
- When possible, report exact p-values (e.g., p = 0.001) rather than merely stating p < 0.05.
- Include test statistics (e.g., F, t, or chi-square) and degrees of freedom; consider including group means and SDs in a table.

Connecting to broader statistics concepts

Model comparison and nested models: restricted model is a special case of the full model (mu1 = mu2).
Intuition: restricted model captures parsimony (fewer parameters) while full model captures potential group differences; the F-test quantifies whether the extra parameters substantially improve fit.
Relationship to ANOVA: two-group design is the simplest ANOVA; adding more groups generalizes to one-way ANOVA with more factor levels.
Practical interpretation: a significant F indicates the difference between group means is larger than expected by chance given the within-group variability.

Data organization notes for researchers

Keep a separate grouping variable (e.g., group, sex, treatment) as a factor in the data set.
The DV should be a single variable (e.g., IQ, sleep hours) for all observations.
Two-group comparisons are most transparent when data are organized with one row per participant and a column for DV and one column for group membership.
When reporting to journals, consider both narrative text and a table with group means and SDs; tables are often preferred but may be costly to format.

What comes next (context within the course)

After mastering two-group designs, we’ll cover the two-assessment design (repeated measures) and nonparametric or distribution-free approaches to the F ratio.
Then we’ll extend to three or more groups and explore how ANOVA handles multiple comparisons and post-hoc tests.
The flow from t-tests and simple ANOVA toward model comparison makes future content more intuitive and cohesive.

Quick recap of key formulas (LaTeX)

Predicted values under the restricted model: $\hat{y}^{(R)}i = \mu0$ , where $\mu_0$ is the grand mean.
Predicted values under the full model: $\hat{y}^{(F)}i = \muj \quad (j = 1,2)$ .
SSE under restricted vs full models: $ER = \sum{i=1}^{n} (yi - \hat{y}^{(R)}i)^2, \quad EF = \sum{i=1}^{n} (yi - \hat{y}^{(F)}i)^2$ .
Degrees of freedom: $dfR = n - 1, \ dfF = n - 2$ .
F statistic: $F = \frac{(ER - EF)/(dfR - dfF)}{EF/dfF}$ .
Special case for two groups: $dfR - dfF = 1$ , so $F = \frac{ER - EF}{EF/dfF}$ .
Relation to t-test: $F = t^2$ when comparing two groups with the same model structure.

Imagine you want to see if apples and oranges are different heights. You can't just pick one apple and one orange. You need a group of apples and a group of oranges.

"Two groups" just means you have two separate piles of things you want to compare. Like one pile of apples and one pile of oranges.

"Independent" means that what happens to an apple in your apple pile doesn't change anything about an orange in your orange pile. They are completely separate. If you pick a really big apple, it doesn't make an orange in the other pile bigger or smaller. They don't affect each other at all.

So, "differentiating two independent groups" simply means we are trying to find out if there's a real, noticeable difference between those two separate piles of things (like apples vs. oranges) when we measure something, like their average height or weight.