Paired (Dependent) Samples: Paired t-test and McNemar's Test - Key Concepts and Calculations

Paired (dependent) samples: Training vs. no training

Study design
- One member randomly received training; the other did not. After training, both take a test.
- Data are numerical scores (quantitative). Interpretation uses a matched-pairs (dependent) setup.
- Significance level: $\alpha = 0.05$ (given).
- Alternative hypothesis is one-sided: training helps (i.e., the training group scores higher than the no-training group).
- Therefore, the test focuses on the difference
- Define the difference for each pair:
 - $di = \text{training}i - \text{no training}_i$ , for $i = 1, \dots, n$ .
- Here, the discussion centers on 12 pairs ( $n = 12$ ).
Key conceptual shift
- The “story” moves from two groups to one group of differences by constructing the difference column. This allows us to ignore the original two-group structure and analyze a single set of 12 differences.
- This connects to earlier modules:
- Module 5: confidence intervals (the one-sample perspective).
- Module 6/7: significance testing (test statistic, p-values).
Hypotheses and setup
- Null hypothesis:
- $H0: \mud = 0.$
- Alternative hypothesis (one-sided):
- If training helps, then the difference should be positive on average:
- H1: \mud > 0.
- Population distribution assumptions (for the test statistic):
- Numerical data (scores).
- Dependent samples (paired differences).
- Randomization present.
- The population of differences should be normal or close to normal (a reasonable approximation for small samples).
Computation details (difference-based analysis)
- Compute the difference column:
- $d1, d2, \dots, d{12}$ where $di = (\text{training}i - \text{no training}i)$ .
- Then compute the sample mean of the differences:
- $\bar{d} = \frac{1}{n} \sum{i=1}^{n} di.$
- Compute the sample standard deviation of the differences:
- $sd = \sqrt{\frac{1}{n-1} \sum{i=1}^{n} (d_i - \bar{d})^2}.$
- The comparison uses the paired (one-sample) t-statistic:
- $t = \frac{\bar{d} - \mu{d0}}{sd / \sqrt{n}}, \quad \text{where } \mu_{d0} = 0.$
- Degrees of freedom:
- $df = n - 1 = 11.$
- Interpretation of the t-statistic remains the same as in Module 6: larger positive $t$ supports $H1$ ; negative or small $t$ supports $H0$ .
Example values from the transcript (illustrative)
- Differences were computed (e.g., 95−90 = 5, 89−85 = 4, etc.).
- 12 difference values yield a mean difference $\bar{d}$ and standard deviation $s_d$ (values computed in class).
- The t-statistic is reported as small in this example, leading to a large p-value.
- Reported one-sided p-value: $p = 0.43.$ (One-sided, since H1 = \mud > 0.)
- Decision:
- Since p = 0.43 > \alpha = 0.05, we do not reject the null hypothesis.
- Meaning of not rejecting $H_0$ :
- It might be true that tutoring has no effect on student performance (i.e., tutoring and non-tutoring yield similar average scores).
- Note on reporting:
- In exams, you should write the explicit conclusion, such as: "Fail to reject $H_0$ at $\alpha = 0.05$ ; there is no evidence that tutoring improves scores under this design."
Practical notes on the workflow
- If software is available: use it to obtain the p-value for the t-statistic with $df = 11$ .
- If not: use the t-table (t_{\alpha, df}) to determine critical values and compare with the observed $t$ .
- The key idea is to work with the differences; you can think of this as transforming a two-group problem into a one-group problem.
- Remember the distinction between one-sided and two-sided tests and reflect that in the p-value interpretation.
Summary takeaways
- For matched-pairs data with a numerical outcome, a paired t-test on the difference scores is appropriate.
- Hypotheses focus on the mean difference; null is zero difference; one-sided alternative is often used when the question specifies a direction (e.g., training improves).
- Critical steps: compute differences, compute $\bar{d}$ and $s_d$ , compute $t$ , determine $df$ , obtain p-value, compare to $\alpha$ , draw conclusion.

McNemar's test for matched categorical data (sibling puzzle experiment)

Context and data structure
- Objective: determine whether there is a difference in the probability of solving a puzzle in less than one minute between older and younger siblings when data are paired.
- Design: 70 paired siblings; each pair has two binary outcomes: time < 1 minute vs time ≥ 1 minute for older vs younger.
- Data are categorical and paired (dependent samples).
- Seven steps framework: categorical data, dependent samples, significance test, $\alpha$ given.
The 2x2 table used to summarize outcomes
- Table layout (rows = older, columns = younger):
- $n_{11}$ : older < 1 min and younger < 1 min = 25
- $n_{12}$ : older < 1 min and younger ≥ 1 min = 18
- $n_{21}$ : older ≥ 1 min and younger < 1 min = 10
- $n_{22}$ : older ≥ 1 min and younger ≥ 1 min = 22
- The two numbers that carry information about the difference between groups are the off-diagonal discordant counts:
- $n{12} = 18, n{21} = 10$ .
- The diagonal counts ( $n{11}, n{22}$ ) do not contribute to the test statistic for McNemar’s test.
Hypotheses
- Null hypothesis: there is no difference in the discordant probabilities; i.e., P(\text{older < 1 & young < 1}) = P(\text{older \ge 1 & young \ge 1}). In practical terms: $n{12}$ and $n{21}$ are drawn from the same distribution under $H_0$ .
- Alternative hypothesis: two-sided, because either direction of difference is of interest (older may be more likely or younger may be more likely to solve quickly).
- Therefore, two-tailed test with $\alpha$ given (here $\alpha = 0.01$ ).
Test statistic (large-sample McNemar’s test)
- Focus on the off-diagonal counts: $n{12}$ and $n{21}$ .
- Large-sample z statistic (no continuity correction):
- $z = \frac{|n{12} - n{21}|}{\sqrt{n{12} + n{21}}}.$ .
- With continuity correction (optional):
- $z = \frac{|n{12} - n{21}| - 1}{\sqrt{n{12} + n{21}}}.$ .
- Using the given numbers: $n{12} = 18, n{21} = 10$
- Difference: $|18 - 10| = 8$
- Sum: $18 + 10 = 28$
- Without correction: $z = \frac{8}{\sqrt{28}} \approx 1.51.$ (as stated in the transcript)
- With correction (if applied): would be $\frac{8 - 1}{\sqrt{28}} = \frac{7}{\sqrt{28}} \approx 1.32.$ .
- Distribution and large-sample rule of thumb:
- The large-sample condition is that n{12} + n{21} > 20. Here 28 > 20, so the normal approximation is acceptable.
P-value and decision rule
- For a two-sided test, double the one-sided tail probability associated with $z = 1.51$ :
- $p\text{-value} \approx 2 \cdot P(Z \ge 1.51) \approx 2 \cdot (1 - \Phi(1.51)) \approx 0.13.$
- Significance level given in the example: $\alpha = 0.01$ .
- Since p\text{-value} \approx 0.13 > \alpha, we fail to reject $H_0$ .
- Conclusion: There is no evidence of a difference in the probability of solving the puzzle in less than one minute between older and younger siblings in this sample.
Key interpretation and nuances
- McNemar’s test uses only the two discordant counts ( $n{12}$ and $n{21}$ ); diagonal counts do not affect the test statistic.
- The test is specifically designed for paired dichotomous data; it assesses whether the two outcomes are equally likely across the two conditions (older vs younger in this setup).
- Large-sample condition is an approximation; for small samples, exact McNemar test is available and may be preferred if $n{12} + n{21}$ is not large.
- If the alternative had been one-sided (e.g., older more likely to be fast than younger), you would use a one-sided p-value (and not double).
Practical notes on the workflow
- You need a 2x2 table with paired observations and a focus on the off-diagonal cells $n{12}$ and $n{21}$ .
- Compute $z$ using the formula above and consult a z-table to obtain the one-sided p-value, then adjust for two-sided if applicable.
- If $n{12} + n{21}$ is not large enough, consider the exact McNemar test instead of the normal approximation.
Connections to broader principles
- This test illustrates how to handle paired categorical data, a common situation when the same subjects are measured under two conditions.
- It contrasts with the paired t-test by dealing with binary outcomes rather than continuous scores.
- The concept of using only the discordant pairs to measure a difference is analogous to focusing on information that actually reflects a change between the paired conditions.
Final takeaway across both examples
- When data are paired or matched, it is often advantageous to analyze differences directly (paired t-test for numerical outcomes; McNemar’s test for binary outcomes).
- The null hypotheses typically express no difference (e.g., zero mean difference, or equal probability of discordant outcomes).
- The choice of one-sided vs two-sided tests depends on the research question; the transcript emphasizes explicit direction when appropriate (one-sided in the training example).
- Always verify sample size conditions ( $df$ for t-test, $n{12} + n{21}$ for McNemar) to decide whether to rely on asymptotic approximations or exact tests.
Common tools and references
- Common test statistics discussed: t-statistic for paired observations, and z-statistic for McNemar’s test.
- Tables discussed: z-table, t-table, and chi-square table.
- In practice, software can provide exact p-values for McNemar and p-values for the paired t-test; if not, use the corresponding tables and the described formulas.

For paired (dependent) samples, statistical tests analyze differences directly:

Paired t-test (for numerical data)

Study Design: Compares two conditions on the same subjects (e.g., training vs. no training). Data are numerical scores.
Key Concept: Transforms the two-group problem into a one-sample problem by computing differences: $di = \text{training}i - \text{no training}_i.$
Hypotheses: Null hypothesis: $H0: \mud = 0$ (no mean difference). Alternative hypothesis: H1: \mud > 0 (e.g., training helps) or $H1: \mud \neq 0$ .
Test Statistic: $t = \frac{\bar{d} - \mu{d0}}{sd / \sqrt{n}}$ with $df = n - 1.$
Decision: Compare the p-value to the significance level $\alpha$ . If p > \alpha, fail to reject $H_0$ . (e.g., p = 0.43 > \alpha = 0.05, no evidence training improves scores).

McNemar's test (for matched categorical data)

Study Design: Compares paired binary outcomes (e.g., probability of solving a puzzle for older vs. younger siblings). Data are categorical.
Data Structure: Summarized in a 2x2 table. Only off-diagonal discordant counts ( $n{12}$ and $n{21}$ ) are used, representing changes between conditions.
Hypotheses: Null hypothesis: no difference in discordant probabilities ( $n{12}$ and $n{21}$ are drawn from the same distribution). Alternative hypothesis: two-sided, indicating a difference.
Test Statistic: Large-sample z-statistic: $z = \frac{|n{12} - n{21}|}{\sqrt{n{12} + n{21}}}.$ The large-sample condition is n{12} + n{21} > 20.
Decision: For a two-sided test, double the one-sided p-value. If p > \alpha, fail to reject $H_0$ . (e.g., $z \approx 1.51$ , p \approx 0.13 > \alpha = 0.01, no evidence of difference in puzzle-solving probability).

General Takeaways

Analyzing differences directly is advantageous for paired data to assess changes or effects within subjects.
Null hypotheses typically state no difference (zero mean difference or equal discordant probabilities).
The choice between one-sided and two-sided tests depends on the research question's directionality.
Always verify sample size conditions (degrees of freedom for t-test, sum of discordant counts for McNemar's) to ensure the validity of the chosen approximation or test.