WK10: Statistical Inference - Comparing Two Means: Confidence Interval for two independent sample problems

Focus is on the difference between two population means.
Variable is measured across both populations.
Two independent samples are collected:
- Sample 1: Size $n1$ from population 1 with unknown mean $\mu1$ and standard deviation $\sigma_1$ .
- Sample 2: Size $n2$ from population 2 with unknown mean $\mu2$ and standard deviation $\sigma_2$ .
Samples are independent; one doesn't influence the other.
Sample sizes can differ.
Summary statistics:
- Population 1: Sample mean $\bar{X}1$ , sample standard deviation $S1$ .
- Population 2: Sample mean $\bar{X}2$ , sample standard deviation $S2$ .

Matched Pairs:
- Two samples of paired data (e.g., before/after measurements on the same individual).
- Example: Vitamin C content in tomatoes before and after cooking.
- A single sample of differences is analyzed.
- One-sample t-procedures are used on the differences.
- Summary statistics: Sample mean of differences $\bar{X}d$ , sample standard deviation of differences $Sd$ , sample size (number of pairs) $n$ .
Independent Samples:
- One sample from each population, independently obtained.
- No matching of individuals.
- Samples can be of different sizes.
- Summary statistics for Population 1: $\bar{X}1$ , $S1$ , $n_1$ .
- Summary statistics for Population 2: $\bar{X}2$ , $S2$ , $n_2$ .

Parameter of interest: Difference of population means ( $\mu1 - \mu2$ ).
Point estimate: Difference of sample means ( $\bar{X}1 - \bar{X}2$ ).
Standard deviation of the sampling distribution of the difference between sample means:
$\sqrt{\frac{\sigma1^2}{n1} + \frac{\sigma2^2}{n2}}$
Since $\sigma1$ and $\sigma2$ are usually unknown, we estimate them with $S1$ and $S2$ .
Standard error of the difference between two sample means:
$\sqrt{\frac{S1^2}{n1} + \frac{S2^2}{n2}}$

Randomness: Each sample must be randomly selected from its population.
Normality: Check the shape of the population distributions.
- Use dot plots or histograms for each sample.
- Small Samples (n1 + n2 < 15): Dot plots should show approximately normal distributions; if there's skewness or outliers, do not use t-procedures.
- Medium Samples (15 < n1 + n2 < 40): T-procedures are okay as long as there's no strong skewness or outliers in the histograms.
- Large Samples (n1 + n2 > 40): T-procedures can be used even with skewed distributions.

General structure: Sample estimate $\pm$ (multiplier $\times$ standard error).
Confidence interval for independent samples: $(\bar{X}1 - \bar{X}2) \pm t^* \sqrt{\frac{S1^2}{n1} + \frac{S2^2}{n2}}$
- $t^*$ is the critical value from the t-distribution for the desired confidence level.
Degrees of Freedom (DOF):
- Complex formula for accurate DOF (Welch-Satterthwaite equation).
- Conservative Approach: Use the smaller of $n1 - 1$ and $n2 - 1$ . This results in a slightly larger t-multiplier and a wider (more conservative) confidence interval.
 - Use this approach for manual calculations.
- Full DOF Approximation: A more accurate (but complex) calculation of degrees of freedom.
 - Accurate if both sample sizes are at least 5.
 - Software usually provides this DOF.

Study comparing resting pulse rates of regular exercisers vs. non-exercisers.
Sample 1 (Non-exercisers): $n1 = 31$ , $\bar{X}1 = 75$ , $S_1 = 9$ .
Sample 2 (Exercisers): $n2 = 29$ , $\bar{X}2 = 66$ , $S_2 = 8$ .
Goal: Construct a 95% confidence interval for the difference in mean pulse rates.
Direction: Non-exercisers - Exercisers.
Sample size condition satisfied (n1 + n2 = 60 > 40), so t-procedures may be used.
Degrees of Freedom Calculation:
- Conservative Approach: $df = min(31-1, 29-1) = 28$ . For a 95% confidence level, the t-multiplier ( $t^*$ ) is 2.048 (from a t-table).
- Full DOF Approximation: $df \approx 57.97$ . Round down to 50 for table lookup, $t^* = 2.009$ .

Using full DOF approximation ( $df = 50, t^* = 2.009$ ):
$(75 - 66) \pm 2.009 \sqrt{\frac{9^2}{31} + \frac{8^2}{29}}$
Resulting 95% confidence interval: (4.435, 13.565).
Interpretation: We are 95% confident that the difference in mean resting pulse rates between non-exercisers and exercisers is between 4.435 and 13.565 beats per minute (non-exercisers minus exercisers).

If the confidence interval for the difference includes zero, we cannot claim a significant difference between the population means.
If the confidence interval does not include zero, we can claim a significant difference.
In this example (4.435, 13.565), zero is not included, so we can claim a 95% confident difference in mean resting pulse rates between the two groups.

When calculating by hand, use the conservative approach for degrees of freedom. The resulting confidence interval will be slightly wider, but the interpretation and conclusion regarding the difference in means remain the same.