Comparing Two Population Parameters: Exhaustive Guide to Proportions and Means

Distinguishing between Independent and Dependent Sampling

Objective 1: Distinguish between independent and dependent sampling.
Definitions: - Independent Sampling: A sampling method is considered independent when an individual selected for one sample does not dictate which individual is to be in a second sample. - Dependent Sampling: A sampling method is dependent when an individual selected to be in one sample is used to determine the individual in the second sample. - Matched-Pairs Samples: This is another term for dependent samples. In some cases, an individual might be matched against him- or herself (e.g., before-and-after measurements).
Examples of Distinguishing Sampling Methods: - Example (a): Hotel Prices: A researcher compares the price of a one-night stay at a Holiday Inn Express versus a Red Roof Inn. She randomly selects 8 towns where the hotel locations are close to each other. - Classification: Dependent. - Reasoning: Once a location is picked for the Holiday Inn Express, the Red Roof Inn must be chosen from the same or a nearby location. - Example (b): Comic Preferences: A polling agency obtained a random sample of 500 young adults (18-39) and 550 senior adults (60-89) and asked if they prefer Marvel or DC Comics. - Classification: Independent. - Reasoning: The selection of young adults in the first sample is not related to the selection of seniors in the second sample.

Hypothesis Testing for Two Population Proportions (Independent Samples)

Objective 2: Test hypotheses regarding two proportions from independent samples.
Sampling Distribution Principles: - To conduct inference, the sampling distribution of the difference of two proportions must be determined. Random assignment often suggests an approximately normal distribution. - Parameters and Statistics: - Suppose a simple random sample of size $n_1$ is taken from population 1 where $x_1$ individuals have a characteristic. - A simple random sample of size $n_2$ is independently taken from population 2 where $x_2$ individuals have the same characteristic. - The sample proportions are defined as $\hat{p}_1 = \frac{x_1}{n_1}$ and $\hat{p}_2 = \frac{x_2}{n_2}$ .
Distribution Properties: - The sampling distribution of $\hat{p}_1 - \hat{p}_2$ is approximately normal if requirements are met. - Mean: $\mu_{\hat{p}_1 - \hat{p}_2} = p_1 - p_2$ . - Standard Deviation: $\sigma_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$ .
Standardized Test Statistic (Z-score): - $Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}$ - This follows an approximate standard normal distribution.
Requirements for the Hypothesis Test: - Samples must be independently obtained via simple random sampling or a completely randomized experiment with two levels of treatment. - Success-failure counts: $n_1 \hat{p}_1 (1 - \hat{p}_1) \ge 10$ and $n_2 \hat{p}_2 (1 - \hat{p}_2) \ge 10$ . - Independence of observations: Sample size must be no more than 5% of the population size ( $n_1 \le 0.05N_1$ and $n_2 \le 0.05N_2$ ).
Case Study 1: Diabetes Treatments (The ADOPT Study): - Context: Does Avandia (a diabetes drug) increase heart attacks compared to other treatments? - Data points: - Group 1 (Avandia): $n_1 = 1456$ , $x_1 = 27$ , $\hat{p}_1 = \frac{27}{1456} \approx 0.0185$ . - Group 2 (Other): $n_2 = 2895$ , $x_2 = 41$ , $\hat{p}_2 = \frac{41}{2895} \approx 0.0142$ . - Requirement Verification: - $1456(0.0185)(1-0.0185) = 26.44 \ge 10$ (Check). - $2895(0.0142)(1-0.0142) = 40.53 \ge 10$ (Check). - Hypotheses: - $H_0: p_1 = p_2$ (or $H_0: p_1 - p_2 = 0$ ). - H_1: p_1 > p_2 (or H_1: p_1 - p_2 > 0). - Results: $Z_{stat} = 1.20$ ; $P\text{-value} = 0.1358$ . - Conclusion: Since 0.1358 > 0.05, we do not reject the null hypothesis. There is no significant evidence of an increased heart attack proportion.
Case Study 2: Unplugging from Devices (Harris Interactive Survey): - Context: Is there a difference in feelings about returning to a time before being "plugged in" between adults (35-54) and baby-boomers (55+)? - Data points: - Group 1 (Adults): $n_1 = 500$ , $x_1 = 385$ , $\hat{p}_1 = \frac{385}{500} = 0.77$ . - Group 2 (Boomers): $n_2 = 500$ , $x_2 = 300$ , $\hat{p}_2 = \frac{300}{500} = 0.60$ . - Requirement Verification: - $500(0.77)(1-0.77) = 88.55 \ge 10$ (Check). - $500(0.60)(1-0.60) = 120 \ge 10$ (Check). - Hypotheses: - $H_0: p_1 - p_2 = 0$ . - $H_1: p_1 - p_2 \neq 0$ . - Results: $Z_{stat} = 5.787$ ; P\text{-value} < 0.0001. - Conclusion: Since the P-value is less than $0.05$ , we reject the null hypothesis.

Confidence Intervals for Two Population Proportions

Objective 3: Construct and interpret confidence intervals for the difference between two population proportions.
Formula for a $(1 - \alpha) \times 100\%$ Confidence Interval: - Lower Bound: $(\hat{p}_1 - \hat{p}_2) - Z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$ - Upper Bound: $(\hat{p}_1 - \hat{p}_2) + Z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$
Requirement Check: Same as the hypothesis test requirements ( $SRS$ , counts $\ge 10$ , $n \le 0.05N$ ).
Case Study: Position on Divorce (Harris Interactive Survey): - Groups: Religious ( $n_1 = 970$ , $x_1 = 834$ , $\hat{p}_1 \approx 0.860$ ) and Not Religious ( $n_2 = 1285$ , $x_2 = 1157$ , $\hat{p}_2 \approx 0.900$ ). - Interval Calculation (95% Confidence): - Lower Bound: $-0.0679$ - Upper Bound: $-0.0133$ - Interpretation: We are 95% confident that the proportion of religious people who find divorce acceptable is between $1.33\%$ and $6.79\%$ less than that of non-religious individuals. - Note: Since 0 is not included in the interval and both bounds are negative (p_1 < p_2), we conclude there is a significant difference.

Sample Size Estimation for Proportions

Objective 4: Determine the sample size necessary for estimating the difference between two population proportions.
Calculation Rule: Always round up to the next integer for sample size calculations.
Formula Case 1: Prior Estimates of $\hat{p}_1$ and $\hat{p}_2$ are available: - $n = n_1 = n_2 = [\hat{p}_1(1-\hat{p}_1) + \hat{p}_2(1-\hat{p}_2)] \left(\frac{Z_{\alpha/2}}{E}\right)^2$
Formula Case 2: Prior Estimates are unavailable: - $n = n_1 = n_2 = 0.5 \left(\frac{Z_{\alpha/2}}{E}\right)^2$
Case Study: Prenatal Care Mothers (15-19 years vs. 30-34 years): - Goal: Estimate difference within 2 percentage points ( $E = 0.02$ ) with 95% confidence. - (a) With Prior Estimates ( $\hat{p}_1 = 0.98$ , $\hat{p}_2 = 0.992$ ): - Result: Sample size required is $n = 265$ . - (b) Without Prior Estimates (Using 0.5): - Result: Sample size required is $n = 4812$ .

Inference about Two Means: Dependent Samples (Matched-Pairs)

Objective 1: Test hypotheses for a population mean from matched-pairs data.
Requirement Factors: - Sample obtained by simple random sampling or matched-pair design. - Samples are dependent. - Differences are normally distributed with no outliers or $n \ge 30$ . - Sample size $n \le 0.05N$ .
Test Statistic for Matched Pairs: - $t = \frac{\bar{d} - \mu_d}{s_d / \sqrt{n}}$ - Follows Student's t-distribution with $n - 1$ degrees of freedom ( $d$ stands for "difference").
Objective 2: Confidence Intervals for Matched-Pairs Data. - Formula: $\bar{d} \pm t_{\alpha/2} \left(\frac{s_d}{\sqrt{n}}\right)$ - Point estimate $\pm$ margin of error.
Case Study: Hotel Prices (Hampton Inn vs. La Quinta): - Data collected from Dallas, Tampa Bay, St. Louis, Seattle, San Diego, Chicago, New Orleans, Phoenix, Atlanta, and Orlando ( $n = 10$ ). - Objective: Construct 95% CI for the mean difference.
Case Study: Disney Wait Times: - Comparing "Pirates of the Caribbean" Wait Times to "Tiana's Bayou Adventure" (formerly Splash Mountain). - Data is paired by specific day and time of day (18 observations). - Hypothesis: Wait times for Tiana's are longer than Pirates.

Inference about Two Means: Independent Samples

Objective 1: Test hypotheses regarding the difference of two independent means.
Behrens-Fisher Problem: The case where population variances are unequal and unknown. The solution used is Welch's approximate t.
Welch's t-test Statistic: - $t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$ - Degrees of Freedom ( $df$ ): Use the smaller of $n_1 - 1$ or $n_2 - 1$ .
Requirements: - Simple random samples or randomized experiment. - Independent samples. - Populations are normally distributed or $n_1 \ge 30$ and $n_2 \ge 30$ . - Sample size is no more than 5% of the population.
Comparison Caution on Pooling: - Pooled two-sample t-tests are for equal variances. Because equality of variance is hard to verify, Welch's t is always preferred.
Case Study: Exam Paper Color: - Question: Does the color of paper (White vs. Marine Blue) affect results? - Hypotheses: - $H_0: \mu_1 = \mu_2$ (or $\mu_1 - \mu_2 = 0$ ). - H_1: \mu_1 > \mu_2 (or \mu_1 - \mu_2 > 0). - Results: $t_0 \approx 2.216$ ; $P\text{-value} = 0.0175$ . - Conclusion: Since 0.0175 < 0.05, reject the null hypothesis. There is evidence that scores are higher on white paper.
Objective 2: Confidence Intervals for Independent Means. - Formula: $(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$ - Example: Salary by Degree: - Engineer ( $n_1=80$ , $\bar{x}_1=60,300$ , $s_1=10,822$ ) vs. Psychology ( $n_2=110$ , $\bar{x}_2=34,200$ , $s_2=10,698$ ). - 95% CI result: Between $22,972.98 and $29,224.03.

Summary: Which Method to Use?

Step 1: Determine the Parameter: - Proportion ( $p$ ): Use Normal distribution ( $Z$ ) provided $n\hat{p}(1-\hat{p}) \ge 10$ and Sample size $\le 5\%$ of population. - Mean ( $\mu$ ): Categorize by sampling method.
Step 2: Determine Sampling Method: - Independent Samples (Proportion): Use $Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}$ . - Independent Samples (Mean): Use Student's t with Welch's adjustment $t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$ , $df = \min(n_1-1, n_2-1)$ . - Dependent Samples (Mean/Matched-Pairs): Use Student's t with $t = \frac{\bar{d} - \mu_d}{s_d / \sqrt{n}}$ , $df = n - 1$ .
Final Examples for Identification: - Economic System Fair Poll: Comparing 200 Democrats and 160 Republicans (Independent, Proportion). - Pulse Rates: Testing pulse before and after a fright (Dependent, Mean).