Comparing Two Population Parameters: Exhaustive Guide to Proportions and Means

Distinguishing between Independent and Dependent Sampling

  • Objective 1: Distinguish between independent and dependent sampling.

  • Definitions:   - Independent Sampling: A sampling method is considered independent when an individual selected for one sample does not dictate which individual is to be in a second sample.   - Dependent Sampling: A sampling method is dependent when an individual selected to be in one sample is used to determine the individual in the second sample.   - Matched-Pairs Samples: This is another term for dependent samples. In some cases, an individual might be matched against him- or herself (e.g., before-and-after measurements).

  • Examples of Distinguishing Sampling Methods:   - Example (a): Hotel Prices: A researcher compares the price of a one-night stay at a Holiday Inn Express versus a Red Roof Inn. She randomly selects 8 towns where the hotel locations are close to each other.     - Classification: Dependent.     - Reasoning: Once a location is picked for the Holiday Inn Express, the Red Roof Inn must be chosen from the same or a nearby location.   - Example (b): Comic Preferences: A polling agency obtained a random sample of 500 young adults (18-39) and 550 senior adults (60-89) and asked if they prefer Marvel or DC Comics.     - Classification: Independent.     - Reasoning: The selection of young adults in the first sample is not related to the selection of seniors in the second sample.

Hypothesis Testing for Two Population Proportions (Independent Samples)

  • Objective 2: Test hypotheses regarding two proportions from independent samples.

  • Sampling Distribution Principles:   - To conduct inference, the sampling distribution of the difference of two proportions must be determined. Random assignment often suggests an approximately normal distribution.   - Parameters and Statistics:     - Suppose a simple random sample of size n1n_1 is taken from population 1 where x1x_1 individuals have a characteristic.     - A simple random sample of size n2n_2 is independently taken from population 2 where x2x_2 individuals have the same characteristic.     - The sample proportions are defined as p^1=x1n1\hat{p}_1 = \frac{x_1}{n_1} and p^2=x2n2\hat{p}_2 = \frac{x_2}{n_2}.

  • Distribution Properties:   - The sampling distribution of p^1p^2\hat{p}_1 - \hat{p}_2 is approximately normal if requirements are met.   - Mean: μp^1p^2=p1p2\mu_{\hat{p}_1 - \hat{p}_2} = p_1 - p_2.   - Standard Deviation: σp^1p^2=p1(1p1)n1+p2(1p2)n2\sigma_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}.

  • Standardized Test Statistic (Z-score):   - Z=(p^1p^2)(p1p2)p^1(1p^1)n1+p^2(1p^2)n2Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}   - This follows an approximate standard normal distribution.

  • Requirements for the Hypothesis Test:   - Samples must be independently obtained via simple random sampling or a completely randomized experiment with two levels of treatment.   - Success-failure counts: n1p^1(1p^1)10n_1 \hat{p}_1 (1 - \hat{p}_1) \ge 10 and n2p^2(1p^2)10n_2 \hat{p}_2 (1 - \hat{p}_2) \ge 10.   - Independence of observations: Sample size must be no more than 5% of the population size (n10.05N1n_1 \le 0.05N_1 and n20.05N2n_2 \le 0.05N_2).

  • Case Study 1: Diabetes Treatments (The ADOPT Study):   - Context: Does Avandia (a diabetes drug) increase heart attacks compared to other treatments?   - Data points:     - Group 1 (Avandia): n1=1456n_1 = 1456, x1=27x_1 = 27, p^1=2714560.0185\hat{p}_1 = \frac{27}{1456} \approx 0.0185.     - Group 2 (Other): n2=2895n_2 = 2895, x2=41x_2 = 41, p^2=4128950.0142\hat{p}_2 = \frac{41}{2895} \approx 0.0142.   - Requirement Verification:     - 1456(0.0185)(10.0185)=26.44101456(0.0185)(1-0.0185) = 26.44 \ge 10 (Check).     - 2895(0.0142)(10.0142)=40.53102895(0.0142)(1-0.0142) = 40.53 \ge 10 (Check).   - Hypotheses:     - H0:p1=p2H_0: p_1 = p_2 (or H0:p1p2=0H_0: p_1 - p_2 = 0).     - H_1: p_1 > p_2 (or H_1: p_1 - p_2 > 0).   - Results: Zstat=1.20Z_{stat} = 1.20; P-value=0.1358P\text{-value} = 0.1358.   - Conclusion: Since 0.1358 > 0.05, we do not reject the null hypothesis. There is no significant evidence of an increased heart attack proportion.

  • Case Study 2: Unplugging from Devices (Harris Interactive Survey):   - Context: Is there a difference in feelings about returning to a time before being "plugged in" between adults (35-54) and baby-boomers (55+)?   - Data points:     - Group 1 (Adults): n1=500n_1 = 500, x1=385x_1 = 385, p^1=385500=0.77\hat{p}_1 = \frac{385}{500} = 0.77.     - Group 2 (Boomers): n2=500n_2 = 500, x2=300x_2 = 300, p^2=300500=0.60\hat{p}_2 = \frac{300}{500} = 0.60.   - Requirement Verification:     - 500(0.77)(10.77)=88.5510500(0.77)(1-0.77) = 88.55 \ge 10 (Check).     - 500(0.60)(10.60)=12010500(0.60)(1-0.60) = 120 \ge 10 (Check).   - Hypotheses:     - H0:p1p2=0H_0: p_1 - p_2 = 0.     - H1:p1p20H_1: p_1 - p_2 \neq 0.   - Results: Zstat=5.787Z_{stat} = 5.787; P\text{-value} < 0.0001.   - Conclusion: Since the P-value is less than 0.050.05, we reject the null hypothesis.

Confidence Intervals for Two Population Proportions

  • Objective 3: Construct and interpret confidence intervals for the difference between two population proportions.

  • Formula for a (1α)×100%(1 - \alpha) \times 100\% Confidence Interval:   - Lower Bound: (p^1p^2)Zα/2p^1(1p^1)n1+p^2(1p^2)n2(\hat{p}_1 - \hat{p}_2) - Z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}   - Upper Bound: (p^1p^2)+Zα/2p^1(1p^1)n1+p^2(1p^2)n2(\hat{p}_1 - \hat{p}_2) + Z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}

  • Requirement Check: Same as the hypothesis test requirements (SRSSRS, counts 10\ge 10, n0.05Nn \le 0.05N).

  • Case Study: Position on Divorce (Harris Interactive Survey):   - Groups: Religious (n1=970n_1 = 970, x1=834x_1 = 834, p^10.860\hat{p}_1 \approx 0.860) and Not Religious (n2=1285n_2 = 1285, x2=1157x_2 = 1157, p^20.900\hat{p}_2 \approx 0.900).   - Interval Calculation (95% Confidence):     - Lower Bound: 0.0679-0.0679     - Upper Bound: 0.0133-0.0133   - Interpretation: We are 95% confident that the proportion of religious people who find divorce acceptable is between 1.33%1.33\% and 6.79%6.79\% less than that of non-religious individuals.   - Note: Since 0 is not included in the interval and both bounds are negative (p_1 < p_2), we conclude there is a significant difference.

Sample Size Estimation for Proportions

  • Objective 4: Determine the sample size necessary for estimating the difference between two population proportions.

  • Calculation Rule: Always round up to the next integer for sample size calculations.

  • Formula Case 1: Prior Estimates of p^1\hat{p}_1 and p^2\hat{p}_2 are available:   - n=n1=n2=[p^1(1p^1)+p^2(1p^2)](Zα/2E)2n = n_1 = n_2 = [\hat{p}_1(1-\hat{p}_1) + \hat{p}_2(1-\hat{p}_2)] \left(\frac{Z_{\alpha/2}}{E}\right)^2

  • Formula Case 2: Prior Estimates are unavailable:   - n=n1=n2=0.5(Zα/2E)2n = n_1 = n_2 = 0.5 \left(\frac{Z_{\alpha/2}}{E}\right)^2

  • Case Study: Prenatal Care Mothers (15-19 years vs. 30-34 years):   - Goal: Estimate difference within 2 percentage points (E=0.02E = 0.02) with 95% confidence.   - (a) With Prior Estimates (p^1=0.98\hat{p}_1 = 0.98, p^2=0.992\hat{p}_2 = 0.992):     - Result: Sample size required is n=265n = 265.   - (b) Without Prior Estimates (Using 0.5):     - Result: Sample size required is n=4812n = 4812.

Inference about Two Means: Dependent Samples (Matched-Pairs)

  • Objective 1: Test hypotheses for a population mean from matched-pairs data.

  • Requirement Factors:   - Sample obtained by simple random sampling or matched-pair design.   - Samples are dependent.   - Differences are normally distributed with no outliers or n30n \ge 30.   - Sample size n0.05Nn \le 0.05N.

  • Test Statistic for Matched Pairs:   - t=dˉμdsd/nt = \frac{\bar{d} - \mu_d}{s_d / \sqrt{n}}   - Follows Student's t-distribution with n1n - 1 degrees of freedom (dd stands for "difference").

  • Objective 2: Confidence Intervals for Matched-Pairs Data.   - Formula: dˉ±tα/2(sdn)\bar{d} \pm t_{\alpha/2} \left(\frac{s_d}{\sqrt{n}}\right)   - Point estimate ±\pm margin of error.

  • Case Study: Hotel Prices (Hampton Inn vs. La Quinta):   - Data collected from Dallas, Tampa Bay, St. Louis, Seattle, San Diego, Chicago, New Orleans, Phoenix, Atlanta, and Orlando (n=10n = 10).   - Objective: Construct 95% CI for the mean difference.

  • Case Study: Disney Wait Times:   - Comparing "Pirates of the Caribbean" Wait Times to "Tiana's Bayou Adventure" (formerly Splash Mountain).   - Data is paired by specific day and time of day (18 observations).   - Hypothesis: Wait times for Tiana's are longer than Pirates.

Inference about Two Means: Independent Samples

  • Objective 1: Test hypotheses regarding the difference of two independent means.

  • Behrens-Fisher Problem: The case where population variances are unequal and unknown. The solution used is Welch's approximate t.

  • Welch's t-test Statistic:   - t=(xˉ1xˉ2)(μ1μ2)s12n1+s22n2t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}   - Degrees of Freedom (dfdf): Use the smaller of n11n_1 - 1 or n21n_2 - 1.

  • Requirements:   - Simple random samples or randomized experiment.   - Independent samples.   - Populations are normally distributed or n130n_1 \ge 30 and n230n_2 \ge 30.   - Sample size is no more than 5% of the population.

  • Comparison Caution on Pooling:   - Pooled two-sample t-tests are for equal variances. Because equality of variance is hard to verify, Welch's t is always preferred.

  • Case Study: Exam Paper Color:   - Question: Does the color of paper (White vs. Marine Blue) affect results?   - Hypotheses:     - H0:μ1=μ2H_0: \mu_1 = \mu_2 (or μ1μ2=0\mu_1 - \mu_2 = 0).     - H_1: \mu_1 > \mu_2 (or \mu_1 - \mu_2 > 0).   - Results: t02.216t_0 \approx 2.216; P-value=0.0175P\text{-value} = 0.0175.   - Conclusion: Since 0.0175 < 0.05, reject the null hypothesis. There is evidence that scores are higher on white paper.

  • Objective 2: Confidence Intervals for Independent Means.   - Formula: (xˉ1xˉ2)±tα/2s12n1+s22n2(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}   - Example: Salary by Degree:     - Engineer (n1=80n_1=80, xˉ1=60,300\bar{x}_1=60,300, s1=10,822s_1=10,822) vs. Psychology (n2=110n_2=110, xˉ2=34,200\bar{x}_2=34,200, s2=10,698s_2=10,698).     - 95% CI result: Between $22,972.98 and $29,224.03.

Summary: Which Method to Use?

  • Step 1: Determine the Parameter:   - Proportion (pp): Use Normal distribution (ZZ) provided np^(1p^)10n\hat{p}(1-\hat{p}) \ge 10 and Sample size 5%\le 5\% of population.   - Mean (μ\mu): Categorize by sampling method.

  • Step 2: Determine Sampling Method:   - Independent Samples (Proportion): Use Z=(p^1p^2)(p1p2)p^1(1p^1)n1+p^2(1p^2)n2Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}.   - Independent Samples (Mean): Use Student's t with Welch's adjustment t=(xˉ1xˉ2)(μ1μ2)s12n1+s22n2t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}, df=min(n11,n21)df = \min(n_1-1, n_2-1).   - Dependent Samples (Mean/Matched-Pairs): Use Student's t with t=dˉμdsd/nt = \frac{\bar{d} - \mu_d}{s_d / \sqrt{n}}, df=n1df = n - 1.

  • Final Examples for Identification:   - Economic System Fair Poll: Comparing 200 Democrats and 160 Republicans (Independent, Proportion).   - Pulse Rates: Testing pulse before and after a fright (Dependent, Mean).