Inference for a Difference in Proportions
6. 3 Inference for a Difference in Proportions
Assignment Note: If you're working on the assignment, use the methods from section four, not the ones covered in this lecture.
An example from last semester, 6.2, getting enough sleep, is available on Learn.
Inference About Two Proportions
Inference involves two categorical variables.
Explanatory Variable: Divides the population into two categories.
Criteria: Proportion of cases meeting a certain criterion within each category.
Example 1 (Single Proportion): Comparing students using Mac vs. Windows computers.
- Only one categorical variable: computer type.
Example 2 (Two Proportions): Comparing students studying abroad (semester exchange) from public vs. private schools.
- Two categorical variables: school type (public/private) and participation in a semester exchange.
Example 3 (Single Proportion): Comparing students from Christchurch vs. outside Christchurch.
- One categorical variable: origin (Christchurch or elsewhere).
Example 4 (Two Proportions): Comparing students receiving financial aid from Christchurch vs. outside Christchurch.
- Explanatory: origin.
- Response: receiving financial aid.
Standard Error for a Difference in Proportions
A standard error for a difference in proportions can be calculated.
Formula (Section 6.3): The formulas are complex.
Use of Excel: Excel will be used for calculations.
Sample Size: Larger sample sizes lead to smaller standard errors.
Census Data Example (2018)
Comparing the proportion of 25-64 year olds with tertiary education in New Zealand and Australia.
Population Proportions:
New Zealand: 38% (0.38).
Australia: 46% (0.46).
Since it's census data, treat it as the population.
Random Samples: If random samples of 50 people are taken from both populations, the sample proportions will vary.
- Objective: Understand the distribution of the statistic.
Negative Difference: If calculated as New Zealand - Australia, the difference will be negative.
Standard Error Calculation: Even with a difference in proportions, the fractions are added under the square root.
Always add to avoid negative numbers and imaginary numbers.
Imaginary numbers are not used in stat one zero one.
Smallest Standard Error: The smallest possible standard error is zero.
- A negative measure of variability is not possible.
Formula: Substitute numbers on paper first.
p1 * (1 - p1) / n_1
p2 * (1 - p2) / n_2
Sampling Distribution: The statistic will be normally distributed and centered on the difference in population proportions.
Conditions for Normality
From Section 6.1: Distribution of a single sample proportion is normal if n * p and n * (1 - p) are at least 10.
Difference in Proportions: The same condition applies, but it must be checked for both categories (four checks in total).
Sample Sizes: Generally, large sample sizes will satisfy these conditions.
Confidence Interval
Structure: Statistic ± Margin of Error.
Application: The same principle applies to a difference in proportions.
Population Proportions: Substitute sample proportions when population proportions are unknown.
Penguin Tagging Example
Variables:
Penguin tag type (metal vs. electronic).
Penguin survival over 10 years (yes/no).
Explanatory: tag type.
Response: survival rate.
Goal: 90% confidence interval for the difference in proportions.
Group Definitions:
- Group 1: Metal tags.
- Group 2: Electronic tags.
Sample Proportions:
- Metal tags: 33 survived out of 167 (approximately 20%).
- Electronic tags: 68 survived out of 189 (approximately 36%).
Sample sizes:
- n_1 = 167
- n_2 = 189
Checking Conditions: Verify n * p and n * (1 - p) are at least 10 for both groups before calculations.
- Metal tags survived: 33 and 167-33 = 134 => meets criteria.
- Electronic tags survived: 68 and 189 - 68 = 121 => meets criteria.
Formula Sheet: Formulas will be provided on a formula sheet for the exam.
Z-score: Check a Z table for the appropriate Z-score for a 90% confidence interval.
Implementation: Use the original fraction in calculations in Excel rather than the rounded decimal so that you are not rounding off these numbers too soon.
Limits:
- On a number line, there's a zero difference in proportions, with a confidence interval of negative 0.24 to negative 0.09.
Interpretation: With 90% confidence, the true survival rate difference between electronic and metal tagged penguins is between 9-24% in favor of electronic tags.
- The setup can be reversed to work with positive numbers (electronic minus metal tag): 9 to 24. We've just flipped it around by looking at a difference of metal tag minus electronic to electronic minus metal tag.
Flexibility: As long as proportions are defined at the beginning, either approach is valid.
Game Show Example (Split or Steal)
Variables:
- Contestant's age (under 30 vs. over 30).
- Choice (split or steal).
Research Question: Are younger or older players more cooperative?
- Two-tailed test.
Proportions:
- P_O = proportion of over-30s that split.
- P_U = proportion of under-30s that split.
Sample Statistic:
- Sample proportion of over-30s that split = 217/378.
- Sample proportion of under-30s that split = 82/196.
Difference in Sample Proportions: Approximately 15.6% (0.156).
Hypothesis Test: Determine if the difference is significant enough to conclude a difference in the populations.
Null Hypothesis: The two proportions are the same and we don't have hypothetical p1 and p2 to use in the calculation for standard error, not any particular values for them.
Pool Proportion: Combine into one big group and calculate the proportion of people that split regardless of their whether they were 30 or 30.
Pool Proportion
Calculate the proportion of people that split from the combined group.
Pooled proportion = (299 that chose to split) / (total sample size).
= 0.521 (rounded to three d. p.).
Formula for standard error where p hat by itself is the pooled proportion.
Can we conclude whether younger or older players are more cooperative, just change these and make this group O and group U so it’s consistent with the notes there.
Hypotheses:
Null: No difference in population proportions, or PO = PU.
Alternative: There is a difference (two-tailed test).
Decision: Use a p-value.
Test Statistic: To get a p-value, first need a test statistic.
Formula: ((P0 - PU) - 0) / SE
The null value is that there is no difference. The null hypothesis is always that there's no difference between these two population proportions, but in theory, you could do a hypothesis test where you start with, this population has a proportion that's 10% larger than some other population, in which case your null parameter would be 0.1.
SE = \sqrt{((Pooled Proportions) * (1 - Pooled proportion) / n1)} + ((Pooled Proportion) * (1 - Pooled Proportion) / n2)
Plug in:
- We worked out that difference in sample proportions before. Dot dot dot. I'm not gonna round the number yet. Yeah? Minus zero. We'll do that for completeness. And then our pool proportion divided by n one. I should call that n o. Sorry. By what do we have? One ninety six.
Resulting test statistic: 3.54 (to two decimal places).