Chapter 8: Sampling and Inference
Chapter 8: Sampling and Inference
8.1 Getting Ready
Focus on theoretical and practical issues related to sampling and statistical inference.
Important for understanding hypothesis testing in the next chapter.
Requires loading the
county20.rdadata set and attaching theDescToolsandplotrixlibraries.
8.2 Statistics and Parameters
Social scientists often seek to make general statements about populations (e.g., voters or funding patterns).
Data from entire populations is often unavailable, leading to reliance on samples.
Sample statistics are calculated and assumed to represent population parameters.
Table 8.1: Symbols and Formulas
Sample statistics: Mean ($ar{x}$), Variance ($S^2$), Standard Deviation ($S$)
Population parameters: Mean ($BC$), Variance ($C3^2$), Standard Deviation ($C3$)
Statistical inference involves generalizing from sample statistics to population parameters.
Example: Estimating how often individuals access the internet for political news using sample data.
Formulas
Mean: ar{x} = rac{ orall_{i=1}^{n} x_{i}}{n}
Population Mean:
BC = rac{ orall_{i=1}^{N} x_{i}}{N}
Variance: S^2 = rac{ orall_{i=1}^{n}(x_{i} - ar{x})^{2}}{n-1}
Population Variance:
C3^2 = rac{ orall_{i=1}^{N}(x_{i} - BC)^{2}}{N}
Standard Deviation: S = rac{ orall_{i=1}^{n}(x_{i} - ar{x})^{2}}{n-1}
Population Standard Deviation:
C3 = rac{ orall_{i=1}^{N}(x_{i} - BC)^{2}}{N}
8.3 Sampling Error
Sampling error will always exist even with large, representative samples.
Example: Drawing colored ping pong balls from a bucket (6000 total, evenly distributed by color).
Random samples may not perfectly reflect the population distribution.
With random and unbiased selection, samples should approximate the population.
Margin of error is often reported (e.g., “±2 percentage points”).
Focus on county-level election returns from the 2020 presidential election for concrete examples.
Example Code and Data Attributes
Dimensions and variable names for
county20data set:dim(county20)outputs: [1] 3152 10Variables include:
state,county_fips,county_name,votes_gop20,votes_dem20,total_votes20,other_vote20,djtpct20,jrbpct20,d2pty20.
Variable of interest: Biden's percent of two-party vote (
d2pty20).
Summary Statistics
Mean, Median, and Histogram graphical displays.
Population-level statistics indicate skewness; mean > median. Prudent to investigate sampling and calculate means from random samples.
8.4 Sampling Distributions
A sampling distribution consists of numerous sample means computed from repeated samples.
Key characteristics of sampling distributions:
Mean of sampling distribution = population mean ($BC = 34.04$).
Cluster around the population mean, with increasing n approaching normality due to the Central Limit Theorem (CLT).
Key Properties of Sampling Distribution
Generally follows a normal distribution.
Mean sample means approximate population mean.
Standard deviation of sampling distribution is known as standard error.
Central Limit Theorem (CLT): For sufficiently large sample sizes (n > 30), sample means will approximate the normal distribution regardless of the population's distribution shape.
Simulating Sampling Distribution
Simulated by taking repeated samples from a known population.
R Code Example:
Draw 50 samples, each containing 50 counties.
Store average of each sample:
R for(i in 1:50){ samp <- sample(county20$d2pty20, 50) sample_means50[i] <- mean(samp, na.rm=T) }
8.5 Confidence Intervals
Confidence intervals (CIs) indicate a range near the true population parameter.
For large random samples, CI will typically resemble normal distribution.
Approximately 68% of sample means fall within one standard deviation of the population mean.
Standard Error formulates as:
S_{ar{x}} = rac{S}{ ext{sqrt}(n)}Example Calculation:
For n=100, estimate population mean and corresponding CI.
Interval Calculation
Low Limit (LL) and Upper Limit (UL) demonstrated through the standard error application.
For sample of 100 counties:
Assign an average and calculate open interval.R mean <- 32.71 se <- 1.711 ci.68 = mean ± 1 * se
Confidence interval confirms aggregate uncertainties based on sampling.
Different samples yield various CIs; however, a prescribed number will encompass the true population mean.
8.6 Proportions
Analogous to mean, the distribution for sample proportions reflects population proportion.
Proportion Calculation:
Example of Sample Proportions
Election characteristics displayed through sample proportions via dichotomous variables, such as whether Biden won or lost in counties.
Average for winning proportion obtained from calculations using R with various samples.
CIs for Proportions
CI formula for proportions:
Where standard errors for proportions are computed using:
8.7 Summary and Implications
Explore further with statistical inference, moving towards hypothesis testing.
Confidence intervals delineate ranges, noting common usage amidst public opinions and opinion polling (e.g., media interpretations).
Understanding CIs and associated calculations forms the foundation for hypothesis-oriented investigations.
Exercises
Apply concepts learned through calculating means, constructing CIs, and assessing sample sizes respective to obtained data. Seek out answers focusing on confidence and estimates proximity to population parameters.