Chapter 8: Sampling and Inference

Focus on theoretical and practical issues related to sampling and statistical inference.
Important for understanding hypothesis testing in the next chapter.
Requires loading the county20.rda data set and attaching the DescTools and plotrix libraries.

Social scientists often seek to make general statements about populations (e.g., voters or funding patterns).
Data from entire populations is often unavailable, leading to reliance on samples.
Sample statistics are calculated and assumed to represent population parameters.
Table 8.1: Symbols and Formulas
- Sample statistics: Mean ($ar{x}$), Variance ($S^2$), Standard Deviation ($S$)
- Population parameters: Mean ($BC$), Variance ($C3^2$), Standard Deviation ($C3$)
Statistical inference involves generalizing from sample statistics to population parameters.
Example: Estimating how often individuals access the internet for political news using sample data.

Mean: ar{x} = rac{ orall_{i=1}^{n} x_{i}}{n}
- Population Mean:
  BC = rac{ orall_{i=1}^{N} x_{i}}{N}
Variance: S^2 = rac{ orall_{i=1}^{n}(x_{i} - ar{x})^{2}}{n-1}
- Population Variance:
  C3^2 = rac{ orall_{i=1}^{N}(x_{i} - BC)^{2}}{N}
Standard Deviation: S = rac{ orall_{i=1}^{n}(x_{i} - ar{x})^{2}}{n-1}
- Population Standard Deviation:
  C3 = rac{ orall_{i=1}^{N}(x_{i} - BC)^{2}}{N}

Sampling error will always exist even with large, representative samples.
- Example: Drawing colored ping pong balls from a bucket (6000 total, evenly distributed by color).
Random samples may not perfectly reflect the population distribution.
- With random and unbiased selection, samples should approximate the population.
Margin of error is often reported (e.g., “±2 percentage points”).
Focus on county-level election returns from the 2020 presidential election for concrete examples.

Dimensions and variable names for county20 data set:
- dim(county20) outputs: [1] 3152 10
- Variables include:
- state, county_fips, county_name, votes_gop20, votes_dem20, total_votes20, other_vote20, djtpct20, jrbpct20, d2pty20.
Variable of interest: Biden's percent of two-party vote (d2pty20).

Mean, Median, and Histogram graphical displays.
Population-level statistics indicate skewness; mean > median. Prudent to investigate sampling and calculate means from random samples.

A sampling distribution consists of numerous sample means computed from repeated samples.
Key characteristics of sampling distributions:
- Mean of sampling distribution = population mean ($BC = 34.04$).
- Cluster around the population mean, with increasing n approaching normality due to the Central Limit Theorem (CLT).

Generally follows a normal distribution.
Mean sample means approximate population mean.
Standard deviation of sampling distribution is known as standard error.
Central Limit Theorem (CLT): For sufficiently large sample sizes (n > 30), sample means will approximate the normal distribution regardless of the population's distribution shape.

Simulated by taking repeated samples from a known population.
R Code Example:
- Draw 50 samples, each containing 50 counties.
- Store average of each sample:
  R for(i in 1:50){ samp <- sample(county20$d2pty20, 50) sample_means50[i] <- mean(samp, na.rm=T) }

Confidence intervals (CIs) indicate a range near the true population parameter.
For large random samples, CI will typically resemble normal distribution.
Approximately 68% of sample means fall within one standard deviation of the population mean.
Standard Error formulates as:
S_{ar{x}} = rac{S}{ ext{sqrt}(n)}
Example Calculation:
- For n=100, estimate population mean and corresponding CI.

Low Limit (LL) and Upper Limit (UL) demonstrated through the standard error application.
- For sample of 100 counties:
  Assign an average and calculate open interval.
  R mean <- 32.71 se <- 1.711 ci.68 = mean ± 1 * se
Confidence interval confirms aggregate uncertainties based on sampling.
Different samples yield various CIs; however, a prescribed number will encompass the true population mean.

Analogous to mean, the distribution for sample proportions reflects population proportion.
Proportion Calculation:
$ext{Proportion } P = rac{ ext{Successful outcomes}}{ ext{Total trials}}$

Election characteristics displayed through sample proportions via dichotomous variables, such as whether Biden won or lost in counties.
- Average for winning proportion obtained from calculations using R with various samples.

CI formula for proportions:
$c.i. .95 = p ext{ ± } z_{.95} * S_p$
Where standard errors for proportions are computed using:
$S_p = ext{sqrt}( rac{p(1-p)}{n})$

Explore further with statistical inference, moving towards hypothesis testing.
Confidence intervals delineate ranges, noting common usage amidst public opinions and opinion polling (e.g., media interpretations).
Understanding CIs and associated calculations forms the foundation for hypothesis-oriented investigations.

Apply concepts learned through calculating means, constructing CIs, and assessing sample sizes respective to obtained data. Seek out answers focusing on confidence and estimates proximity to population parameters.