Chapter 8: Sampling and Inference

Chapter 8: Sampling and Inference

8.1 Getting Ready

  • Focus on theoretical and practical issues related to sampling and statistical inference.

  • Important for understanding hypothesis testing in the next chapter.

  • Requires loading the county20.rda data set and attaching the DescTools and plotrix libraries.

8.2 Statistics and Parameters

  • Social scientists often seek to make general statements about populations (e.g., voters or funding patterns).

  • Data from entire populations is often unavailable, leading to reliance on samples.

  • Sample statistics are calculated and assumed to represent population parameters.

  • Table 8.1: Symbols and Formulas

    • Sample statistics: Mean ($ar{x}$), Variance ($S^2$), Standard Deviation ($S$)

    • Population parameters: Mean ($BC$), Variance ($C3^2$), Standard Deviation ($C3$)

  • Statistical inference involves generalizing from sample statistics to population parameters.

  • Example: Estimating how often individuals access the internet for political news using sample data.

Formulas
  • Mean: ar{x} = rac{ orall_{i=1}^{n} x_{i}}{n}

    • Population Mean:
      BC = rac{ orall_{i=1}^{N} x_{i}}{N}

  • Variance: S^2 = rac{ orall_{i=1}^{n}(x_{i} - ar{x})^{2}}{n-1}

    • Population Variance:
      C3^2 = rac{ orall_{i=1}^{N}(x_{i} - BC)^{2}}{N}

  • Standard Deviation: S = rac{ orall_{i=1}^{n}(x_{i} - ar{x})^{2}}{n-1}

    • Population Standard Deviation:
      C3 = rac{ orall_{i=1}^{N}(x_{i} - BC)^{2}}{N}

8.3 Sampling Error

  • Sampling error will always exist even with large, representative samples.

    • Example: Drawing colored ping pong balls from a bucket (6000 total, evenly distributed by color).

  • Random samples may not perfectly reflect the population distribution.

    • With random and unbiased selection, samples should approximate the population.

  • Margin of error is often reported (e.g., “±2 percentage points”).

  • Focus on county-level election returns from the 2020 presidential election for concrete examples.

Example Code and Data Attributes
  • Dimensions and variable names for county20 data set:

    • dim(county20) outputs: [1] 3152 10

    • Variables include:

    • state, county_fips, county_name, votes_gop20, votes_dem20, total_votes20, other_vote20, djtpct20, jrbpct20, d2pty20.

  • Variable of interest: Biden's percent of two-party vote (d2pty20).

Summary Statistics
  • Mean, Median, and Histogram graphical displays.

  • Population-level statistics indicate skewness; mean > median. Prudent to investigate sampling and calculate means from random samples.

8.4 Sampling Distributions

  • A sampling distribution consists of numerous sample means computed from repeated samples.

  • Key characteristics of sampling distributions:

    • Mean of sampling distribution = population mean ($BC = 34.04$).

    • Cluster around the population mean, with increasing n approaching normality due to the Central Limit Theorem (CLT).

Key Properties of Sampling Distribution
  • Generally follows a normal distribution.

  • Mean sample means approximate population mean.

  • Standard deviation of sampling distribution is known as standard error.

  • Central Limit Theorem (CLT): For sufficiently large sample sizes (n > 30), sample means will approximate the normal distribution regardless of the population's distribution shape.

Simulating Sampling Distribution
  • Simulated by taking repeated samples from a known population.

  • R Code Example:

    • Draw 50 samples, each containing 50 counties.

    • Store average of each sample:
      R for(i in 1:50){ samp <- sample(county20$d2pty20, 50) sample_means50[i] <- mean(samp, na.rm=T) }

8.5 Confidence Intervals

  • Confidence intervals (CIs) indicate a range near the true population parameter.

  • For large random samples, CI will typically resemble normal distribution.

  • Approximately 68% of sample means fall within one standard deviation of the population mean.

  • Standard Error formulates as:
    S_{ar{x}} = rac{S}{ ext{sqrt}(n)}

  • Example Calculation:

    • For n=100, estimate population mean and corresponding CI.

Interval Calculation
  • Low Limit (LL) and Upper Limit (UL) demonstrated through the standard error application.

    • For sample of 100 counties:
      Assign an average and calculate open interval.
      R mean <- 32.71 se <- 1.711 ci.68 = mean ± 1 * se

  • Confidence interval confirms aggregate uncertainties based on sampling.

  • Different samples yield various CIs; however, a prescribed number will encompass the true population mean.

8.6 Proportions

  • Analogous to mean, the distribution for sample proportions reflects population proportion.

  • Proportion Calculation:
    extProportionP=racextSuccessfuloutcomesextTotaltrialsext{Proportion } P = rac{ ext{Successful outcomes}}{ ext{Total trials}}

Example of Sample Proportions
  • Election characteristics displayed through sample proportions via dichotomous variables, such as whether Biden won or lost in counties.

    • Average for winning proportion obtained from calculations using R with various samples.

CIs for Proportions
  • CI formula for proportions:
    c.i..95=pext±z.95Spc.i. .95 = p ext{ ± } z_{.95} * S_p
    Where standard errors for proportions are computed using:
    Sp=extsqrt(racp(1p)n)S_p = ext{sqrt}( rac{p(1-p)}{n})

8.7 Summary and Implications

  • Explore further with statistical inference, moving towards hypothesis testing.

  • Confidence intervals delineate ranges, noting common usage amidst public opinions and opinion polling (e.g., media interpretations).

  • Understanding CIs and associated calculations forms the foundation for hypothesis-oriented investigations.

Exercises
  • Apply concepts learned through calculating means, constructing CIs, and assessing sample sizes respective to obtained data. Seek out answers focusing on confidence and estimates proximity to population parameters.