Samples and Sampling Distributions

Importance of Representative Samples

  • Quality of any statistical analysis is directly tied to how representative the sample data are of the target population.
    • If the sample is not representative, conclusions will be unreliable.
    • Therefore, the first step of any investigation is to design a sampling plan that yields representative data.
  • Sampling (studying a subset) saves time and money compared with a full census.
    • When done well, the information from a representative sample is “almost as good” as information from a census.
  • Key idea: REPRESENTATIVENESS generally requires introducing RANDOMNESS into the sampling procedure.

Random Sampling

  • Goal: give every member (or every possible sample) of the population a known, non-zero probability of selection.
  • Requires a clear, operational definition of the population.
  • Core vocabulary
    • Population: complete set of individuals/objects of interest.
    • Sample: subset of the population that is actually observed/recorded.
    • Random sample: sample chosen by a process governed purely by chance.

Voluntary Response & Selection Bias

  • Voluntary sampling (participants choose to opt in) is common in media & internet surveys.
    • Example context: online polls, call-in radio surveys, website feedback links.
  • Tends to over-represent individuals with strong opinions and under-represent individuals with weak/no opinions.
  • Voluntary response is a specific form of selection bias – the sampling mechanism systematically favors certain population segments.
  • Because of this bias, voluntary samples are generally NOT representative of the population.

Simple Random Sampling (SRS)

  • Definition: A sample in which every possible sample of the same size, n, has the same probability of being selected.
  • Construction steps (when a complete list is available):
    1. Create a sampling frame – a numbered list of every population member.
    2. Assign each member a unique ID number.
    3. Select numbers at random until n distinct IDs are chosen.
    • Methods: drawing numbers from a hat, using a random-number table, or computer-generated random integers.
  • Practical guidelines for random-number tables/computer tools:
    • Identify the correct number of digits (e.g., 4 digits for up to 8,000 IDs).
    • Start at any row/column; read consecutive blocks of the chosen digit length.
    • Ignore numbers outside the valid ID range; keep reading until n distinct valid IDs are gathered.
  • Example: Selecting n students from 8,000 in a college registry → use registrar’s database + software to draw random IDs.
  • Advantages: Simplicity, well-understood theoretical properties, minimal bias when the frame is complete.
  • Main liability: Requires a complete sampling frame.
    • Often impossible for diffuse populations (e.g., “all potential customers of a mall”).

Sampling-Frame Challenges

  • Many real-world populations lack an obvious, exhaustive list.
    • Market areas, hidden/rare populations, transitory groups.
  • Without a good frame, an SRS can’t be executed directly; alternative designs (cluster, stratified, systematic, multi-stage) may be needed.
  • Crafting an accurate frame is costly and time-consuming; errors in the frame introduce coverage bias.

Sampling Distributions & Random Variables

  • Think of a statistic (e.g., the sample mean) as a random variable because its value changes from sample to sample.
    • Notation: sample mean Xˉ\bar{X}, population mean μ\mu.
  • Sampling distribution of Xˉ\bar{X}: the probability distribution of Xˉ\bar{X} over all possible samples of size n from the population.
    • Captures variability attributable solely to sampling.
  • Key ideas:
    • Even if the population is fixed, the estimator varies because different random samples yield different values.
    • Understanding the sampling distribution lets us quantify uncertainty (e.g., build confidence intervals, perform hypothesis tests).

Point Estimation

  • A point estimator is a single-number statistic meant to be “close” to a population parameter.
    • Example estimators: Xˉ\bar{X} for μ\mu, ss for σ\sigma, p^\hat{p} for population proportion pp.
  • The observed value from a specific sample is called the point estimate.
  • Rationale: Because a census is impractical, we rely on point estimates as best guesses of true, unknown parameters.

Key Formulas & Symbols

  • Sample mean: Xˉ=1n<em>i=1nX</em>i\bar{X} = \frac{1}{n} \sum<em>{i=1}^{n} X</em>i
  • Population mean: μ=E[X]\mu = E[X] (unknown, fixed)
  • Sample standard deviation: s=1n1<em>i=1n(X</em>iXˉ)2s = \sqrt{\frac{1}{n-1}\sum<em>{i=1}^{n}(X</em>i-\bar{X})^2}
  • Population standard deviation: σ\sigma (unknown, fixed)
  • Sample proportion: \hat{p} = \frac{\text{# of successes in sample}}{n}
  • Point estimator notation summary
    • Xˉμ\bar{X} \to \mu
    • sσs \to \sigma
    • p^p\hat{p} \to p

Practical, Ethical, & Philosophical Implications

  • Practical: Poor sampling → wasted resources + faulty decisions (policy, business strategy, medicine).
  • Ethical: Misleading inferences from biased samples can harm under-represented groups or propagate misinformation.
  • Philosophical: The demand for randomness underscores our acceptance that we cannot fully control or know reality; probability models formalize uncertainty.

Connections & Context

  • Builds on earlier probability concepts – random variables, probability distributions, expected value.
  • Sets the stage for upcoming topics: Central Limit Theorem, confidence intervals, hypothesis testing.
  • Reinforces core statistical principle: Variation is inevitable; understanding it is the key to learning from data.

Study Takeaways

  • Always start with a clearly defined population and an unbiased sampling method.
  • Voluntary response almost guarantees selection bias; avoid relying on it for serious inference.
  • A Simple Random Sample requires a complete sampling frame – feasible for well-cataloged populations, tough otherwise.
  • Treat statistics as random variables; their sampling distributions measure sampling variability.
  • Point estimators (e.g., Xˉ,s,p^\bar{X}, s, \hat{p}) provide best single-value guesses of population parameters when a census is not possible.