Samples and Sampling Distributions

Quality of any statistical analysis is directly tied to how representative the sample data are of the target population.
- If the sample is not representative, conclusions will be unreliable.
- Therefore, the first step of any investigation is to design a sampling plan that yields representative data.
Sampling (studying a subset) saves time and money compared with a full census.
- When done well, the information from a representative sample is “almost as good” as information from a census.
Key idea: REPRESENTATIVENESS generally requires introducing RANDOMNESS into the sampling procedure.

Goal: give every member (or every possible sample) of the population a known, non-zero probability of selection.
Requires a clear, operational definition of the population.
Core vocabulary
- Population: complete set of individuals/objects of interest.
- Sample: subset of the population that is actually observed/recorded.
- Random sample: sample chosen by a process governed purely by chance.

Voluntary sampling (participants choose to opt in) is common in media & internet surveys.
- Example context: online polls, call-in radio surveys, website feedback links.
Tends to over-represent individuals with strong opinions and under-represent individuals with weak/no opinions.
Voluntary response is a specific form of selection bias – the sampling mechanism systematically favors certain population segments.
Because of this bias, voluntary samples are generally NOT representative of the population.

Definition: A sample in which every possible sample of the same size, n, has the same probability of being selected.
Construction steps (when a complete list is available):
1. Create a sampling frame – a numbered list of every population member.
2. Assign each member a unique ID number.
3. Select numbers at random until n distinct IDs are chosen.
- Methods: drawing numbers from a hat, using a random-number table, or computer-generated random integers.
Practical guidelines for random-number tables/computer tools:
- Identify the correct number of digits (e.g., 4 digits for up to 8,000 IDs).
- Start at any row/column; read consecutive blocks of the chosen digit length.
- Ignore numbers outside the valid ID range; keep reading until n distinct valid IDs are gathered.
Example: Selecting n students from 8,000 in a college registry → use registrar’s database + software to draw random IDs.
Advantages: Simplicity, well-understood theoretical properties, minimal bias when the frame is complete.
Main liability: Requires a complete sampling frame.
- Often impossible for diffuse populations (e.g., “all potential customers of a mall”).

Many real-world populations lack an obvious, exhaustive list.
- Market areas, hidden/rare populations, transitory groups.
Without a good frame, an SRS can’t be executed directly; alternative designs (cluster, stratified, systematic, multi-stage) may be needed.
Crafting an accurate frame is costly and time-consuming; errors in the frame introduce coverage bias.

Think of a statistic (e.g., the sample mean) as a random variable because its value changes from sample to sample.
- Notation: sample mean $\bar{X}$ , population mean $\mu$ .
Sampling distribution of $\bar{X}$ : the probability distribution of $\bar{X}$ over all possible samples of size n from the population.
- Captures variability attributable solely to sampling.
Key ideas:
- Even if the population is fixed, the estimator varies because different random samples yield different values.
- Understanding the sampling distribution lets us quantify uncertainty (e.g., build confidence intervals, perform hypothesis tests).

A point estimator is a single-number statistic meant to be “close” to a population parameter.
- Example estimators: $\bar{X}$ for $\mu$ , $s$ for $\sigma$ , $\hat{p}$ for population proportion $p$ .
The observed value from a specific sample is called the point estimate.
Rationale: Because a census is impractical, we rely on point estimates as best guesses of true, unknown parameters.

Sample mean: $\bar{X} = \frac{1}{n} \sum<em>{i=1}^{n} X</em>i$
Population mean: $\mu = E[X]$ (unknown, fixed)
Sample standard deviation: $s = \sqrt{\frac{1}{n-1}\sum<em>{i=1}^{n}(X</em>i-\bar{X})^2}$
Population standard deviation: $\sigma$ (unknown, fixed)
Sample proportion: \hat{p} = \frac{\text{# of successes in sample}}{n}
Point estimator notation summary
- $\bar{X} \to \mu$
- $s \to \sigma$
- $\hat{p} \to p$

Practical: Poor sampling → wasted resources + faulty decisions (policy, business strategy, medicine).
Ethical: Misleading inferences from biased samples can harm under-represented groups or propagate misinformation.
Philosophical: The demand for randomness underscores our acceptance that we cannot fully control or know reality; probability models formalize uncertainty.

Builds on earlier probability concepts – random variables, probability distributions, expected value.
Sets the stage for upcoming topics: Central Limit Theorem, confidence intervals, hypothesis testing.
Reinforces core statistical principle: Variation is inevitable; understanding it is the key to learning from data.

Always start with a clearly defined population and an unbiased sampling method.
Voluntary response almost guarantees selection bias; avoid relying on it for serious inference.
A Simple Random Sample requires a complete sampling frame – feasible for well-cataloged populations, tough otherwise.
Treat statistics as random variables; their sampling distributions measure sampling variability.
Point estimators (e.g., $\bar{X}, s, \hat{p}$ ) provide best single-value guesses of population parameters when a census is not possible.