Samples and Sampling Distributions Importance of Representative Samples Quality of any statistical analysis is directly tied to how representative the sample data are of the target population.If the sample is not representative, conclusions will be unreliable. Therefore, the first step of any investigation is to design a sampling plan that yields representative data. Sampling (studying a subset) saves time and money compared with a full census. When done well, the information from a representative sample is “almost as good” as information from a census. Key idea: REPRESENTATIVENESS generally requires introducing RANDOMNESS into the sampling procedure. Random Sampling Goal: give every member (or every possible sample) of the population a known, non-zero probability of selection. Requires a clear, operational definition of the population. Core vocabularyPopulation: complete set of individuals/objects of interest. Sample: subset of the population that is actually observed/recorded. Random sample: sample chosen by a process governed purely by chance. Voluntary Response & Selection Bias Voluntary sampling (participants choose to opt in) is common in media & internet surveys.Example context: online polls, call-in radio surveys, website feedback links. Tends to over-represent individuals with strong opinions and under-represent individuals with weak/no opinions . Voluntary response is a specific form of selection bias – the sampling mechanism systematically favors certain population segments. Because of this bias, voluntary samples are generally NOT representative of the population. Simple Random Sampling (SRS) Definition: A sample in which every possible sample of the same size, n, has the same probability of being selected . Construction steps (when a complete list is available):Create a sampling frame – a numbered list of every population member. Assign each member a unique ID number. Select numbers at random until n distinct IDs are chosen. Methods: drawing numbers from a hat, using a random-number table, or computer-generated random integers. Practical guidelines for random-number tables/computer tools:Identify the correct number of digits (e.g., 4 digits for up to 8,000 IDs). Start at any row/column; read consecutive blocks of the chosen digit length. Ignore numbers outside the valid ID range; keep reading until n distinct valid IDs are gathered. Example: Selecting n students from 8,000 in a college registry → use registrar’s database + software to draw random IDs. Advantages: Simplicity, well-understood theoretical properties, minimal bias when the frame is complete. Main liability: Requires a complete sampling frame . Often impossible for diffuse populations (e.g., “all potential customers of a mall”). Sampling-Frame Challenges Many real-world populations lack an obvious, exhaustive list.Market areas, hidden/rare populations, transitory groups. Without a good frame, an SRS can’t be executed directly; alternative designs (cluster, stratified, systematic, multi-stage) may be needed. Crafting an accurate frame is costly and time-consuming; errors in the frame introduce coverage bias . Sampling Distributions & Random Variables Think of a statistic (e.g., the sample mean) as a random variable because its value changes from sample to sample.Notation: sample mean X ˉ \bar{X} X ˉ , population mean μ \mu μ . Sampling distribution of X ˉ \bar{X} X ˉ : the probability distribution of X ˉ \bar{X} X ˉ over all possible samples of size n from the population.Captures variability attributable solely to sampling. Key ideas:Even if the population is fixed, the estimator varies because different random samples yield different values. Understanding the sampling distribution lets us quantify uncertainty (e.g., build confidence intervals, perform hypothesis tests). Point Estimation A point estimator is a single-number statistic meant to be “close” to a population parameter.Example estimators: X ˉ \bar{X} X ˉ for μ \mu μ , s s s for σ \sigma σ , p ^ \hat{p} p ^ for population proportion p p p . The observed value from a specific sample is called the point estimate . Rationale: Because a census is impractical, we rely on point estimates as best guesses of true, unknown parameters. Sample mean: X ˉ = 1 n ∑ < e m > i = 1 n X < / e m > i \bar{X} = \frac{1}{n} \sum<em>{i=1}^{n} X</em>i X ˉ = n 1 ∑ < e m > i = 1 n X < / e m > i Population mean: μ = E [ X ] \mu = E[X] μ = E [ X ] (unknown, fixed) Sample standard deviation: s = 1 n − 1 ∑ < e m > i = 1 n ( X < / e m > i − X ˉ ) 2 s = \sqrt{\frac{1}{n-1}\sum<em>{i=1}^{n}(X</em>i-\bar{X})^2} s = n − 1 1 ∑ < e m > i = 1 n ( X < / e m > i − X ˉ ) 2 Population standard deviation: σ \sigma σ (unknown, fixed) Sample proportion: \hat{p} = \frac{\text{# of successes in sample}}{n} Point estimator notation summaryX ˉ → μ \bar{X} \to \mu X ˉ → μ s → σ s \to \sigma s → σ p ^ → p \hat{p} \to p p ^ → p Practical, Ethical, & Philosophical Implications Practical: Poor sampling → wasted resources + faulty decisions (policy, business strategy, medicine). Ethical: Misleading inferences from biased samples can harm under-represented groups or propagate misinformation. Philosophical: The demand for randomness underscores our acceptance that we cannot fully control or know reality; probability models formalize uncertainty. Connections & Context Builds on earlier probability concepts – random variables, probability distributions, expected value. Sets the stage for upcoming topics: Central Limit Theorem, confidence intervals, hypothesis testing. Reinforces core statistical principle: Variation is inevitable; understanding it is the key to learning from data . Study Takeaways Always start with a clearly defined population and an unbiased sampling method. Voluntary response almost guarantees selection bias; avoid relying on it for serious inference. A Simple Random Sample requires a complete sampling frame – feasible for well-cataloged populations, tough otherwise. Treat statistics as random variables; their sampling distributions measure sampling variability. Point estimators (e.g., X ˉ , s , p ^ \bar{X}, s, \hat{p} X ˉ , s , p ^ ) provide best single-value guesses of population parameters when a census is not possible.