Notes on Sampling Bias and Random Sampling
Sampling bias: occurs when the sample is not representative of the population, due to how the sample is picked or who responds.
- Key idea: if you cannot pick a sample properly or if the sample is not representative, sampling bias occurs.
- This bias is very common in practice.
Population vs. sample ( UT Tyler example )
- Population: all students at UT Tyler (qualitative description matters more than the raw count).
- Sample: a subset of students you actually survey (e.g., 500 students, 1,000, etc.).
- You don’t need to know the exact population size to define the population; you need to know who qualifies as members of the population.
- When estimating the average hours studied per week at UT Tyler, define population as UT Tyler students; sample as the selected students you survey.
Common sample types (A–D) and why they can be biased
- A) Library-based sample: go to the library and ask all students there.
- Problem: not representative because library-goers are not a random cross-section; many students don’t study in the library or don’t study at all there.
- Consequence: tends to overestimate study hours since those in the library study more.
- B) Email all students asking how much they study.
- Intuition: fair, since you’re asking everyone; however, nonresponse bias can occur if a large fraction doesn’t reply.
- Issue: some students never respond; you cannot assume the 70% who replied represent the remaining 30%.
- Concept: nonresponse bias; sample is biased if nonrespondents differ from respondents.
- C) Sampling within a class (e.g., focus on students in this class).
- Problem: excludes remote students or those not enrolled in that class; also may over-represent students who are present and engaged.
- D) Focus on a specific subgroup (e.g., only graduate students or only students from one department).
- Problem: excludes other groups; introduces bias toward the characteristics of the chosen subgroup.
For all A–D, the common point is that some part of the population has no chance to be selected (or is less likely to be selected), which creates sampling bias.
- If certain groups cannot be selected, or are less likely to participate, the sample is biased.
- Examples: remote students, students who never use certain services, or students who never check email.
How to avoid sampling bias: aim for random selection
- Random sampling means giving every case in the population the same chance of being selected.
- Simple random sample (SRS): every possible sample of size n has the same probability of being chosen.
- In this class, advanced sampling schemes exist, but we focus on simple random sampling.
- The core idea: avoid systematic exclusion or over-representation of any subgroup.
Simple random sample: precise definitions and math
- Let N be the population size, and n the sample size.
- A simple random sample S of size n is a subset of the population such that every subset of size n has equal probability of being selected.
- The probability of selecting any particular sample S is:
P(S) = rac{1}{inom{N}{n}}. - If you draw with replacement (putting each name back after drawing), each draw is independent and each unit has probability rac{1}{N} of being chosen on any given draw.
- Probability a given unit is selected at least once in n draws with replacement:
P( ext{selected at least once}) = 1 - iggl(1 - rac{1}{N}iggr)^n. - If you draw without replacement (do not replace), each unit has probability rac{n}{N} to be included in the sample (since you select n out of N without replacement).
- The visual example: 100 students, drawing one name at a time.
- Jessica selected first: probability rac{1}{100}.
- If you want a second name and you do not put Jessica back, probability for Tom becomes rac{1}{99} on the conditional draw, whereas Jessica’s probability becomes 0 on that second draw.
- To keep fairness, you typically replace the name back (keep the chance equal across draws) or you accept the exact change as part of the sampling design.
- Lottery analogy: random draw with replacement ensures equal, independent chances across draws; there is no systematic bias introduced by shrinking the pool.
- Modern practice: for very large populations, sampling with technology (randomization software, online panels) is common to handle large N efficiently while preserving randomness.
Random sampling in practice: philosophy and trade-offs
- Randomness is foundational in statistics: without randomness, you cannot rely on the sample to generalize.
- Random sampling can be implemented using traditional methods (e.g., a black box with name tags) or technology for large samples.
- Explicitly acknowledge that some data collection methods (like unsolicited emails with incentives) may be biased; randomization alone does not guarantee representativeness if nonresponse or coverage biases are present.
Other forms of bias beyond sampling bias
- Question wording bias (leading or loaded questions):
- Example: two questions about opposing views on democracy that appear similar but steer respondents differently due to phrasing (e.g.,
- Q1: "Should speeches against democracy be allowed?"
- Q2: "Should speeches against democracy be forbidden?" )
- Although the questions may be conceptually similar, the emphasis (allowed vs forbidden) shifts focus and yields different results.
- Context bias (influenced responses by framing):
- Providing contextual descriptions (e.g., defining what counts as government programs) can nudge respondents toward a particular stance.
- This is used in public surveys or media polls and can distort the true preference unless carefully controlled.
- Inaccurate or biased self-assessment (response bias):
- Anecdote: 93% of US students said they were in the top half for driving skill.
- Such self-assessment questions suffer from overconfidence or social desirability bias; they do not accurately measure actual skill.
- Remedy: use quantitative/measurable indicators (e.g., number of tickets in the past 3 years) rather than self-rated qualitative judgments.
- Nonresponse bias (sample still biased after response):
- If a large portion does not respond, the respondent group may differ from nonrespondents, biasing results.
- Coverage bias / selection bias:
- Some groups are never able to participate (e.g., seniors without email, remote students, etc.). If these groups are excluded, the sample is biased.
- Forced participation bias:
- Forcing individuals to respond can contaminate data with response bias or nonresponse-like distortions.
- Context- and policy-framing bias in media/elections:
- Questionnaires during campaigns may be designed to elicit preferred responses, intentionally or unintentionally.
- Example: questions about government programs labeled with favorable framing to influence responses.
- Quantitative vs. qualitative measurement bias:
- Quantitative questions (e.g., counts, tickets, years) tend to yield more reliable data than yes/no or vague judgments.
- Qualitative questions are more susceptible to misinterpretation and misreporting.
- Ice cream and drowning example (confounding and correlation vs causation):
- Claim: consumption of ice cream causes drowning.
- In reality, both ice cream consumption and drowning deaths rise in hot weather; temperature acts as a confounding variable (seasonality).
- Lesson: correlation does not imply causation; beware lurking variables that jointly influence two observed trends.
Practical implications and critical thinking
- Be skeptical of data collection methods: ask whether a sample is truly representative or if there is potential bias.
- Evaluate whether the sampling method and question design might influence results beyond the underlying reality.
- Recognize trade-offs between ideal rigorous sampling and practical constraints (cost, time, accessibility).
- When in doubt, document the sampling frame, response rates, and potential biases; present data with transparency about representativeness.
Quick takeaway checklist for avoiding bias in surveys
- Define population clearly (who qualifies).
- Ensure the sampling method gives every unit a chance to be selected (randomness).
- Prefer simple random samples or well-documented randomization schemes.
- Monitor and mitigate nonresponse (follow-ups, incentives, weighting when appropriate).
- Include diverse subgroups to avoid coverage bias; do not over-focus on a single subgroup.
- Use quantitative, precise questions to reduce ambiguity and overinterpretation.
- Be wary of question wording, framing, and context effects; randomize question order when possible.
- Be mindful of confounding variables; distinguish correlation from causation.
- Report sample size, response rate, and sampling method; discuss potential biases openly.
Connections to foundational principles
- Sampling bias relates to the core probabilistic idea that inference relies on representative data.
- Simple random sampling embodies the principle of equal opportunity for selection, which underpins unbiased inference.
- Nonresponse and framing biases illustrate why controlling methodology is as important as the data itself in statistical conclusions.
Final example recap (ice cream story)
- Observed correlation between ice cream consumption and drowning deaths across seasons.
- The underlying mechanism is a common cause (seasonal temperature) rather than a causal link between ice cream and drowning.
- Demonstrates the importance of considering lurking variables and avoiding naive causal conclusions from correlations.
Ethical and real-world relevance
- Biased sampling can lead to misleading public policy, misinformed consumer opinions, and skewed media narratives.
- Critical thinking and transparency about methods help combat misinformation and promote better decision-making.