Notes on Sampling Bias and Random Sampling

  • Sampling bias: occurs when the sample is not representative of the population, due to how the sample is picked or who responds.

    • Key idea: if you cannot pick a sample properly or if the sample is not representative, sampling bias occurs.
    • This bias is very common in practice.
  • Population vs. sample ( UT Tyler example )

    • Population: all students at UT Tyler (qualitative description matters more than the raw count).
    • Sample: a subset of students you actually survey (e.g., 500 students, 1,000, etc.).
    • You don’t need to know the exact population size to define the population; you need to know who qualifies as members of the population.
    • When estimating the average hours studied per week at UT Tyler, define population as UT Tyler students; sample as the selected students you survey.
  • Common sample types (A–D) and why they can be biased

    • A) Library-based sample: go to the library and ask all students there.
    • Problem: not representative because library-goers are not a random cross-section; many students don’t study in the library or don’t study at all there.
    • Consequence: tends to overestimate study hours since those in the library study more.
    • B) Email all students asking how much they study.
    • Intuition: fair, since you’re asking everyone; however, nonresponse bias can occur if a large fraction doesn’t reply.
    • Issue: some students never respond; you cannot assume the 70% who replied represent the remaining 30%.
    • Concept: nonresponse bias; sample is biased if nonrespondents differ from respondents.
    • C) Sampling within a class (e.g., focus on students in this class).
    • Problem: excludes remote students or those not enrolled in that class; also may over-represent students who are present and engaged.
    • D) Focus on a specific subgroup (e.g., only graduate students or only students from one department).
    • Problem: excludes other groups; introduces bias toward the characteristics of the chosen subgroup.
  • For all A–D, the common point is that some part of the population has no chance to be selected (or is less likely to be selected), which creates sampling bias.

    • If certain groups cannot be selected, or are less likely to participate, the sample is biased.
    • Examples: remote students, students who never use certain services, or students who never check email.
  • How to avoid sampling bias: aim for random selection

    • Random sampling means giving every case in the population the same chance of being selected.
    • Simple random sample (SRS): every possible sample of size n has the same probability of being chosen.
    • In this class, advanced sampling schemes exist, but we focus on simple random sampling.
    • The core idea: avoid systematic exclusion or over-representation of any subgroup.
  • Simple random sample: precise definitions and math

    • Let N be the population size, and n the sample size.
    • A simple random sample S of size n is a subset of the population such that every subset of size n has equal probability of being selected.
    • The probability of selecting any particular sample S is:
      P(S) = rac{1}{inom{N}{n}}.
    • If you draw with replacement (putting each name back after drawing), each draw is independent and each unit has probability rac{1}{N} of being chosen on any given draw.
    • Probability a given unit is selected at least once in n draws with replacement:
      P( ext{selected at least once}) = 1 - iggl(1 - rac{1}{N}iggr)^n.
    • If you draw without replacement (do not replace), each unit has probability rac{n}{N} to be included in the sample (since you select n out of N without replacement).
    • The visual example: 100 students, drawing one name at a time.
    • Jessica selected first: probability rac{1}{100}.
    • If you want a second name and you do not put Jessica back, probability for Tom becomes rac{1}{99} on the conditional draw, whereas Jessica’s probability becomes 0 on that second draw.
    • To keep fairness, you typically replace the name back (keep the chance equal across draws) or you accept the exact change as part of the sampling design.
    • Lottery analogy: random draw with replacement ensures equal, independent chances across draws; there is no systematic bias introduced by shrinking the pool.
    • Modern practice: for very large populations, sampling with technology (randomization software, online panels) is common to handle large N efficiently while preserving randomness.
  • Random sampling in practice: philosophy and trade-offs

    • Randomness is foundational in statistics: without randomness, you cannot rely on the sample to generalize.
    • Random sampling can be implemented using traditional methods (e.g., a black box with name tags) or technology for large samples.
    • Explicitly acknowledge that some data collection methods (like unsolicited emails with incentives) may be biased; randomization alone does not guarantee representativeness if nonresponse or coverage biases are present.
  • Other forms of bias beyond sampling bias

    • Question wording bias (leading or loaded questions):
    • Example: two questions about opposing views on democracy that appear similar but steer respondents differently due to phrasing (e.g.,
      • Q1: "Should speeches against democracy be allowed?"
      • Q2: "Should speeches against democracy be forbidden?" )
    • Although the questions may be conceptually similar, the emphasis (allowed vs forbidden) shifts focus and yields different results.
    • Context bias (influenced responses by framing):
    • Providing contextual descriptions (e.g., defining what counts as government programs) can nudge respondents toward a particular stance.
    • This is used in public surveys or media polls and can distort the true preference unless carefully controlled.
    • Inaccurate or biased self-assessment (response bias):
    • Anecdote: 93% of US students said they were in the top half for driving skill.
    • Such self-assessment questions suffer from overconfidence or social desirability bias; they do not accurately measure actual skill.
    • Remedy: use quantitative/measurable indicators (e.g., number of tickets in the past 3 years) rather than self-rated qualitative judgments.
    • Nonresponse bias (sample still biased after response):
    • If a large portion does not respond, the respondent group may differ from nonrespondents, biasing results.
    • Coverage bias / selection bias:
    • Some groups are never able to participate (e.g., seniors without email, remote students, etc.). If these groups are excluded, the sample is biased.
    • Forced participation bias:
    • Forcing individuals to respond can contaminate data with response bias or nonresponse-like distortions.
    • Context- and policy-framing bias in media/elections:
    • Questionnaires during campaigns may be designed to elicit preferred responses, intentionally or unintentionally.
    • Example: questions about government programs labeled with favorable framing to influence responses.
    • Quantitative vs. qualitative measurement bias:
    • Quantitative questions (e.g., counts, tickets, years) tend to yield more reliable data than yes/no or vague judgments.
    • Qualitative questions are more susceptible to misinterpretation and misreporting.
    • Ice cream and drowning example (confounding and correlation vs causation):
    • Claim: consumption of ice cream causes drowning.
    • In reality, both ice cream consumption and drowning deaths rise in hot weather; temperature acts as a confounding variable (seasonality).
    • Lesson: correlation does not imply causation; beware lurking variables that jointly influence two observed trends.
  • Practical implications and critical thinking

    • Be skeptical of data collection methods: ask whether a sample is truly representative or if there is potential bias.
    • Evaluate whether the sampling method and question design might influence results beyond the underlying reality.
    • Recognize trade-offs between ideal rigorous sampling and practical constraints (cost, time, accessibility).
    • When in doubt, document the sampling frame, response rates, and potential biases; present data with transparency about representativeness.
  • Quick takeaway checklist for avoiding bias in surveys

    • Define population clearly (who qualifies).
    • Ensure the sampling method gives every unit a chance to be selected (randomness).
    • Prefer simple random samples or well-documented randomization schemes.
    • Monitor and mitigate nonresponse (follow-ups, incentives, weighting when appropriate).
    • Include diverse subgroups to avoid coverage bias; do not over-focus on a single subgroup.
    • Use quantitative, precise questions to reduce ambiguity and overinterpretation.
    • Be wary of question wording, framing, and context effects; randomize question order when possible.
    • Be mindful of confounding variables; distinguish correlation from causation.
    • Report sample size, response rate, and sampling method; discuss potential biases openly.
  • Connections to foundational principles

    • Sampling bias relates to the core probabilistic idea that inference relies on representative data.
    • Simple random sampling embodies the principle of equal opportunity for selection, which underpins unbiased inference.
    • Nonresponse and framing biases illustrate why controlling methodology is as important as the data itself in statistical conclusions.
  • Final example recap (ice cream story)

    • Observed correlation between ice cream consumption and drowning deaths across seasons.
    • The underlying mechanism is a common cause (seasonal temperature) rather than a causal link between ice cream and drowning.
    • Demonstrates the importance of considering lurking variables and avoiding naive causal conclusions from correlations.
  • Ethical and real-world relevance

    • Biased sampling can lead to misleading public policy, misinformed consumer opinions, and skewed media narratives.
    • Critical thinking and transparency about methods help combat misinformation and promote better decision-making.