Lecture 3 – Sampling Frames and Frame Error (Stats 240)

Lecture 3 notes: Sampling frames and error (Stats 240)

  • Course context and aims

    • This lecture focuses on sampling frames, frame errors, and how to find and evaluate sampling frames.

    • Goal: understand where mistakes and bias come from in surveys to interpret results correctly.

    • Repetition as a learning strategy: recaps basic concepts before introducing new material.

    • Readings and quizzes go live this afternoon on Canvas; quiz helps reinforce fundamental principles.

  • Key concepts and terminology

    • Target population: the group we want to make inferences about.

    • Framing concepts:

    • Frame population (sampling frame): the population from which we actually draw the sample; the list or frame used to select units.

    • Sampled population: the population from the sampling frame that we end up sampling.

    • Respondents: people who actually respond and are eligible to respond.

    • Important caveat: respondents are seldom the same as the target or sampled populations; this mismatch is a major source of nonsampling error.

    • Coverage error: when the population from which the sample is drawn differs from the target population (shape difference between target and frame).

  • Accuracy, precision, bias, and random error

    • Accuracy: being on target; hitting the right value.

    • Precision: ability to produce similar results across repeated attempts.

    • You can be both accurate and precise: consistently on target.

    • Bias (systematic error): a non-random distortion that pulls results away from the true value; can be fatal to a study if not addressed.

    • Random error: unpredictable fluctuations due to chance; acceptable if random and well-distributed, as it tends to cancel out with more data.

    • Bias vs random error:

    • Bias affects validity and representativeness; not random, often clustered or linked to specific factors.

    • Random error affects reliability but can be mitigated with larger samples; bias remains a problem even with big samples.

    • Practical implication: aim for prevention of bias; random error can often be handled with design and increased sample size, but bias might persist or even worsen with certain actions (e.g., increasing sample size without addressing frame issues).

    • Bias can occur at any stage of a study; more complex designs increase opportunities for bias.

  • Nonsampling error vs sampling error

    • Nonsampling error: all errors not due to the act of sampling itself; occurs at every step of the survey process (problem specification, data collection, data entry, etc.).

    • Sampling error: error due to observing a sample instead of the full population; can be mitigated by random sampling and larger samples, assuming a good frame.

    • Sources of nonsampling error include:

    • Flawed problem specification or definition of what is being studied.

    • Incorrect target population or misidentified questions.

    • Incomplete information from respondents (nonresponse, partial responses).

    • Data management and entry errors.

    • Confounders and mis-specified causal paths (see examples below).

  • Confounding, mediation, and problem formulation

    • Confounder: a variable that influences both the exposure and the outcome, potentially creating a spurious association if not controlled.

    • Example 1 (coffee and pancreatic cancer): caffeine intake correlates with coffee consumption (exposure) and pancreatic cancer (outcome), but smoking is a confounder that affects both exposure (coffee intake associated with smoking) and the outcome (smoking-related carcinogens).

    • Important caution: associations can be misleading if confounders are not considered; coffee may be a mediator or linked to smoking, not the true causal factor.

    • Example 2 (vaccination and death rates across countries): high-vaccination countries showing higher death rates could be due to other factors (e.g., age structure, lockdowns reducing flu burden, respiratory system damage from COVID, waning immunity over time) rather than vaccines increasing death risk. Emphasizes need to examine causal pathways, not just simple associations.

    • The big point: good problem formulation requires considering all relevant causal pathways and potential confounders to avoid misleading conclusions.

    • Michael Plank example: in New Zealand, excess deaths during COVID were negative when accounting for older population and other factors; highlights importance of robust interpretation of mortality data.

  • Prevention, planning, and data quality processes

    • Prevention is the preferred strategy to minimize nonsampling errors:

    • Preparation and planning as a core activity before data collection.

    • Pretesting (pilot testing) survey instruments with similar populations to ensure questions measure what they intend to measure and are understood.

    • Piloting the entire process to identify issues in flow and logistics.

    • Training and supervision to maintain consistency across interviewers and data collection staff.

    • Processing and quality assessment to monitor ongoing data collection quality.

    • Budget constraints often force smaller samples or less rigorous processes; trade-offs must be documented and acknowledged.

    • “Copying” in surveys (as discussed) refers to practical duplication during implementation to maintain consistency; this will be addressed in more detail later.

  • Framing errors and coverage bias (in-depth)

    • Four key terms in sampling:

    • Target population: the group of interest for inference.

    • Frame population (sampling frame): the actual list from which sampling units are drawn.

    • Sampled population: the portion of the frame that is actually sampled.

    • Respondents: those who respond and are eligible.

    • Frame errors are a major source of nonsampling error and are difficult or impossible to correct after data collection.

    • Visualizing the problem: target population vs sampling frame can differ in size and shape (e.g., oval vs square). Some groups may be included in the target but not in the frame (missing from frame); others in the frame may not respond (nonresponse); others may be ineligible.

    • Coverage error specifics:

    • Not everybody in the target population is in the sampling frame (undercoverage).

    • Not everybody in the sampling frame responds (nonresponse).

    • Causes of frame errors include:

    • Incomplete or outdated lists (e.g., old addresses, outdated telephone directories).

    • List type mismatches: sampling units vs. individuals vs. households vs. dwellings (e.g., landlines as sampling units for households, addresses as dwellings).

    • Data quality issues: duplicates, incorrect ages, erroneous identifiers (e.g., NHIs in New Zealand with duplicates).

    • Examples of sampling frames across contexts:

    • Phone surveys: landline frames target households; random-digit dialing targets phone numbers; cell-phone frames may yield mixed results with non-teen audiences if targeting teenagers.

    • In-person surveys: address lists sample dwellings rather than individuals or households; you may encounter multiple residents per dwelling.

    • Business surveys: consumer loyalty programs or customer lists provide high-quality frames for businesses.

    • Social media and online platforms (e.g., Facebook): can generate frames via user activity profiling but raise ethical concerns and potential bias.

    • Geographic sampling: mesh blocks or geographic units with random starting points and directionality to create spatial frames.

    • Important caveat: frames are usually imperfect; you should always acknowledge limitations and potential biases when reporting results.

  • Respondents, response rate, and the quality of the sampling frame

    • A random sample from a poor sampling frame can yield a biased sample, even if the sampling is random.

    • A low response rate (e.g., 15%) magnifies bias if nonrespondents differ systematically from respondents.

    • Therefore, to reduce bias you typically want both:

    • A good sampling frame that covers the target population well and minimizes undercoverage.

    • A reasonably high response rate; a large sample alone cannot compensate for a poor frame or high nonresponse bias.

  • Practical solutions to frame errors

    • Prevention with a strong frame creation process is the best solution (e.g., substantial upfront work to construct a high-quality frame; example: GP survey frame development took 9 months within a 3-year project).

    • Post-stratification (weighting) to adjust the sample to resemble the target population when key variables are known in both populations:

    • Concept: stratify the sample by key variables and weight the strata so the sample proportions match population proportions.

    • Simple weight formula (example): for stratum h, the weight is
      wh = rac{Ph}{Sh} where Ph is the proportion of the population in stratum h and S_h is the proportion of the sample in stratum h.

    • This approach helps align the sample with known population characteristics, but it relies on having good information about the target population and the strata that matter for the analysis.

    • Use census or other external data sources to inform post-stratification and frame assessment:

    • If census data provide reliable distribution of key variables, you can compare your sample to the census and adjust accordingly.

    • If census data will change (e.g., after a census ends), you may turn to other studies with similar populations to obtain the necessary target distributions.

    • When missing or incomplete information limits post-stratification, acknowledge the limitations and consider supplementary sampling frames to capture underrepresented groups (e.g., Maori GP survey example).

    • Other advanced methods (e.g., ratio regression estimates) may be used in specialized epidemiological contexts but are expensive and require specialized skills; not typically part of stage-two coursework.

    • If a sampling frame is inherently flawed, sometimes the best practice is to recognize and report the limitations rather than attempt extensive statistical correction.

  • Detection and handling of bias in reporting

    • After data collection, you should assess potential biases and discuss their likely direction and magnitude in the results.

    • Provide qualitative assessments of bias size (minimal vs large) and discuss potential effects on interpretation and applicability.

    • Document the limitations of the sampling frame and the possible impact on external validity (generalizability).

  • How to improve sampling frame quality in practice

    • Start with a clear research question to define the population of interest and the units of analysis (sampling unit).

    • Evaluate availability and quality of frames: access, cost, coverage, and accuracy.

    • Consider creating supplementary frames if necessary to cover missing groups.

    • Use verification questions at the start of interviews to confirm eligibility and prevent misclassified respondents (eligibility filters).

    • Aim for robust design and consistent conduct to minimize errors across data collection teams.

    • Recognize that practical constraints (budget, time, personnel) will influence the achievable frame quality and sample size; document what compromises were made and why.

  • Examples and practical takeaways from the lecture

    • Real-world sampling picks:

    • Electoral rolls, with addresses and demographic indicators (e.g., Maori descent in New Zealand) useful for contact lists.

    • Telephone directories (historical) and online directories; addresses for geographic or door-to-door sampling.

    • Customer lists and loyalty programs can yield high-quality sampling frames for business surveys.

    • Facebook and other platforms can provide rich frames but raise privacy and bias concerns and may involve complex consent issues.

    • Conceptual visuals to remember:

    • Target population (oval) vs frame population (square) conceptually illustrate undercoverage and frame gaps.

    • A key warning: increasing sample size does not fix a biased frame; a larger non-representative sample amplifies the same bias.

  • Quick summary of takeaway messages

    • Sampling frames are a critical source of nonsampling error and can seriously bias results if not carefully constructed and evaluated.

    • Bias is a systematic error that cannot be fixed by merely collecting more data; prevention, careful design, and transparent reporting are essential.

    • Post-stratification and supplementary frames are common remedies when information about the target population is available, but they rely on good data and assumptions.

    • Always document limitations related to frame quality, coverage, response rates, and potential confounders; provide context for interpretation and generalizability.

  • Looking ahead

    • Next week: more on survey methods, nonresponse, and how to address nonresponse bias, as well as practical demonstrations of post-stratification and weighting.

    • Thomas Lundley will elaborate on more advanced bias analysis and ratio estimates in the upcoming session.

  • Reading list and assignments (as announced)

    • Readings on Canvas related to sampling frames, frame errors, and post-stratification (live this afternoon).

    • Quiz: fundamental principles of sampling frames and error; attempt multiple times to improve score.

  • Final practical note from the lecture

    • The instructor emphasized that everything in survey design—especially frame construction and bias analysis—serves to improve the reliability and interpretability of results, even if perfect corrections aren’t possible within limited resources.

  • Equations and key formulas used in this note

    • Bias as a product of undercoverage and frame difference:
      ext{Bias} = U imes igl(ar{X}{ ext{frame}} - ar{X}{ ext{not in frame}}igr)

    • Post-stratification weight for stratum h:
      wh = rac{Ph}{S_h}

    • Conceptual weighting goal: adjust the sample so that the weighted sample resembles the target population on key stratifying variables.