BM

Chapter 1-5: Introduction to Data, Surveys, and Sampling

Survey Design and Data Collection

  • The speaker emphasizes practical challenges in survey design, especially around respondent burden and data quality.
  • Avoid asking respondents to perform their own arithmetic in surveys; surveys should be quick and to the point.
  • Do not mislead respondents with promises like a survey taking only a few minutes if it will actually take longer; honesty about length matters for response quality.
  • Survey design is itself a field of study; it requires careful planning and can be treated as its own course.
  • When planning surveys, you must consider how you’ll recruit participants and what population you intend to study.
  • The right approach is to aim for a census or a carefully constructed sample rather than convenient but biased sampling (e.g., stopping by a random spot to ask passersby).

Population, Census, and Sampling

  • A census collects data from everyone in the population; historically in the United States, a census is conducted every ten years.
  • Even censuses face nonresponse and missing data issues; asking every person is often impossible or impractical.
  • If you cannot survey everyone, you use a sample to represent the population.
  • The population example often used is broad (e.g., all people in the United States) or narrower (e.g., people in Maine, students at UMS).
  • The speaker notes that sampling from a specific, unrealistically small or biased group (e.g., just UMS students) will yield results that may not generalize to the broader population.
  • The Depression-era example: calling only wealthy people for a survey biased voting results; illustrates how sampling frame can skew outcomes.
  • Census data collection aims to include everyone, but it still faces nonresponse biases and practical barriers.

Census vs. Sampling and Representativeness

  • Census: data from every individual in the population; ideal but often infeasible.
  • Sampling: select a subset of the population to infer about the whole; requires careful design to avoid bias.
  • The key question: if you can’t survey everybody, how do you get a representative sample?
  • The standard answer: use a sample with well-defined probabilities of selection to enable generalization.
  • A common practical goal: design samples so that each member of the population has an equal chance of being selected, and each possible sample of size n has an equal chance of being chosen.
  • This principle helps ensure that the sample can reflect population characteristics and enable valid inferences.
  • The speaker mentions a census as a goal but acknowledges its impracticality and the need for sampling methods.

Random Sampling and Biases

  • Not every person has an equal chance of being chosen in non-random sampling; careful randomization is essential for validity.
  • The idea introduced: any sample of size n should be equally likely to be drawn from the population, which minimizes selection bias.
  • Real-world samples must consider who is reachable and willing to respond; the Depression-era and modern smartphone-era contexts illustrate changing sampling frames.
  • The population you study matters: asking questions about the US population is different from asking only a specific subpopulation (e.g., college students).
  • To obtain generalizable results, you must consider the population you want to learn about and design your sampling frame accordingly.

Practical Guidelines for Surveys

  • Keep surveys quick and to the point; respondents should not feel the survey is onerous.
  • Avoid promising overly short completion times when surveys are long; be realistic about length.
  • If a survey is too long, consider breaking it into smaller, modular segments or redesigning items to reduce burden.
  • When deciding whom to sample, consider the target population and how representative the sample will be of that population.
  • The population framing (US, Maine, students, etc.) determines the interpretation and generalizability of results.
  • For large heterogeneous populations, a well-chosen probabilistic sample is more informative than a convenience sample.

Population, Samples, and Notation in Practice

  • Population: the entire group of interest (e.g., all adults in the United States).
  • Sample: a subset drawn from the population (e.g., 2,178 people surveyed for a study).
  • Parameters vs. Statistics:
    • Parameter: a numerical summary of the population (e.g., population mean μ, population proportion p).
    • Statistic: a numerical summary derived from the sample (e.g., sample mean
      (\bar{X}), sample proportion (\hat{p})).
  • The relationship: statistics are used to estimate parameters; the accuracy depends on sampling design and sample size.

Sample Size and Representativeness: What Size Do We Need?

  • A common-sounding but context-dependent claim is that a certain fixed sample size (e.g., 30) is enough for many analyses; the speaker questions this rule of thumb and notes it depends on the situation.
  • Example discussed: a study might use a sample of 2,178 people; the lecturer notes there is curiosity about why such a number is chosen and when smaller samples would suffice.
  • The key idea: larger samples generally reduce sampling error, but the exact necessary size depends on the population variability and the desired precision.
  • The sample composition matters less than assumed for some questions; you can draw valid inferences even if the exact gender or subgroup composition is uneven, provided the sampling is random and representative of the population.
  • Hypothetical scenario: you could sample many more men than women or vice versa and still make valid population-level inferences if the sampling design is appropriate and the population proportions are accounted for in analysis.

Foundations of Statistical Inference

  • Statistics is the science of collecting, describing, and analyzing data; it is not just about data collection but also about inference and interpretation.
  • In the course, roughly one third is devoted to data analysis (descriptive and inferential stats) and one third to probability theory.
  • Inferential statistics uses data from samples to draw conclusions about populations, leveraging probability and sampling theory.
  • The population in a question could be broad (all people in the United States) or narrow (all students at a particular university); the population definition guides the analysis.
  • The lecture uses practical examples to connect theory to reality: TV-watching habits, voting behavior, education statistics, and gender gaps in SAT scores.

Notation, Terminology, and Conceptual Language

  • Population: the entire group of interest.
  • Sample: a subset drawn from the population.
  • Parameter: a numerical characteristic of the population (e.g., μ, σ, p).
  • Statistic: a numerical characteristic of the sample (e.g., (\bar{X}), s, (\hat{p})).
  • Population mean: \mu; population standard deviation: \sigma.
  • Sample mean: \bar{X}; sample standard deviation: s.
  • Aiming for clear language helps avoid memorization that is not concept-driven; the goal is understanding, not rote recall.

Two-Population Hypothesis Testing and Practical Inference

  • A key research question problem: do two populations differ in some parameter (e.g., mean, proportion)?
  • Hypothesis testing framework (illustrative form):
    • Null hypothesis: H0: \mu1 = \mu_2
    • Alternative hypothesis: Ha: \mu1 \neq \mu_2 (two-sided) or other forms like greater/less than.
  • The course emphasizes learning the language and procedure rather than memorizing definitions for quizzes.
  • Hypothesis testing is a major tool for determining whether observed sample differences reflect true population differences or random chance.
  • Example discussion: comparing SAT scores by gender and education levels; the concept of sample composition and its effect on conclusions is nuanced and requires careful analysis.
  • The instructor notes that the exact allocation of sample sizes across subgroups (e.g., 200 men vs. 138 women) does not automatically invalidate conclusions about population means; what matters is representativeness and proper inference.
  • The broader takeaway: with random sampling and correct inference procedures, you can draw conclusions about population questions like voting intentions, TV-watching habits, or educational outcomes.

Real-World Relevance, Ethics, and Philosophical Context

  • Surveys and sampling connect to real-world problems such as political polls, consumer research, and social science investigations.
  • Ethical considerations include designing fair sampling frames, avoiding bias, and being transparent about limitations and potential sources of error.
  • Philosophical takeaway: data collection, sampling, and inference require careful judgment about what the data represent and how generalizable the conclusions are to the target population.

Quick Formulas and Notation Reference

  • Number of possible samples when choosing n from N: \binom{N}{n}
  • Probability a given individual is selected in a simple random sample of size n from N: P(i\in S) = \frac{n}{N}
  • Example: choosing 8 from 50: \binom{50}{8}
  • Population mean and standard deviation: \mu, \ \sigma
  • Sample mean and standard deviation: \bar{X}, \ s
  • Hypotheses for two-population mean comparison:
    • Null: H0: \mu1 = \mu_2
    • Alternative: Ha: \mu1 \neq \mu_2
  • Key inferential objective: use sample statistics to make inferences about population parameters