Chapter 1-5: Introduction to Data, Surveys, and Sampling
Survey Design and Data Collection
- The speaker emphasizes practical challenges in survey design, especially around respondent burden and data quality.
- Avoid asking respondents to perform their own arithmetic in surveys; surveys should be quick and to the point.
- Do not mislead respondents with promises like a survey taking only a few minutes if it will actually take longer; honesty about length matters for response quality.
- Survey design is itself a field of study; it requires careful planning and can be treated as its own course.
- When planning surveys, you must consider how you’ll recruit participants and what population you intend to study.
- The right approach is to aim for a census or a carefully constructed sample rather than convenient but biased sampling (e.g., stopping by a random spot to ask passersby).
Population, Census, and Sampling
- A census collects data from everyone in the population; historically in the United States, a census is conducted every ten years.
- Even censuses face nonresponse and missing data issues; asking every person is often impossible or impractical.
- If you cannot survey everyone, you use a sample to represent the population.
- The population example often used is broad (e.g., all people in the United States) or narrower (e.g., people in Maine, students at UMS).
- The speaker notes that sampling from a specific, unrealistically small or biased group (e.g., just UMS students) will yield results that may not generalize to the broader population.
- The Depression-era example: calling only wealthy people for a survey biased voting results; illustrates how sampling frame can skew outcomes.
- Census data collection aims to include everyone, but it still faces nonresponse biases and practical barriers.
Census vs. Sampling and Representativeness
- Census: data from every individual in the population; ideal but often infeasible.
- Sampling: select a subset of the population to infer about the whole; requires careful design to avoid bias.
- The key question: if you can’t survey everybody, how do you get a representative sample?
- The standard answer: use a sample with well-defined probabilities of selection to enable generalization.
- A common practical goal: design samples so that each member of the population has an equal chance of being selected, and each possible sample of size n has an equal chance of being chosen.
- This principle helps ensure that the sample can reflect population characteristics and enable valid inferences.
- The speaker mentions a census as a goal but acknowledges its impracticality and the need for sampling methods.
Random Sampling and Biases
- Not every person has an equal chance of being chosen in non-random sampling; careful randomization is essential for validity.
- The idea introduced: any sample of size n should be equally likely to be drawn from the population, which minimizes selection bias.
- Real-world samples must consider who is reachable and willing to respond; the Depression-era and modern smartphone-era contexts illustrate changing sampling frames.
- The population you study matters: asking questions about the US population is different from asking only a specific subpopulation (e.g., college students).
- To obtain generalizable results, you must consider the population you want to learn about and design your sampling frame accordingly.
Practical Guidelines for Surveys
- Keep surveys quick and to the point; respondents should not feel the survey is onerous.
- Avoid promising overly short completion times when surveys are long; be realistic about length.
- If a survey is too long, consider breaking it into smaller, modular segments or redesigning items to reduce burden.
- When deciding whom to sample, consider the target population and how representative the sample will be of that population.
- The population framing (US, Maine, students, etc.) determines the interpretation and generalizability of results.
- For large heterogeneous populations, a well-chosen probabilistic sample is more informative than a convenience sample.
Population, Samples, and Notation in Practice
- Population: the entire group of interest (e.g., all adults in the United States).
- Sample: a subset drawn from the population (e.g., 2,178 people surveyed for a study).
- Parameters vs. Statistics:
- Parameter: a numerical summary of the population (e.g., population mean μ, population proportion p).
- Statistic: a numerical summary derived from the sample (e.g., sample mean
(\bar{X}), sample proportion (\hat{p})).
- The relationship: statistics are used to estimate parameters; the accuracy depends on sampling design and sample size.
Sample Size and Representativeness: What Size Do We Need?
- A common-sounding but context-dependent claim is that a certain fixed sample size (e.g., 30) is enough for many analyses; the speaker questions this rule of thumb and notes it depends on the situation.
- Example discussed: a study might use a sample of 2,178 people; the lecturer notes there is curiosity about why such a number is chosen and when smaller samples would suffice.
- The key idea: larger samples generally reduce sampling error, but the exact necessary size depends on the population variability and the desired precision.
- The sample composition matters less than assumed for some questions; you can draw valid inferences even if the exact gender or subgroup composition is uneven, provided the sampling is random and representative of the population.
- Hypothetical scenario: you could sample many more men than women or vice versa and still make valid population-level inferences if the sampling design is appropriate and the population proportions are accounted for in analysis.
Foundations of Statistical Inference
- Statistics is the science of collecting, describing, and analyzing data; it is not just about data collection but also about inference and interpretation.
- In the course, roughly one third is devoted to data analysis (descriptive and inferential stats) and one third to probability theory.
- Inferential statistics uses data from samples to draw conclusions about populations, leveraging probability and sampling theory.
- The population in a question could be broad (all people in the United States) or narrow (all students at a particular university); the population definition guides the analysis.
- The lecture uses practical examples to connect theory to reality: TV-watching habits, voting behavior, education statistics, and gender gaps in SAT scores.
Notation, Terminology, and Conceptual Language
- Population: the entire group of interest.
- Sample: a subset drawn from the population.
- Parameter: a numerical characteristic of the population (e.g., μ, σ, p).
- Statistic: a numerical characteristic of the sample (e.g., (\bar{X}), s, (\hat{p})).
- Population mean: \mu; population standard deviation: \sigma.
- Sample mean: \bar{X}; sample standard deviation: s.
- Aiming for clear language helps avoid memorization that is not concept-driven; the goal is understanding, not rote recall.
Two-Population Hypothesis Testing and Practical Inference
- A key research question problem: do two populations differ in some parameter (e.g., mean, proportion)?
- Hypothesis testing framework (illustrative form):
- Null hypothesis: H0: \mu1 = \mu_2
- Alternative hypothesis: Ha: \mu1 \neq \mu_2 (two-sided) or other forms like greater/less than.
- The course emphasizes learning the language and procedure rather than memorizing definitions for quizzes.
- Hypothesis testing is a major tool for determining whether observed sample differences reflect true population differences or random chance.
- Example discussion: comparing SAT scores by gender and education levels; the concept of sample composition and its effect on conclusions is nuanced and requires careful analysis.
- The instructor notes that the exact allocation of sample sizes across subgroups (e.g., 200 men vs. 138 women) does not automatically invalidate conclusions about population means; what matters is representativeness and proper inference.
- The broader takeaway: with random sampling and correct inference procedures, you can draw conclusions about population questions like voting intentions, TV-watching habits, or educational outcomes.
Real-World Relevance, Ethics, and Philosophical Context
- Surveys and sampling connect to real-world problems such as political polls, consumer research, and social science investigations.
- Ethical considerations include designing fair sampling frames, avoiding bias, and being transparent about limitations and potential sources of error.
- Philosophical takeaway: data collection, sampling, and inference require careful judgment about what the data represent and how generalizable the conclusions are to the target population.
- Number of possible samples when choosing n from N: \binom{N}{n}
- Probability a given individual is selected in a simple random sample of size n from N: P(i\in S) = \frac{n}{N}
- Example: choosing 8 from 50: \binom{50}{8}
- Population mean and standard deviation: \mu, \ \sigma
- Sample mean and standard deviation: \bar{X}, \ s
- Hypotheses for two-population mean comparison:
- Null: H0: \mu1 = \mu_2
- Alternative: Ha: \mu1 \neq \mu_2
- Key inferential objective: use sample statistics to make inferences about population parameters