Chapter 1-5: Introduction to Data, Surveys, and Sampling

Survey Design and Data Collection

The speaker emphasizes practical challenges in survey design, especially around respondent burden and data quality.
Avoid asking respondents to perform their own arithmetic in surveys; surveys should be quick and to the point.
Do not mislead respondents with promises like a survey taking only a few minutes if it will actually take longer; honesty about length matters for response quality.
Survey design is itself a field of study; it requires careful planning and can be treated as its own course.
When planning surveys, you must consider how you’ll recruit participants and what population you intend to study.
The right approach is to aim for a census or a carefully constructed sample rather than convenient but biased sampling (e.g., stopping by a random spot to ask passersby).

Population, Census, and Sampling

A census collects data from everyone in the population; historically in the United States, a census is conducted every ten years.
Even censuses face nonresponse and missing data issues; asking every person is often impossible or impractical.
If you cannot survey everyone, you use a sample to represent the population.
The population example often used is broad (e.g., all people in the United States) or narrower (e.g., people in Maine, students at UMS).
The speaker notes that sampling from a specific, unrealistically small or biased group (e.g., just UMS students) will yield results that may not generalize to the broader population.
The Depression-era example: calling only wealthy people for a survey biased voting results; illustrates how sampling frame can skew outcomes.
Census data collection aims to include everyone, but it still faces nonresponse biases and practical barriers.

Census vs. Sampling and Representativeness

Census: data from every individual in the population; ideal but often infeasible.
Sampling: select a subset of the population to infer about the whole; requires careful design to avoid bias.
The key question: if you can’t survey everybody, how do you get a representative sample?
The standard answer: use a sample with well-defined probabilities of selection to enable generalization.
A common practical goal: design samples so that each member of the population has an equal chance of being selected, and each possible sample of size n has an equal chance of being chosen.
This principle helps ensure that the sample can reflect population characteristics and enable valid inferences.
The speaker mentions a census as a goal but acknowledges its impracticality and the need for sampling methods.

Random Sampling and Biases

Not every person has an equal chance of being chosen in non-random sampling; careful randomization is essential for validity.
The idea introduced: any sample of size n should be equally likely to be drawn from the population, which minimizes selection bias.
Real-world samples must consider who is reachable and willing to respond; the Depression-era and modern smartphone-era contexts illustrate changing sampling frames.
The population you study matters: asking questions about the US population is different from asking only a specific subpopulation (e.g., college students).
To obtain generalizable results, you must consider the population you want to learn about and design your sampling frame accordingly.

Practical Guidelines for Surveys

Keep surveys quick and to the point; respondents should not feel the survey is onerous.
Avoid promising overly short completion times when surveys are long; be realistic about length.
If a survey is too long, consider breaking it into smaller, modular segments or redesigning items to reduce burden.
When deciding whom to sample, consider the target population and how representative the sample will be of that population.
The population framing (US, Maine, students, etc.) determines the interpretation and generalizability of results.
For large heterogeneous populations, a well-chosen probabilistic sample is more informative than a convenience sample.

Population, Samples, and Notation in Practice

Population: the entire group of interest (e.g., all adults in the United States).
Sample: a subset drawn from the population (e.g., 2,178 people surveyed for a study).
Parameters vs. Statistics:
- Parameter: a numerical summary of the population (e.g., population mean μ, population proportion p).
- Statistic: a numerical summary derived from the sample (e.g., sample mean
  (\bar{X}), sample proportion (\hat{p})).
The relationship: statistics are used to estimate parameters; the accuracy depends on sampling design and sample size.

Sample Size and Representativeness: What Size Do We Need?

A common-sounding but context-dependent claim is that a certain fixed sample size (e.g., 30) is enough for many analyses; the speaker questions this rule of thumb and notes it depends on the situation.
Example discussed: a study might use a sample of 2,178 people; the lecturer notes there is curiosity about why such a number is chosen and when smaller samples would suffice.
The key idea: larger samples generally reduce sampling error, but the exact necessary size depends on the population variability and the desired precision.
The sample composition matters less than assumed for some questions; you can draw valid inferences even if the exact gender or subgroup composition is uneven, provided the sampling is random and representative of the population.
Hypothetical scenario: you could sample many more men than women or vice versa and still make valid population-level inferences if the sampling design is appropriate and the population proportions are accounted for in analysis.

Foundations of Statistical Inference

Statistics is the science of collecting, describing, and analyzing data; it is not just about data collection but also about inference and interpretation.
In the course, roughly one third is devoted to data analysis (descriptive and inferential stats) and one third to probability theory.
Inferential statistics uses data from samples to draw conclusions about populations, leveraging probability and sampling theory.
The population in a question could be broad (all people in the United States) or narrow (all students at a particular university); the population definition guides the analysis.
The lecture uses practical examples to connect theory to reality: TV-watching habits, voting behavior, education statistics, and gender gaps in SAT scores.

Notation, Terminology, and Conceptual Language

Population: the entire group of interest.
Sample: a subset drawn from the population.
Parameter: a numerical characteristic of the population (e.g., μ, σ, p).
Statistic: a numerical characteristic of the sample (e.g., (\bar{X}), s, (\hat{p})).
Population mean: \mu; population standard deviation: \sigma.
Sample mean: \bar{X}; sample standard deviation: s.
Aiming for clear language helps avoid memorization that is not concept-driven; the goal is understanding, not rote recall.

Two-Population Hypothesis Testing and Practical Inference

A key research question problem: do two populations differ in some parameter (e.g., mean, proportion)?
Hypothesis testing framework (illustrative form):
- Null hypothesis: H0: \mu1 = \mu_2
- Alternative hypothesis: Ha: \mu1 \neq \mu_2 (two-sided) or other forms like greater/less than.
The course emphasizes learning the language and procedure rather than memorizing definitions for quizzes.
Hypothesis testing is a major tool for determining whether observed sample differences reflect true population differences or random chance.
Example discussion: comparing SAT scores by gender and education levels; the concept of sample composition and its effect on conclusions is nuanced and requires careful analysis.
The instructor notes that the exact allocation of sample sizes across subgroups (e.g., 200 men vs. 138 women) does not automatically invalidate conclusions about population means; what matters is representativeness and proper inference.
The broader takeaway: with random sampling and correct inference procedures, you can draw conclusions about population questions like voting intentions, TV-watching habits, or educational outcomes.

Real-World Relevance, Ethics, and Philosophical Context

Surveys and sampling connect to real-world problems such as political polls, consumer research, and social science investigations.
Ethical considerations include designing fair sampling frames, avoiding bias, and being transparent about limitations and potential sources of error.
Philosophical takeaway: data collection, sampling, and inference require careful judgment about what the data represent and how generalizable the conclusions are to the target population.

Quick Formulas and Notation Reference

Number of possible samples when choosing n from N: \binom{N}{n}
Probability a given individual is selected in a simple random sample of size n from N: P(i\in S) = \frac{n}{N}
Example: choosing 8 from 50: \binom{50}{8}
Population mean and standard deviation: \mu, \ \sigma
Sample mean and standard deviation: \bar{X}, \ s
Hypotheses for two-population mean comparison:
- Null: H0: \mu1 = \mu_2
- Alternative: Ha: \mu1 \neq \mu_2
Key inferential objective: use sample statistics to make inferences about population parameters