Foundations of Statistics: Real-World Contexts, Population vs. Sample, Representativeness, and Sampling Error
Real-world presence of statistics
Statistics appear outside the classroom in everyday contexts: doctor’s office, politics, grocery stores, online articles, and commercials.
Doctor’s office examples:
Height percentile for babies (e.g., what percent of children are shorter than a given toddler).
Prevalence of diseases (e.g., how common a given illness is).
The common cold as an example of a frequently discussed illness.
Politics examples:
Polling numbers: asking people who they plan to vote for and reporting percentages on websites or TV.
Election outcomes and related statistics at the county or regional level (e.g., unemployment rates).
Grocery store examples:
Price per pound or price per ounce shown on stickers.
Commercials and online examples:
Advertisements claiming certain statistics (e.g., toothpaste ads saying \"four out of five dentists agree\").
Side effects of medications and the number of people who take medications can be statistics.
Online articles (e.g., BuzzFeed) sometimes cite research bases; some data can be faulty or misinterpreted; statistics can be massaged or misinterpreted.
Other points:
Statistics can be encountered in many places outside of formal classes.
Aimed to develop critical tools to examine statistics more carefully (the class emphasizes a critical angle).
Most real-world, practically applicable topic discussed: chapter 2 on graphs, because graphs can mislead people if not used carefully.
Why we need statistics
Data by itself is not informative; statistics helps us transform raw data into understandable information.
Functions of statistics:
Organize and summarize information: raw data are just numbers; statistics turn them into usable summaries.
Communicate findings: turning numbers into a readable report or graph helps others understand what was found.
Personal example from the lecturer:
Worked in institutional research evaluating a program; created graphs and wrote a report on how the program benefited students.
Data set example: a data set with 365 rows (a long table) that would be hard to interpret without statistical tools or visualization.
Short definition: statistics is where we organize, summarize, and interpret information.
Note on note-taking: the instructor may move quickly between slides; students can request a repeat if needed.
Research methodology: foundational concepts before analysis
Before collecting data, determine who is worth studying (the population).
Population:
The largest group of people related to the research question.
Usually not feasible to study the entire population due to size, cost, and practicality.
Example: The US census attempts to collect data from all Americans every ten years; it’s extremely challenging and not perfect.
The census is very diligent (people may knock on doors, reminders are sent), but no census is perfectly complete.
Sample:
A smaller subset of the population.
Data from a sample is used to generalize to the population.
Conceptually, the sample is a miniature population; the goal is to infer what would happen if the entire population were studied.
Example concepts to illustrate population vs sample:
How many cats does the average American own? Population could be households in the United States; a sample could be Conway, AR households, or coastal campus students.
If we study two samples from two different contexts (e.g., a small mountain village vs. a hospital in a large city), the results may differ due to context, not just sampling error.
Population vs sample in practice:
Often not feasible to study the entire population; samples are used instead.
If the sample comes from non-representative contexts, results can be biased and not generalizable.
Non-examples and the need for representativeness:
In a zombie apocalypse example, one sample from a mountain village and one from a city hospital both yield very different infection percentages (3% vs 2%), illustrating how context matters and why samples may not reflect the population as a whole.
The more representative a sample is of the population, the more trustworthy the generalization.
Representativeness and sampling error
Representativeness:
A representative sample should be similar to the population; its results should closely resemble what would be found if the entire population were studied.
If the sample is not representative, conclusions about the population may be wrong.
Sampling error (sampling bias):
The difference between sample data and population data, a natural and unavoidable source of error.
Even with careful sampling, samples are not identical to the population.
Expressed conceptually as the difference between the sample statistic and the population parameter, e.g.,
If the population height is, say, 6 feet on average, a sample will almost certainly not yield exactly 6 feet; there will be some deviation due to sampling error.
Examples to illustrate representativeness and errors:
Height example: sampling only horse jockeys, gymnasts, or basketball players would skew the estimate of the average American height; a truly representative sample would include a mix of people from across the population.
Jury decision-making research often uses college students as a sample, but the population includes actual jurors with different age ranges and experiences.
Practical takeaway:
Researchers aim for representative samples to generalize results, but perfect representativeness is rarely achievable; sampling error must be acknowledged and accounted for.
Real-world case studies to illustrate sampling issues
Jury decision making research:
Common practice in psychology is to use college students as participants, but the population of interest is actual jurors.
Key differences between college students and jurors:
Age range: students are typically 18–22, whereas jurors span a wider age range (possibly 18–90+).
Access to information: students may have greater access to academic resources; jurors may have different information exposure.
Previous jury experience: jurors may have served on juries before; students typically have little or no jury experience.
Demographics and worldview: college students often lean more liberal; jurors come from diverse backgrounds.
Implications:
Generalizing results from college students to all jurors is problematic; it depends on context and the degree to which the sample resembles the population.
Sometimes researchers broaden sampling to courthouses to obtain data from actual jurors, which can improve generalizability.
Takeaway about generalization:
Generalizing from a non-representative sample to a population is not guaranteed; researchers must acknowledge sampling limitations and strive for more representative samples when possible.
Practical guidance and ethical considerations
Graphs and data visualization:
Graphs can mislead if not used carefully; be critical about how data are presented (scales, selections, and framing matter).
Chapter 2 focus on graphs emphasizes real-world misinterpretation risks.
Critical examination of statistics:
Statistics can be massaged or misinterpreted; readers should seek to understand data sources, sampling methods, and potential biases.
Textbooks and research foundations:
Textbooks are built from individual research studies; understanding statistics helps connect textbook content to original research.
Ethical implications:
Accurate representation of statistics matters for policy, public opinion, and scientific credibility.
Misleading statistics can lead to poor decisions; researchers have a responsibility to report honestly and clearly.
Quick reference: key definitions and notation (LaTeX)
Population size: N
Sample size: n
Population parameter (true proportion): p
Sample statistic (sample proportion): \hat{p} = \frac{X}{n} where X = number of successes in the sample
Expected value of the sample proportion: \mathbb{E}[\hat{p}] = p
Variance of the sample proportion: \mathrm{Var}(\hat{p}) = \frac{p(1-p)}{n}
Margin of error (for a proportion, with z-value for confidence): \text{MoE} = z \sqrt{\frac{p(1-p)}{n}}
General idea: a representative sample aims to approximate the population, but sampling error ensures the sample will differ from the population to some extent.
Closing takeaway
Statistics is the tool that turns raw data into understandable, communicable information, enabling us to draw conclusions and generalize findings.
The key challenge is choosing samples that are representative of the population so that conclusions are valid and useful in the real world.
Always approach statistics with a critical eye: check data sources, sampling methods, and potential biases, especially when framing graphs or headlines.