Foundations of Statistics: Real-World Contexts, Population vs. Sample, Representativeness, and Sampling Error

Real-world presence of statistics

  • Statistics appear outside the classroom in everyday contexts: doctor’s office, politics, grocery stores, online articles, and commercials.

  • Doctor’s office examples:

    • Height percentile for babies (e.g., what percent of children are shorter than a given toddler).

    • Prevalence of diseases (e.g., how common a given illness is).

    • The common cold as an example of a frequently discussed illness.

  • Politics examples:

    • Polling numbers: asking people who they plan to vote for and reporting percentages on websites or TV.

    • Election outcomes and related statistics at the county or regional level (e.g., unemployment rates).

  • Grocery store examples:

    • Price per pound or price per ounce shown on stickers.

  • Commercials and online examples:

    • Advertisements claiming certain statistics (e.g., toothpaste ads saying \"four out of five dentists agree\").

    • Side effects of medications and the number of people who take medications can be statistics.

    • Online articles (e.g., BuzzFeed) sometimes cite research bases; some data can be faulty or misinterpreted; statistics can be massaged or misinterpreted.

  • Other points:

    • Statistics can be encountered in many places outside of formal classes.

    • Aimed to develop critical tools to examine statistics more carefully (the class emphasizes a critical angle).

  • Most real-world, practically applicable topic discussed: chapter 2 on graphs, because graphs can mislead people if not used carefully.

Why we need statistics

  • Data by itself is not informative; statistics helps us transform raw data into understandable information.

  • Functions of statistics:

    • Organize and summarize information: raw data are just numbers; statistics turn them into usable summaries.

    • Communicate findings: turning numbers into a readable report or graph helps others understand what was found.

  • Personal example from the lecturer:

    • Worked in institutional research evaluating a program; created graphs and wrote a report on how the program benefited students.

    • Data set example: a data set with 365 rows (a long table) that would be hard to interpret without statistical tools or visualization.

  • Short definition: statistics is where we organize, summarize, and interpret information.

  • Note on note-taking: the instructor may move quickly between slides; students can request a repeat if needed.

Research methodology: foundational concepts before analysis

  • Before collecting data, determine who is worth studying (the population).

  • Population:

    • The largest group of people related to the research question.

    • Usually not feasible to study the entire population due to size, cost, and practicality.

  • Example: The US census attempts to collect data from all Americans every ten years; it’s extremely challenging and not perfect.

    • The census is very diligent (people may knock on doors, reminders are sent), but no census is perfectly complete.

  • Sample:

    • A smaller subset of the population.

    • Data from a sample is used to generalize to the population.

    • Conceptually, the sample is a miniature population; the goal is to infer what would happen if the entire population were studied.

  • Example concepts to illustrate population vs sample:

    • How many cats does the average American own? Population could be households in the United States; a sample could be Conway, AR households, or coastal campus students.

    • If we study two samples from two different contexts (e.g., a small mountain village vs. a hospital in a large city), the results may differ due to context, not just sampling error.

  • Population vs sample in practice:

    • Often not feasible to study the entire population; samples are used instead.

    • If the sample comes from non-representative contexts, results can be biased and not generalizable.

  • Non-examples and the need for representativeness:

    • In a zombie apocalypse example, one sample from a mountain village and one from a city hospital both yield very different infection percentages (3% vs 2%), illustrating how context matters and why samples may not reflect the population as a whole.

    • The more representative a sample is of the population, the more trustworthy the generalization.

Representativeness and sampling error

  • Representativeness:

    • A representative sample should be similar to the population; its results should closely resemble what would be found if the entire population were studied.

    • If the sample is not representative, conclusions about the population may be wrong.

  • Sampling error (sampling bias):

    • The difference between sample data and population data, a natural and unavoidable source of error.

    • Even with careful sampling, samples are not identical to the population.

    • Expressed conceptually as the difference between the sample statistic and the population parameter, e.g.,

    • If the population height is, say, 6 feet on average, a sample will almost certainly not yield exactly 6 feet; there will be some deviation due to sampling error.

  • Examples to illustrate representativeness and errors:

    • Height example: sampling only horse jockeys, gymnasts, or basketball players would skew the estimate of the average American height; a truly representative sample would include a mix of people from across the population.

    • Jury decision-making research often uses college students as a sample, but the population includes actual jurors with different age ranges and experiences.

  • Practical takeaway:

    • Researchers aim for representative samples to generalize results, but perfect representativeness is rarely achievable; sampling error must be acknowledged and accounted for.

Real-world case studies to illustrate sampling issues

  • Jury decision making research:

    • Common practice in psychology is to use college students as participants, but the population of interest is actual jurors.

    • Key differences between college students and jurors:

    • Age range: students are typically 18–22, whereas jurors span a wider age range (possibly 18–90+).

    • Access to information: students may have greater access to academic resources; jurors may have different information exposure.

    • Previous jury experience: jurors may have served on juries before; students typically have little or no jury experience.

    • Demographics and worldview: college students often lean more liberal; jurors come from diverse backgrounds.

    • Implications:

    • Generalizing results from college students to all jurors is problematic; it depends on context and the degree to which the sample resembles the population.

    • Sometimes researchers broaden sampling to courthouses to obtain data from actual jurors, which can improve generalizability.

  • Takeaway about generalization:

    • Generalizing from a non-representative sample to a population is not guaranteed; researchers must acknowledge sampling limitations and strive for more representative samples when possible.

Practical guidance and ethical considerations

  • Graphs and data visualization:

    • Graphs can mislead if not used carefully; be critical about how data are presented (scales, selections, and framing matter).

    • Chapter 2 focus on graphs emphasizes real-world misinterpretation risks.

  • Critical examination of statistics:

    • Statistics can be massaged or misinterpreted; readers should seek to understand data sources, sampling methods, and potential biases.

  • Textbooks and research foundations:

    • Textbooks are built from individual research studies; understanding statistics helps connect textbook content to original research.

  • Ethical implications:

    • Accurate representation of statistics matters for policy, public opinion, and scientific credibility.

    • Misleading statistics can lead to poor decisions; researchers have a responsibility to report honestly and clearly.

Quick reference: key definitions and notation (LaTeX)

  • Population size: N

  • Sample size: n

  • Population parameter (true proportion): p

  • Sample statistic (sample proportion): \hat{p} = \frac{X}{n} where X = number of successes in the sample

  • Expected value of the sample proportion: \mathbb{E}[\hat{p}] = p

  • Variance of the sample proportion: \mathrm{Var}(\hat{p}) = \frac{p(1-p)}{n}

  • Margin of error (for a proportion, with z-value for confidence): \text{MoE} = z \sqrt{\frac{p(1-p)}{n}}

  • General idea: a representative sample aims to approximate the population, but sampling error ensures the sample will differ from the population to some extent.

Closing takeaway

  • Statistics is the tool that turns raw data into understandable, communicable information, enabling us to draw conclusions and generalize findings.

  • The key challenge is choosing samples that are representative of the population so that conclusions are valid and useful in the real world.

  • Always approach statistics with a critical eye: check data sources, sampling methods, and potential biases, especially when framing graphs or headlines.