Notes on Generalizability, Sampling, Replication, and Error Management in Psychology
Generalizability Restriction: "In the People We Studied…"
Population size is denoted by N; sample size by n; relationship: 0 < n < N. In most studies, n is much closer to 0 than to N, and in some cases n = 1 (the case method).
Case method and famous individuals: studying single cases can yield (produce) deep insights into cognitive domains or abilities, e.g., memory (Akira Haraguchi memorized 100,000 digits of pi), consciousness (Henry Molaison, HM, with impaired time perspective), creativity (Alma Deutscher: piano sonata at 6; violin concerto at 9; full opera Cinderella at 10). These individuals are interesting in their own right and can provide important insights into how the rest of us work.
By contrast, the vast majority of studies use much larger samples: typically n = 10, 100, 1,000, or more. Larger samples increase confidence in findings because sampling error decreases as sample size grows.
Ideal world vs. practical reality: in an ideal world, n would be very close to N itself, but real-world constraints (time, funds, access to participants) force researchers to use the largest feasible n.
Generalizability: the goal is to generalize from the sample to the population—i.e., to conclude that what was observed in the sample would also have been observed if the entire population had been measured.
Case method limitations and value: while case studies provide rich, detailed information, they are not typically generalizable to everyone; they can nonetheless illuminate mechanisms, hypotheses, and processes applicable more broadly.
Random sampling is not always used because of practical constraints; many studies rely on samples of volunteers, often university students, who are not representative of the global population.
External validity and representativeness: Random sampling helps obtain a representative sample, but most psychology studies involve nonrandom samples.
Notable statistics on sampling across psychology (Henrich et al., 2010):
About 96% of the people psychologists study come from countries that have just 12% of the world's population.
70% come from the United States alone.
Why nonrandom sampling persists: logistical challenges, accessibility, and the reality that researchers often study locally accessible populations (e.g., university students) when sampling the global population would be impractical or impossible.
Is nonrandom sampling a fatal flaw? No, for two reasons:
1) Sometimes the representativeness of a sample does not matter for the question at hand.
2) Sometimes there is a reasonable starting assumption that the people in the sample behave similarly to or differently from others in the population; if there is a compelling reason to expect differences, researchers test those possibilities with new studies.Bottom line: learning about some people can tell us a lot—and often more than learning about no people at all. External validity is a practical concern, but nonrandom sampling does not automatically invalidate insights.
Methods of Explanation: Discovering Why People Do What They Do
Replication and estimation of psychology's replication rate involve repeating studies under the same procedures with new samples from the same population.
In recent years, many outlets have claimed a replication crisis in psychology, but the reality is nuanced.
Replication studies and key findings:
Some teams (e.g., Open Science Collaboration et al., 2015) found surprisingly low replication rates in some samples.
Others (e.g., Klein et al., 2014) found replication rates that were reasonably high in other samples.
However, these replication rate estimates are not necessarily representative of psychology as a whole.
Important caveats (warning) about replication studies:
Sampling: replication projects often did not randomly sample the entire population of published studies; they selected particular kinds of studies (e.g., easier-to-conduct studies) from particular subfields (usually social psychology), with little representation from neuroscience, developmental, or clinical psychology.
Method fidelity: sometimes exact original methods could not be replicated (details missing or changed, mistakes in replication), so some replications are not true replications of the original work.
Journalistic interpretation: journalists may treat imperfect replications as evidence of a broader crisis, which may overstate the case.
National Academies of Sciences, Engineering, and Medicine (US): concluded that it is not helpful or justified to label psychology as being in a crisis; there is no definitive replication rate for psychology or any science (open-ended and context-dependent).
The bottom line: no one knows the exact replication rate of experiments across psychology or science in general; replication is a tool that increases confidence but does not guarantee certainty.
Type I and Type II Errors: what can go wrong when drawing conclusions from evidence
A Type I error (false positive): conclude there is a causal relation when there isn’t one. Example: concluding that playing violent video games increases aggression when it does not. Also known as a false positive.
A Type II error (false negative): conclude there is no causal relation when there is one. Example: concluding that playing violent video games does not increase aggression when it actually does. Also known as a false negative.
The trade-off: reducing the likelihood of Type I errors often increases Type II errors, and vice versa.
Analogy: a home security system with detector sensitivity
If sensitivity is high: you detect all burglars (no Type II errors) but may have many false alarms (Type I errors).
If sensitivity is low: you avoid false alarms (no Type I errors) but may miss burglars (Type II errors).
Practical choice: researchers balance these errors, aiming to minimize the more serious error in a given situation while accepting some risk of the opposite error.
Replication and error rates:
If researchers designed experiments to avoid all Type II errors, they would miss many true discoveries (low replication reliability).
If they designed to avoid all Type I errors, they would miss many true discoveries (high reliability but low discovery rate).
The National Academies note that even with a general estimate of replicability, we do not know the expected non-replicability level for a healthy science; perfect replication is neither attainable nor necessarily desirable.
Replication serves a crucial function: it strengthens the evidence for a causal relationship rather than proving it absolutely; repeated replication increases confidence in the relationship.
Key takeaway: replication does not prove causality, but it moves us closer to certainty about causal relationships when results are consistently reproduced.
The Reliability Restriction: "It Is Likely That …"
A replication is an experiment that uses the same procedures as a previous experiment but with a new sample from the same population.
In recent years, headlines claimed that replications often fail to reproduce original results, fueling concerns about reliability. There is nuance:
Not all replications are true replications due to incomplete methods or variation in procedures.
Even when replications fail, this does not necessarily mean the original finding was wrong; it may reflect context, sample differences, or methodological issues.
The take-home messages:
Replication is essential for assessing reliability, but it does not by itself determine the truth of a theory.
People should be cautious about generalizing replication headlines to the entire field.
The best science uses replication as a tool to refine, revise, or strengthen causal claims rather than as a binary judgment of truth vs. falsehood.
The Case for Representative Sampling and Cautious Generalization
External validity matters: random sampling helps ensure that findings generalize to the population; however, practical constraints often necessitate nonrandom sampling.
Even with nonrandom samples, research can be informative if it relies on a principled assumption that the sample behaves similarly to the population, or if differences are tested explicitly.
The bottom line: learning about some people can illuminate general patterns but may not capture every subgroup. It is better to learn from some people than from none, while remaining mindful of limits.
Practical and Ethical Implications for Research Practice
Researchers should be transparent about sampling methods, sample characteristics, and potential limits to generalizability.
When possible, use random sampling or at least clearly articulate why nonrandom sampling was necessary and how it might affect conclusions.
Emphasize replication, open data, and methodological transparency to advance understanding and reduce misinterpretation of results.
Communicate findings with appropriate caveats about generalizability and replication status to avoid overgeneralization or sensational headlines.
Key Takeaways for Exam Preparation
Generalizability requires careful consideration of population size (N) and sample size (n): 0 < n < N.
Larger samples increase confidence in results; random sampling aims to produce representative samples, but nonrandom sampling is common in psychology due to practical constraints.
The cherry-picking analogy helps illustrate why random sampling tends to produce more generalizable findings than nonrandom selection.
Replication is a tool to assess reliability, not a guarantee of truth; the replication rate of psychology as a whole is unknown and context-dependent.
Type I errors (false positives) and Type II errors (false negatives) are competing risks; researchers balance these errors to optimize study design and conclusions.
The reliability restriction emphasizes that replications can help refine our understanding, but headlines about a crisis should be read with caution and awareness of methodological nuances.
Core mathematical/statistical ideas touched upon include sampling proportions, the logic of replication, and the balance of error types, with expressions such as 0 < n < N\,, and the role of sample size in confidence, as well as the conceptual use of a significance threshold for controlling Type I error probability.