Full study
Chapter 9: Inferential Statistics – Making Sense of Significance, Confidence, and Inference
Overview: In social/policy research, data often come from random samples intended to represent a larger population; the goal is to make inferences about population parameters from sample estimates. The three basic components of statistical inference are:
Point estimates: the sample statistic as the best guess of the population parameter.
Precision: how close the estimate is likely to be to the true parameter (described by standard errors and confidence intervals).
Significance tests (hypothesis tests): whether observed differences/relationships are real or likely due to sampling fluctuation.
The Sampling Distribution: Foundation for inference
Imagine repeating the same sampling procedure many times and collecting the distribution of the resulting estimates (a sampling distribution).
With enough repetitions (e.g., 100–1,000 samples), the sampling distribution tends toward a normal shape; its mean centers at the population parameter P.
The center of the sampling distribution equals the population parameter (P) and the spread is governed by the standard error.
The normal shape makes inference tractable, since the distribution is defined by its mean and standard deviation.
The Standard Error (SE)
SE measures the typical distance between a sample statistic and the population parameter due to sampling variability.
For a proportion p (sample proportion) with population proportion P and sample size n:
In practice, P is unknown; we substitute the sample proportion p for P to compute SE.
Example: If P = 0.05 and n = 400, the SE for a proportion is
For a mean, SE is where S is the population standard deviation (substitute s, the sample SD, when S is unknown).
The Empirical Rule (for a Normal Sampling Distribution)
68% of the distribution within SE
95% within SE (more precisely 1.96 SE)
99.7% within SE
These rules underpin construction of approximate confidence intervals (CIs).
Confidence Intervals (CIs)
A CI provides a range in which the true parameter is likely to lie, given the sample and a chosen confidence level.
General form: estimate Z* SE, where Z* is the precise number of SEs corresponding to the desired confidence level.
Common values:
95% CI: use Z* = 1.96 (exact value is 1.96; exact 95% CI uses this in large samples; in small samples, t-distribution is preferred).
90% CI: Z* = 1.65
99% CI: Z* = 2.58
For a 95% CI for a proportion using the basic empirical rule, you can approximate with: , but exact calculation uses Z* = 1.96.
Important caveat: Confidence intervals reflect only sampling error; they do not account for other sources of error (measurement error, coverage error, nonresponse, data processing errors, causal inference error, etc.).
Confidence Intervals for Proportions: Worked example
Example: A hospital patient-satisfaction survey with n = 100 and p̂ = 0.67 (67% satisfied).
SE(p̂) = .
95% CI:
Interpretation: We are 95% confident that the true satisfaction rate lies between 58% and 76%.
Substitution principle: The population parameter P is unknown; we replace it with p̂ in SE calculations.
Confidence Intervals for Means
For a mean, use SE = (sample SD s) if the population SD S is unknown.
95% CI for the mean: (using the t distribution is more accurate for small samples: use t_{df} instead of 1.96; with large samples, t Z).
Example: If sample mean wait time is 8.3 hours, s = 5.6, n = 100:
SE =
95% CI using Z: hours.
With the t distribution, the critical value would be slightly larger ( for 95% in small samples), yielding a similar but slightly wider CI.
Power, Significance, and Hypothesis Testing (Inference about Population Parameters)
Significance tests (hypothesis tests) assess whether a difference or relationship is real (not a fluke of sampling).
Key components:
Null hypothesis (H0): typically a statement of no difference/no effect (e.g., no difference between groups, slope = 0).
Alternative hypothesis (H1 or Ha): the statement being tested (e.g., there is a difference, slope 0).
p-value: the probability, under H0, of observing results as extreme as or more extreme than those observed.
Decision rule (conventional): reject H0 if p-value < chosen significance level (alpha), commonly 0.05, but other levels (0.10, 0.01) are also used depending on context.
Important nuance: a p-value is not the probability that H0 is true; it is the probability of obtaining the observed data (or more extreme) given that H0 is true.
Significance levels and practical significance can diverge: a result can be statistically significant but with a trivially small effect size (practical significance).
Example (t-test for difference in means): If observed difference is -1.072 with SE = 0.620, then
and the p-value is about 0.084, which may be considered not significant at the 5% level but could be at 10% depending on the context.
Interpreting and Interpreting Significance: Practical vs Statistical Significance
Box 9.1 outlines sources of statistical significance and statistical insignificance (e.g., large differences or small SEs vs large SEs or tiny differences).
It is possible to have statistically significant results that are not practically significant (and vice versa).
Publication bias toward statistically significant results is a concern; consider robustness and practical relevance, not just p-values.
When many tests are performed, multiple comparison corrections (e.g., Bonferroni, Scheffé) may be necessary to control the overall error rate.
Hypothesis Testing in Regression and Related Tests
Significance testing in regression typically uses a t-test for the slope (or intercept) to assess whether a relationship is present.
Example: NELS data show a slope for hours of homework on test scores; t = 2.45 with p = 0.016 indicates a statistically significant relationship at conventional levels.
For regression, the phrase “Population slope = = 0” is the null; the alternative is 0.
For relationships between categorical variables, chi-square tests examine whether there is an association; the null is no relationship; the alternative is there is a relationship.
p-values for different tests (t, F, chi-square) are interpreted similarly: small p-values imply rejection of the null; large p-values imply insufficient evidence to reject the null.
Practical Topics in Inference
Power and Type II errors: power = 1 \minus P(Type II error); power increases with larger samples and stronger effects; larger alpha decreases power.
Minimal Detectable Effect (MDE): the smallest effect size that a study has power to detect at a given alpha and sample size.
Lehr’s equation for planning sample size in comparing two means: where is the standardized difference in means (effect size).
Cohen’s conventions for effect sizes: small 0.2, medium 0.5, large 0.8 (in terms of standardized mean difference).
Precision with Complex Sampling and Nonprobability Samples
Complex sampling (clustering, stratification, oversampling) requires design effects to adjust standard errors; software can often adjust SEs accordingly, or use design effects.
Nonprobability samples (convenience samples, voluntary samples) challenge the basis of inference; superpopulation concepts and model-based inference are used in some cases, but interpretations must be cautious.
Bootstrapping provides an alternative inference approach when standard error formulas are hard to obtain or when assumptions are dubious.
Bayesian vs Frequentist Inference
Frequentist inference treats probability as long-run frequency and relies on sampling distributions and p-values without prior information.
Bayesian inference starts with prior probabilities and updates them with data to form posterior probabilities; priors can be subjective but provide a coherent framework for updating beliefs.
In practice, most applied work uses frequentist methods, though Bayesian intuition influences how researchers interpret results.
Summary of Chapter 9 Takeaways
Confidence intervals quantify precision; they reflect sampling error but not all sources of error.
Significance tests help determine whether observed patterns are unlikely under the null, but p-values do not convey practical importance.
Larger samples reduce SE and can render small effects statistically significant; always consider practical significance and power.
When multiple tests are performed, consider corrections to control for type I error inflation.
Chapter 10: Multivariate Statistics – Making Sense of Multiple Variables
What multivariate statistics is about
Real-world phenomena involve many variables; multivariate methods help analyze multiple independent and dependent variables simultaneously.
The centerpiece in many applied settings is multiple regression, which predicts a dependent variable y from several independent variables x1, x2, …, xk.
Core equation:
Interpretation:
The constant a is the predicted value of y when all x's equal 0 (often of limited substantive meaning).
Each coefficient is the predicted change in y for a one-unit increase in , holding all other x’s constant (the key advantage over simple regression).
R-squared () is the proportion of variation in y explained by all independent variables together.
Adjusted R-squared adjusts for the number of predictors, giving a less biased estimate of explained variance when comparing models with different numbers of predictors.
Example from the text: earnings predicted by education (x1) and experience (x2) with
= 0.57
Intercept a = $-\beta1 (education) = $3,292 per year
\beta2 (experience) = $415 per year
Interpretation of the example: with 4 more years of education, holding experience constant, earnings rise by 4 \times 3292 = 13,168.\betaR^2R^2R^2R^2R^2\minus\beta{diab}\beta{hyper}\beta_{int}\times\times Diabetes) similarly modify the slope for the predictor depending on the level of the interacting variable.
Nonlinearity and Transformations
Not all relationships are linear; nonlinear patterns (e.g., U-shaped) can be modeled with polynomial terms (e.g., Experience and Experience$^2$) or by transforming the dependent/independent variables (log, exponential, etc.).
When a quadratic term is included, interpret the two coefficients jointly to understand the marginal effect at different levels of the predictor.
Example: Earnings = a + \beta{exp}\beta{exp2} Experience$^2$; a negative \beta_{exp2}\beta jxjR^2 y = a + \beta1 x1 + \beta2 x2 + \cdots + \beta k xk SE(p) = \sqrt{\frac{P(1-P)}{n}}\quad\text{or with }p\text{ as estimate: }P\approx pSE(\bar{x}) = \frac{S}{\sqrt{n}}\hat{p} \pm Z^*\times SE(p)\bar{x} \pm Z^*\times SE(\bar{x})\quad(\text{or } t_{df} \text{ for small samples})MOE \approx 2\times SEt = \frac{\text{Estimate} - \text{Null}}{SE}\chi^2 = \sum \frac{(\text{Observed} - \text{Expected})^2}{\text{Expected}}\text{Power} = 1 - \Pr(\text{Type II error})n = \frac{16}{\Delta^2},\quad \Delta = \frac{\text{Difference in means}}{\sigma}R^2R^2\times\ge\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\timesY{1i}Y{0i}D_i \in {0,1}Yi = Di Y{1i} + (1 - Di) Y_{0i}Di = 1Y{0i}Di = 0Y{1i}ITEi = Y{1i} - Y_{0i}ATE = \mathbb{E}[Y{1i} - Y{0i}] = \mathbb{E}[ITE_i]\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\rightarrow\rightarrow\rightarrowIE = a \times bDE = c'TE = DE + IE = c' + a bY(x)ATE = E[Y(1) - Y(0)]RR_{\text{marijuana/tobacco}} = 2.5RR_{\text{alcohol}} = 1.5\rightarrow-16 \text{ percentage points}
How to Use These Notes on the Exam
Distinguish correlation from causation in every question.
Identify possible alternative explanations: reverse causation, common causes, or unknown confounders.
Determine whether a study design achieves exogeneity (randomization, intervention) or if it relies on observational evidence with control variables.
Be ready to discuss potential mechanisms and mediators that explain how a proposed causal effect would work.
If given a scenario, sketch a simple path diagram showing possible causal pathways (direct, mediated, and confounded) and indicate where bias might arise.
When asked to critique a study, consider time order, replication across contexts, plausibility of mechanisms, and the adequacy of controls for confounding.
Theories, Models, and Research Questions
Real-world example: Broken windows theory applied to New York City crime and subway crime reduction (Kelling & Wilson, 1982; Bratton). The idea: addressing small disorders (vandalism, graffiti, public drinking, loitering) can prevent more serious crime. In NYC, quality-of-life policing targeted petty disorder; crime fell in the 1990s. Debates exist: other factors (end of crack epidemic, economy) also plausible explanations. The central aim of the chapter: define theory and models, discuss variables, relationships, and causal mechanisms, and show how path diagrams and logic models help plan, manage, and evaluate programs.
What Is a Theory?
Theory in social science often means a logical idea about how part of the world works. Centered on middle-range theory (Merton, 1967).
King, Keohane, and Verba (1994): A social science theory is a reasoned, precise speculation answering a research question, including why the proposed answer is correct (p. 19).
Theories prompt questions and guide the search for plausible explanations; theories can describe large-scale (e.g., war) or small-scale (e.g., reading ability) phenomena. Theories are practical because they illuminate how to change the world.
Theories are not automatically true; they must withstand questioning and empirical testing.
The Key Functions of Theories
Identify key variables: The broken windows theory highlights disorder as a key variable, potentially affecting crime.
Tell causal stories: The theory posits a causal mechanism where disorder signals lack of social control, emboldening criminals.
Recognize that a theory is one of many possible causes; outcomes almost always have multiple causes (economic conditions, demography, weather, etc.).
Theories produce probabilistic predictions: they describe what is likely to happen on average, not guaranteed outcomes in every case.
Theories explain variation: they account for longitudinal variation (over time) and cross-sectional variation (across places or groups). Visuals referenced: Fig. 2.1 (murder rate over time, longitudinal variation) and Fig. 2.2 (murder rates across large U.S. cities, cross-sectional variation).
Theories Generate Testable Hypotheses
Good theories yield observable implications; they should be falsifiable (Popper, 1959).
Examples from broken windows: more crime in neighborhoods with vandalism and graffiti; less crime where such disorder is reduced.
A theory can be tested against data and compared to alternative theories.
Theories Focus on Modifiable Variables
Some causes are nonmodifiable (economy downturns, weather, population age structure), but theory often targets modifiable variables because they offer policy/practice leverage (e.g., increasing patrols against disorder).
Other theories may study nonmodifiable factors to understand broader influences on crime.
Where Do Theories Come From?
Grand theories (paradigms) vs. middle-range theories. Grand theories include structural functionalism, symbolic interactionism, rational choice, Marxist materialism, Freudian psychoanalysis, critical theory, feminism, postmodernism, etc. They shape how researchers frame variables and mechanisms.
Rational choice: crime explained as opportunistic self-interest where rewards outweigh costs.
Some grand theories (critical theory, postmodernism) are antipositivist and question causation and empirical testing in human behavior, leading to different forms of theory.
Theories come from
Academic disciplines (economics, sociology, psychology).
Induction (building theory from empirical observations) and deduction (starting from principles and testing them).
Exploratory studies, lived experience, and practitioner observations (police chiefs noting patterns).
Qualitative research and exploratory work can generate theory and hypotheses.
Induction and Deduction; Testing Theories
Induction: theory emerges from empirical observations; testing with independent data is essential to falsify.
Deduction: theory starts from principles; testing requires new data not used to generate the theory.
The distinction matters because using the same data to both generate and test a theory yields a non-falsifiable test.
Question prompt: sidewalk litter – induction vs. deduction? Identify the test needed to evaluate a proposed relation (e.g., income and crime). (From Box discussions in the chapter).
Exploratory and Qualitative Research
Qualitative research provides insight into processes and influences in social settings (e.g., high-crime neighborhoods).
Qualitative insights can suggest variables (e.g., social control) and potential policy targets.
Theories, Norms, and Values
Theories are descriptive (positive) rather than normative; they describe what produces crime, not what ought to happen.
Theories are not value-free: different theories emphasize different causes and thus imply different policy options (e.g., deterrence vs. social investment).
Underlying assumptions influence theories; these assumptions may come from discipline, culture, or politics. It’s important to make assumptions explicit when interpreting or creating theories.
What Is a Model?
A model communicates a theory; it is a representation of the causal process.
Types: graphical (path diagrams) or mathematical (equations).
Path diagrams will be used extensively in this book (e.g., Fig. 2.3 Broken Windows path diagram).
Variables and Relationships
A model consists of variables (ovals) and relationships (arrows).
The plus sign (+) on an arrow indicates the direction of the relationship.
A variable is something that can take different values (it must vary) and can be independent (X) or dependent (Y) in the causal order.
Independent vs. Dependent Variables:
Independent variable (X): the cause, symbolized as X.
Dependent variable (Y): the effect, symbolized as Y.
In simple models: Class size (X) \rightarrow\minus\rightarrow\rightarrow\minus\rightarrow-\text{ signs}\rightarrow\rightarrow\rightarrow\minus\minus-\text{ signs} on the arrows to reflect directionality; some categorical variables may not have a direction.
Tip 8: Recognize levels of detail; simpler models for big-picture proposals vs. more detailed models for implementation and evaluation; ideally, each link has empirical backing or clearly labeled gaps.
Inputs, Activities, Outputs, and Outcomes
Logic models can include program implementation aspects: inputs (resources), activities (actions), outputs (immediate products).
Outcomes can be short-term, intermediate, and long-term.
Example: CDC logic model for hypertension management focusing on chronic care management (CCM) shows inputs, activities, outputs, and cascading outcomes from better treatment to fewer heart disease/strokes.
The caution: avoid over-emphasizing implementation details at the expense of clearly articulating the causal mechanism to desired outcomes.
Additional Issues in Theory Building
Interpretivist theory: contrasts with the quantitative, middle-range view; focuses on interpreting meanings, norms, and symbols; aims for understanding rather than causal prediction.
Does Theory Shape Observation? Observations can be theory-laden; survey question design can influence responses and shape observed relations.
Theories of the Independent Variable: Sometimes a theory may posit that both X and Y are effects of a common cause (e.g., socioeconomic disadvantage). Attacking X alone may be less effective; consider underlying common causes.
Moderators (interactions): A moderator changes the strength/direction of a relationship (e.g., teacher experience moderating the effect of instruction time on test scores). In path diagrams, moderators are shown as arrows toward the relationship they influence (Figure 2.11).
Hierarchical (multilevel) models and contextual variables: Some relationships operate across different units (students within classrooms within schools). Higher-level contextual variables can affect relationships at lower levels.
Theoretical research vs. empirical research: Theoretical work synthesizes existing theories to predict new situations; examples include Rosen (1981) on technology changes and earnings concentration in opera singers; Christensen & Remler (2009) on electronic health records adoption and system lock-in.
How to Find and Focus Research Questions
A research question is the motivation for a study; real-world research often involves messy, iterative processes: starting with a broad question, refining it through data access, feasibility, and theoretical framing.
Applied research questions often arise from policymakers and practitioners (e.g., do smaller classes improve learning? does telework increase productivity? does lowering speed limit reduce fatalities?).
The process includes defining the intervention (X) and expected outcomes (Y), considering intervening variables, and exploring unintended consequences.
Example questions: Does the JPS class-size reduction improve third-grade exit-test scores? Through what mechanisms (instruction time, individual attention) does it operate? Are there unintended consequences (reduced resources for libraries, arts, or after-school programs)?
Chapter provides practical guidance on forming research questions; the process involves/benefits from using model-building tools, developing a path model, and evaluating feasibility given data constraints.
Descriptive vs Causal Questions
Descriptive (what is) vs. causal (what if) questions require different methods and data.
Researchers should clearly define whether their primary aim is description or causal inference.
Positive vs Normative Framing of Questions
Theories are positive (describing how the world is) rather than normative (how it should be).
Framing a question positively (without normative assumptions) improves testability and empirical focus.
Example: Instead of asking, Why aren’t more young people interested in politics and voting? (normative), ask: How interested are young people in politics, and how often do they vote? Does interest predict voting behavior?
Generating Questions and Ideas
How to generate research questions:
Review scholarly literature for anomalies or unanswered questions.
Explore policy/practice concerns and current events to surface relevant problems.
Read widely across disciplines to gain new perspectives.
Maintain a notebook of ideas.
Heuristics for question generation (Andrew Abbott, 2004): analogies (e.g., voting as an economic transaction), reversals (asking why people vote when individual impact is small), cross-disciplinary borrowing to develop new insights.
Conclusion: Theories Are Practical
The broken windows example shows theory guiding policy decisions and practical action.
Theories help explain causes of social problems and guide interventions; logic models help show how programs work and how to improve them.
Path diagrams are essential tools for representing causal reasoning and guiding analysis.
Boxed Highlights, Key Terms, and Exercises
Key terms to know include: Causal mechanism, Causal and noncausal relationships, Cross-sectional variation, Ecological fallacy, Grand theory, Hierarchical models, Hypothesis, Independent and dependent variables, Logic model, Moderation, Intervening variable, Path diagram, Unit of analysis, etc.
Boxes cover practical points: Box 2.1 on independent/dependent variables; Box 2.2 on equations as models; Box 2.3 on what a logic model is; Box 2.4 on a real-world AIDS program example; Box 2.5 on critical questions to ask about theory; Box 2.6 on doing your own research with heuristics; Box 2.7 (not explicitly numbered here) on interpretivism and observation; Box 2.8 on other examples explaining logic models (inputs/activities/outputs).
Figures mentioned include:
Figure 2.3: Path diagram of Broken Windows Theory (disorder \rightarrowY = a + bX
where Y is the dependent variable, X is the independent variable, a is the intercept, and b is the slope.
Direction of relationships via sign on arrows:
Positive: $+$ sign \rightarrow\minus\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\minus\rightarrow\rightarrow(\minus) \times (+) \times (+) = (\minus)\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow Outcomes).
Fig. 2.10: Common Cause illustration (independent variable and dependent variable both influenced by a common cause).
Fig. 2.11: Moderator in a Path Diagram (moderator effect on a relationship).
Box 2.1: Independent and Dependent Variables – naming and interpreting directions.
Box 2.2: Equations as Models – right-hand side vs. left-hand side variables; Y = a + bX.
Box 2.3: What Is a Logic Model? – definition and purpose.
Box 2.4: China AIDS Prevention Program – real-world logic-model example with practical tips.
Box 2.5: Critical Questions to Ask About Theory, Models, and Research Questions.
Box 2.6: Tips on Doing Your Own Research – practical guidance for developing questions and models.
Chapter Resources and Exercises (Overview)
EXERCISES 2.1–2.6 prompt you to create theories, identify variables, determine directions, specify units of analysis, and draft a logic model for a real-world program.
STUDENT STUDY SITE: online resources including self-quiz, eFlashcards, and related materials.
Key terms and glossary appear throughout, with opportunities to test understanding using the included questions.
Purpose and Value of Evidence
We want to make a difference in the world (education, health, crime reduction, arts, innovation, housing, leadership) and need evidence beyond personal experience to know what works and to persuade decision-makers with authority and resources.
Good evidence comes from well-made research. Evidence can take many forms: journal studies, internal analyses, government or foundation reports, performance briefs, program evaluations, needs assessments, or surveys of clients or employees.
Government and international data sources provide empirical evidence across topics like health services, education, labor markets, crime, housing, and environment. Examples of data sources: data.gov (U.S.), data.un.org (UN). Similar data portals exist in many countries.
The Internet era creates an abundance of studies and statistics, but we must know how to choose, interpret, and apply them. Good research is well designed and well made; trust in brand names is limited because each study is unique with strengths and weaknesses.
Research methods are the techniques and procedures that produce evidence (sampling, measurement instruments, planned comparisons, statistical techniques). Understanding methods helps us judge study quality and the strength of its evidence.
May the Best Methods Win
Understanding research methods helps us argue about evidence that supports or undermines our aims.
Controversy example: abstinence-only sex education vs comprehensive sex education.
Abstinence-only advocates argue against condom distribution; proponents of comprehensive education argue it better addresses real-life behaviors and reduces pregnancy and STIs.
A review by Douglas Kirby (2007) identified 115 studies on pregnancy prevention programs for U.S. teens (abstinence and comprehensive).
The takeaway: in public policy, neither side wins merely by citing a single study; the battle centers on how well the studies are designed and conducted (methods matter).
Research-Savvy People Rule
Research methods are essential whether you are a researcher, analyst, practitioner, or administrator.
Reasons to know methods:
Good research provides a factual basis for decisions and strengthens arguments.
In the information age, the ability to find, understand, and apply complex information is highly valuable to organizations.
Funding and policy-making demand evidence-based programs and management reforms; to win support, you must demonstrate methodological understanding.
Without method literacy, you are at a disadvantage in securing jobs, advancing, and obtaining financial and political support.
Research, Policy, and Practice
Research has become central to modern public policy and management, reflected in performance measurement, program evaluation, and evidence-based practices.
Performance Measurement
The idea: measure performance to manage and improve. Examples include data-driven crime tracking (e.g., CompStat in New York City).
Emphasis on measuring performance across education, health care, and other sectors.
The information revolution supplies data; logic models are used to decide what to measure; valid and reliable measurements are discussed in later chapters.
Evaluation Research
Core questions: Did a program have an impact? Did it improve outcomes? Also describes implementation processes.
Evaluation is a standard requirement for government and foundation funding; Rossi, Lipsey, & Freeman (2003) and C. Weiss (1997) are key references.
Evidence-Based Policy and Programs
Governments, businesses, and nonprofits increasingly favor evidence-based approaches.
Decision-makers compare programs for effectiveness and cost-effectiveness (magnitude of effect relative to cost).
Chapters teach how to identify, assess, and produce good evidence to support aims.
Evidence Can Mislead
Not all evidence is perfect; methodological flaws can mislead.
Misleading measurements: flawed data collection can inflate or distort outcomes (NCLB and the Houston TAAS vs. Stanford TAAS discrepancy).
Misleading samples: nonrandom samples (e.g., USA Today poll) can overstate or misrepresent population characteristics; proper random sampling (e.g., GSS) provides more accurate estimates.
Misleading correlations: correlation does not imply causation (fluoridated water example; Hillier et al. 2000 showed that controlling for age, sex, weight, lifestyle can remove the apparent link).
What Is Research?
Research is a social and intellectual activity involving systematic inquiry to describe and explain the world.
Primary vs Secondary Research:
Secondary research: searches and syntheses of published sources (not the primary focus of this text).
Primary research: original collection or analysis of data to answer new questions; includes both primary data collection and the original analysis of secondary data (e.g., existing surveys, administrative records).
Data terminology:
Data: refers to raw observations or a data set; not the published facts.
It Comes in Various Shapes and Sizes
Research varies: large-scale vs small-scale, snapshots vs longitudinal, lab experiments vs naturalistic observation, carefully planned interventions vs opportunistic discoveries, theoretical analyses, informal internal analyses.
Inventiveness and creativity are important; good research often involves new methods or clever strategies.
It’s Never Perfect; It’s Uncertain and Contingent; It Aims to Generalize; Bits and Pieces of a Puzzle
No study is perfect; good consumers spot weaknesses but also identify strengths.
Uncertainty is inherent; results are often probabilistic and context-dependent.
Generalizability: the extent to which results apply beyond the original setting; real-world studies are often less generalizable; researchers strive for generalizability but must acknowledge limits.
Empirical evidence is cumulative; rarely is a single study definitive; scientific consensus emerges within bounds of probability.
Explain Generalizability
Generalizability is the ability to apply research results beyond the exact setting (time, place, circumstances) studied.
Example: emergency-visit policies with out-of-pocket payments tested in one insurance plan may not apply to older, less healthy, lower-income populations with different incentives and behaviors; results may be limited to that context.
Although generalizability is a goal, real-world research has limitations; still, the evidence can inform policy and practice when interpreted carefully.
Global Warming and Scientific Consensus
Across thousands of studies on global warming, none alone proves human causation, but the body of evidence supports a consensus that warming is occurring and is very likely caused by human activity (UN IPCC 2007; U.S. Global Change Research Program 2009).
Establishing consensus requires years of diverse research and debate; it can be possible for consensus to be tempered by new evidence or contested by dissenting researchers.
The process involves competition and critique, especially peer review, in which researchers assess each other’s methods and conclusions.
Peer review is usually blind to avoid bias; readers should approach research with honest, critical thinking even when peer-reviewed.
Quantitative, Qualitative, and Mixed Methods; Triangulation
Research can be quantitative (numbers, statistics), qualitative (language, images, meanings), or a mix.
Mixed methods combine strengths of both approaches; triangulation uses multiple methods to confirm findings.
Qualitative research can be rigorous; numbers alone do not determine quality.
Chapter emphasizes that qualitative research is foundational (Chapter 3) and good quantitative work relies on solid qualitative groundwork; both perspectives enhance each other.
Applied vs Basic Research
Applied research: conducted to solve practical problems; has direct policy or practice implications (e.g., unemployment, smaller class sizes, policing strategies).
Basic research: pursuit of knowledge for its own sake; may be less immediately practical but builds theoretical foundations that inform policy and practice.
Both types advance knowledge, though the link from basic research to application is often indirect.
Descriptive and Causal Research
Descriptive research describes the world: what things are, the size of phenomena, and how variables relate (associations or correlations).
Causal research seeks to determine what would happen if we change something: estimating the effect of interventions or policy changes.
In practice, practitioners need both descriptive understanding and causal evidence for effective policies and programs.
Autism example: policymakers and practitioners seek descriptions (how many, where, severity) and causal understanding (what would reduce incidence or severity).
Distinguishing description from causation is central to the text; Part II covers description, Part IV covers causation.
Correlation Is Not Causation
It is easy to confuse a correlation with a causal effect.
Example: educated mothers and autism incidence is a correlation, not proof that education causes autism; confounding factors may explain the relationship.
The fluoridation example shows a spurious correlation that disappears when other factors are controlled (Hillier et al. 2000).
Important skill: distinguish correlation from causation and assess evidence of causal effects (Chapter 11 and beyond).
Epistemology: Ways of Knowing
Ways of knowing include direct measurement, trusted authorities, tradition, intuition, and common sense.
The book emphasizes the scientific method as a privileged, systematic approach to knowledge production.
Readers should question scientific knowledge just as they question common sense or tradition.
The Scientific Method
Key characteristics of the scientific method:
Systematic observation or measurement (including qualitative observation).
Logical explanation via theory or model that aligns with logic and established facts.
Prediction in the form of a hypothesis derived from the theory; falsifiability is preferred over post hoc explanations.
Openness: methods are documented and available for review to enable replication.
Skepticism: peer review and critique to identify shortcomings or alternative explanations.
The science method is a privileged form of knowing because it is transparent, logical, and evidence-based; however, science can be misrepresented or misused, so critical appraisal remains essential.
Understand that interpretations of the method vary across disciplines and over time.
Is There One Truth in Social Science?
The social world differs from natural sciences due to human consciousness, culture, history, and politics; social phenomena vary more across places and times, making knowledge more contingent.
Social science is shaped by language and socially constructed categories, influencing what is observed and how it is interpreted.
Some reject the relevance of the scientific method to social policy (antipositivism), while others defend a broader, pragmatic version of scientific realism.
The authors describe a stance of scientific realism: social reality exists and can be studied with methods modeled on science, despite social constructions.
Induction and Deduction
Researchers use either induction (from systematic observation to theory) or deduction (from theory to hypotheses/tests), or a combination.
Induction is common in qualitative research; in quantitative work, patterns may inspire theory.
Structuralists argue for starting with theory; most researchers use a mix: theory informs data collection and data leads to new theories.
Fresh data are required to truly test a theory or hypothesis; data cannot be used both to develop and definitively confirm a theory in the same way.
Research is often iterative, alternating between deduction and induction.
Approaching Research From Different Angles
The book addresses three perspectives:
Consuming Research: readers as researchers, policymakers, journalists, or students who digest and apply findings.
Commissioning Research: clients frame questions, approve methods (sampling, measures), review briefs, and decide on changes; the choice of researchers influences quality; open communication with researchers is essential.
Conducting Research: applied research in government, nonprofits, business, and consulting; researchers may have diverse backgrounds; informal research by practitioners is also valuable.
Ethics of Research: research involving human subjects raises ethical concerns that shape study design and methods. Case studies illustrate historical ethical breaches and the development of ethical standards.
Ethical Issues in Research; History and Principles
Historical abuses led to formal ethical principles and procedures for human subjects research:
Nuremberg Code (1947): informed consent, voluntary participation, no harm, beneficence.
Declaration of Helsinki (1964): ethical principles for medical research.
Belmont Report (1979): U.S. framework for ethics regulations (45 C.F.R. Part 46).
Core ethical standards (the Belmont standards):
Respect for persons: informed consent and voluntary participation.
Beneficence: minimize harm and maximize benefits.
Justice: fairness in the distribution of research benefits and burdens.
Many countries have ethics review processes (IRBs in the U.S.; analogous bodies elsewhere).
Informed consent involves ensuring understanding and voluntary participation; challenges include comprehension levels, language/cultural barriers, and power dynamics.
Privacy and confidentiality vary by research form and context (administrative data vs in-depth interviews; public data vs restricted data).
Informed consent and deception: debates about using deception versus transparency; safeguards include allowing withdrawal, debriefing, and minimizing risk.
Ethical issues depend on the form and context of research (qualitative interviews, measurement length, secondary data usage, laboratory experiments, randomized trials, quasi/ natural experiments).
The text signals that ethics must be considered throughout the research lifecycle, including policy applications and IRB processes.
Informed Consent: What It Entails
Informed consent requires understanding what participation involves and being competent to consent.
Voluntary consent may be complicated by power imbalances or potential consequences (e.g., benefits eligibility).
Challenges include language proficiency, reading level of consent forms, and cultural differences in interpreting participation.
Researchers must balance providing enough information with ensuring comprehension; different contexts raise different consent considerations.
Ethical Issues Depend on Form and Context
Confidentiality means different things across study types (statistical health data vs in-depth interviews about abuse).
Informed consent, confidentiality, deception, and acceptable data use vary by research method and context.
The chapters preview where ethical considerations will be discussed in detail (qualitative methods, measurement burden, secondary data, primary data collection, laboratory/causal experiments, randomized trials, quasi/natural experiments, and policy applications).
Conclusion: The Road Ahead
Research methods draw on multiple disciplines (sociology, economics, health sciences, education) and form a complex landscape.
Even experienced researchers struggle to communicate across disciplinary dialects; the goal is to cut through terminological confusion and understand core issues.
The book aims to equip readers to think critically about theory, models, and research questions and to engage with research as both consumers and producers.
Chapter Resources and Key Terms
Key terms introduced: Applied research, Basic research, Beneficence, Causal research, Contingent, Data, Deduction, Descriptive research, Epistemology, Evaluation research, Generalizability, Induction, Justice, Peer review, Performance measurement, Positivism, Primary data, Primary research, Relationships, Research methods, Respect for persons, Scientific method, Scientific realism, Secondary data, Secondary research, Spurious correlation, Structuralists.
Exercises (PROMPTS)
Battleship Research 1.1: Identify other policy debates where opponents use research; discuss role of research methods.
Research in the Corner Office 1.2: Identify leadership roles in your field that will use or commission research; consider interviewing someone.
Following the Trends 1.3: Provide examples of a performance measure, a program evaluation, and an evidence-based policy in your area.
Misleading Evidence 1.4: Find a news article about a study; assess whether criticisms relate to misleading measurements, samples, or correlations.
Descriptive vs Causal Research 1.5: Propose descriptive and causal questions for a social issue.
Ways of Knowing 1.6–1.7: Explore sources like Wikipedia and FiveThirtyEight for methodological alignment with the scientific method; evaluate sources and citations.
Ethical Research: Informed Consent 1.8–1.9: Find a study involving human subjects; discuss ethical issues; plan an interview-based study ensuring respect for persons, beneficence, and justice; define informed consent contents.
Study Site and Further Resources: Access the Sage Study Site for quizzes, eFlashcards, and additional resources.
Notes on Notable Examples Mentioned
Houston TAAS vs Stanford TAAS discrepancy criticized by New York Times (Schemo & Fessenden, 2003) illustrating measurement issues and test validity differences.
USA Today quick poll vs General Social Survey (GSS) on gun ownership showing how sampling methods affect results.
Hillier et al. (2000) UK study showing correlations between fluoridated water and bone fractures can be explained by confounding variables; demonstrates spurious correlation.
Milgram (1960s) obedience studies and Humphreys (1966–67) social observation study are classic ethical debates illustrating risk, deception, consent, and the balance of scientific value and ethical safeguards.
Belmont Report (1979) established foundational ethics standards: respect for persons, beneficence, justice, and the role of IRBs.
References to Frameworks and Theorists Mentioned
Rossi, Lipsey, & Freeman (2003) on evaluation research
C. Weiss (1997) on evaluation and policy
Hatry (2007); Kaplan & Harvard Business School (2009); Poister (2003) on performance measurement and management
Davies, Nutley, & Smith (2000) on evidence-based policy and practice
Scola (2012) on data-driven political campaigns
Bēcker & Becker (1998) on deduction in economics
Godfrey-Smith (2003); Bunge (1993) on scientific realism and philosophy of science
UK Hillier et al. (2000) on controlling for confounders in observational study
UN IPCC (2007); U.S. Global Change Research Program (2009) on global warming consensus
Key Takeaways
The credibility of policy and practice hinges on the quality of methods; better-designed studies tend to produce more reliable conclusions.
No study is perfect; critical appraisal requires weighing weaknesses against strengths and the breadth of the evidence base.
Descriptive and causal research serve different purposes; understanding their differences is essential for applying research to real-world problems.
Ethical considerations are integral to all research with humans, with historical cases driving current safeguards and procedures.
Researchers and practitioners should be literate in methods to be effective consumers, commissioners, and conductors of research.
Secondary Data: Overview
Secondary data are data that already exist because they were collected for a prior administrative or research activity.
Importance in social and policy research: low-cost computing/storage and the Internet make secondary data widely available and useful.
Examples from flu tracking:
CDC uses administrative data: hospital ER visits for flu symptoms, OTC cold/flu medication purchases, vaccine uptake from surveys.
Real-time indicators include Google query searches about flu symptoms (Ginsberg et al., 2009).
Big Data context:
Big Data refers to vast stores of qualitative data (texts, images, audio, video) and quantitative data (tax records, spending, etc.), often linked and analyzed to solve problems.
Real-world Big Data examples: CDC’s Google Flu Trends; NYC’s sewer mapping to detect illegal grease dumping by restaurants (Feuer, 2013).
Learning goal of chapter: understand sources and types of existing data and how computing advances turn qualitative data into quantitative data; contrast with primary data collection (to be covered in the next chapter).
Real-world relevance: secondary data underpin many policy analyses and program evaluations; they enable large-scale, cost-effective research.
Big Data and the Virtual World
Our digital lives generate vast data streams across contexts: shopping, socializing, studying, collaborating online.
Much nonvirtual activity leaves electronic traces: medical tests, search terms, YouTube videos, etc.
Much of this data is qualitative (texts, images, audio, video). Examples: emails, blogs, tweets, webpages, government and organization documents.
Numerical/quantitative data also proliferate: taxes, spending, school attendance, crime, economic indicators; private sector data include financial transactions, inventory, payroll, prices, stock values.
Big Data enables problem-solving through data integration and analysis across sources.
Practical examples:
Google Flu Trends as an instance of Big Data in public health.
New York City example using Big Data to identify non-compliant grease disposal by linking sewer data to restaurant data (Feuer, 2013).
Takeaway: learning about data sources and structures is the first step toward leveraging Big Data in policy analysis.
Quantitative Data—and Their Forms
Definition: Quantitative data are information recorded, coded, and stored in numerical form. Data is plural; dataset is singular.
Source typically involves measurement, though not always strictly; statistics provides tools for analyzing quantitative data.
Quantitative data can be of two broad kinds:
Quantitative variables: Numerically observable things, primarily used for data analysis. Numeric measurements representing quantities (e.g., age in years, income in dollars). Examples include counts (ER visits, Google searches) and monetary amounts. These are inherently numeric dependent variables.
Categorical variables recorded as numbers: Many categorical variables can be converted into quantitative forms for analysis. e.g., gender 1 = male, 2 = female; happiness 1 = very happy, 2 = somewhat happy, 3 = not happy; region 1 = Northeast, 2 = Midwest, 3 = South, 4 = West. Note: Categorical variables are quantitative data because they can be sorted, counted, summarized, and analyzed; many statistical methods address categorical data as well (Agresti, 2007). Example 1: The number of years of education is numerical, but can categorize for other variables like income and health status. Example 2: Gender can be numerically coded (e.g., Male = 1, Female = 2) for use in statistical analyses like regression.
Qualitative data can be automatically coded into categories (qualitative \rightarrow quantitative): example—Twitter messages with hashtags (#) indicating topic; hashtag counts yield quantitative trends.
Data forms/structures depend on aggregation level and time dimension (Table 6.1 referenced):
Forms include microdata, aggregated (ecological) data, single-measure data, and multilevel data.
Real-world linkages:
Administrative records, surveys, and other sources can generate both quantitative and coded qualitative data.
Data Management: Then and Now
Administrative data originate from MIS (management information systems): financial records, employee records, production records, client records, performance indicators.
Paper vs electronic storage varies; much data still in paper form, though electronic medical records growing.
Converting to research-ready form often requires flattening relational databases to flat-file formats for statistical analysis.
Relational databases contain multiple linked tables; statistical analysis generally uses flat files where rows = units of analysis and columns = variables (Figure 6.4).
Data cleaning is a crucial step: verify/correct/code fields, handle string data (e.g., “flu” vs “FLU” vs “influenza”); fix spelling variations; address inconsistent entries.
Data quality issues: inaccurate, incomplete, or inconsistent cases; range checks; logical consistency across variables; duplicate removal.
As Big Data expands, data cleaning becomes even more important.
Privacy and ethics: nonidentifiability (de-identification) is essential when releasing administrative data; avoid linking identifiers (names, SSNs) while enabling linkage via secure processes.
Flattening and integration challenges: turning relational data into flat files is time-consuming but often necessary; advanced analyses may require multi-level or panel data structures.
Administrative Records and Their Preparation for Research
Administrative records commonly used for policy research:
Financial and expenditure records; employee records; output production records; client records (patients, students); performance indicators.
Stored in MIS; some still on paper.
Adapting administrative data for research involves:
Data cleaning and formatting into flat-file structures suitable for software like SPSS, SAS, or Stata.
Verifying and recoding variables; ensuring numeric formats; handling missing data.
Potentially linking datasets across programs (e.g., Medicare, Social Security, cancer registry, death certificates) to address broader questions; requires security and de-identification.
Data management challenges:
Paper records present retrieval and coding challenges.
String formats require numeric coding (e.g., illness type coded as 23 for influenza).
Duplicate entries and inconsistent field types must be addressed.
Data cleaning tasks are time-consuming and often more work than the statistical analysis itself.
Flattening relational databases: the process of converting multi-table databases into flat, cross-sectional data suitable for analysis; multi-level analyses may require combining units of analysis across levels.
Administrative data privacy and ethics:
HIPAA regulates health data privacy; data must be nonidentifiable when used for research.
Access often requires secure procedures; data may be matched across sources with identifiers removed before release.
Protecting privacy becomes more challenging as data linkage increases; privacy scandals highlight the need for careful governance.
Data sources that produce published tables (aggregate data) vs. microdata:
Published aggregate tables (e.g., poverty by state) are easier to access but lack detail of microdata; larger studies may rely on microdata with proper governance.
Where Do Quantitative Data Come From?
Quantitative data come from diverse sources and methods:
Administrative record data, commercial transactions, sample surveys, and linked data across sources.
The Internet-era data include web traffic, search terms, online transactions, etc.
Creative researchers often combine/link data from multiple sources to create richer variable sets.
Administrative data continue to be central to program evaluation and policy analysis; increasingly, vendors provide commercially purchased data (e.g., pharmaceutical sales, bankruptcy data).
Public/private data linkages and data fusion are common, including GIS-based linking by geography.
Data quality/compatibility issues persist across sources; data cleaning and matching are essential steps before analysis.
Administrative Data: Data Availability, Ethics, and Public vs. Private Data
Public and private organizations collect administrative data; data may be used for research with appropriate approvals.
Ethical considerations:
Administrative records often contain private information not collected with consent for research; use is regulated.
HIPAA governs health information sharing; researchers must follow strict privacy rules.
A key ethical obligation is nonidentifiability: data released should not enable identification of individuals.
When linking confidential datasets, an approved process allows researchers to receive stripped data (identifiers removed) to preserve confidentiality.
Data access and privacy in practice:
PUMS (Public Use Microdata Samples) are de-identified microdata with restrictions on geography to protect privacy.
Census Research Data Centers (RDCs) provide secure facilities for accessing confidential microdata.
Data purchasing and licensing:
Some data are purchasable (e.g., IMS Health pharmaceutical sales) and may come from publicly available sources or proprietary providers.
Commercial vendors add value by cleaning/formatting data for research use.
Ethical questions for researchers:
How to access and use data while preserving privacy and complying with legal restrictions?
How to balance public value against individual privacy concerns?
Practical exercise prompts:
Identify administrative data you might have access to at work or school; describe storage, access, and ethical issues.
Published Data Tables and Data Archives
Many agencies publish aggregate data tables online; these are accessible and useful for macro-level analyses.
Example: Table 6.2 Aggregate Data Table (U.S. Census Bureau) showing poverty by state, with counts (in thousands) and percentages; notes explain data universe changes (e.g., pre- vs post-2006 ACS universe).
Important notes when using published data:
Always read notes/documentation to understand variables, sources, years, and calculations.
Properly cite the data source and download date since data can be updated or corrected.
Published data tables can be used for aggregate panel studies but lack micro-level detail.
The process of assembling data from published tables often involves carefully documenting definitions and sampling frames.
Where to Find Published Tables and Data Archives
Data archives and portals include:
ICPSR (Inter-university Consortium for Political and Social Research) – University of Michigan
GESIS (German Social Science Infrastructure Service)
Roper Center Public Opinion Archives – University of Connecticut
SDA (Survey Documentation and Analysis), UC Berkeley
CESSDA (Council of European Social Science Data Archives)
UK Data Archive (UKDA)
Data archives store microdata and documentation to facilitate reuse; access may require registration or formal agreements.
Ethics of public-use microdata:
Public-use microdata reveal individual responses; risk of re-identification exists, so many archives top-code sensitive values and limit geography.
RDCs provide controlled access for researchers needing more detailed data.
Public-use microdata examples:
CPS (Current Population Survey)
NHIS, NHANES, BRFSS, YRBSS, MEPS, NAEP, NELS, NHES, NAAL, PSID, NLSY, NCVS, AHS, HRS, SIPP, ANES, Eurobarometer, WVS, ISSP, ESS, and more.
Access tools and online analysis:
WEAT (Web Enabled Analysis Tool) for online analysis of certain datasets; NAEP Data Explorer; DataFerrett (DataWeb) for CPS
Public-use microdata vs. restricted data:
Some microdata are publicly downloadable; others require application or restricted access due to privacy.
Public Use Microdata: Examples, Access, and Ethics
Public-use microdata are provided by major surveys in many policy areas (education, health, labor, crime, housing, politics).
Data access options:
Downloadable public-use microdata
Online analysis tools (WEAT, NAEP Data Explorer, GSS Nesstar, ANES public data tools)
Some datasets provide both microdata and aggregated tables
Major surveys (Table 6.4 overview):
Health: NHIS, NHANES, NHCS (Healthcare), BRFSS, YRBSS, MEPS
Education: NAEP, NELS, NHES, NAAL, CPS (labor/economics aspects)
Labor and employment: SIPP, PSID, NLSY79/97, etc.
Crime and housing: NCVS, AHS, HRS (retirement), PSID families
Political/social attitudes: GSS, ANES
International: Eurobarometer, WVS, ISSP, ESS
Data documentation and codebooks are essential for understanding how data were collected, coded, weighted, and analyzed.
Ethics of public-use microdata:
Even when data are anonymized, detailed microdata can pose privacy risks when linked with other data.
Access often requires confidentiality agreements and may restrict certain geographic detail (e.g., PUMS geographies).
Secondary Qualitative Data and Linking Data
Secondary qualitative data reuse: UK Data Archive and other repositories archive qualitative data (interviews, focus groups, narratives) for secondary analysis.
Ethical considerations for secondary qualitative data:
Original consent may not cover secondary uses; possible mismatch with new research questions.
Additional consent or fit within the language of the original consent is often required.
Qualitative data can be linked with administrative data or aggregate data to create richer analyses; examples include linking interview data to neighborhood characteristics or to survey results with area-level data via GIS.
Linking data across sources is a core feature of Big Data research, enabling richer variable sets and more powerful analyses.
Some Limitations of Secondary Data
Availability biases: some data are more readily accessible than others; e.g., elderly populations under Medicare receive centralized data; younger populations may be harder to study due to dispersed data sources.
Data availability can distort research questions: researchers may study topics that are well-covered by available data rather than the most important questions.
While data access is expanding, choice of data sources can constrain analyses and require compromises.
When to Collect Original Data?
Reasons to collect primary data include:
Small-area studies requiring data at city/neighborhood scales not available in microdata or publicly available data.
Need to measure variables not captured in existing data; missing or insufficient variables.
Desire for a specific combination of variables or up-to-date data.
Privacy/confidentiality restrictions limit reuse of existing data.
Next chapter focuses on collecting primary data (surveys, observations).
Conclusion: Chapter summarizes the sources and uses of quantitative data, highlighting the need to understand data provenance and quality for policy-relevant research.
Chapter Resources and Key Terms (Chapter 6)
Key terms to review (selected):
Aggregate (ecological) data, Big Data, Codebook, Cross-sectional data, Data archive, Data cleaning, Flat file, Longitudinal data, Metadata, Microdata, Multilevel (hierarchical) data, Nonidentifiable, Online data analysis tool, Panel data, Pooled cross sections, Prospective cohort, Quantitative data, Relational database, Time series, Unit of observation, and more.
Exercises focus on identifying data forms, sources, and methods for using secondary data, as well as exploring public-use microdata and online analysis tools.
Surveys and Other Primary Data
Chapter 7 introduces the collection of primary data through surveys and observation, plus other primary data sources (trained observation, instruments, experimental data).
Context: US economic indicators in 2010 provide a backdrop for the central role of surveys in understanding the economy (e.g., unemployment, consumer sentiment; 9.9% unemployment in April 2010; 7 in 10 adults believed the country was on the wrong track).
Surveys underpin knowledge about health, housing, crime, transportation, education, and more; they guide public policy and organizational management.
Core questions before conducting a survey:
Do you know enough about the topic to design questions? If not, qualitative methods (focus groups) may be needed first.
Does the information exist in another source? Avoid duplicating data collection if data already exist.
Can people provide the information you want? Some data may be hard for respondents to recall or measure accurately.
Will people provide truthful answers? Especially for sensitive topics.
Steps in the survey research process:
Identify the Population: Clearly define the target group for the survey.
Develop a Questionnaire: Create a set of questions designed to elicit the desired information.
Pretest Questionnaire: Test the questionnaire with a small group similar to the target population to: Ensure the wording is unambiguous and clear. Validate the questions, ensuring they measure what they intend to measure.
Recruit and Train Interviewers: If applicable, select and equip interviewers with the necessary skills to avoid biasing interviewees.
Collect Data: Administer the survey to the identified population.
Analyze and Present Findings: Process the collected data and communicate the results.
Modes of survey data collection:
Intercept Interview Surveys: An interview technique known for constant feedback and generally high response rates. Typically conducted in high-traffic public spaces such as shopping malls.
Household Interview Surveys: In-person interviews conducted in the respondents' homes. Often high-quality data; advantages include high response rates; disadvantages include time, cost, logistical clustering, potential social desirability bias, and interviewer effects; use of CAPI/CASI to improve administration and confidentiality. Common for sensitive topics like income or health.
Telephone Interview Surveys: Conducted over the phone, often for customer satisfaction or public opinion polling. Fast and cost-effective; use RDD for sampling; CATI for live data capture; high effort required to reach respondents; declining response rates; BRFSS as a major US example.
Automated Telephone Surveys (IVR/robocalls): Surveys where questions are asked and responses collected using automated voice systems. Very low cost but often low response rates; best for short, simple questionnaires; risk of leading questions and push polls.
Mail Self-Administered Surveys: Questionnaires sent via mail for respondents to complete and return on their own. Cost-effective, especially for limited literacy populations with complete sampling frames; Dillman’s Tailored Design Method (TDM) emphasizes multi-contact strategies and careful design to boost response rates; limitations include literacy requirements and potential skip/response issues.
Group Self-Administered Surveys: A practical and cost-effective method where surveys are completed by a group of respondents simultaneously, often in a classroom or meeting setting. Administered in settings like schools/worksites; advantages include efficiency and supervision; disadvantages include clustering effects and potential response bias due to group dynamics; YRBSS as a prominent example.
Web/Internet Surveys: Rapid, low-cost, flexible; best when you have an established email list or use opt-in panels; can incorporate sophisticated skip patterns and multimedia; drawbacks include spam fatigue, panel attrition, duplicate responses, and tech issues across devices. Surveys administered online, typically via email invitations or website links.
Establishment Surveys: Surveys focused on collecting data from companies and/or organizations rather than individuals. Involve multiple potential respondents within an organization; mixed-mode approaches may be used to reach diverse respondents; gatekeepers can make contact challenging.
Panel or longitudinal surveys: track the same respondents over time; challenges include attrition and maintaining contact information; use incentives and careful tracking to reduce loss to follow-up.
Mixing modes: mixed-mode surveys can improve coverage but introduce mode effects (responses may differ by mode), sample frame incompatibilities, and interpretation challenges.
Crafting a questionnaire: start with purpose/constructs; consider one or two essential questions; use mock tables to design questions that will yield needed analyses; replicate questions from established surveys where possible to enable comparability.
Opening questions matter: start with simple, relevant questions to engage respondents; sensitive or complex questions should be placed later; avoid early hard questions (e.g., income) to prevent dropout.
Closed-ended vs open-ended questions: balance; open-ended questions provide rich data but are time-consuming to analyze; excessive open-ended questions often lead to discarded data.
19 principles for writing survey questions (Dillman, 2007):
Use simple words; avoid jargon.
Be concise; use complete sentences.
Avoid vague quantifiers; use precise response options.
Avoid overly detailed or unrealistic recall requests; provide bounded ranges.
Ensure equal numbers of positive/negative response options when using scales.
Distinguish undecided/neutral options; consider explicit neutral categories.
Present balanced wording to avoid bias in response options.
State both sides of attitude scales in the stem (e.g., “satisfied or dissatisfied” rather than “satisfied” only).
Use mutually exclusive response categories; avoid overlapping categories.
Use cognitive design to aid recall (priming).
Give appropriate time referents (e.g., “in the last 7 days”).
Ensure the question is technically precise (avoid ambiguity like “Do you own your home?” which might omit mortgages).
Use standardized questions where possible to enable comparisons.
Avoid yes/yes double negatives; avoid double-barreled questions.
Avoid unnecessary calculations for respondents; do calculations in analysis mode when possible.
Ensure response categories cover realistic distributions (update ranges to reflect current distributions).
Consider the sequence of questions to minimize fatigue and bias.
Pretest to identify problems and revise accordingly.
Physical and graphical design: layout, instructions, navigation aids, shading, and cross-device consistency for web surveys; pretesting recommended.
Ethics of survey research:
Informed Consent: Ensuring participants understand the survey's purpose, risks, and benefits before agreeing to participate. A fundamental principle but implemented differently across modes; tacit consent often occurs in online surveys.
Pushing for High Response Rate: Balancing the need for sufficient data with avoiding undue pressure on potential respondents. High response rates are desirable but should not involve coercion or deception.
Overburdening Respondents: Designing surveys that are not excessively long or demanding, respecting respondents' time and effort.
Protecting Privacy and Confidentiality: Safeguarding personal information and ensuring responses cannot be linked back to individual participants. Anonymize data; use codes to separate identifiers; manage geocoding to avoid precise location disclosure.
Surveying Minors and Other Vulnerable Populations: Implementing additional safeguards and obtaining appropriate permissions when surveying individuals who may be more susceptible to coercion or harm. Special considerations for surveying vulnerable populations (children, prisoners, cognitively impaired, etc.).
Making Survey Data Available for Public Use: Considering the ethical implications of sharing data, including anonymization and data security. Public-use data require restrictions to limit identification; data-sharing must balance research value with privacy.
Geocoding and linking data raise privacy concerns; precise locations can enable identification; use aggregated location data when possible.
Other primary data sources:
Trained observation (quantitative coding of observed conditions; e.g., Sampson & Raudenbush’s Chicago neighborhoods project; street cleanliness scorecards in NYC using photographic standards).
Use of handheld devices to capture qualitative and quantitative data (e.g., ComNET project).
Scientific instruments (biometric readings, lab tests, brain imaging like fMRI/PET/EEG).
Data extraction algorithms and web crawling for extracting data from the Internet; Big Data methods (trawling) expand data sources beyond traditional surveys.
Conclusion: Surveys are central to primary data collection but are not the only method; other primary data sources complement surveys in policy research; the next steps focus on statistical analysis and interpretation of collected data.
BOXES, BOX 7.2, BOX 7.3, BOX 7.4, BOX 7.5 (Key Guidance)
BOX 7.2: Opening questions should be engaging and directly related to the topic; avoid early burdensome questions; examples compare two opening designs.
BOX 7.3: Example of an open-ended questionnaire that can be time-consuming to analyze; use sparingly and plan for qualitative analysis if used.
BOX 7.4: Critical questions to ask about surveys and other primary data (scope, mode, who conducted the survey, question wording, availability of questions, etc.).
BOX 7.5: Practical tips for doing your own survey (avoid duplicating existing data, mock tables, replicate standardized questions, write a purpose statement, pretest, etc.).
Exercise and Study Site Reminders
Exercises in the chapter encourage identifying data sources, choosing survey modes, designing questionnaires, and exploring online data tools.
The study site (www.sagepub.com/remler2e) offers a self-quiz, eFlashcards, and additional resources to reinforce learning.
Summary of Key Themes
Secondary data are foundational for policy research due to accessibility and cost advantages, but require careful attention to data provenance, quality, and ethics.
Big Data expands the potential to link diverse data sources (administrative, microdata, qualitative data) to generate richer insights, while also raising privacy concerns.
Quantitative data come in multiple forms and time dimensions; understanding units of observation vs. units of analysis and the time structure (cross-sectional, panel, time series) is critical for proper analysis.
Administrative data require substantial preparation (cleaning, formatting, de-identification) but remain a powerful source due to breadth and cost savings.
Public-use microdata and data archives democratize access to large, high-quality data sets, but come with ethical constraints and documentation requirements.
Primary data collection (surveys and other methods) remains essential when data do not exist or are not fit for purpose; choosing the right mode, designing rigorous questionnaires, and attending to ethics are crucial for valid results.
The field increasingly relies on mixed-methods and mixed-mode approaches to cover diverse populations, while being mindful of mode effects and sampling frame compatibility.
Key Formulas and Notation (LaTeX)
Rate example (per 100,000 inhabitants):
\text{rate} = \frac{\text{number of events}}{\text{population}} \times 10^5
Poverty data notes (Table 6.2): units are in thousands (1,000s); e.g., United States: 33,311 (2000) to 42,868 (2009) below poverty (in thousands).
Data types shorthand:
Microdata: individual-level observations
Aggregated data: summarized by larger units (e.g., state-level averages)
Panel data: repeated measures on the same units over time
Time-series: measurements over time for a single or few units
Generalizability and External Validity
Generalizability: the extent to which findings from a study can be projected to a larger population, time period, or different contexts.
External validity is another term for generalizability (Shadish, Cook, & Campbell, 2002).
In practice, researchers care about what the study implies for the broader world, not just the specific sample.
Katrina example: CBS poll of 725 adults was used to infer thinking of a few hundred million people; the value lies in broader implications, not the exact individuals polled.
Population, Sampling Frame, and Generalizability
Population of interest: the entire group the study aims to learn about (e.g., all U.S. adults in the Katrina poll).
Sampling frame: a concrete list or operational representation from which the sample is drawn (e.g., voter lists, phone numbers, organizational rosters).
Parameters vs statistics: the study aims to learn about population characteristics (parameters); the sample yields statistics that estimate these parameters.
The closer the sample’s results are to true population parameters, the more generalizable the results.
Broader population, geography, time, and groups increase generalizability; studying many places and times tends to improve external validity.
Random (probability) sampling tends to be more generalizable than nonrandom sampling; small random samples often beat large nonprobability samples for generalizability.
Examples: Katrina poll (random sample) vs Red Cross shelter study (convenience/nonrandom sampling) – both informative but with different generalizability limits.
Question: which features of a sample affect generalizability? This is a prelude to later sections.
Are Experiments More Generalizable?
Many biological, psychological, or economic processes are fairly universal, making findings generalizable even from small, unrepresentative samples (especially in controlled experiments).
Experiments are used to determine causal relationships ("what if" questions).
Examples:
Drug/medical trials often use clinical volunteers (generalizability can be limited).
Psychological experiments often use undergraduates and still yield generalizable laws of perception, cognition, and behavior.
Experimental economists study altruism and risk aversion with small samples (e.g., ultimatum game, prisoners’ dilemma).
Caveat: generalizability is not guaranteed; experiments can be criticized for homogeneous or idiosyncratic samples and limited external validity.
Replication and Meta-Analysis
Replication: repeating a study with different samples, places, times, or designs to test robustness and generalizability.
Replication enhances generalizability of findings from small or nonrandom samples.
Meta-analysis: pooling multiple studies to produce a larger, more generalizable estimate of a treatment effect or relationship.
Formal definitions: meta-analysis combines separate effects into a single, generalizable estimate.
Examples: air pollution and daily mortality across many cities; second-generation antipsychotics efficacy across 124 experiments (Davis, Chen, & Glick, 2003).
Applications: health, education, social work, criminal justice, job training, etc.
Relationships and Generalizability (Health and Happiness in Moldova)
Relationships among variables (not just descriptive percentages) tend to generalize better.
World Values Survey data: Moldova (small Eastern European country) has means and correlations similar to global patterns in health and happiness, despite Moldova’s low GDP per capita.
Global correlation (health vs happiness): \approx\text{corr(Health, Happiness) across all nations} \rightarrow 0.333\approx 974\text{corr(Health, Happiness)} \rightarrow 0.333 \text{ (similar to global)}\approx 2,021\text{corr(Health, Happiness)} \rightarrow 0.359\rightarrow\times\times\rightarrowSE = \sqrt{\frac{p(1-p)}{n}}\text{CI} = \hat{p} \pm 2 \times SE\approx\approx 0.055 \pm 0.022 \rightarrow (0.033, 0.077)\pm2 \times SE\pm3.1n \approx \frac{Z^2 \cdot p(1-p)}{E^2} = \frac{Z^2 \cdot 0.25}{E^2}\approx\approx\sim\approx\approx\timesn_{\text{eff}}\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\rightarrow\text{Response rate} = \text{Contact rate} \times \text{Cooperation rate}SE = \sqrt{\frac{p(1-p)}{n}}\hat{p} \pm 2 \times SEn \approx \frac{Z^2 \cdot p(1-p)}{E^2}\approx[construct] \rightarrow [measure] \leftarrow [Error]\rightarrow\rightarrow\times1-5\minusXi = Ti + Bi + NiX_iT_iB_iN_iBi \approx 0Ni \approx 0\sim$$ 0.70 often considered acceptable, but depends on use (higher stakes require higher reliability).
Parallel Forms Reliability: Use different but equivalent forms of a test to assess consistency across versions; important when tests must change over time (e.g., yearly standardized tests). Distribute the same or equivalent test to two different samples to measure overlapping agreements and disagreements. Example: Checking the quality of teachers and the level of interpretation agreement among students on a given task.
Reliability is necessary but not sufficient for validity; a reliable measure can still be biased or fail to capture the intended construct.
Figure 4.4 and related text illustrate reliability concepts (good vs. poor reliability); Figure 4.5 shows how increasing random error affects averages and confidence intervals; Figure 4.6 shows how reliability impacts relationships.
Reliability in qualitative research: validity and reliability concepts apply; intercoder reliability; code-recode reliability; qualitative validity concerns (does interpretation capture participants’ experiences consistently?).
Validity vs reliability: a practical contrast
A measure can be valid but unreliable, or reliable but invalid, or both, or neither (bull’s-eye analogy in Figure 4.7):
Reliable but not valid: shots clustered but off-target
Valid but not reliable: centered on target on average but widely dispersed
Both reliable and valid: clustered tightly around target
Neither: dispersed and off-target
Implications for measurement in practice:
For job performance measures, self-reports may be reliable (consistent) but not valid (biased upward);
Supervisor assessments may be more valid but face reliability concerns (inter-rater differences).
In qualitative research, validity and reliability translate to credible, trustworthy interpretations; intercoder reliability and code validity are key concerns.
The chapter emphasizes that validity and reliability are context-dependent and different measures may be valid for different purposes.
Levels of measurement, units of analysis, and data types
Levels of measurement (two broad types):
Quantitative Variables: Numbers that refer to actual quantities (e.g., age, income, hours worked, weight). Unit of measurement matters (e.g., dollars, kilograms).
Interval: Data with ordered categories where intervals between categories are equal, but there is no true zero point (e.g., temperature in Celsius or Fahrenheit).
Ratio: Data with ordered categories where intervals between categories are equal, and there is a true zero point, allowing for meaningful ratios (e.g., height, weight, income).
Categorical Variables: Numbers refer to categories; can be nominal or ordinal.
Nominal: Data that are purely categorical, without any intrinsic order or ranking (e.g., gender, race, religion).
Ordinal: Data with ordered categories, but the intervals between categories are not necessarily equal or meaningful (e.g., education level: high school, bachelor's, master's).
Important distinctions:
Level of measurement (nominal, ordinal, interval, ratio): affects allowable statistics.
Unit of measurement: the unit (e.g., dollars, kilograms) that defines the quantitative variable.
Unit of analysis: the object described by the measure (people, households, neighborhoods, organizations).
Box 4.9 clarifies unit vs level vs unit of analysis distinctions.
Examples:
Household income coded into 12 categories in ESS; although labeled in euros, this is a categorical variable (not a precise quantity) unless midpoints are used.
Income could be treated as a quantitative variable if midpoints are assigned to categories (midpoint approximation) or if a multi-item scale sums to a continuous score.
Turning categorical variables into quantitative measures
Dummy variables (indicator variables): two-value categories (0/1) used to represent presence/absence (e.g., Employed: 0=no, 1=yes).
Using dummy variables for multi-category nominal variables: create a separate dummy per category (e.g., White, Black, Hispanic, Asian, Other).
Midpoint approximation: for ordinal measures with categories, use midpoints of ranges to approximate a quantitative score (e.g., income categories become approximate euros).
Multi-item scales: add up ordinal indicators to form a composite score; often treated as quantitative for analysis.
Endpoint scales and thermometers: 1–7 or 1–10 scales with endpoints anchored; some argue equal-interval interpretation across the scale; feeling thermometers (0–100) used for attitudes toward groups or leaders.
Figure 4.8 example: feeling thermometer illustrating usage of scales to quantify attitudes.
Levels of measurement and unit of analysis in practice
Unit of analysis vs level of measurement interact: a variable can be categorical at the individual level but become quantitative when aggregated to a geographic level (e.g., poverty rate by census tract).
Poverty example: individual poverty is a binary (categorical) variable; poverty rate by tract or county is a continuous quantitative measure.
The measurement in the real world: trade-offs and choices
Measurement is rarely perfect; practitioners balance validity, reliability, cost, and feasibility.
Costs and practicality:
Objective measures (clinical exams, bank records, hair/hair analyses) can be more valid but expensive.
Longer questionnaires increase reliability but raise respondent burden and reduce response rates; shorter measures reduce burden but may sacrifice reliability.
Reliability tends to improve with more indicators; multi-item scales generally provide more reliable measurements, but longer instruments increase cost and respondent burden.
Validity-reliability trade-off: lengthy, nuanced assessments (e.g., essays) may be valid in capturing complex constructs but less reliable due to interrater variability and scoring concerns; shorter tests improve reliability but may oversimplify constructs.
Established measures often preferred for reliability/validity; inventing new measures risks lower comparability across time and studies.
High-stakes measurement can induce behavior changes (gaming) that threaten validity; examples include test prep and coaching; public sector examples include fraud in performance-based pay schemes.
Multi-dimensional measures ( dashboards ) vs single headline indicators: EU and UN often use multiple dimensions; some argue for a single summary indicator for policy clarity; both approaches have trade-offs regarding comprehensiveness and comparability.
Measurement aggregation can obscure important differences across dimensions; a dashboard approach may be preferable for a fuller picture; aggregation weights are inherently arbitrary.
The chapter concludes with a call to thoughtful measurement: define concepts clearly, justify instrumentation and protocols, assess validity and reliability, and ask critical questions about measures.
Critical questions and practical guidance (Box summaries)
Box 4.10: Critical questions to ask about measurement
What is the purpose and origin of the measure?
What is the conceptual definition and its dimensions?
How is the measure operationalized (instruments, personnel, protocols)? Is it a single indicator or a multi-item scale? Is it a proxy or proxy reporting?
How valid is the measure (face, content, criterion-related; specific forms like concurrent, predictive, convergent, discriminant, nomological)?
How reliable is the measure (evidence and strength of reliability tests)?
What is the level of measurement?
Box 4.11: Tips on doing your own research: measurement planning steps (develop conceptual definitions, search for established measures, plan operationalization, decide on single item vs multi-item scales, consider proxies, test validity/reliability, review existing literature).
Key terms (glossary-style references)
Bias, measurement bias, random measurement error (noise)
Construct, latent vs manifest construct
Conceptualization, operationalization
Instrument, protocol, proxy, proxy respondent, proxy reporting
Indicator, composite measure, scale, index
Validity (face, content, criterion-related, convergent, discriminant, nomological)
Reliability (test-retest, interrater, split-half, internal consistency, Cronbach’s alpha, parallel forms)
Item Response Theory (IRT), computer-adaptive testing (CAT)
Levels of measurement (nominal, ordinal, interval, ratio)
Unit of measurement, unit of analysis
Dimensionality and multi-item scales (SF-36 dimensions; Rosenberg self-esteem scale)
Qualitative validity and reliability concepts (code validity, intercoder reliability)
Technical concepts: constructs, indicators, proxies, and the measurement model
Connections to theory, policy, and practice
Measurement translates abstract policy concepts (like poverty) into observable data, enabling evaluation, accountability, and resource allocation.
The debate over poverty measures illustrates how theory, politics, and data availability shape measurement choices and policy implications.
Logic models and theory-driven measurement connect theoretical propositions to empirical tests via carefully defined constructs and indicators.
The balance between validity and reliability, and the choice between single-item measures vs. multi-item scales, reflect practical trade-offs in policy research and program evaluation.
The chapter emphasizes that measurements are always context-dependent: a measure can be valid for one purpose and not for another; the same measure can have different validity in different settings or times.
Real-world relevance and ethical considerations
Measurement choices affect policy conclusions, program funding, and public accountability.
Cost, respondent burden, and ethical concerns