The Challenge of Descriptive Inference and Experiments

Chapter 5 Page 116

The Challenge of Descriptive Inference introduces biases in descriptive work and emphasizes the need to understand and quantify error in data.
Non-response bias is a cross-national data analogue to missing data: data are often missing in low-income countries due to inadequate data collection capacity, introducing systematic biases in cross-national comparisons.
Key idea: Researchers must understand the nature of error (measurement vs sampling; systematic vs random) to make valid inferences. Tools exist for quantitative data to estimate error and quantify uncertainty; qualitative data can benefit from the logic of these quantitative methods.
Random sampling error depends on two factors: (1) the size of the sample and (2) the amount of variation in the data. A small random sample may miss population variation; a larger sample increases the probability of reflecting the population.
The standard error (SE) captures random sampling error. It is used to quantify uncertainty in estimates.
Example intuition: Income variation in a city can be huge; a sample of 300 may miss population variance, while 1,000 may better capture it; if income variance is smaller across a city, 300 might suffice.

Page 117

Most students encounter the standard error via the margin of error (MoE) used in polls.
Example: UP I-CVoter poll before the 2012 presidential election found Obama 49% vs Romney 48% among 1,000 likely voters; MoE reported as 3.5%. The interpretation: the population parameter (share of Obama-leaning voters) lies within 49% ± 3.5% (i.e., between 45.5% and 52.5%).
A confidence interval is the range within which the true population parameter lies with a stated level of confidence; however, we can never be 100% confident.
The standard error and margin of error relationship:
- For sample proportions, the standard error is: ext{std.error} = rac{\,igl(p(1-p)igr)^{1/2}}{\,
  } where n is the sample size. In explicit form: $ext{std.error} = rac{ rac{p(1-p)}{n}^{1/2}}{1} = \sqrt{\frac{p(1-p)}{n}}.$
- For sample means, SE is $ext{std.error} = \frac{s}{\sqrt{n}}$ where s is the sample standard deviation.
Central Limit Theorem: 95% of the time, a value from a random sample falls within about two standard errors of the population value. This underpins the usual notion that MoE ≈ 2 × SE for large samples.
Practical implications:
- Increasing sample size reduces SE and MoE, but the reduction is not uniform across all ranges; early increases in sample size yield larger reductions in SE than later increments (diminishing returns).
- Example: A Washington Post–ABC poll with 2,345 respondents reduced MoE to about 2% (compared with 1,000 respondents). The reduction is not perfectly proportional across all increases in n.
When reporting uncertainty, researchers should not rely solely on the sample statistic; reporting SE and MoE (or CIs) is essential to convey precision.

Page 118

Confidence intervals and uncertainty: 90% confidence intervals are discussed for cross-country data; CIs reflect random sampling error.
The CPI example (Transparency International) uses a 178-country dataset and plots 90% CIs around each country’s CPI score. For Romania, a point estimate of 3.7 on a 1–10 scale has a 90% CI of approximately 3.3–4.2; the United States has a higher CI (e.g., 6.5–7.7). These intervals illustrate the uncertainty around country scores.
Interpreting confidence intervals:
- If the U.S. interval is entirely above Romania’s interval, we have statistical evidence (at that confidence level) that the U.S. CPI score is higher than Romania’s.
- If intervals overlap (e.g., Italy vs Romania), we cannot be confident there is a difference.
- The term "statistically significant difference" refers to differences larger than the error in the data; it does not necessarily imply a large substantive difference.
Another point: the 90% CI framing is common in this text, but the same logic applies with other confidence levels (e.g., 95%).
Reporting uncertainty is essential; it helps avoid over-claiming precision in reported numbers.

Page 119

The CPI example (Figure 4.3) shows the distribution of estimates across 178 countries with 90% CIs.
It is possible for confidence intervals to overlap heavily across many countries, underscoring the importance of reporting uncertainty rather than focusing solely on point estimates.
The authors discuss the meaning and limits of statistical significance and emphasize that ordinary language use of the word “significant” may mislead readers about the magnitude of differences.

Page 120

There is debate about what the standard error precisely accounts for. Traditionally, SE is a measure of random sampling error. Some scholars argue that SE also incorporates some random measurement error (i.e., SE is an estimate of all random error, including sampling and measurement). In practice, SEs and margins of error are calculated as if they measure random sampling error, and they do not inherently account for systematic error.
Systematic error (both sampling and measurement) is more difficult to handle; researchers may attempt to correct for it when population parameters are known or estimable.
Weighing data to correct for systematic sampling differences:
- Weighing adjusts the influence of certain respondent groups so that the sample more closely resembles the population (e.g., if young people 18–30 are underrepresented in a sample, each young respondent can be given more weight).
- Weighting can correct non-response bias and sampling biases when population parameters are known for weighting variables.
Practical examples of weighting for population accuracy:
- Pew Research Center weights by household size, combined landline and cell phone usage, age, gender, education, race/ethnicity, and population density.
In addition to sampling error, measurement error is a concern. Weighting can sometimes correct for measurement error if population parameters are known; otherwise, transparency about potential errors is essential.
The U.S. DHHS National Survey on Drug Use and Health (65,000 respondents) illustrates measurement error: self-reported marijuana use under-reported usage by about 20% relative to biological tests. Kilmer and Pacula used this to adjust California marijuana consumption estimates for policy analysis.
Summary: There are tools to mitigate sampling and measurement error (better measurement, random sampling, tests of significance, data weighting), but problems can often be minimized rather than eliminated. Researchers should report standard errors and confidence intervals and discuss potential biases and limitations.

Page 121

The role of weighting continues: data weighting helps overcome non-response bias and sampling biases. If population parameters are known for weighting variables, weighting can improve accuracy; otherwise, researchers should be transparent about potential systematic error.
The text notes that qualitative data face similar issues with measurement and sampling error, though procedures are less standardized.
The Kilmer & Pacula example is revisited to illustrate adjusting for systematic measurement error using known population parameters.
The Pew Center notes that weighting by household characteristics and population features helps overcome non-response bias in survey data.

Page 122

Qualitative data and triangulation: Qualitative researchers combine multiple information sources (statements, observations, documents) to make inferences. This triangulation helps address limitations of any single data source and provides a more robust inference.
Transparency about the method, explicit acknowledgement of potential errors (random, systematic, measurement, sampling), and careful wording about uncertainty are essential even in qualitative work.

Page 123

Making descriptive inferences depends on the data type.
Nominal-level data: categories without an inherent order (e.g., regime classifications). Useful descriptive statistics include frequencies, percentages, valid percentages, and the mode; visualized via frequency tables or bar charts.
The example table (Table 4.2) classifies regimes into six types: Parliamentary democracy, Mixed democracy, Presidential democracy, Civilian dictatorship, Military dictatorship, Monarchic dictatorship, plus Missing data. The data show frequencies and percentages (e.g., Parliamentary democracy 56, 29.3% etc.).
The nominal data example helps illustrate how counts and proportions convey the distribution across categories.

Page 124

Ordinal-level data: categories with a meaningful order but not necessarily equal spacing between categories (e.g., Freedom House democracy status: Free, Partly free, Not free).
The Freedom House example shows how ordinal data can be summarized with frequencies, percentages, mode, and median (e.g., median = partly free; mode = free). It also notes limitations: broad categories may mask subtle differences (e.g., Argentina vs. Mexico both partly free but differ in degree).
Visual tools include bar charts; the scale and axis labeling matter to avoid misinterpretation.
The text cautions against misrepresenting data via misleading scales or by presenting raw counts instead of percentages when sample size varies.

Page 125

The presentation of ordinal data with graphs: two bar charts for the same data can be misleading if scales differ or if counts are shown instead of percentages.
Figure 4.7 (poor design) versus Figure 4.8 (good design) illustrate how proper labeling, consistent scales, and uncertainty (confidence intervals) improve interpretability.
Confidence intervals around proportions can be displayed with error bars (e.g., 90% or 95% CI) to reflect sampling uncertainty.
Ordinal data across countries (e.g., Freedom House) demonstrate how confidence intervals help readers assess whether observed differences are statistically meaningful.

Page 126

The third level of measurement is interval level data, which is ordered and communicates precise differences.
An example: Vanhanen’s polyarchy index, which combines competition (percent of seats held by the largest party) and participation (electoral turnout). This yields precise numbers (e.g., Mexico 20.78, Argentina 26.14) with a difference of 5.35.
GDP per capita figures also illustrate interval data with precise numerical differences (e.g., Mexico $10,047 vs Argentina $10,942; difference $895).
Interval data can be continuous or truncated (e.g., literacy rate cannot exceed 100%; age in years is count data with integer values).
For interval data, descriptive techniques include measures of central tendency and dispersion (mean, median, mode; standard deviation) rather than just frequencies and percentages.
The standard deviation is defined as: ext{sd} = igg(\frac{1}{n}\sum{i=1}^n (xi - \bar{x})^2\bigg)^{1/2}. The text also notes that this is equivalent to the root of the average squared deviation from the mean.
Additional distribution descriptors include skewness (lopsidedness) and the shape of the distribution (normal vs skewed).

Page 127

Skewed distributions (e.g., income) tend to have a mean that is pulled toward the tail; the median often better represents the central tendency for skewed data.
The CPI distribution across countries (Figure 4.10) is skewed, with many high-corruption countries and fewer low-corruption ones. In such distributions, the median (3.3) is less affected by extreme values than the mean (4.0).
When data are interval-level and skewed, researchers often focus on the median rather than the mean for descriptive summaries.
There is a brief discussion of a common pitfall: reporting the mean when the median would be more informative for skewed data.
Interval data allows a rich set of techniques beyond central tendency (e.g., dispersion, distribution characteristics).

Page 128

Visual representations of distributions include histograms, which group values into bins (e.g., 0.5 increments on a 1–10 CPI scale) to show the distribution shape.
The CPI histogram demonstrates a clearly skewed distribution with most countries scoring higher corruption (lower CPI) and a tail toward better scores.
As described, the mean is affected by extreme values, so the median is often preferred when describing skewed data.
The CPI histogram example also reinforces the principle that the distribution shape affects which descriptive statistics are most informative.

Page 129

The discussion expands on interpreting data across the three levels of measurement (nominal, ordinal, interval) and clarifies how the level of measurement determines appropriate descriptive techniques.
Qualitative data are often categorized into three sources: (1) survey/interview/focus group statements, (2) researcher observations, (3) documented information (e.g., articles, documents).
Qualitative data presentation can be narrative, including quotes and descriptive detail; however, this approach has to be tempered by concerns about representativeness and generalizability.
Triangulation (combining multiple data sources) is highlighted as a way to improve the validity of qualitative inferences by cross-checking evidence from different sources.

Page 130

Summing up descriptive inference: researchers should be clear about the concept being measured and ensure it is measurable, select data types appropriately, and recognize the challenges linked to aggregation that may mask differences.
Measures should minimize measurement error (e.g., social desirability bias, double-barreled questions) and sampling error (random sampling where possible).
Use random sampling to minimize sampling error where the population is not fully observed; construct sampling frames to minimize coverage bias; reduce non-response bias and missing data.
Report standard errors and confidence intervals to quantify uncertainty; discuss potential biases and limitations, including systematic errors when possible.
When population information is available, develop weights to correct for sampling and measurement errors; otherwise be transparent about limitations.
Students and researchers should acknowledge error and minimize it wherever possible; often, methodological limitations are reported in the paper’s methodology section to contextualize causal claims.
The text emphasizes that valid and reliable descriptive inferences are a prerequisite for valid and reliable causal inferences; subsequent chapters discuss experimental, large-n observational, and small-n observational approaches to causality.

Page 131

KEY TERMS list (aggregation, bias, census, cluster sampling, conceptualization, confidence interval, continuous, convenience sample, count data, coverage bias, cross-national data, disaggregation, double-barreled question, frequencies, histogram, index, etc.)
This glossary reinforces the vocabulary used to discuss descriptive inference and error.

Page 132

Additional KEY TERMS entries continue (inter-coder reliability, level of analysis, longitudinal data, margin of error, mean, median, mode, nominal-level data, non-response bias, normal distribution, operational definition, ordinal-level data, panel data, etc.).
The list culminates in a comprehensive glossary for the chapter.

Page 133

Contents of Chapter 5 begin: What is an Experiment? Why control means stronger inferences? Types of Experiments; Designing the Experiment; Analyzing and Presenting Results; Avoiding Mistakes; Natural Experiments; Conclusion; Key Terms.
Figure 5.1 shows historical growth in experimental methods in APSR articles: very few before 1975; gradual increase through the 1980s; substantial growth in the 1990s and 2000s; experiments are now a prominent method in political science.
The chapter positions experiments as the “gold standard” for causal inference in political science, while acknowledging trade-offs like internal vs external validity.

Page 134

The authors provide reasons for the rise of experiments: ease of computer/internet experimentation; cross-disciplinary collaboration (psychology, economics); advocacy by methodologists to promote causal inference through experiments.
Boxed reference to the Cambridge Handbook of Experimental Political Science by Druckman et al. is noted for further reading.
The text argues that experiments are especially valued for their ability to yield strong causal inferences and to test mechanisms under controlled conditions.
Figure 5.1 (reproduced) highlights the growth of experimental work over time.

Page 135

What is an experiment? A formal definition: an experiment is characterized by the researcher’s control over the data-generating process.
Distinction between observational and experimental designs: In observational designs, researchers do not manipulate data generation; in experiments, researchers intervene and manipulate a variable to observe outcomes.
Example: campaign advertising effects on votes. In an observational study, researchers might compare vote shares across markets with different ad allocations but cannot guarantee that ad allocation is independent of other factors like party identification.
In an experimental design, researchers manipulate which advertisements are shown and to whom, thereby introducing randomization and reducing confounding factors.
The key point is control over the data-generating process, which improves causal inference by reducing systematic differences across treatment groups.

Page 136

The importance of randomization: If advertisers' exposure is assigned randomly, treatment is uncorrelated with potential confounders, allowing a cleaner causal estimate of advertising effects.
The hypothetical example with 60 voters (Democrats, Independents, Republicans) illustrates how partisanship can confound observational inferences: even if there is no true advertising effect, observed differences by ad exposure can arise from party identification linked to who sees which ads.
The figure (5.2) demonstrates that failing to account for party ID can lead to spurious conclusions about advertising effects.
The authors emphasize that random assignment creates equivalent groups on average, enabling a causal interpretation of observed differences.
With a larger sample, randomization yields comparable distributions across treatment conditions, facilitating robust inference.

Page 137

Randomized experiments vs observational studies: If party identification is controlled for (through random assignment and sample size), the treatment effect can be identified more accurately.
The text notes that in practice, even with randomization, some groups may differ by chance in small samples; larger samples reduce the likelihood of such imbalances.
The discussion underscores that randomization is a powerful tool for removing confounding factors that would otherwise bias causal estimates in observational settings.

Page 138

Internal validity vs external validity:
- Internal validity refers to the degree to which a study establishes a credible causal effect within the experimental setting.
- External validity concerns the generalizability of the findings to the real world.
Although laboratory experiments offer strong internal validity, they may suffer from lower external validity because subjects behave differently in a lab than in the field.
The chapter suggests balancing internal and external validity by using a mix of methods (lab experiments for mechanism, observational studies for generalizability).
The text notes that experiments can be designed to approximate real-world conditions, including more natural viewing contexts and realistic stimuli, to improve external validity.
A concrete example: Ansolabehere and Iyengar studied negative advertising with lab experiments that mimic real campaigns while trying to preserve realism (e.g., real candidates, event settings, naturalistic viewing environments).
The authors emphasize disguising the true purpose of the study to avoid subjects altering behavior due to awareness of being studied.
A second external validity threat: convenience samples (often used in lab experiments) may not represent the broader population; this issue is discussed with respect to college student participants.
They recount the practical challenges of recruiting non-student adults for lab experiments (costs, time, and representativeness concerns).

Page 139

Laboratory experiments: description of typical setups (computers, tasks) and a representative example of advertising experiments.
Two major points about lab experiments:
- They often use a controlled environment to maximize internal validity.
- They may lack external validity, but researchers justify this by stressing the value of understanding causal mechanisms and by using multiple methods to triangulate findings.
The text provides a concrete visualization of a lab experiment on advertising: participants see ads in a controlled setting and are asked to evaluate candidates afterward.
The chapter discusses how laboratory experiments can approximate real-world political campaigns by manipulating exposure to ads and measuring subsequent attitudes or intentions.

Page 140

Box 5.1: Genes and political attitudes (Natural experiment approach): a study by Alford, Funk, and Hibbing (2005) uses monozygotic vs dizygotic twins to test genetic transmission of political orientations.
Key findings: Identical twins show higher concordance in political attitudes than fraternal twins; genetic factors account for about half the variance in ideology, with shared environment accounting for about 11%.
Critiques: The Equal Environment Assumption may be violated; identical twins might be treated more similarly than fraternal twins, which could inflate genetic attributions.
The box notes that twin designs face methodological scrutiny, but they remain an important example of how experimental-like designs can inform causal inference about the origins of political attitudes.
The text also notes that genetics research is controversial and subject to ongoing debate about interpretation and limitations.

Page 141

Experimental design in Ostrom et al. (1992) on self-governance of common-pool resources (laboratory analog): Covenants with and without a sword.
Research question: Under what conditions can groups self-govern a common resource? The experiment simulates a common-pool resource with interdependent payoffs and tests the effects of communication and sanctions.
Design: A control group (no communication, no sanctions) and three treatment groups (communication only, sanctions only, both communication and sanctions).
Results: Control earned 32% of potential; sanctions-only group earned 38.8%; communication-only group earned 75%; combined communication and sanctions group earned 97%.
Implications: Communication emerged as a more important factor than sanctions; results challenge some prior political theory.
The authors argue that the laboratory’s controlled setting yields high internal validity, enabling clearer causal inferences about the mechanisms driving self-governance.

Page 142

External validity critique of laboratory experiments: lab results may not fully generalize to the real world due to simplified environments.
The authors discuss the need to balance internal validity with external validity by using a variety of methods and designing experiments that better approximate real-world conditions (e.g., more naturalistic settings, real campaigns, realistic stimuli).

Page 143

The text notes that laboratory experiments simplify the world to identify causal mechanisms, but the real world is complex with multiple information sources and contextual factors.
To mitigate external validity concerns, researchers emphasize the importance of naturalistic designs, transparency, and replication across contexts.
The authors discuss strategies to maximize external validity, including presenting realistic stimuli and not disclosing the study’s full purpose to prevent demand effects.
They stress the value of combining laboratory experiments with observational and field experiments to triangulate causal inferences.

Page 144

Laboring with convenience samples: The Ansolabehere & Iyengar negative advertising experiments relied on volunteer subjects in a lab; subject pools often consist of undergraduates, which may bias results.
The authors discuss the benefits of recruiting non-student adults and the practical costs associated with this approach (e.g., payment, space, staff).
They warn that even non-student samples may still be unrepresentative of the general population, and there are trade-offs between experimental control and representativeness.
The chapter notes the cost of experiments (e.g., paying participants, maintaining lab space, etc.) and acknowledges that large-scale, representative field experiments can be expensive.
In Ansolabehere & Iyengar’s work, subjects were more likely to be college-educated and African American than the general population, highlighting potential sampling differences.

Page 145

The discussion continues on participant recruitment and limits of representativeness in laboratory experiments.
Druckman & Kam (2011) address debates about the use of student samples and argue that concerns about external validity are sometimes overstated; the danger of student samples as a “narrow data base” may be overstated if the design and treatments approximate real-world decision contexts.
The authors emphasize the trade-off: laboratory experiments offer strong internal validity and can illuminate causal mechanisms, but researchers should be cautious about generalizing to broader populations.

Page 146

Survey experiments: These were developed to address issues of question wording and order effects that pollsters observed in public opinion polling.
Key idea: By randomly assigning respondents to different survey conditions (e.g., order of questions, phrasing, or treatments embedded in the survey), researchers can identify how design influences responses and how robust results are to wording and sequencing.
Classic examples include Pew Research Center’s Iraq War question experiment: shorter phrasing vs longer phrasing that included casualties risk altered respondents’ support levels for military action.
Findings illustrate that question wording can substantially affect expressed support, highlighting priming effects and the sensitivity of public opinion to framing.
The value of survey experiments lies in their relative affordability and ability to generalize to a broader population (when conducted with representative samples) and their capacity to test a wider range of treatments via internet surveys.

Page 147

Survey experiments can be conducted with representative samples or convenience samples. Internet-based surveys enable dynamic treatments, multimedia stimuli, and broad reach; they lower costs compared with traditional telephone/in-person surveys.
However, while survey experiments are often more externally valid than lab experiments due to their sampling frames, they can be limited in the kinds of treatments they can administer (less interactivity than lab-based experiments).
The use of internet surveys allows embedding videos or pictures and can support more complex manipulations than traditional surveys.
The authors discuss the trade-off: survey experiments tend to be simpler to implement than lab experiments but may yield lower internal validity; nevertheless, they can improve external validity by utilizing more representative samples.

Page 148

The section on survey experiments also covers convenience samples accessed via online platforms (e.g., Survey Monkey, Survey Gizmo, Zoomerang) and recruitment through online ads, including Mechanical Turk.
Mechanical Turk provides rapid data collection and a larger, more diverse participant pool relative to student samples, though it is not perfectly representative of the general population. Turk workers tend to be younger, more educated, and more likely to be female and to identify as Democrats or independents.
Despite limitations, MTurk has become a viable option for quick, cost-effective survey experiments with substantial variation across demographic variables; it often yields higher engagement and attentiveness than student-only samples.
The authors acknowledge MTurk’s limitations but argue that, compared with traditional convenience samples (like college students), MTurk can provide more representative and varied samples for many experimental purposes.

End of the excerpt

Across the pages, the overarching theme is a rigorous approach to descriptive inference that explicitly accounts for uncertainty, uses appropriate data types and measurement levels, and carefully considers the strengths and limits of different research designs (observational, lab, field, survey experiments).
The material emphasizes transparency about error sources (random vs systematic; measurement vs sampling), the role of weighting to correct biases, and the careful interpretation of statistical significance and confidence intervals.
It also introduces the experimental method as a powerful tool for causal inference, with explicit attention to internal and external validity and to the trade-offs involved in laboratory versus field or survey experiments, including practical considerations of recruitment, cost, and generalizability.

Key concepts to review

Error types: random sampling error, measurement error, systematic error, non-response bias.
Standard error and margin of error: formulas for proportions and means; Central Limit Theorem and confidence intervals.
Weighting: purpose, methods, and limitations; how weights correct sampling bias and non-response.
Data levels: nominal, ordinal, interval (and occasional mention of ratio data); appropriate descriptive tools for each.
Descriptive statistics: frequencies, percentages, valid percent, mode, median, mean, standard deviation, skewness, histogram, bar chart.
Visualization best practices: appropriate scales, labels, and explicit reporting of uncertainty (confidence intervals).
Qualitative data and triangulation: multiple sources, transparency, limitations.
Experimental designs: observational vs experimental, randomization, control over data-generating process, internal vs external validity, laboratory vs survey vs field experiments, mechanism testing.
Box 5.1 (genetics and politics): natural experiments using twins to infer heritability of political attitudes; limitations such as equal environment assumption.
Real-world accessibility and costs of experiments: recruitment challenges, subject pools, and the trade-offs between internal validity and generalizability.

Summary of formulas and key relationships

Standard error for a sample proportion:
$ext{std.error} = \sqrt{\frac{p(1-p)}{n}}$
Standard error for a sample mean:
$ext{std.error} = \frac{s}{\sqrt{n}}$
Standard deviation (population sample):
$ext{sd} = \sqrt{\frac{1}{n}\sum<em>{i=1}^n (x</em>i - \bar{x})^2}$
Margin of error and CI relation: MoE ≈ 2 × SE for 95% CI (under normal approximation).
95% Central Limit Theorem intuition: about 95% of sample estimates fall within ±2 SE of the population value.
Data weighting formulae are not explicit, but conceptually involve reweighting observations by known population shares to match population parameters.