Lee, Zhang: Legibility and State Capacity — Comprehensive Notes

LEGIBILITY AND STATE CAPACITY: COMPREHENSIVE NOTES

Core idea: state capacity partly derives from legibility — the breadth and depth of the state’s knowledge about its citizens and their activities — and legibility is crucial for effective, centralized governance.
The authors build an informational account of state capacity, linking legibility to the state’s ability to curb free-riding in collective action dilemmas, with a focus on taxation and public goods provision.
Key theoretical anchor: James Scott’s concept of making local practices legible to central officials, but with a twist that legibility also supports social order and control of opportunistic behavior, not only extraction.
Definition of legibility (as used in this article):
- (a) The state possesses information about local practices and populations.
- (b) This information is rendered in standardized forms (e.g., cadastral maps, birth certificates, property registers) that are understandable to state administrators.
Why legibility matters for state capacity:
- Enables monitoring private behavior and enforcing rules in collective action settings.
- Centralizes information to resolve free-riding in public goods provision and taxation.
- Provides a foundation for efficient social order and governance.
Two classic illustrations of legibility challenges before modern data systems:
- Local weights and measures: dozens of pints and aunes across locales (e.g., in eighteenth-century Paris, a pinte = 0.93 L, Seine-en-Montagne 1.99 L, Precy-sous-Thil 3.33 L; at least seventeen different aunes). This shows how local knowledge is hard to aggregate for centralized planning.
- Naming conventions and identification: in pre-modern Britain, tracking property ownership, taxes, court records, etc., was hindered by common names (e.g., 6 John/Wiliam/Thomas/etc. among 90% of males) and lack of standardized identifiers — illustrating the administrative nightmare of legibility without standard forms.
Operational payoff: legibility helps resolve fiscal free-riding by enabling the state to identify contributors and under-contributors and to enforce tax rules.
Historical example highlighting legibility’s role in taxation: Vauban’s 1686 proposal to conduct annual censuses to know the number of subjects and their resources, wealth, and poverty — signaling the long-standing view that granular population information supports taxation and governance.
Pre-modern to modern shift: even with cash economies and cross-border earnings, the information problem persists; legibility remains essential for taxing and delivering public goods.
Swedish historical case as a concrete demonstration of legibility’s power:
- Beginning around 1540, royal bailiffs compiled tax registers of peasant households.
- In 1628, Denmark/Sweden undertook early cadastral mapping (roughly 12,000 maps detailing boundaries, ownership, tenure, cultivation, and valuation).
- Ecclesiastical Laws of 1686 tasked clergy with maintaining parishioners’ tax lists.
- By 1920, ~80% of the economically active population in Sweden was registered with the tax authority.
- Legibility enabled broad-based taxation: between 1690 and 1873, Sweden could levy taxes in kind to support the army, mobilizing about 1.5% of the population for military service — higher militarization than several major powers.
- Takeaway: legibility underpinned stable, high-yield taxation and broad social order.

OPERATIONALIZING LEGIBILITY

Research goal: measure legibility across countries and over time using a coarse, widely comparable data series — national censuses.
Why census data? Broad geographic and temporal coverage; long-standing traditional role in population information; suitable for cross-national and cross-time comparisons.
Improvement over prior work: move beyond a binary “census conducted or not” to quantify the quality/accuracy of census data, focusing on age information.
Rationale for age accuracy as legibility proxy:
- Age information is collected in almost all censuses, enhancing comparability.
- The smoothness of true age distributions allows detection of data errors via age-heaping patterns.
- Age information is foundational for many administrative tasks (voting eligibility, military service, driver’s licenses, education, benefits).
Conceptual link: age-heaping indicates general legibility of population data; if age data are inaccurate, other administrative data are likely less legible as well.
Data-generating processes for census age data:
- Mechanism A: Low age-awareness in the population due to limited state-society interaction (remote regions, weak state presence) leads to inaccurate self-reports.
- Mechanism B: Enumerators may record ages inaccurately due to difficult field conditions, reliance on second-hand information, or personal biases.
- Both mechanisms produce age misreporting, interpreted as a broader legibility problem.
Historical anecdotes reinforcing enumeration challenges:
- Nepal 1961 example: a hill region enumerator relied on visible settlements and second-hand reporting, risking inaccurate counts.
- Frontier/remote areas in 19th-century United States faced logistical and security challenges to enumeration, increasing opportunities for shirking and undercounting.
The core empirical strategy: create a census-based legibility indicator by measuring age accuracy across censuses and countries.
Data availability and scope:
- 370 censuses spanning 1960–2012.
- Substantial temporal and geographic coverage; many observations are national-level, with some subnational units.
- About half of observations come from original data collection from national census reports.
Key empirical predictions: more legible populations (as measured by census age data accuracy) will exhibit better state performance in taxation and public goods provision.

DATA-GENERATING PROCESSES AND MEASURES

Core measurement tool: the Myers Index of age-heaping, complemented by the Whipple Index.
Why focus on the Myers Index?
- Accounts for mortality (unlike Whipple in some uses).
- Suitable for cross-country comparisons and for datasets with varying population sizes.
- Produces a continuous measure with interpretable scale, useful for regression analyses.
Whipple Index (definition and interpretation):
- Measures disproportionate reporting of ages ending in 0 or 5.
- Formula (conceptual):
  W = rac{N{0,5}}{0.2 imes N} imes 100, where $N{0,5}$ is the number of records ending in 0 or 5 and $N$ is the total number of records considered.
- Range: 100 (no preference for end-in-0/5) to 500 (all ages end in 0 or 5).
Myers Index (conceptual description):
- A blending technique that accounts for age distributions across ten overlapping age bins (e.g., [15–64], [16–65], …, [24–73]).
- For each bin, compute digit-ending distribution and then average across the ten starting points to obtain a single index.
- Scale: 0 to 90, where 0 indicates no age-digit heaping and 90 indicates extreme heaping on a single terminal digit.
- In this study, Myers is inverted so that higher values indicate greater legibility. Concretely, if $M$ is the Myers Index, legibility is defined as
  L = 90 - M.
Practical notes on index interpretation:
- The Myers score tends to be upwardly biased for very small populations (less than about 5,000 individuals).
- The analysis in the paper focuses on national-level data or large subnational units where the index is more reliable.
Data coverage and patterns (descriptive):
- Global sample: 370 censuses; mean Myers score ~8.21; wide range up to ~45.67.
- Distribution is highly right-skewed overall; after de-meaning by country, the distribution approximates normality.
- Regions with wider variation in legibility: Asia, Sub-Saharan Africa, and the Middle East/North Africa tend to show broader ranges; Western democracies show higher legibility (lower age-heaping).
Validation strategy (link to other indicators):
- Subnational validity: relate Myers scores to birth registration and birth certificates where subnational data exist.
- National validity: relate inverted Myers to established state-capacity proxies (ICRG, WGI, FSI, BTI).
- Expected signs: positive correlation between legibility and both birth registration and birth certificates; positive correlation with measures of governance/capacity; negative correlation with fragile-state indicators (as appropriate).
What the Myers Index captures about legibility:
- A robust proxy for the state’s “presence on the ground” and reach into society.
- Sensitive to variation in the middle range of state capacity, where most countries reside, not just extreme high/low ends.

GLOBAL PATTERNS AND VALIDATION OF THE MYERS INDEX

Worldwide variation (summary):
- The world mean Myers score ~8.21; substantial cross-country skewness driven by between-country differences in legibility.
- The sub-sample of Asia, Sub-Saharan Africa, and Middle East/North Africa shows the widest range; Western states show high legibility (low heaping).
Subnational validation results (birth registration and birth certificate, N varies by data availability):
- Birth registration correlation with Myers (subnational): ~0.44 (raw) and ~0.52 (logged) for 393 observations.
- Birth certificate correlation (subnational): ~0.34 (raw) and ~0.41 (logged) for 282 observations.
National validity results (correlation with governance/performance indicators):
- ICRG total index: ~0.48 (raw) and ~0.61 (logged).
- Internal conflict, bureaucratic quality: positive correlations consistent with theory.
- WGI indicators (government effectiveness, political stability, rule of law, regulatory quality, control of corruption): generally positive and significant.
- FSI (Fragile States Index): overall negative relationships where higher FSI means worse outcomes (negative sign matches expectation for legibility).
- BTI (Bertelsmann Transformation Index): positive associations with legibility, particularly for state capacity dimensions like Stateness and overall index.
Interpretation: inverted Myers scores track with standard state-capacity measures and governance quality, supporting the Myers index as a valid proxy for legibility.

RESULTS: LEGIBILITY, TAXATION, AND COLLECTIVE GOODS

Research design for testing legibility effects on taxation and public goods:
- Two dependent variables at the subnational level: tax revenue (income tax collected by province) and tax ratio (tax revenue / province GDP).
- Controls: regional GDP per capita, distance from capital, population density, terrain ruggedness; all variables log-transformed due to skewness.
- Regression specification: ordinary least squares (OLS), robust to clustering by country; Myers scores lagged by one year to mitigate reverse causality; standardized variables (mean 0, SD 1) so coefficients reflect SD changes; legibility is inverted so higher values imply greater legibility.
Subnational sample and model:
- 12 countries: Argentina, Brazil, Greece, India, Indonesia, Italy, Mexico, Philippines, South Africa, Tanzania, Thailand, Turkey.
- N of observations: 399; N of countries: 12.
Key results (Table 4): legibility is positively and statistically significantly associated with tax outcomes.
- Tax Revenue (model 1): coefficient on Legibility = 0.319*.
- Tax Revenue (model 2): coefficient on Legibility = 0.104*.
- Tax Ratio (model 3): coefficient on Legibility = 0.0634*.
- Tax Ratio (model 4): coefficient on Legibility = 0.0587*.
- Significance: p < .05 for all legibility coefficients in the four models.
Substantive interpretation (based on standardized coefficients):
- A one standard deviation increase in legibility is associated with about a 10% increase in tax revenues (summary interpretation given in the text).
- An equivalent one SD increase in legibility is associated with roughly a 6% SD increase in the tax ratio (relative to the SD of the dependent variable).
India example (illustrative quantitative illustration):
- Uttar Pradesh (UP) 2012 context: regional GDP per capita ≈ US$372; legibility in UP is in the bottom quartile; UP tax revenue ≈ US$2.3 billion in 2012.
- A one SD increase in legibility in UP implies roughly an additional ≈US$320 million in income tax revenue.
- Tax ratio implications: a one SD increase in legibility could raise UP’s tax ratio from about 3% to around 6.5%.
Public goods outcomes (national-level analysis; cross-national data):
- Data: up to 111 countries; decade averages for dependent variables due to data timing; controls include logged GDP per capita, democracy, population density, terrain ruggedness; legibility inverted so higher is more legible.
- Dependent variables: infant mortality rate (log), adult literacy rate, primary school enrollment rate.
- Regression framework: OLS with decade fixed effects and country-clustered standard errors.
- Results (Table 6, full covariates): legibility is statistically significant and in the hypothesized direction for all three outcomes:
- Infant mortality: legibility negatively associated (more legible states have lower infant mortality).
- Adult literacy: legibility positively associated (more legible states have higher literacy).
- Primary school enrollment: legibility positively associated (more legible states have higher enrollment).
- Magnitudes (illustrative): in Kenya (2000s series), a 1 SD rise in legibility is associated with approximately 17 fewer infant deaths per 1,000 births, about 11 percentage-point higher literacy, and about 3 percentage-point higher primary enrollment.
Interpretation and caveats:
- The authors stress that the results are suggestive and not causal due to data limitations, potential omitted variables, and measurement error.
- Nevertheless, the consistent positive association between legibility and both tax collection and public goods is robust to bootstrapped standard errors and persists under lag specifications.
- The results support the view that legibility enables the state to employ more effective fiscal instruments and extractive capacity, and to deliver public goods more efficiently.
Additional implications from the tax/public goods results:
- Legibility may facilitate a shift from indirect to direct taxation by providing better information on citizens’ economic activity, enabling broader and more effective taxation regimes.
- This aligns with the broader literature linking information, governance, and state-building (e.g., Jones 1988; Kiser and Sacks 2009; Martin, Mehrotra, and Prasad 2009).

INTERPRETIVE AND THEORETICAL IMPLICATIONS

The legibility framework helps reconcile different strands of state-capacity and development research by foregrounding information flows as a fundamental determinant of state performance.
The authors argue legibility is a central, previously underappreciated component of state capacity, contributing to both the extraction (taxation) and provision (public goods) branches of governance.
The data’s geographic and temporal breadth enables new research avenues, including:
- Using legibility itself as a dependent variable to study how states increase information at their disposal (incentives, artifacts, or both).
- Exploring legibility as a potential mediator or moderator in links between state legitimacy and capacity, as well as privacy and surveillance debates in contemporary politics.
The authors emphasize that legibility is most impactful in the middle range of state capacity — where most states reside — and not solely in extremes of state strength or weakness.
Broader theoretical implications for conflict and development:
- If legibility is crucial for both resource extraction and public provision, legible states may be better at preventing civil conflict by meeting citizens’ basic needs and avoiding under-provision or coercive overreach.
- The approach offers a way to decompose state capacity into its informational components, allowing sharper tests of competing explanations of development and conflict.

CONCLUSION AND FUTURE RESEARCH AGENDA

Main takeaway: legibility, defined as the breadth and depth of state knowledge about its citizens and activities, is a crucial, understudied component of state capacity that facilitates centralized monitoring and enforcement, reduces free-riding, and improves taxation and public goods provision.
The census-based Myers Index provides unprecedented temporal and geographic coverage for measuring legibility at national and subnational levels, enabling analyses that previously were not feasible.
The authors highlight several key contributions:
- Reintroduction of legibility as a core variable in the state-capacity literature.
- Theoretical linkage between legibility and the state’s role in controlling opportunistic behavior in collective action settings.
- An original, broadly applicable legibility indicator with wide cross-national and historical applicability.
Practical implications:
- Legibility is associated with both quantitative and qualitative changes in state power, enabling more effective revenue collection and broader deployment of fiscal instruments.
- The movement from indirect to direct taxation aligns with the expansion of comprehensive informational bases on citizens’ economic activity.
Future research directions suggested by the authors:
- Treat legibility as a dependent variable and explore how states increase information: incentives versus artifacts, and potential political constraints.
- Examine how legibility interacts with legitimacy and privacy debates, including contemporary surveillance concerns.
- Use the Myer Index to study intrastate conflict, civil war, and development, particularly focusing on middle-range states where variation is largest.
Acknowledgments: the paper notes support from NSF and other institutions and references a comprehensive bibliography for related state-capacity literature.

Appendix notes and data caveats (as discussed in the text)

The authors acknowledge data limitations, measurement error, and missing data, particularly in weaker states, which may temper the observed effects.
They perform robustness checks (e.g., bootstrapped standard errors) and discuss lag specifications to mitigate concerns about reverse causality.
The Myers Index’s broad coverage and historical applicability offer a novel research tool for future studies on state capacity and related outcomes.

Key numerical references for quick recall

Whipple Index:
W = rac{N{0,5}}{0.2 imes N} imes 100, where $N{0,5}$ is the count of ages ending in 0 or 5 and $N$ is the total counted ages.
Range: 100 (no heaping) to 500 (extreme heaping).
Myers Index: scaled 0 to 90, with 0 indicating no heaping and 90 indicating extreme heaping; inverted in this study to produce legibility score $L$:
L = 90 - M,
where $M$ is the Myers score.
Sample and coverage: 370 censuses (1960–2012); national and subnational data; mean raw Myers ≈ 8.21; Canada (1991) ≈ 0.18; Pakistan (1973) ≈ 45.67; France and Switzerland show low heaping overall; Sierra Leone (2004) shows high heaping.
Subnational tax results (legibility coefficient on tax revenue or tax ratio):
- Tax Revenue (1): $eta_{Leg} = 0.319^{*}$;
- Tax Revenue (2): $eta_{Leg} = 0.104^{*}$;
- Tax Ratio (3): $eta_{Leg} = 0.0634^{*}$;
- Tax Ratio (4): $eta_{Leg} = 0.0587^{*}$.
Illustrative quantitative magnitudes:
- 1 SD increase in legibility → ~10% increase in tax revenues (global, cross-country interpretation).
- 1 SD increase in legibility → ~6% SD increase in tax ratio.
- India example (Uttar Pradesh, 2012): 1 SD legibility rise → ≈$320 million increase in income tax revenue; tax ratio increase from ~3% to ~6.5%.
- Kenya (2000s): 1 SD legibility rise → ≈17 fewer infant deaths per 1,000 births; ≈11 percentage-point rise in literacy; ≈3 percentage-point rise in enrollment.
Public goods outcomes (national level, decade averages): legibility negatively associated with infant mortality; positively associated with literacy and enrollment; results hold after controlling for GDP per capita, democracy, population density, terrain.

Connections to broader themes

This work situates legibility as a foundational component of state capacity, bridging informational infrastructure with both revenue extraction and service delivery.
It provides a data-rich platform to test long-standing questions about how states become more capable and how information regimes shape developmental trajectories.
The approach invites examination of the political economy of data generation, including incentives for governments to improve administrative records and the potential feedback with legitimacy and privacy concerns.