Lee, Zhang: Legibility and State Capacity — Comprehensive Notes
LEGIBILITY AND STATE CAPACITY: COMPREHENSIVE NOTES
- Core idea: state capacity partly derives from legibility — the breadth and depth of the state’s knowledge about its citizens and their activities — and legibility is crucial for effective, centralized governance.
- The authors build an informational account of state capacity, linking legibility to the state’s ability to curb free-riding in collective action dilemmas, with a focus on taxation and public goods provision.
- Key theoretical anchor: James Scott’s concept of making local practices legible to central officials, but with a twist that legibility also supports social order and control of opportunistic behavior, not only extraction.
- Definition of legibility (as used in this article):
- (a) The state possesses information about local practices and populations.
- (b) This information is rendered in standardized forms (e.g., cadastral maps, birth certificates, property registers) that are understandable to state administrators.
- Why legibility matters for state capacity:
- Enables monitoring private behavior and enforcing rules in collective action settings.
- Centralizes information to resolve free-riding in public goods provision and taxation.
- Provides a foundation for efficient social order and governance.
- Two classic illustrations of legibility challenges before modern data systems:
- Local weights and measures: dozens of pints and aunes across locales (e.g., in eighteenth-century Paris, a pinte = 0.93 L, Seine-en-Montagne 1.99 L, Precy-sous-Thil 3.33 L; at least seventeen different aunes). This shows how local knowledge is hard to aggregate for centralized planning.
- Naming conventions and identification: in pre-modern Britain, tracking property ownership, taxes, court records, etc., was hindered by common names (e.g., 6 John/Wiliam/Thomas/etc. among 90% of males) and lack of standardized identifiers — illustrating the administrative nightmare of legibility without standard forms.
- Operational payoff: legibility helps resolve fiscal free-riding by enabling the state to identify contributors and under-contributors and to enforce tax rules.
- Historical example highlighting legibility’s role in taxation: Vauban’s 1686 proposal to conduct annual censuses to know the number of subjects and their resources, wealth, and poverty — signaling the long-standing view that granular population information supports taxation and governance.
- Pre-modern to modern shift: even with cash economies and cross-border earnings, the information problem persists; legibility remains essential for taxing and delivering public goods.
- Swedish historical case as a concrete demonstration of legibility’s power:
- Beginning around 1540, royal bailiffs compiled tax registers of peasant households.
- In 1628, Denmark/Sweden undertook early cadastral mapping (roughly 12,000 maps detailing boundaries, ownership, tenure, cultivation, and valuation).
- Ecclesiastical Laws of 1686 tasked clergy with maintaining parishioners’ tax lists.
- By 1920, ~80% of the economically active population in Sweden was registered with the tax authority.
- Legibility enabled broad-based taxation: between 1690 and 1873, Sweden could levy taxes in kind to support the army, mobilizing about 1.5% of the population for military service — higher militarization than several major powers.
- Takeaway: legibility underpinned stable, high-yield taxation and broad social order.
OPERATIONALIZING LEGIBILITY
- Research goal: measure legibility across countries and over time using a coarse, widely comparable data series — national censuses.
- Why census data? Broad geographic and temporal coverage; long-standing traditional role in population information; suitable for cross-national and cross-time comparisons.
- Improvement over prior work: move beyond a binary “census conducted or not” to quantify the quality/accuracy of census data, focusing on age information.
- Rationale for age accuracy as legibility proxy:
- Age information is collected in almost all censuses, enhancing comparability.
- The smoothness of true age distributions allows detection of data errors via age-heaping patterns.
- Age information is foundational for many administrative tasks (voting eligibility, military service, driver’s licenses, education, benefits).
- Conceptual link: age-heaping indicates general legibility of population data; if age data are inaccurate, other administrative data are likely less legible as well.
- Data-generating processes for census age data:
- Mechanism A: Low age-awareness in the population due to limited state-society interaction (remote regions, weak state presence) leads to inaccurate self-reports.
- Mechanism B: Enumerators may record ages inaccurately due to difficult field conditions, reliance on second-hand information, or personal biases.
- Both mechanisms produce age misreporting, interpreted as a broader legibility problem.
- Historical anecdotes reinforcing enumeration challenges:
- Nepal 1961 example: a hill region enumerator relied on visible settlements and second-hand reporting, risking inaccurate counts.
- Frontier/remote areas in 19th-century United States faced logistical and security challenges to enumeration, increasing opportunities for shirking and undercounting.
- The core empirical strategy: create a census-based legibility indicator by measuring age accuracy across censuses and countries.
- Data availability and scope:
- 370 censuses spanning 1960–2012.
- Substantial temporal and geographic coverage; many observations are national-level, with some subnational units.
- About half of observations come from original data collection from national census reports.
- Key empirical predictions: more legible populations (as measured by census age data accuracy) will exhibit better state performance in taxation and public goods provision.
DATA-GENERATING PROCESSES AND MEASURES
- Core measurement tool: the Myers Index of age-heaping, complemented by the Whipple Index.
- Why focus on the Myers Index?
- Accounts for mortality (unlike Whipple in some uses).
- Suitable for cross-country comparisons and for datasets with varying population sizes.
- Produces a continuous measure with interpretable scale, useful for regression analyses.
- Whipple Index (definition and interpretation):
- Measures disproportionate reporting of ages ending in 0 or 5.
- Formula (conceptual):
W = rac{N{0,5}}{0.2 imes N} imes 100,
where $N{0,5}$ is the number of records ending in 0 or 5 and $N$ is the total number of records considered. - Range: 100 (no preference for end-in-0/5) to 500 (all ages end in 0 or 5).
- Myers Index (conceptual description):
- A blending technique that accounts for age distributions across ten overlapping age bins (e.g., [15–64], [16–65], …, [24–73]).
- For each bin, compute digit-ending distribution and then average across the ten starting points to obtain a single index.
- Scale: 0 to 90, where 0 indicates no age-digit heaping and 90 indicates extreme heaping on a single terminal digit.
- In this study, Myers is inverted so that higher values indicate greater legibility. Concretely, if $M$ is the Myers Index, legibility is defined as
L = 90 - M.
- Practical notes on index interpretation:
- The Myers score tends to be upwardly biased for very small populations (less than about 5,000 individuals).
- The analysis in the paper focuses on national-level data or large subnational units where the index is more reliable.
- Data coverage and patterns (descriptive):
- Global sample: 370 censuses; mean Myers score ~8.21; wide range up to ~45.67.
- Distribution is highly right-skewed overall; after de-meaning by country, the distribution approximates normality.
- Regions with wider variation in legibility: Asia, Sub-Saharan Africa, and the Middle East/North Africa tend to show broader ranges; Western democracies show higher legibility (lower age-heaping).
- Validation strategy (link to other indicators):
- Subnational validity: relate Myers scores to birth registration and birth certificates where subnational data exist.
- National validity: relate inverted Myers to established state-capacity proxies (ICRG, WGI, FSI, BTI).
- Expected signs: positive correlation between legibility and both birth registration and birth certificates; positive correlation with measures of governance/capacity; negative correlation with fragile-state indicators (as appropriate).
- What the Myers Index captures about legibility:
- A robust proxy for the state’s “presence on the ground” and reach into society.
- Sensitive to variation in the middle range of state capacity, where most countries reside, not just extreme high/low ends.
GLOBAL PATTERNS AND VALIDATION OF THE MYERS INDEX
- Worldwide variation (summary):
- The world mean Myers score ~8.21; substantial cross-country skewness driven by between-country differences in legibility.
- The sub-sample of Asia, Sub-Saharan Africa, and Middle East/North Africa shows the widest range; Western states show high legibility (low heaping).
- Subnational validation results (birth registration and birth certificate, N varies by data availability):
- Birth registration correlation with Myers (subnational): ~0.44 (raw) and ~0.52 (logged) for 393 observations.
- Birth certificate correlation (subnational): ~0.34 (raw) and ~0.41 (logged) for 282 observations.
- National validity results (correlation with governance/performance indicators):
- ICRG total index: ~0.48 (raw) and ~0.61 (logged).
- Internal conflict, bureaucratic quality: positive correlations consistent with theory.
- WGI indicators (government effectiveness, political stability, rule of law, regulatory quality, control of corruption): generally positive and significant.
- FSI (Fragile States Index): overall negative relationships where higher FSI means worse outcomes (negative sign matches expectation for legibility).
- BTI (Bertelsmann Transformation Index): positive associations with legibility, particularly for state capacity dimensions like Stateness and overall index.
- Interpretation: inverted Myers scores track with standard state-capacity measures and governance quality, supporting the Myers index as a valid proxy for legibility.
RESULTS: LEGIBILITY, TAXATION, AND COLLECTIVE GOODS
- Research design for testing legibility effects on taxation and public goods:
- Two dependent variables at the subnational level: tax revenue (income tax collected by province) and tax ratio (tax revenue / province GDP).
- Controls: regional GDP per capita, distance from capital, population density, terrain ruggedness; all variables log-transformed due to skewness.
- Regression specification: ordinary least squares (OLS), robust to clustering by country; Myers scores lagged by one year to mitigate reverse causality; standardized variables (mean 0, SD 1) so coefficients reflect SD changes; legibility is inverted so higher values imply greater legibility.
- Subnational sample and model:
- 12 countries: Argentina, Brazil, Greece, India, Indonesia, Italy, Mexico, Philippines, South Africa, Tanzania, Thailand, Turkey.
- N of observations: 399; N of countries: 12.
- Key results (Table 4): legibility is positively and statistically significantly associated with tax outcomes.
- Tax Revenue (model 1): coefficient on Legibility = 0.319*.
- Tax Revenue (model 2): coefficient on Legibility = 0.104*.
- Tax Ratio (model 3): coefficient on Legibility = 0.0634*.
- Tax Ratio (model 4): coefficient on Legibility = 0.0587*.
- Significance: p < .05 for all legibility coefficients in the four models.
- Substantive interpretation (based on standardized coefficients):
- A one standard deviation increase in legibility is associated with about a 10% increase in tax revenues (summary interpretation given in the text).
- An equivalent one SD increase in legibility is associated with roughly a 6% SD increase in the tax ratio (relative to the SD of the dependent variable).
- India example (illustrative quantitative illustration):
- Uttar Pradesh (UP) 2012 context: regional GDP per capita ≈ US$372; legibility in UP is in the bottom quartile; UP tax revenue ≈ US$2.3 billion in 2012.
- A one SD increase in legibility in UP implies roughly an additional ≈US$320 million in income tax revenue.
- Tax ratio implications: a one SD increase in legibility could raise UP’s tax ratio from about 3% to around 6.5%.
- Public goods outcomes (national-level analysis; cross-national data):
- Data: up to 111 countries; decade averages for dependent variables due to data timing; controls include logged GDP per capita, democracy, population density, terrain ruggedness; legibility inverted so higher is more legible.
- Dependent variables: infant mortality rate (log), adult literacy rate, primary school enrollment rate.
- Regression framework: OLS with decade fixed effects and country-clustered standard errors.
- Results (Table 6, full covariates): legibility is statistically significant and in the hypothesized direction for all three outcomes:
- Infant mortality: legibility negatively associated (more legible states have lower infant mortality).
- Adult literacy: legibility positively associated (more legible states have higher literacy).
- Primary school enrollment: legibility positively associated (more legible states have higher enrollment).
- Magnitudes (illustrative): in Kenya (2000s series), a 1 SD rise in legibility is associated with approximately 17 fewer infant deaths per 1,000 births, about 11 percentage-point higher literacy, and about 3 percentage-point higher primary enrollment.
- Interpretation and caveats:
- The authors stress that the results are suggestive and not causal due to data limitations, potential omitted variables, and measurement error.
- Nevertheless, the consistent positive association between legibility and both tax collection and public goods is robust to bootstrapped standard errors and persists under lag specifications.
- The results support the view that legibility enables the state to employ more effective fiscal instruments and extractive capacity, and to deliver public goods more efficiently.
- Additional implications from the tax/public goods results:
- Legibility may facilitate a shift from indirect to direct taxation by providing better information on citizens’ economic activity, enabling broader and more effective taxation regimes.
- This aligns with the broader literature linking information, governance, and state-building (e.g., Jones 1988; Kiser and Sacks 2009; Martin, Mehrotra, and Prasad 2009).
INTERPRETIVE AND THEORETICAL IMPLICATIONS
- The legibility framework helps reconcile different strands of state-capacity and development research by foregrounding information flows as a fundamental determinant of state performance.
- The authors argue legibility is a central, previously underappreciated component of state capacity, contributing to both the extraction (taxation) and provision (public goods) branches of governance.
- The data’s geographic and temporal breadth enables new research avenues, including:
- Using legibility itself as a dependent variable to study how states increase information at their disposal (incentives, artifacts, or both).
- Exploring legibility as a potential mediator or moderator in links between state legitimacy and capacity, as well as privacy and surveillance debates in contemporary politics.
- The authors emphasize that legibility is most impactful in the middle range of state capacity — where most states reside — and not solely in extremes of state strength or weakness.
- Broader theoretical implications for conflict and development:
- If legibility is crucial for both resource extraction and public provision, legible states may be better at preventing civil conflict by meeting citizens’ basic needs and avoiding under-provision or coercive overreach.
- The approach offers a way to decompose state capacity into its informational components, allowing sharper tests of competing explanations of development and conflict.
CONCLUSION AND FUTURE RESEARCH AGENDA
- Main takeaway: legibility, defined as the breadth and depth of state knowledge about its citizens and activities, is a crucial, understudied component of state capacity that facilitates centralized monitoring and enforcement, reduces free-riding, and improves taxation and public goods provision.
- The census-based Myers Index provides unprecedented temporal and geographic coverage for measuring legibility at national and subnational levels, enabling analyses that previously were not feasible.
- The authors highlight several key contributions:
- Reintroduction of legibility as a core variable in the state-capacity literature.
- Theoretical linkage between legibility and the state’s role in controlling opportunistic behavior in collective action settings.
- An original, broadly applicable legibility indicator with wide cross-national and historical applicability.
- Practical implications:
- Legibility is associated with both quantitative and qualitative changes in state power, enabling more effective revenue collection and broader deployment of fiscal instruments.
- The movement from indirect to direct taxation aligns with the expansion of comprehensive informational bases on citizens’ economic activity.
- Future research directions suggested by the authors:
- Treat legibility as a dependent variable and explore how states increase information: incentives versus artifacts, and potential political constraints.
- Examine how legibility interacts with legitimacy and privacy debates, including contemporary surveillance concerns.
- Use the Myer Index to study intrastate conflict, civil war, and development, particularly focusing on middle-range states where variation is largest.
- Acknowledgments: the paper notes support from NSF and other institutions and references a comprehensive bibliography for related state-capacity literature.
Appendix notes and data caveats (as discussed in the text)
- The authors acknowledge data limitations, measurement error, and missing data, particularly in weaker states, which may temper the observed effects.
- They perform robustness checks (e.g., bootstrapped standard errors) and discuss lag specifications to mitigate concerns about reverse causality.
- The Myers Index’s broad coverage and historical applicability offer a novel research tool for future studies on state capacity and related outcomes.
Key numerical references for quick recall
- Whipple Index:
W = rac{N{0,5}}{0.2 imes N} imes 100,
where $N{0,5}$ is the count of ages ending in 0 or 5 and $N$ is the total counted ages.
Range: 100 (no heaping) to 500 (extreme heaping). - Myers Index: scaled 0 to 90, with 0 indicating no heaping and 90 indicating extreme heaping; inverted in this study to produce legibility score $L$:
L = 90 - M,
where $M$ is the Myers score. - Sample and coverage: 370 censuses (1960–2012); national and subnational data; mean raw Myers ≈ 8.21; Canada (1991) ≈ 0.18; Pakistan (1973) ≈ 45.67; France and Switzerland show low heaping overall; Sierra Leone (2004) shows high heaping.
- Subnational tax results (legibility coefficient on tax revenue or tax ratio):
- Tax Revenue (1): $eta_{Leg} = 0.319^{*}$;
- Tax Revenue (2): $eta_{Leg} = 0.104^{*}$;
- Tax Ratio (3): $eta_{Leg} = 0.0634^{*}$;
- Tax Ratio (4): $eta_{Leg} = 0.0587^{*}$.
- Illustrative quantitative magnitudes:
- 1 SD increase in legibility → ~10% increase in tax revenues (global, cross-country interpretation).
- 1 SD increase in legibility → ~6% SD increase in tax ratio.
- India example (Uttar Pradesh, 2012): 1 SD legibility rise → ≈$320 million increase in income tax revenue; tax ratio increase from ~3% to ~6.5%.
- Kenya (2000s): 1 SD legibility rise → ≈17 fewer infant deaths per 1,000 births; ≈11 percentage-point rise in literacy; ≈3 percentage-point rise in enrollment.
- Public goods outcomes (national level, decade averages): legibility negatively associated with infant mortality; positively associated with literacy and enrollment; results hold after controlling for GDP per capita, democracy, population density, terrain.
Connections to broader themes
- This work situates legibility as a foundational component of state capacity, bridging informational infrastructure with both revenue extraction and service delivery.
- It provides a data-rich platform to test long-standing questions about how states become more capable and how information regimes shape developmental trajectories.
- The approach invites examination of the political economy of data generation, including incentives for governments to improve administrative records and the potential feedback with legitimacy and privacy concerns.