Lee, Zhang: Legibility and State Capacity — Comprehensive Notes

LEGIBILITY AND STATE CAPACITY: COMPREHENSIVE NOTES

  • Core idea: state capacity partly derives from legibility — the breadth and depth of the state’s knowledge about its citizens and their activities — and legibility is crucial for effective, centralized governance.
  • The authors build an informational account of state capacity, linking legibility to the state’s ability to curb free-riding in collective action dilemmas, with a focus on taxation and public goods provision.
  • Key theoretical anchor: James Scott’s concept of making local practices legible to central officials, but with a twist that legibility also supports social order and control of opportunistic behavior, not only extraction.
  • Definition of legibility (as used in this article):
    • (a) The state possesses information about local practices and populations.
    • (b) This information is rendered in standardized forms (e.g., cadastral maps, birth certificates, property registers) that are understandable to state administrators.
  • Why legibility matters for state capacity:
    • Enables monitoring private behavior and enforcing rules in collective action settings.
    • Centralizes information to resolve free-riding in public goods provision and taxation.
    • Provides a foundation for efficient social order and governance.
  • Two classic illustrations of legibility challenges before modern data systems:
    • Local weights and measures: dozens of pints and aunes across locales (e.g., in eighteenth-century Paris, a pinte = 0.93 L, Seine-en-Montagne 1.99 L, Precy-sous-Thil 3.33 L; at least seventeen different aunes). This shows how local knowledge is hard to aggregate for centralized planning.
    • Naming conventions and identification: in pre-modern Britain, tracking property ownership, taxes, court records, etc., was hindered by common names (e.g., 6 John/Wiliam/Thomas/etc. among 90% of males) and lack of standardized identifiers — illustrating the administrative nightmare of legibility without standard forms.
  • Operational payoff: legibility helps resolve fiscal free-riding by enabling the state to identify contributors and under-contributors and to enforce tax rules.
  • Historical example highlighting legibility’s role in taxation: Vauban’s 1686 proposal to conduct annual censuses to know the number of subjects and their resources, wealth, and poverty — signaling the long-standing view that granular population information supports taxation and governance.
  • Pre-modern to modern shift: even with cash economies and cross-border earnings, the information problem persists; legibility remains essential for taxing and delivering public goods.
  • Swedish historical case as a concrete demonstration of legibility’s power:
    • Beginning around 1540, royal bailiffs compiled tax registers of peasant households.
    • In 1628, Denmark/Sweden undertook early cadastral mapping (roughly 12,000 maps detailing boundaries, ownership, tenure, cultivation, and valuation).
    • Ecclesiastical Laws of 1686 tasked clergy with maintaining parishioners’ tax lists.
    • By 1920, ~80% of the economically active population in Sweden was registered with the tax authority.
    • Legibility enabled broad-based taxation: between 1690 and 1873, Sweden could levy taxes in kind to support the army, mobilizing about 1.5% of the population for military service — higher militarization than several major powers.
    • Takeaway: legibility underpinned stable, high-yield taxation and broad social order.

OPERATIONALIZING LEGIBILITY

  • Research goal: measure legibility across countries and over time using a coarse, widely comparable data series — national censuses.
  • Why census data? Broad geographic and temporal coverage; long-standing traditional role in population information; suitable for cross-national and cross-time comparisons.
  • Improvement over prior work: move beyond a binary “census conducted or not” to quantify the quality/accuracy of census data, focusing on age information.
  • Rationale for age accuracy as legibility proxy:
    • Age information is collected in almost all censuses, enhancing comparability.
    • The smoothness of true age distributions allows detection of data errors via age-heaping patterns.
    • Age information is foundational for many administrative tasks (voting eligibility, military service, driver’s licenses, education, benefits).
  • Conceptual link: age-heaping indicates general legibility of population data; if age data are inaccurate, other administrative data are likely less legible as well.
  • Data-generating processes for census age data:
    • Mechanism A: Low age-awareness in the population due to limited state-society interaction (remote regions, weak state presence) leads to inaccurate self-reports.
    • Mechanism B: Enumerators may record ages inaccurately due to difficult field conditions, reliance on second-hand information, or personal biases.
    • Both mechanisms produce age misreporting, interpreted as a broader legibility problem.
  • Historical anecdotes reinforcing enumeration challenges:
    • Nepal 1961 example: a hill region enumerator relied on visible settlements and second-hand reporting, risking inaccurate counts.
    • Frontier/remote areas in 19th-century United States faced logistical and security challenges to enumeration, increasing opportunities for shirking and undercounting.
  • The core empirical strategy: create a census-based legibility indicator by measuring age accuracy across censuses and countries.
  • Data availability and scope:
    • 370 censuses spanning 1960–2012.
    • Substantial temporal and geographic coverage; many observations are national-level, with some subnational units.
    • About half of observations come from original data collection from national census reports.
  • Key empirical predictions: more legible populations (as measured by census age data accuracy) will exhibit better state performance in taxation and public goods provision.

DATA-GENERATING PROCESSES AND MEASURES

  • Core measurement tool: the Myers Index of age-heaping, complemented by the Whipple Index.
  • Why focus on the Myers Index?
    • Accounts for mortality (unlike Whipple in some uses).
    • Suitable for cross-country comparisons and for datasets with varying population sizes.
    • Produces a continuous measure with interpretable scale, useful for regression analyses.
  • Whipple Index (definition and interpretation):
    • Measures disproportionate reporting of ages ending in 0 or 5.
    • Formula (conceptual):
      W = rac{N{0,5}}{0.2 imes N} imes 100, where $N{0,5}$ is the number of records ending in 0 or 5 and $N$ is the total number of records considered.
    • Range: 100 (no preference for end-in-0/5) to 500 (all ages end in 0 or 5).
  • Myers Index (conceptual description):
    • A blending technique that accounts for age distributions across ten overlapping age bins (e.g., [15–64], [16–65], …, [24–73]).
    • For each bin, compute digit-ending distribution and then average across the ten starting points to obtain a single index.
    • Scale: 0 to 90, where 0 indicates no age-digit heaping and 90 indicates extreme heaping on a single terminal digit.
    • In this study, Myers is inverted so that higher values indicate greater legibility. Concretely, if $M$ is the Myers Index, legibility is defined as
      L = 90 - M.
  • Practical notes on index interpretation:
    • The Myers score tends to be upwardly biased for very small populations (less than about 5,000 individuals).
    • The analysis in the paper focuses on national-level data or large subnational units where the index is more reliable.
  • Data coverage and patterns (descriptive):
    • Global sample: 370 censuses; mean Myers score ~8.21; wide range up to ~45.67.
    • Distribution is highly right-skewed overall; after de-meaning by country, the distribution approximates normality.
    • Regions with wider variation in legibility: Asia, Sub-Saharan Africa, and the Middle East/North Africa tend to show broader ranges; Western democracies show higher legibility (lower age-heaping).
  • Validation strategy (link to other indicators):
    • Subnational validity: relate Myers scores to birth registration and birth certificates where subnational data exist.
    • National validity: relate inverted Myers to established state-capacity proxies (ICRG, WGI, FSI, BTI).
    • Expected signs: positive correlation between legibility and both birth registration and birth certificates; positive correlation with measures of governance/capacity; negative correlation with fragile-state indicators (as appropriate).
  • What the Myers Index captures about legibility:
    • A robust proxy for the state’s “presence on the ground” and reach into society.
    • Sensitive to variation in the middle range of state capacity, where most countries reside, not just extreme high/low ends.

GLOBAL PATTERNS AND VALIDATION OF THE MYERS INDEX

  • Worldwide variation (summary):
    • The world mean Myers score ~8.21; substantial cross-country skewness driven by between-country differences in legibility.
    • The sub-sample of Asia, Sub-Saharan Africa, and Middle East/North Africa shows the widest range; Western states show high legibility (low heaping).
  • Subnational validation results (birth registration and birth certificate, N varies by data availability):
    • Birth registration correlation with Myers (subnational): ~0.44 (raw) and ~0.52 (logged) for 393 observations.
    • Birth certificate correlation (subnational): ~0.34 (raw) and ~0.41 (logged) for 282 observations.
  • National validity results (correlation with governance/performance indicators):
    • ICRG total index: ~0.48 (raw) and ~0.61 (logged).
    • Internal conflict, bureaucratic quality: positive correlations consistent with theory.
    • WGI indicators (government effectiveness, political stability, rule of law, regulatory quality, control of corruption): generally positive and significant.
    • FSI (Fragile States Index): overall negative relationships where higher FSI means worse outcomes (negative sign matches expectation for legibility).
    • BTI (Bertelsmann Transformation Index): positive associations with legibility, particularly for state capacity dimensions like Stateness and overall index.
  • Interpretation: inverted Myers scores track with standard state-capacity measures and governance quality, supporting the Myers index as a valid proxy for legibility.

RESULTS: LEGIBILITY, TAXATION, AND COLLECTIVE GOODS

  • Research design for testing legibility effects on taxation and public goods:
    • Two dependent variables at the subnational level: tax revenue (income tax collected by province) and tax ratio (tax revenue / province GDP).
    • Controls: regional GDP per capita, distance from capital, population density, terrain ruggedness; all variables log-transformed due to skewness.
    • Regression specification: ordinary least squares (OLS), robust to clustering by country; Myers scores lagged by one year to mitigate reverse causality; standardized variables (mean 0, SD 1) so coefficients reflect SD changes; legibility is inverted so higher values imply greater legibility.
  • Subnational sample and model:
    • 12 countries: Argentina, Brazil, Greece, India, Indonesia, Italy, Mexico, Philippines, South Africa, Tanzania, Thailand, Turkey.
    • N of observations: 399; N of countries: 12.
  • Key results (Table 4): legibility is positively and statistically significantly associated with tax outcomes.
    • Tax Revenue (model 1): coefficient on Legibility = 0.319*.
    • Tax Revenue (model 2): coefficient on Legibility = 0.104*.
    • Tax Ratio (model 3): coefficient on Legibility = 0.0634*.
    • Tax Ratio (model 4): coefficient on Legibility = 0.0587*.
    • Significance: p < .05 for all legibility coefficients in the four models.
  • Substantive interpretation (based on standardized coefficients):
    • A one standard deviation increase in legibility is associated with about a 10% increase in tax revenues (summary interpretation given in the text).
    • An equivalent one SD increase in legibility is associated with roughly a 6% SD increase in the tax ratio (relative to the SD of the dependent variable).
  • India example (illustrative quantitative illustration):
    • Uttar Pradesh (UP) 2012 context: regional GDP per capita ≈ US$372; legibility in UP is in the bottom quartile; UP tax revenue ≈ US$2.3 billion in 2012.
    • A one SD increase in legibility in UP implies roughly an additional ≈US$320 million in income tax revenue.
    • Tax ratio implications: a one SD increase in legibility could raise UP’s tax ratio from about 3% to around 6.5%.
  • Public goods outcomes (national-level analysis; cross-national data):
    • Data: up to 111 countries; decade averages for dependent variables due to data timing; controls include logged GDP per capita, democracy, population density, terrain ruggedness; legibility inverted so higher is more legible.
    • Dependent variables: infant mortality rate (log), adult literacy rate, primary school enrollment rate.
    • Regression framework: OLS with decade fixed effects and country-clustered standard errors.
    • Results (Table 6, full covariates): legibility is statistically significant and in the hypothesized direction for all three outcomes:
    • Infant mortality: legibility negatively associated (more legible states have lower infant mortality).
    • Adult literacy: legibility positively associated (more legible states have higher literacy).
    • Primary school enrollment: legibility positively associated (more legible states have higher enrollment).
    • Magnitudes (illustrative): in Kenya (2000s series), a 1 SD rise in legibility is associated with approximately 17 fewer infant deaths per 1,000 births, about 11 percentage-point higher literacy, and about 3 percentage-point higher primary enrollment.
  • Interpretation and caveats:
    • The authors stress that the results are suggestive and not causal due to data limitations, potential omitted variables, and measurement error.
    • Nevertheless, the consistent positive association between legibility and both tax collection and public goods is robust to bootstrapped standard errors and persists under lag specifications.
    • The results support the view that legibility enables the state to employ more effective fiscal instruments and extractive capacity, and to deliver public goods more efficiently.
  • Additional implications from the tax/public goods results:
    • Legibility may facilitate a shift from indirect to direct taxation by providing better information on citizens’ economic activity, enabling broader and more effective taxation regimes.
    • This aligns with the broader literature linking information, governance, and state-building (e.g., Jones 1988; Kiser and Sacks 2009; Martin, Mehrotra, and Prasad 2009).

INTERPRETIVE AND THEORETICAL IMPLICATIONS

  • The legibility framework helps reconcile different strands of state-capacity and development research by foregrounding information flows as a fundamental determinant of state performance.
  • The authors argue legibility is a central, previously underappreciated component of state capacity, contributing to both the extraction (taxation) and provision (public goods) branches of governance.
  • The data’s geographic and temporal breadth enables new research avenues, including:
    • Using legibility itself as a dependent variable to study how states increase information at their disposal (incentives, artifacts, or both).
    • Exploring legibility as a potential mediator or moderator in links between state legitimacy and capacity, as well as privacy and surveillance debates in contemporary politics.
  • The authors emphasize that legibility is most impactful in the middle range of state capacity — where most states reside — and not solely in extremes of state strength or weakness.
  • Broader theoretical implications for conflict and development:
    • If legibility is crucial for both resource extraction and public provision, legible states may be better at preventing civil conflict by meeting citizens’ basic needs and avoiding under-provision or coercive overreach.
    • The approach offers a way to decompose state capacity into its informational components, allowing sharper tests of competing explanations of development and conflict.

CONCLUSION AND FUTURE RESEARCH AGENDA

  • Main takeaway: legibility, defined as the breadth and depth of state knowledge about its citizens and activities, is a crucial, understudied component of state capacity that facilitates centralized monitoring and enforcement, reduces free-riding, and improves taxation and public goods provision.
  • The census-based Myers Index provides unprecedented temporal and geographic coverage for measuring legibility at national and subnational levels, enabling analyses that previously were not feasible.
  • The authors highlight several key contributions:
    • Reintroduction of legibility as a core variable in the state-capacity literature.
    • Theoretical linkage between legibility and the state’s role in controlling opportunistic behavior in collective action settings.
    • An original, broadly applicable legibility indicator with wide cross-national and historical applicability.
  • Practical implications:
    • Legibility is associated with both quantitative and qualitative changes in state power, enabling more effective revenue collection and broader deployment of fiscal instruments.
    • The movement from indirect to direct taxation aligns with the expansion of comprehensive informational bases on citizens’ economic activity.
  • Future research directions suggested by the authors:
    • Treat legibility as a dependent variable and explore how states increase information: incentives versus artifacts, and potential political constraints.
    • Examine how legibility interacts with legitimacy and privacy debates, including contemporary surveillance concerns.
    • Use the Myer Index to study intrastate conflict, civil war, and development, particularly focusing on middle-range states where variation is largest.
  • Acknowledgments: the paper notes support from NSF and other institutions and references a comprehensive bibliography for related state-capacity literature.

Appendix notes and data caveats (as discussed in the text)

  • The authors acknowledge data limitations, measurement error, and missing data, particularly in weaker states, which may temper the observed effects.
  • They perform robustness checks (e.g., bootstrapped standard errors) and discuss lag specifications to mitigate concerns about reverse causality.
  • The Myers Index’s broad coverage and historical applicability offer a novel research tool for future studies on state capacity and related outcomes.

Key numerical references for quick recall

  • Whipple Index:
    W = rac{N{0,5}}{0.2 imes N} imes 100, where $N{0,5}$ is the count of ages ending in 0 or 5 and $N$ is the total counted ages.
    Range: 100 (no heaping) to 500 (extreme heaping).
  • Myers Index: scaled 0 to 90, with 0 indicating no heaping and 90 indicating extreme heaping; inverted in this study to produce legibility score $L$:
    L = 90 - M,
    where $M$ is the Myers score.
  • Sample and coverage: 370 censuses (1960–2012); national and subnational data; mean raw Myers ≈ 8.21; Canada (1991) ≈ 0.18; Pakistan (1973) ≈ 45.67; France and Switzerland show low heaping overall; Sierra Leone (2004) shows high heaping.
  • Subnational tax results (legibility coefficient on tax revenue or tax ratio):
    • Tax Revenue (1): $eta_{Leg} = 0.319^{*}$;
    • Tax Revenue (2): $eta_{Leg} = 0.104^{*}$;
    • Tax Ratio (3): $eta_{Leg} = 0.0634^{*}$;
    • Tax Ratio (4): $eta_{Leg} = 0.0587^{*}$.
  • Illustrative quantitative magnitudes:
    • 1 SD increase in legibility → ~10% increase in tax revenues (global, cross-country interpretation).
    • 1 SD increase in legibility → ~6% SD increase in tax ratio.
    • India example (Uttar Pradesh, 2012): 1 SD legibility rise → ≈$320 million increase in income tax revenue; tax ratio increase from ~3% to ~6.5%.
    • Kenya (2000s): 1 SD legibility rise → ≈17 fewer infant deaths per 1,000 births; ≈11 percentage-point rise in literacy; ≈3 percentage-point rise in enrollment.
  • Public goods outcomes (national level, decade averages): legibility negatively associated with infant mortality; positively associated with literacy and enrollment; results hold after controlling for GDP per capita, democracy, population density, terrain.

Connections to broader themes

  • This work situates legibility as a foundational component of state capacity, bridging informational infrastructure with both revenue extraction and service delivery.
  • It provides a data-rich platform to test long-standing questions about how states become more capable and how information regimes shape developmental trajectories.
  • The approach invites examination of the political economy of data generation, including incentives for governments to improve administrative records and the potential feedback with legitimacy and privacy concerns.