Suppose the true data generating process of an outcome interested in is Y = B0 + B1T + B2X + u. For some treatment T and some spatial proximity factor X. If cant include X in treatment leads to omitted variable bias and therefore E(B1) = B1 + b2 * cov(T,X) / var(T). Bias != 0 if correlated with treatment and is correlated with treatment. There are spatial factors that can therefore effect the model and lead to biased estimators therefore by including spatial factors it can help reduce bias in the model. Example estimating home prices P = B0 + B1 year built + u, other spatial factors can effect it such as the comparable home prices in the surrounding areas. Also fixes standard errors because omitted variables lead to biased standard errors.
Spatial Correlation in Errors: If errors are correlated across space, standard errors computed using traditional methods will be underestimated, leading to overconfidence in statistical significance. Using a spatial weighting matrix allows for heteroskedasticity-robust and spatially corrected standard errors (e.g., HAC-type estimators for spatial models).
Eliminates time-invariant omitted variable bias - By including fixed effects, we remove bias stemming from unobservable, location-specific factors that do not change over time (e.g., regional culture, permanent geographic features). Rules out measuring any placed base interactions with the fixed effect (collinarity) because the equation y - Gy~ = (X - GX)y~ + C~. Spatial differencing removes both GXtheta and GZ8~. Because if the spatial weighting is the same for all the variables differencing them will cancel out these effects. Gx = Gz = Gv. Too many fixed effects can cause overfitting. FE models help in identifying the causal effect of time-varying variables by ruling out confounding from unobserved, stable place-based characteristics. Empirical setting to use it is when there are multiple observations per unit over time and works best when we observe the same locationsover multiple time periods ex cities.
❌ When there is little variation within units over time.
If most of the variation in the data is across locations (not over time), FE might not be useful.
Example: Studying how climate affects agricultural yields; since climate doesn’t change much in a given location, FE would wipe out the key explanatory variable.
✔ Causal effects net of permanent location differences.
FE models eliminate bias from unobserved, time-invariant characteristics.
Example: If we estimate the effect of minimum wage laws on employment, FE ensures that differences in state-level economic conditions don’t bias our results.
Possible interpretations of V is random shocks, cultural or institutional factors. When considering interactions between people near each other and omitted factor could be the influence of peer inmates.
Peer Influence Affects Both Behavior and Recidivism
Prisoners who are surrounded by well-behaved inmates may be more likely to engage in good behavior themselves.
These same prisoners may also experience lower recidivism rates because they develop pro-social behaviors, acquire job skills, or avoid gang involvement.
If we don’t control for peer influence, we might mistakenly attribute lower recidivism entirely to individual good behavior, rather than recognizing the role of inmate interactions.
If peer effects are positive (i.e., surrounded by well-behaved inmates encourages good behavior and reduces recidivism):
The estimated effect of good behavior on reducing recidivism will be overstated (upward bias) because it picks up both the true effect of behavior and the omitted effect of peer influence.
If peer effects are negative (e.g., violent inmates force others into gangs or illegal activity despite individual good behavior):
The estimated effect of good behavior might be understated (downward bias) because peer effects are suppressing the full benefits of good behavior on recidivism.
* In data with arbitrary spatial correlation the homoscedastic t-stat becomes dramatically inflated leading to vast over rejection of the null hypothesis. This spatial dependence can lead to biased inference if not properly accounted for in standard errors.In many economic studies, spatial correlation arises because observations that are geographically close tend to have similar characteristics due to shared environmental, institutional, or economic factors.If this correlation is ignored, standard regression techniques may underestimate the true variability of the estimates, leading to overly narrow confidence intervals and inflated statistical significance. Ignoring spatial dependence leads to artificially small standard errors, increasing the likelihood of Type I errors (rejecting the null hypothesis when it is actually true).When errors are spatially autocorrelated, nearby observations do not provide completely independent information.
- go over how matrix is constructed again - Conley spatial errors constructed when allow for and estimate covariance between observations within a set distance of each other. This is clustering but allowing observations to be in multiple groups. Conley (1999) standard errors are designed to account for spatial dependence in regression models. They provide a way to correct standard errors for spatial autocorrelation and heteroskedasticity. Calculate the product of residuals for each pair of observations:ε^iε^jε^iε^j. These residual covariances capture the degree of spatial dependence between observations ii and jj. This matrix accounts for spatial autocorrelation by incorporating information from neighboring observations. The final standard errors are derived from the diagonal elements of the adjusted variance-covariance matrix. Conley standard errors are particularly useful in empirical settings where spatial correlation is expected. Environmental & Urban Economics: Pollution, real estate prices, or urban development data often have spatial spillovers. A key decision in applying Conley standard errors is choosing the maximum distance dmaxdmax beyond which spatial correlation is ignored
Larger dmaxdmax (More Spatial Dependence Accounted For)
Pros:
Better correction for spatial correlation if residuals are correlated over long distances.
More conservative (wider) standard errors, reducing the risk of Type I errors.
Cons:
Higher computational burden (quadratic in the number of observations).
Potential over-adjustment, making estimates less efficient.
Smaller dmaxdmax (Only Local Correlations Considered)
Pros:
Reduces computational costs and simplifies estimation.
Can prevent over-smoothing and retain efficiency.
Cons:
May under-correct for spatial dependence, leading to underestimated standard errors.
Risk of Type I errors (false positives).
DIFF in DIFF assumptions
Parallel Trends assumption - Reason
Control group is perfect counterfactual for the changes we would have seen in the treated group
Choose groups where it has similar characteristics as each other and not affect by spatial spillover
No anticipation of treatment
Independent sampling
No simultaneous shocks
Treatment confined to treated (SUVA)
SUTVA assumes that the treatment assigned to one unit does not affect the outcomes of other units.
Avoid Using Immediate Neighboring Cities as Controls
If crime is displaced from San Francisco to nearby areas, those areas would experience indirect effects of the policy. Using them as a control group would lead to biased estimates because they are not truly untreated.
Consider More Distant Cities as Controls
A better approach would be to select control cities that are socioeconomically and demographically similar to San Francisco but far enough away to be unaffected by spillover effects.
Assuming for parallel trends and similar characteristics
Regression Discontinuity
Threshold treatment is determined fully or partially by a threshold rule or cutoff
Need a running variable where it is continuous and determines whether an observation is above or below the threshold
This may be bad for spatial regression discontinuity. The assumption most relevant for consideration is the running variable. The running variable is assumes that the units are similar on either or side of the boundary. However if this boundary is also the administrative boundary this could prove to have some issues. This means there could be different laws and different services to different racial groups outside of this boundary from education , policing, or housing policies. Therefore a jump might not solely observe the differences beyond this line.
Violation of Continuity Assumption: If current public service provision differs sharply at the redlining boundary, then individuals on either side may differ in ways beyond just past redlining exposure (e.g., differences in education, policing, or housing policies). This makes it harder to attribute observed differences solely to redlining effects.
Threshold has to be the only thing affected at the redline boundary and because there are administrative differences other variables will be impacted by that line or threshold
These other effects can affect the target variable
Feature | LASSO (Least Absolute Shrinkage and Selection Operator) | Random Forest |
Model Type | Linear, penalized regression | Nonlinear, ensemble of decision trees |
Strengths | Feature selection, interpretability, works well with high-dimensional data | Captures complex nonlinear relationships, handles interactions well, robust to overfitting |
Weaknesses | Struggles with nonlinear relationships, sensitive to multicollinearity | Less interpretable, requires more computation |
Handling of Features | Shrinks some coefficients to zero (automatic selection) | Uses all features but can assess importance |
Overfitting Risk | Lower due to regularization | Higher, but mitigated through bagging and averaging |
LASSO – Predicting the Impact of Policy Changes on Housing Prices
In estimating the effect of a new zoning law on property values, many potential predictors exist (e.g., income levels, school quality, crime rates).
LASSO helps by selecting the most relevant variables, reducing overfitting, and improving interpretability.
Random Forest – Forecasting Labor Market Outcomes
Suppose we want to predict wage growth based on education, experience, industry, and regional characteristics.
The relationship between these variables may be nonlinear and involve interactions (e.g., the effect of education may differ across industries).
Random Forest captures these complexities effectively, making it a strong tool for labor economics forecasting.
RF - sparse parameter space, many irrelevant covariaters, care more about predicting an outcome than interpreting the parameters of our model, the outcome is determined by many interactions
9. Why Use Cross Validation?
Cross validation (CV) helps assess a machine learning model’s performance by splitting the data into multiple subsets. This approach reduces the risk of overfitting and provides a more reliable estimate of how the model will generalize to unseen data. For a model like LASSO, CV is especially useful for selecting the regularization parameter (λλ), which controls the penalty for large coefficients and affects feature selection.
Procedure for Cross Validation with LASSO:
Split the Data: Divide the dataset into kk folds (e.g., 5 or 10).
Train the Model: For each fold:
Use k−1k−1 folds to train the LASSO model.
Vary the regularization parameter λλ across a grid of possible values.
Validate the Model: Use the remaining fold to calculate the performance metric (e.g., MSE).
Repeat the Process: Rotate through all folds, so each fold serves as the validation set once.
Aggregate Results: Average the performance metric across all folds for each λλ.
Select the Optimal λλ: Choose the λλ value that minimizes the average validation error.
Choosing the Evaluation Metric:
The most common choice is the Mean Squared Error (MSE):
MSE=1n∑i=1n(yi−y^i)2MSE=n1i=1∑n(yi−y^i)2
Why Use MSE?
Directly Measures Prediction Error: MSE quantifies how close predictions are to actual outcomes, making it an intuitive measure of model accuracy.
Matches LASSO’s Objective: Since LASSO minimizes a penalized sum of squared errors, MSE aligns well with the model’s optimization goal.
Balances Bias-Variance Tradeoff: MSE reflects both underfitting (high bias) and overfitting (high variance), helping you find a λλ value that generalizes best.
Would you like me to walk through an example of coding this in Python? Let me know! 🚀
10. Machine Learning in Empirical Research with Many Control Variables and Limited Theory
In empirical economics, researchers often face situations where:
There are many possible control variables, but it's unclear which ones matter most.
Economic theory does not provide clear guidance on which variables to include or the functional form of the model.
Machine learning (ML) provides flexible and data-driven methods to handle these challenges effectively:
Feature Selection & Regularization
Methods like LASSO and Elastic Net automatically select the most relevant variables by shrinking less important coefficients toward zero.
This helps prevent overfitting while still accounting for key controls.
Example: Estimating the effect of a minimum wage increase on employment while selecting from hundreds of potential confounders (e.g., industry trends, firm characteristics, local economic indicators).
Capturing Complex Relationships
Models like Random Forests and Gradient Boosting capture nonlinear relationships and interactions without requiring a pre-specified functional form.
This is useful when theory does not dictate whether effects should be linear, quadratic, or involve interactions.
Example: Predicting consumer credit risk when factors like income, age, spending behavior, and debt levels interact in complex ways.
Handling High-Dimensional Data
With high-dimensional datasets (e.g., detailed firm- or individual-level data), ML techniques can efficiently process and extract the most relevant information.
Example: Identifying the key determinants of firm productivity from thousands of potential variables, such as workforce composition, technology adoption, and management practices.
11. 1. Raster
Description:
Raster data is a grid-based format where each cell (or pixel) represents a value, such as elevation, temperature, or satellite imagery. It is often used for continuous data.
Example in Economics:
A heatmap of nighttime lights (from NASA’s VIIRS data) used as a proxy for economic activity in regions where GDP data is sparse.
Description:
A polygon shapefile represents areas, such as countries, states, or land use zones, using multiple connected points. It is used to define geographic boundaries.
Example in Economics:
A shapefile of U.S. counties used to analyze unemployment rates across different regions.
Description:
A point shapefile consists of individual geographic points, typically representing locations of specific entities like cities, businesses, or events.
Example in Economics:
A dataset of factory locations used to study the effect of industrial clusters on local wages.
Description:
A line shapefile is composed of connected points representing linear features such as roads, rivers, or railways.
Example in Economics:
A shapefile of transportation networks used to examine how highway expansions affect regional trade and economic development.
Description:
A tabular dataset where each row corresponds to an administrative unit (e.g., state, county, country) with associated statistics.
Example in Economics:
A table of GDP per capita by country, used to compare economic growth across different nations.
Description:
A dataset where each row represents a specific location with latitude and longitude coordinates, often used for mapping or spatial analysis.
Example in Economics:
A dataset of microfinance loan recipients, including their exact locations, to analyze how access to credit influences local economic outcomes.
12. Primary Data Sources Needed
Enterprise Zone Boundaries (Polygon Shapefile)
A GIS file defining the geographic boundaries of the enterprise zones.
Used to determine which businesses are inside or outside the zone for the regression discontinuity (RD) design.
Business Location Data (Point Shapefile or Table with Coordinates)
A dataset containing business addresses with latitude and longitude coordinates.
Used to spatially join businesses to the enterprise zone boundaries and measure their distance from the boundary.
Business Income Data (Table Format)
A dataset containing annual business income for firms within and near enterprise zones.
Used as the outcome variable in the RD model.
Spatial Join to Assign Enterprise Zone Status
Use ArcGIS to spatially join business location points to the enterprise zone shapefile.
Create a binary indicator variable (EnterpriseZone = 1 if a business is inside the zone, 0 otherwise).
Calculate Distance to the Enterprise Zone Boundary
Use the "Near" or "Generate Near Table" tool in ArcGIS to compute the shortest distance from each business to the nearest boundary of the enterprise zone.
This distance serves as the running variable in the RD model.
Restrict Sample to a Bandwidth Around the Boundary
Apply a distance threshold to include only businesses within a specified range (e.g., 1 km) of the enterprise zone boundary.
This ensures a valid RD design by comparing businesses just inside and just outside the zone.
Merge Business Income Data
Join the business income table to the spatial dataset using a common identifier (e.g., business name or address).
Ensure each business has an income measure linked to its location and enterprise zone status.
Export the Final Dataset for Regression Analysis
Save the processed dataset as a table for use in statistical software (e.g., Stata, R, Python).
The final dataset should include business income (outcome variable), enterprise zone status (treatment variable), and distance to boundary (running variable).
Using the constructed dataset, estimate the spatial RD model:
Yi=β0+β1EnterpriseZonei+f(DistanceToBoundaryi)+εiYi=β0+β1EnterpriseZonei+f(DistanceToBoundaryi)+εi
13. Example 1: Nighttime Lights (VIIRS/DMSP-OLS Nighttime Lights Data)
Statistic/Dataset Created:
Construct an index of economic activity by aggregating nighttime light intensity at the regional or city level over time.
Compute changes in light intensity before and after policy interventions, such as infrastructure projects or tax incentives.
Economic Question:
How do large-scale infrastructure projects (e.g., new highways, electricity grid expansions) impact local economic activity in developing countries?
Nighttime light intensity can serve as a proxy for economic growth in areas where GDP data is unreliable or unavailable.
Statistic/Dataset Created:
Measure annual land use changes, specifically deforestation rates, in areas affected by agricultural policies, subsidies, or conservation efforts.
Looking at the red and infrared bands
Healthy vegetation reflects more NIR and absorbs more red light, resulting in high NDVI values
Compare NDVI values
Calculate the percentage of land converted from forest to agriculture over time in response to economic incentives.
Economic Question:
How do agricultural subsidies affect deforestation in tropical regions?
By linking land cover changes to economic policies, researchers can analyze whether subsidies intended to boost agricultural output lead to unintended environmental consequences.
14. 1. Cloud-Based Processing of Large Satellite Datasets
Why it Helps: Google Earth Engine (GEE) provides access to massive remote sensing datasets, such as the MODIS Vegetation Continuous Fields (VCF) or Hansen Global Forest Change datasets, without requiring users to download them.
How it Facilitates Computation: Instead of handling terabytes of satellite imagery locally, GEE processes the data on Google’s cloud infrastructure, enabling fast and efficient analysis of tree cover across the entire U.S. over a 20-year period.
Why it Helps: GEE allows users to easily filter and aggregate images over time using its Earth Engine API, which supports operations like averaging, masking, and zonal statistics.
How it Facilitates Computation: Users can define a spatial boundary (e.g., the U.S.) and a temporal range (e.g., 2004–2024), then apply pre-built functions to compute the average annual tree cover. This eliminates the need for manually handling and merging thousands of images.
Aggregation tools ee.reducer
GEE provides direct access to global satellite datasets, such as MODIS, Landsat, and Hansen’s Global Forest Change dataset, which include long-term records of tree cover.
Instead of downloading massive datasets and manually pre-processing them (e.g., cloud masking, radiometric corrections), users can work with already cleaned and pre-processed data in the cloud.
This significantly reduces the time and computational effort needed to collect, store, and manage satellite imagery.
GEE performs computations on Google’s cloud servers, allowing users to analyze vast geospatial datasets without requiring high-performance local hardware.
The MapReduce framework enables efficient processing of time-series data, making it easy to aggregate annual tree cover statistics over large spatial extents (such as the entire US).
Users can apply built-in functions to filter images by date, region, and cloud cover, and then compute summary statistics like annual averages with a few lines of JavaScript or Python.