BigData_ New Tricks For Econometrics
Introduction: The Promise & Challenges of Big Data for Econometrics
- Computers mediate an ever-growing share of economic transactions; digital traces are automatically captured, stored and can be analyzed.
- Traditional econometric tools (e.g., OLS regression) are still valuable, but “big” datasets introduce challenges that demand new techniques:
- Sheer volume → need for distributed storage/processing.
- High‐dimensionality (many potential predictors) → variable selection and regularization become critical.
- Complex, possibly nonlinear relationships → flexible, nonparametric algorithms (machine learning) often outperform linear models.
- Hal Varian’s core advice to graduate students: “Go to the computer-science department and take machine-learning.”
- Collaboration between computer scientists & econometricians is expected to be as fruitful as earlier stat–CS partnerships.
- Traditional spreadsheets become unwieldy beyond ≈ 106 rows; relational databases like MySQL help but plateau at several million observations or a few GB.
- NoSQL ("not only SQL") databases trade sophisticated querying for scalability—handle terabytes/petabytes across clusters.
- Large tech platforms (Amazon, Google, Microsoft) convert fixed IT costs to variable (pay-as-you-go) via cloud services, lowering entry barriers.
- Google’s internal stack & open-source analogs (Table 1):
- Google File System ↔ Hadoop File System (HDFS): store files too large for single machines.
- Bigtable ↔ Cassandra: distributed column-family stores.
- MapReduce ↔ Hadoop MapReduce: parallel data-access & aggregation (“map” shards, then “reduce”).
- Sawzall ↔ Pig: high-level language for MapReduce jobs.
- Go (no open-source analog needed): compiled language designed for concurrency.
- Dremel/BigQuery ↔ Hive, Drill, Impala: interactive SQL-like querying over petabyte-scale data (e.g., scan 103 TB in seconds).
Data Analysis Tasks & Terminology
- Four canonical goals in statistics/econometrics:
- Prediction
- Summarization / pattern discovery
- Estimation of parameters / structural modeling
- Hypothesis testing
- Field jargon:
- Machine Learning (ML): mainly prediction; cares about computational constraints.
- Data Mining: prediction + summarization (pattern discovery).
- Data Science: umbrella covering prediction, summarization, manipulation, visualization, etc.
- Other synonyms: knowledge extraction, information discovery/harvesting, data archaeology, exploratory data analysis.
- After distributed preprocessing, analysts often materialize a manageable “skinny” table.
- When tables are still large, random subsamples (~0.1% at Google) often suffice for downstream statistical modeling.
- Exploratory Data Analysis (EDA) & cleaning remain artisanal; tools like OpenRefine & DataWrangler assist.
General Considerations for Prediction
- Goal: minimize out-of-sample loss (e.g., MSE, MAE) for new observations xnew.
- Overfitting dangers:
- n independent regressors fit n observations perfectly but generalize poorly.
- Three ML countermeasures:
- Regularization – penalize model complexity.
- Train–validation–test split – distinct data for estimation, model choice, and final evaluation.
- k-Fold Cross-Validation (CV):
- Step-by-step procedure:
- Partition data into k folds (s=1,…,k).
- Choose tuning parameter candidate.
- Fit on k−1 folds, predict on held-out fold s, record loss.
- Cycle through all folds & parameter values.
- Select parameter minimizing average out-of-sample loss (often “1-SE rule” picks simpler model within one SD of minimum).
- Economists historically cited “small-sample” excuses for in-sample fit measures; big data removes that alibi.
- When linear/logit models are too rigid, consider:
- Classification and Regression Trees (CART)
- Random Forests / Boosted Trees
- Penalized regression: LASSO, LARS, Elastic Net
- (Additional but not covered in detail: neural nets, deep learning, SVMs.)
Classification & Regression Trees (CART)
Concept
- Binary recursive partitioning: repeatedly split predictor space into rectangles (or hyper-rectangles) to minimize impurity (Gini, deviance, MSE).
- Produces human-readable decision rules.
- Handles nonlinearities, interactions & missing data naturally.
Titanic Example (rpart in R)
- Predict survival (0/1) using age & class.
- Learned rules (Table 2):
- 3rd-class passengers → “Died” (370/501).
- 1st/2nd-class children (<16 yrs) → “Lived” (34/36), etc.
- Misclassification ≈ 30% on test set.
- Partition plot (2D) visualizes rectangular regions; scalable to many predictors (non-visual).
Logistic vs Tree Contrast
- Logistic regression yielded tiny age coefficient (p≈0.07).
- Tree discovered sharp nonlinearity: survival high for children (<8.5 yrs) & low for elderly—pattern obscured in global logit.
- Takeaway: Trees uncover heterogeneity & thresholds that linear models may mask.
Pruning & Complexity Control
- “Large” trees overfit just like saturated regressions.
- Strategy: cost-complexity pruning with CV to choose optimal #leaves.
- Analysts often pick penalty one SD beyond minimum loss for parsimony.
Conditional Inference Trees (ctree)
- Hypothesis-test-based splitting avoids bias toward predictors with many cut-points.
- Example tree for Titanic (Figure 4): first split on gender, then class, age, siblings/spouse; summarizes “Women & children first—especially in 1st class.”
Boosting, Bagging, Bootstrap
- Bootstrap: resample with replacement to assess sampling variability.
- Bagging (Bootstrap AGGregatING): average predictions from many bootstrapped models → variance reduction, especially effective for unstable learners (trees).
- Boosting: sequentially re-weight misclassified observations; final prediction is weighted vote/average of many weak learners → often large accuracy gains.
- Combination yields Ensembles of trees (forests, gradient boosting machines) that routinely win predictive competitions.
Random Forests
- Canonical algorithm:
- Draw bootstrap sample of observations.
- At each split, consider random subset of predictors (mtry).
- Grow tree to full depth (no pruning).
- Aggregate many trees (majority vote or mean).
- Strengths: high accuracy, handles nonlinearities & high-dimensionality, built-in variable importance metrics.
- Weakness: interpretability (“black box”).
- HMDA mortgage-approval data: random forest misclassified 223/2 380 (better than logit 225 and ctree 228); top predictor dmi = denied mortgage insurance, race near bottom → mirrors ctree finding.
Variable Selection in Linear Models
Penalized Regression Family
- Objective with Elastic Net penalty:
min<em>b</em>0,b∑<em>t(y</em>t−b<em>0−x</em>t′b)2+λ∑<em>p=1P[(1−α)∣b</em>p∣+α∣bp∣2]
- λ ≥ 0 tunes overall shrinkage.
- α=0 → LASSO (ℓ₁): induces sparsity (some bp=0).
- α=1 → Ridge (ℓ₂): shrinks coefficients but rarely to zero.
- 0<\alpha<1 → Elastic Net: hybrid.
- Computable efficiently (coordinate descent, LARS), enabling large-P problems.
Spike-and-Slab Bayesian Selection
- Indicator vector γ (length P) denotes inclusion/exclusion; prior P(γp=1)=π.
- “Spike” at zero mass for excluded vars; “slab” diffuse Normal prior for included coefficients.
- MCMC draws yield posterior inclusion probabilities P(γp=1∣data) and coefficient distributions.
Growth-Regression Application (Sala-i-Martin dataset, 72 countries, 42 vars)
- Compared four methods:
- Exhaustive CDF(0) search (2 × 10⁶ regressions)
- Bayesian Model Averaging (BMA)
- LASSO
- Spike-and-Slab
- Broad agreement on top predictors (initial GDP, Confucian share, life expectancy, equipment investment). Divergence on lower-ranked variables illustrates model uncertainty.
- LASSO & Bayesian methods far more computationally efficient than brute-force search.
Variable Selection for Time-Series: Bayesian Structural Time Series (BSTS)
- Motivation: choosing informative Google Trends predictors among billions of queries.
- Model components:
- Observation equation y<em>t=ℓ</em>t+z<em>t+e</em>1t
- State equations:
- Level ℓ<em>t=ℓ</em>t−1+b<em>t−1+e</em>2t
- Trend b<em>t=b</em>t−1+e3t
- Regression z<em>t=β′x</em>t with spike-and-slab on β.
- MCMC draws jointly sample variances, inclusion vector γ, coefficients, and latent states, enabling:
- Point & interval forecasts via Kalman filtering/smoothing.
- Posterior inclusion probabilities for each predictor.
Housing Example
- Response: U.S. monthly new-home sales (HSN1FNSA).
- Google Correlate provided top 100 query series.
- BSTS selected terms like “appreciation rate” (positive) and “irs 1031” (negative) after dropping spurious ones (“oldies lyrics”).
- Incremental MAE fell from 0.52 (trend only) → 0.15 with two predictors (Figure 7) despite tiny training sample.
Causal Inference vs Pure Prediction
- Econometric toolkit for causality: IV, regression discontinuity, diff-in-diffs, natural & randomized experiments.
- ML traditionally focuses on prediction; yet causal modeling frameworks exist (e.g., Pearl’s graphical models) but are under-used in practice.
- Two core causal-estimation challenges (Angrist–Pischke decomposition):
- Modeling assignment rule (selection into treatment).
- Modeling counterfactual outcome for treated units.
- Both are prediction problems ⇒ ML can help improve estimates of treatment effects by reducing prediction error for selection bias & counterfactuals.
Advertising Effectiveness Illustration
- Problem: ad spend & sales both rise during holidays → confounding.
- Full experiments are costly; alternative: treat counterfactual as forecasting problem.
- Procedure:
- Build BSTS model on pre-treatment data (trend, seasonality, exogenous Google Trends, weather, etc.).
- Run ad campaign.
- Forecast “would-have-been” visits, compare to actual → estimate causal lift.
- Example (Figure 8): cumulative uplift 107 k visits over 55 days (≈ 27% relative), with credible intervals.
- Predictive model may outperform naive randomized control groups if it captures rich covariate info (e.g., city-specific weather).
Model Uncertainty & Ensemble Learning
- Lesson from Netflix Prize: blend of 800+ models beat any single algorithm; even averaging top-2 submissions improved accuracy.
- Macro-forecast literature recognized decades ago that mean of projections > individual models.
- Applied econometric papers implicitly address model uncertainty via multiple-specification tables; big-data context urges formal, systematic treatment (BMA, ensemble forecasts) rather than ad-hoc few specs.
Opportunities for Cross-Fertilization
- Extend ML methods to non-IID settings: panels, time series with serial correlation.
- Embed causal-inference structure (treatment, potential outcomes) within flexible ML estimators.
- Combine econometric identification strategies with ML regularization & prediction (e.g., “double/debiased ML” of Chernozhukov et al., post-dating this article).
Ethical & Practical Implications
- Lower cost of storage/compute democratizes big-data analytics → raises privacy, confidentiality concerns (especially with fine-grained transaction data).
- Interpretability vs accuracy trade-off: regulators & policy analysts may require transparent models; ensemble/tree methods need post-hoc explanation tools.
Suggested Further Reading & Resources
- Textbooks:
- Hastie, Tibshirani & Friedman (2009) – graduate-level bible of statistical learning.
- James, Witten, Hastie & Tibshirani (2013) – introductory level with R labs.
- Murphy (2012) – Bayesian perspective on ML.
- Venables & Ripley (2002) –