BigData_ New Tricks For Econometrics

Introduction: The Promise & Challenges of Big Data for Econometrics

  • Computers mediate an ever-growing share of economic transactions; digital traces are automatically captured, stored and can be analyzed.
  • Traditional econometric tools (e.g., OLS regression) are still valuable, but “big” datasets introduce challenges that demand new techniques:
    • Sheer volume → need for distributed storage/processing.
    • High‐dimensionality (many potential predictors) → variable selection and regularization become critical.
    • Complex, possibly nonlinear relationships → flexible, nonparametric algorithms (machine learning) often outperform linear models.
  • Hal Varian’s core advice to graduate students: “Go to the computer-science department and take machine-learning.”
  • Collaboration between computer scientists & econometricians is expected to be as fruitful as earlier stat–CS partnerships.

Tools to Manipulate Big Data

  • Traditional spreadsheets become unwieldy beyond ≈ 10610^6 rows; relational databases like MySQL help but plateau at several million observations or a few GB.
  • NoSQL ("not only SQL") databases trade sophisticated querying for scalability—handle terabytes/petabytes across clusters.
  • Large tech platforms (Amazon, Google, Microsoft) convert fixed IT costs to variable (pay-as-you-go) via cloud services, lowering entry barriers.
  • Google’s internal stack & open-source analogs (Table 1):
    • Google File System ↔ Hadoop File System (HDFS): store files too large for single machines.
    • Bigtable ↔ Cassandra: distributed column-family stores.
    • MapReduce ↔ Hadoop MapReduce: parallel data-access & aggregation (“map” shards, then “reduce”).
    • Sawzall ↔ Pig: high-level language for MapReduce jobs.
    • Go (no open-source analog needed): compiled language designed for concurrency.
    • Dremel/BigQuery ↔ Hive, Drill, Impala: interactive SQL-like querying over petabyte-scale data (e.g., scan 10310^3 TB in seconds).

Data Analysis Tasks & Terminology

  • Four canonical goals in statistics/econometrics:
    1. Prediction
    2. Summarization / pattern discovery
    3. Estimation of parameters / structural modeling
    4. Hypothesis testing
  • Field jargon:
    • Machine Learning (ML): mainly prediction; cares about computational constraints.
    • Data Mining: prediction + summarization (pattern discovery).
    • Data Science: umbrella covering prediction, summarization, manipulation, visualization, etc.
  • Other synonyms: knowledge extraction, information discovery/harvesting, data archaeology, exploratory data analysis.

From Extraction to “Small” Tables

  • After distributed preprocessing, analysts often materialize a manageable “skinny” table.
  • When tables are still large, random subsamples (~0.1%0.1\% at Google) often suffice for downstream statistical modeling.
  • Exploratory Data Analysis (EDA) & cleaning remain artisanal; tools like OpenRefine & DataWrangler assist.

General Considerations for Prediction

  • Goal: minimize out-of-sample loss (e.g., MSE, MAE) for new observations xnewx_{new}.
  • Overfitting dangers:
    • nn independent regressors fit nn observations perfectly but generalize poorly.
  • Three ML countermeasures:
    1. Regularization – penalize model complexity.
    2. Train–validation–test split – distinct data for estimation, model choice, and final evaluation.
    3. k-Fold Cross-Validation (CV):
    • Step-by-step procedure:
      1. Partition data into kk folds (s=1,,ks=1,\dots,k).
      2. Choose tuning parameter candidate.
      3. Fit on k1k-1 folds, predict on held-out fold ss, record loss.
      4. Cycle through all folds & parameter values.
      5. Select parameter minimizing average out-of-sample loss (often “1-SE rule” picks simpler model within one SD of minimum).
  • Economists historically cited “small-sample” excuses for in-sample fit measures; big data removes that alibi.

Flexible Regression-Like ML Tools

  • When linear/logit models are too rigid, consider:
    1. Classification and Regression Trees (CART)
    2. Random Forests / Boosted Trees
    3. Penalized regression: LASSO, LARS, Elastic Net
    • (Additional but not covered in detail: neural nets, deep learning, SVMs.)

Classification & Regression Trees (CART)

Concept

  • Binary recursive partitioning: repeatedly split predictor space into rectangles (or hyper-rectangles) to minimize impurity (Gini, deviance, MSE).
  • Produces human-readable decision rules.
  • Handles nonlinearities, interactions & missing data naturally.

Titanic Example (rpart in R)

  • Predict survival (0/1) using age & class.
  • Learned rules (Table 2):
    • 3rd-class passengers → “Died” (370/501).
    • 1st/2nd-class children (<16 yrs) → “Lived” (34/36), etc.
  • Misclassification ≈ 30%30\% on test set.
  • Partition plot (2D) visualizes rectangular regions; scalable to many predictors (non-visual).

Logistic vs Tree Contrast

  • Logistic regression yielded tiny age coefficient (p0.07p \approx 0.07).
  • Tree discovered sharp nonlinearity: survival high for children (<8.5 yrs) & low for elderly—pattern obscured in global logit.
  • Takeaway: Trees uncover heterogeneity & thresholds that linear models may mask.

Pruning & Complexity Control

  • “Large” trees overfit just like saturated regressions.
  • Strategy: cost-complexity pruning with CV to choose optimal #leaves.
  • Analysts often pick penalty one SD beyond minimum loss for parsimony.

Conditional Inference Trees (ctree)

  • Hypothesis-test-based splitting avoids bias toward predictors with many cut-points.
  • Example tree for Titanic (Figure 4): first split on gender, then class, age, siblings/spouse; summarizes “Women & children first—especially in 1st class.”

Boosting, Bagging, Bootstrap

  • Bootstrap: resample with replacement to assess sampling variability.
  • Bagging (Bootstrap AGGregatING): average predictions from many bootstrapped models → variance reduction, especially effective for unstable learners (trees).
  • Boosting: sequentially re-weight misclassified observations; final prediction is weighted vote/average of many weak learners → often large accuracy gains.
  • Combination yields Ensembles of trees (forests, gradient boosting machines) that routinely win predictive competitions.

Random Forests

  • Canonical algorithm:
    1. Draw bootstrap sample of observations.
    2. At each split, consider random subset of predictors (mtrym_{try}).
    3. Grow tree to full depth (no pruning).
    4. Aggregate many trees (majority vote or mean).
  • Strengths: high accuracy, handles nonlinearities & high-dimensionality, built-in variable importance metrics.
  • Weakness: interpretability (“black box”).
  • HMDA mortgage-approval data: random forest misclassified 223/2 380 (better than logit 225 and ctree 228); top predictor dmi = denied mortgage insurance, race near bottom → mirrors ctree finding.

Variable Selection in Linear Models

Penalized Regression Family

  • Objective with Elastic Net penalty: min<em>b</em>0,b<em>t(y</em>tb<em>0x</em>tb)2+λ<em>p=1P[(1α)b</em>p+αbp2]\min<em>{b</em>0,\,b} \sum<em>{t} (y</em>t - b<em>0 - x</em>t' b)^2 + \lambda \sum<em>{p=1}^{P} \big[(1-\alpha)|b</em>p| + \alpha |b_p|^2 \big]
    • λ\lambda ≥ 0 tunes overall shrinkage.
    • α=0\alpha = 0LASSO (ℓ₁): induces sparsity (some bp=0b_p=0).
    • α=1\alpha = 1Ridge (ℓ₂): shrinks coefficients but rarely to zero.
    • 0<\alpha<1 → Elastic Net: hybrid.
  • Computable efficiently (coordinate descent, LARS), enabling large-P problems.

Spike-and-Slab Bayesian Selection

  • Indicator vector γ\gamma (length PP) denotes inclusion/exclusion; prior P(γp=1)=πP(\gamma_p=1)=\pi.
  • “Spike” at zero mass for excluded vars; “slab” diffuse Normal prior for included coefficients.
  • MCMC draws yield posterior inclusion probabilities P(γp=1data)P(\gamma_p=1|data) and coefficient distributions.

Growth-Regression Application (Sala-i-Martin dataset, 72 countries, 42 vars)

  • Compared four methods:
    • Exhaustive CDF(0) search (2 × 10⁶ regressions)
    • Bayesian Model Averaging (BMA)
    • LASSO
    • Spike-and-Slab
  • Broad agreement on top predictors (initial GDP, Confucian share, life expectancy, equipment investment). Divergence on lower-ranked variables illustrates model uncertainty.
  • LASSO & Bayesian methods far more computationally efficient than brute-force search.

Variable Selection for Time-Series: Bayesian Structural Time Series (BSTS)

  • Motivation: choosing informative Google Trends predictors among billions of queries.
  • Model components:
    • Observation equation y<em>t=</em>t+z<em>t+e</em>1ty<em>t = \ell</em>t + z<em>t + e</em>{1t}
    • State equations:
    • Level <em>t=</em>t1+b<em>t1+e</em>2t\ell<em>t = \ell</em>{t-1} + b<em>{t-1} + e</em>{2t}
    • Trend b<em>t=b</em>t1+e3tb<em>t = b</em>{t-1} + e_{3t}
    • Regression z<em>t=βx</em>tz<em>t = \beta' x</em>t with spike-and-slab on β\beta.
  • MCMC draws jointly sample variances, inclusion vector γ\gamma, coefficients, and latent states, enabling:
    • Point & interval forecasts via Kalman filtering/smoothing.
    • Posterior inclusion probabilities for each predictor.

Housing Example

  • Response: U.S. monthly new-home sales (HSN1FNSA).
  • Google Correlate provided top 100 query series.
  • BSTS selected terms like “appreciation rate” (positive) and “irs 1031” (negative) after dropping spurious ones (“oldies lyrics”).
  • Incremental MAE fell from 0.52 (trend only) → 0.15 with two predictors (Figure 7) despite tiny training sample.​

Causal Inference vs Pure Prediction

  • Econometric toolkit for causality: IV, regression discontinuity, diff-in-diffs, natural & randomized experiments.
  • ML traditionally focuses on prediction; yet causal modeling frameworks exist (e.g., Pearl’s graphical models) but are under-used in practice.
  • Two core causal-estimation challenges (Angrist–Pischke decomposition):
    1. Modeling assignment rule (selection into treatment).
    2. Modeling counterfactual outcome for treated units.
    • Both are prediction problems ⇒ ML can help improve estimates of treatment effects by reducing prediction error for selection bias & counterfactuals.

Advertising Effectiveness Illustration

  • Problem: ad spend & sales both rise during holidays → confounding.
  • Full experiments are costly; alternative: treat counterfactual as forecasting problem.
  • Procedure:
    1. Build BSTS model on pre-treatment data (trend, seasonality, exogenous Google Trends, weather, etc.).
    2. Run ad campaign.
    3. Forecast “would-have-been” visits, compare to actual → estimate causal lift.
  • Example (Figure 8): cumulative uplift 107 k visits over 55 days (≈ 27% relative), with credible intervals.
  • Predictive model may outperform naive randomized control groups if it captures rich covariate info (e.g., city-specific weather).

Model Uncertainty & Ensemble Learning

  • Lesson from Netflix Prize: blend of 800+ models beat any single algorithm; even averaging top-2 submissions improved accuracy.
  • Macro-forecast literature recognized decades ago that mean of projections > individual models.
  • Applied econometric papers implicitly address model uncertainty via multiple-specification tables; big-data context urges formal, systematic treatment (BMA, ensemble forecasts) rather than ad-hoc few specs.

Opportunities for Cross-Fertilization

  • Extend ML methods to non-IID settings: panels, time series with serial correlation.
  • Embed causal-inference structure (treatment, potential outcomes) within flexible ML estimators.
  • Combine econometric identification strategies with ML regularization & prediction (e.g., “double/debiased ML” of Chernozhukov et al., post-dating this article).

Ethical & Practical Implications

  • Lower cost of storage/compute democratizes big-data analytics → raises privacy, confidentiality concerns (especially with fine-grained transaction data).
  • Interpretability vs accuracy trade-off: regulators & policy analysts may require transparent models; ensemble/tree methods need post-hoc explanation tools.

Suggested Further Reading & Resources

  • Textbooks:
    • Hastie, Tibshirani & Friedman (2009) – graduate-level bible of statistical learning.
    • James, Witten, Hastie & Tibshirani (2013) – introductory level with R labs.
    • Murphy (2012) – Bayesian perspective on ML.
    • Venables & Ripley (2002) –