BigData_ New Tricks For Econometrics

Introduction: The Promise & Challenges of Big Data for Econometrics

Computers mediate an ever-growing share of economic transactions; digital traces are automatically captured, stored and can be analyzed.
Traditional econometric tools (e.g., OLS regression) are still valuable, but “big” datasets introduce challenges that demand new techniques:
- Sheer volume → need for distributed storage/processing.
- High‐dimensionality (many potential predictors) → variable selection and regularization become critical.
- Complex, possibly nonlinear relationships → flexible, nonparametric algorithms (machine learning) often outperform linear models.
Hal Varian’s core advice to graduate students: “Go to the computer-science department and take machine-learning.”
Collaboration between computer scientists & econometricians is expected to be as fruitful as earlier stat–CS partnerships.

Tools to Manipulate Big Data

Traditional spreadsheets become unwieldy beyond ≈ $10^6$ rows; relational databases like MySQL help but plateau at several million observations or a few GB.
NoSQL ("not only SQL") databases trade sophisticated querying for scalability—handle terabytes/petabytes across clusters.
Large tech platforms (Amazon, Google, Microsoft) convert fixed IT costs to variable (pay-as-you-go) via cloud services, lowering entry barriers.
Google’s internal stack & open-source analogs (Table 1):
- Google File System ↔ Hadoop File System (HDFS): store files too large for single machines.
- Bigtable ↔ Cassandra: distributed column-family stores.
- MapReduce ↔ Hadoop MapReduce: parallel data-access & aggregation (“map” shards, then “reduce”).
- Sawzall ↔ Pig: high-level language for MapReduce jobs.
- Go (no open-source analog needed): compiled language designed for concurrency.
- Dremel/BigQuery ↔ Hive, Drill, Impala: interactive SQL-like querying over petabyte-scale data (e.g., scan $10^3$ TB in seconds).

Data Analysis Tasks & Terminology

Four canonical goals in statistics/econometrics:
1. Prediction
2. Summarization / pattern discovery
3. Estimation of parameters / structural modeling
4. Hypothesis testing
Field jargon:
- Machine Learning (ML): mainly prediction; cares about computational constraints.
- Data Mining: prediction + summarization (pattern discovery).
- Data Science: umbrella covering prediction, summarization, manipulation, visualization, etc.
Other synonyms: knowledge extraction, information discovery/harvesting, data archaeology, exploratory data analysis.

From Extraction to “Small” Tables

After distributed preprocessing, analysts often materialize a manageable “skinny” table.
When tables are still large, random subsamples (~ $0.1\%$ at Google) often suffice for downstream statistical modeling.
Exploratory Data Analysis (EDA) & cleaning remain artisanal; tools like OpenRefine & DataWrangler assist.

General Considerations for Prediction

Goal: minimize out-of-sample loss (e.g., MSE, MAE) for new observations $x_{new}$ .
Overfitting dangers:
- $n$ independent regressors fit $n$ observations perfectly but generalize poorly.
Three ML countermeasures:
1. Regularization – penalize model complexity.
2. Train–validation–test split – distinct data for estimation, model choice, and final evaluation.
3. k-Fold Cross-Validation (CV):
- Step-by-step procedure:
  1. Partition data into $k$ folds ( $s=1,\dots,k$ ).
  2. Choose tuning parameter candidate.
  3. Fit on $k-1$ folds, predict on held-out fold $s$ , record loss.
  4. Cycle through all folds & parameter values.
  5. Select parameter minimizing average out-of-sample loss (often “1-SE rule” picks simpler model within one SD of minimum).
Economists historically cited “small-sample” excuses for in-sample fit measures; big data removes that alibi.

Flexible Regression-Like ML Tools

When linear/logit models are too rigid, consider:
1. Classification and Regression Trees (CART)
2. Random Forests / Boosted Trees
3. Penalized regression: LASSO, LARS, Elastic Net
- (Additional but not covered in detail: neural nets, deep learning, SVMs.)

Classification & Regression Trees (CART)

Concept

Binary recursive partitioning: repeatedly split predictor space into rectangles (or hyper-rectangles) to minimize impurity (Gini, deviance, MSE).
Produces human-readable decision rules.
Handles nonlinearities, interactions & missing data naturally.

Titanic Example (rpart in R)

Predict survival (0/1) using age & class.
Learned rules (Table 2):
- 3rd-class passengers → “Died” (370/501).
- 1st/2nd-class children (<16 yrs) → “Lived” (34/36), etc.
Misclassification ≈ $30\%$ on test set.
Partition plot (2D) visualizes rectangular regions; scalable to many predictors (non-visual).

Logistic vs Tree Contrast

Logistic regression yielded tiny age coefficient ( $p \approx 0.07$ ).
Tree discovered sharp nonlinearity: survival high for children (<8.5 yrs) & low for elderly—pattern obscured in global logit.
Takeaway: Trees uncover heterogeneity & thresholds that linear models may mask.

Pruning & Complexity Control

“Large” trees overfit just like saturated regressions.
Strategy: cost-complexity pruning with CV to choose optimal #leaves.
Analysts often pick penalty one SD beyond minimum loss for parsimony.

Conditional Inference Trees (ctree)

Hypothesis-test-based splitting avoids bias toward predictors with many cut-points.
Example tree for Titanic (Figure 4): first split on gender, then class, age, siblings/spouse; summarizes “Women & children first—especially in 1st class.”

Boosting, Bagging, Bootstrap

Bootstrap: resample with replacement to assess sampling variability.
Bagging (Bootstrap AGGregatING): average predictions from many bootstrapped models → variance reduction, especially effective for unstable learners (trees).
Boosting: sequentially re-weight misclassified observations; final prediction is weighted vote/average of many weak learners → often large accuracy gains.
Combination yields Ensembles of trees (forests, gradient boosting machines) that routinely win predictive competitions.

Random Forests

Canonical algorithm:
1. Draw bootstrap sample of observations.
2. At each split, consider random subset of predictors ( $m_{try}$ ).
3. Grow tree to full depth (no pruning).
4. Aggregate many trees (majority vote or mean).
Strengths: high accuracy, handles nonlinearities & high-dimensionality, built-in variable importance metrics.
Weakness: interpretability (“black box”).
HMDA mortgage-approval data: random forest misclassified 223/2 380 (better than logit 225 and ctree 228); top predictor dmi = denied mortgage insurance, race near bottom → mirrors ctree finding.

Variable Selection in Linear Models

Penalized Regression Family

Objective with Elastic Net penalty: $\min{b0,\,b} \sum{t} (yt - b0 - xt' b)^2 + \lambda \sum{p=1}^{P} \big[(1-\alpha)|bp| + \alpha |b_p|^2 \big]$
- $\lambda$ ≥ 0 tunes overall shrinkage.
- $\alpha = 0$ → LASSO (ℓ₁): induces sparsity (some $b_p=0$ ).
- $\alpha = 1$ → Ridge (ℓ₂): shrinks coefficients but rarely to zero.
- 0<\alpha<1 → Elastic Net: hybrid.
Computable efficiently (coordinate descent, LARS), enabling large-P problems.

Spike-and-Slab Bayesian Selection

Indicator vector $\gamma$ (length $P$ ) denotes inclusion/exclusion; prior $P(\gamma_p=1)=\pi$ .
“Spike” at zero mass for excluded vars; “slab” diffuse Normal prior for included coefficients.
MCMC draws yield posterior inclusion probabilities $P(\gamma_p=1|data)$ and coefficient distributions.

Growth-Regression Application (Sala-i-Martin dataset, 72 countries, 42 vars)

Compared four methods:
- Exhaustive CDF(0) search (2 × 10⁶ regressions)
- Bayesian Model Averaging (BMA)
- LASSO
- Spike-and-Slab
Broad agreement on top predictors (initial GDP, Confucian share, life expectancy, equipment investment). Divergence on lower-ranked variables illustrates model uncertainty.
LASSO & Bayesian methods far more computationally efficient than brute-force search.

Variable Selection for Time-Series: Bayesian Structural Time Series (BSTS)

Motivation: choosing informative Google Trends predictors among billions of queries.
Model components:
- Observation equation $yt = \ellt + zt + e{1t}$
- State equations:
- Level $\ellt = \ell{t-1} + b{t-1} + e{2t}$
- Trend $bt = b{t-1} + e_{3t}$
- Regression $zt = \beta' xt$ with spike-and-slab on $\beta$ .
MCMC draws jointly sample variances, inclusion vector $\gamma$ , coefficients, and latent states, enabling:
- Point & interval forecasts via Kalman filtering/smoothing.
- Posterior inclusion probabilities for each predictor.

Housing Example

Response: U.S. monthly new-home sales (HSN1FNSA).
Google Correlate provided top 100 query series.
BSTS selected terms like “appreciation rate” (positive) and “irs 1031” (negative) after dropping spurious ones (“oldies lyrics”).
Incremental MAE fell from 0.52 (trend only) → 0.15 with two predictors (Figure 7) despite tiny training sample.

Causal Inference vs Pure Prediction

Econometric toolkit for causality: IV, regression discontinuity, diff-in-diffs, natural & randomized experiments.
ML traditionally focuses on prediction; yet causal modeling frameworks exist (e.g., Pearl’s graphical models) but are under-used in practice.
Two core causal-estimation challenges (Angrist–Pischke decomposition):
1. Modeling assignment rule (selection into treatment).
2. Modeling counterfactual outcome for treated units.
- Both are prediction problems ⇒ ML can help improve estimates of treatment effects by reducing prediction error for selection bias & counterfactuals.

Advertising Effectiveness Illustration

Problem: ad spend & sales both rise during holidays → confounding.
Full experiments are costly; alternative: treat counterfactual as forecasting problem.
Procedure:
1. Build BSTS model on pre-treatment data (trend, seasonality, exogenous Google Trends, weather, etc.).
2. Run ad campaign.
3. Forecast “would-have-been” visits, compare to actual → estimate causal lift.
Example (Figure 8): cumulative uplift 107 k visits over 55 days (≈ 27% relative), with credible intervals.
Predictive model may outperform naive randomized control groups if it captures rich covariate info (e.g., city-specific weather).

Model Uncertainty & Ensemble Learning

Lesson from Netflix Prize: blend of 800+ models beat any single algorithm; even averaging top-2 submissions improved accuracy.
Macro-forecast literature recognized decades ago that mean of projections > individual models.
Applied econometric papers implicitly address model uncertainty via multiple-specification tables; big-data context urges formal, systematic treatment (BMA, ensemble forecasts) rather than ad-hoc few specs.

Opportunities for Cross-Fertilization

Extend ML methods to non-IID settings: panels, time series with serial correlation.
Embed causal-inference structure (treatment, potential outcomes) within flexible ML estimators.
Combine econometric identification strategies with ML regularization & prediction (e.g., “double/debiased ML” of Chernozhukov et al., post-dating this article).

Ethical & Practical Implications

Lower cost of storage/compute democratizes big-data analytics → raises privacy, confidentiality concerns (especially with fine-grained transaction data).
Interpretability vs accuracy trade-off: regulators & policy analysts may require transparent models; ensemble/tree methods need post-hoc explanation tools.

BigData_ New Tricks For Econometrics

Introduction: The Promise & Challenges of Big Data for Econometrics

Tools to Manipulate Big Data

Data Analysis Tasks & Terminology

From Extraction to “Small” Tables

General Considerations for Prediction

Flexible Regression-Like ML Tools

Classification & Regression Trees (CART)

Concept

Titanic Example (rpart in R)

Logistic vs Tree Contrast

Pruning & Complexity Control

Conditional Inference Trees (ctree)

Boosting, Bagging, Bootstrap

Random Forests

Variable Selection in Linear Models

Penalized Regression Family

Spike-and-Slab Bayesian Selection

Growth-Regression Application (Sala-i-Martin dataset, 72 countries, 42 vars)

Variable Selection for Time-Series: Bayesian Structural Time Series (BSTS)

Housing Example

Causal Inference vs Pure Prediction

Advertising Effectiveness Illustration

Model Uncertainty & Ensemble Learning

Opportunities for Cross-Fertilization

Ethical & Practical Implications

Suggested Further Reading & Resources