Model Selection

Model Selection Purposes

  • Choosing explanatory variables based on the model's goals:

    • Prediction for new observations.

    • Describing relationships between variables.

Variable Selection Methods

  1. Stepwise Regression: Add/remove variables one by one.

  2. Best Subset Regression: Identify a subset of predictor variables.

  3. Shrinkage: Fit a model with all predictors, deeming unimportant coefficients.

Example Analysis

  • Factors affecting life expectancy include:

    • Number of deaths between 15-60 years (per 1000).

    • Infant deaths (per 1000).

    • Per capita alcohol consumption.

    • Health expenditure (% of GDP per capita).

    • Average BMI.

  • Data collected from WHO and UN (2014), missing data accounted.

Model Comparisons

  • Simple Model:

    • $model.simple <- lm(Life expectancy ~ GDP)$

  • Extended Model:

    • Includes all variables from the dataset.

    • Comparison of $R^2$ values for model fit.

Stepwise Selection Process

  • Forward Selection: Begin with single variable models, progressively add.

  • Backward Selection: Start with all variables, remove the least contributing one.

  • Hybrid Approach: Combines forward and backward methods.

Variable Selection Criteria

  • Adjusted R²: Considered for linear models.

  • Aikaike's Information Criterion (AIC): Lower values preferred.

  • Bayesian Information Criterion (BIC): Lower values preferred.

Step() Function in R

  • Used for model selection through both forward and backward directions.

Problems with Single-Direction Selection

  1. Fixed positions after add/remove lead to suboptimal models.

  2. Increased collinearity issues.

  3. Automated hypothesis testing increases Type I errors.

Best Subsets Selection

  • Finds best models for any subsets of up to 8 variables.

  • Often uses BIC as the selection criterion.

LASSO Overview

  • Regularizes coefficient estimates using tuning parameters.

Evaluating Final Model

  • Check if final variables make sense and meet necessary assumptions.

  • Assess for multicollinearity using Variance Inflation Factor (VIF).

Final Model Interpretation

  • Adult Mortality, HIV/AIDS negatively impact life expectancy.

  • Alcohol, Total Expenditure, Schooling positively influence life expectancy.