LU

Complex Models, Error, and Data Splitting

Model Usage: Prediction vs. Interpretation

  • Prediction:

    • Objective: Find the optimal \omega to achieve the best predictive performance.
    • Approach: Treat the model as a black box, focusing on the accuracy of predictions.
    • Example: Linear regression, where the goal is to predict future values without necessarily understanding the significance of each parameter.
  • Interpretation:

    • Objective: Understand the impact of different features on the outcome.
    • Approach: Focus on the values of \omega to analyze the importance of each feature.
    • Example: Housing prices, where linear regression is used to identify which features (e.g., location, size) have the most significant impact on house prices.

Complex Regression Models

  • Polynomial Regression:

    • A method to create more complex models by introducing polynomial terms.
    • Definition: A polynomial is a sum of monomials, where a monomial is a constant times a variable raised to a power.
  • General Form:

    • y = \beta0 + \beta1x + \beta2x^2 + \dots + \betakx^k
  • Polynomial Regression is Linear:

    • Polynomial regression is a linear model because it transforms the data, not the coefficients (\beta).
    • The coefficients \beta remain linear, even though the terms involve higher powers of x.

Model Comparison and Complexity

  • Overfitting:

    • Occurs when the model learns the training data too well, capturing noise and specific patterns that do not generalize to new data.
    • Overfitted models perform poorly on unseen data because they memorize the training set instead of learning the underlying phenomenon.
  • Bayesian Information Criterion (BIC):

    • A metric to compare models while accounting for their complexity. It balances the model's error with the number of parameters.
    • Formula: BIC = \text{Error} + \lambda \times \text{Number of Parameters}, where \lambda is a penalty term.
    • Goal: Minimize BIC to find a model with the lowest error and minimum number of parameters.

Extensions of Linear Models

  • Types:

    • Logistic Regression
    • Support Vector Machines (SVM)
    • Multi-Layer Perceptron (MLP)
  • Interaction Terms:

    • Used to model the interaction between features.
    • Example: Life expectancy, where the interaction between alcohol consumption and liver cancer has a combined effect greater than the sum of their individual effects.

Data Splitting and Testing

  • Holdout Set:

    • A portion of the data used to mimic unseen data.
    • The model is trained on the training set and then evaluated on the holdout set (test set).
  • Data Leakage:

    • Occurs when information from the test set inadvertently influences the training set.
    • Example: Including the average budget or revenue as a feature, which contains information about the test set.
  • Workflow:

    1. Split the dataset into a training set and a test set.
    2. Train the model using the training set.
    3. Test the model on the test set to evaluate its performance on unseen data.

Cross-Validation

  • Purpose:

    • To address the issue of arbitrary data splitting by using multiple test sets.
    • Provides a more robust measure of error by averaging the results across different splits.
  • Process:

    1. Split the data into multiple folds.
    2. Train the model on a subset of the folds and test on the remaining fold.
    3. Repeat this process, using different folds for testing each time.
    4. Average the error across all iterations.
  • Data Leakage:

    • Can still occur in cross-validation if not handled carefully.
    • It's essential to ensure that no information from the test sets leaks into the training sets.

Stratified Sampling

  • Issue:

    • Imbalanced datasets can lead to biased samples, where one group is overrepresented in the training or test set.
  • Solution:

    • Stratified sampling ensures that each group is represented proportionally in both the training and test sets.
    • Example: If a dataset has 60% group A and 40% group B, stratified sampling ensures that the training and test sets maintain this proportion.