LU

Learning form Data - Week 3

Week 3: Learning from Data

Coursework Update

  • Coursework release delayed due to departmental moderation.
  • PDF and early page expected by Wednesday.

Recap

  • Goal: Find a mapping function between x and y.
  • Mapping function has a specific form with parameters \omega.
  • Best parameters are found by minimizing error (e.g., finding \beta0 and \beta1 that minimize error).
  • More complex models were introduced along with methods for comparison.
  • Challenge: Models are trained on data, but their performance on unseen data is unknown.
  • Solution: Split data into training and test sets to mimic real-world scenarios.
  • Techniques like cross-validation are used for data splitting.

Bias-Variance Trade-off

  • Focus: Model complexity from the perspective of bias-variance trade-off.
  • Regularization: Using techniques like Lasso and Ridge regression.
  • Applicability: Primarily discussed in the context of regression but applicable to other models like regression trees.

K-Fold Cross-Validation Revisited

  • Data is split into training and test sets.
  • Model is trained on the training set and tested on the test set.
  • Process is repeated k times, resulting in k measures of error.
  • Average error is calculated.
  • Variance: The spread of error measurements across different test sets is considered.
  • High Variance: If a model trained on one subset performs poorly on a different test set, it indicates high variance.

Polynomial Regression Example

  • Models of varying degrees are considered (e.g., linear regression).
  • Expected Behavior: As complexity increases, training error decreases (the model "memorizes" the data).
  • Zero Error: Possible to achieve zero error on the training data with a sufficiently complex model.
  • Test Set Error: Initially decreases with complexity, but then starts to increase as the model overfits.
Underfitting
  • Occurs when the model is too simple and fails to capture the underlying patterns in the data.
Overfitting
  • Occurs when the model is too complex and memorizes the training data, including the noise.
Sweet Spot
  • The ideal level of complexity where the model captures the real data or the real model generating that data, without overfitting.

Bias and Variance in Model Behavior

  • Underfitting Model: Makes systematic errors with low variance (consistent errors).
  • Overfitting Model: Performs well on data similar to the training set but makes significant errors on different data (high variance).

Bias vs. Variance Analogy (Target Shooting)

  • Low Bias, Low Variance: Shots are centered around the target and tightly grouped.
  • Low Bias, High Variance: Shots are centered around the target but widely scattered.
  • High Bias, Low Variance: Shots are consistently off-target but tightly grouped.

Error Sources

  • Model Error: The model may be inherently wrong or unstable.
  • Intrinsic Uncertainty: Data may have inherent randomness or noise.

Components of Error

  • Bias: Arises from the model being too simple and missing patterns in the data (underfitting).
  • Variance: Arises from the model capturing noise in addition to the underlying patterns (overfitting).
  • Uncertainty: Inherent randomness in the data or the phenomenon being modeled.

Bias-Variance Trade-off Illustrated

  • As model complexity increases, bias decreases, but variance increases.
  • The goal is to find the optimal balance between bias and variance to minimize overall error.

Tuning Model Complexity

  • In polynomial regression, the degree of the polynomial controls complexity.
  • The optimal degree can be found by observing the trade-off between training and test error.
  • Regularization techniques are needed for models where complexity isn't as straightforward to tune.

Regularization Intuition

  • Regularization makes model simpler by decreasing its complexity.

Example: Linear Regression with Limited Data

  • Scenario: A linear regression model is trained on only two data points.
  • Problem: The model perfectly fits the training data (zero error) but may not generalize well to unseen data.
  • Solution: Make the model "forget" the data by reducing the slope ([\beta_1]) of the regression line.
  • Beta one is defined here.
  • Decreasing the coefficient makes the training error increase but potentially improves test set performance.
  • This process is called shrinkage.

Regularization Techniques

  • Goal: Systematically decrease the values of the betas.
  • Method: Add a penalty term to the error function that accounts for the magnitude of the coefficients.

Ridge Regression

  • Objective: Minimize both the error and the magnitude of the coefficients.
  • $\sum{i=1}^{n}(yi - \hat{yi})^2 + \lambda \sum{j=1}^{p} \beta_j^2$
  • Modified Error Function: Minimize error + \lambda * (sum of squared coefficients).
  • Effect of Lambda ($\lambda$): Controls the strength of the penalty.
    • $\lambda = 0$: Equivalent to linear regression.
    • Large $\lambda$: Penalizes large coefficients, causing them to shrink towards zero.
  • Cross-validation is used to find the best value of \lambda.

Scaling Features

  • Importance: Features must be scaled to have a similar range to prevent features with larger ranges from dominating the minimization process.

Visual Example of Ridge Regression

  • Changing \lambda in a polynomial regression model visually demonstrates the effect.
    • Low \lambda: Model fits training data closely.
    • Increasing \lambda: Model becomes simpler and fits the data more generally.
    • High \lambda: Model becomes a flat line (all coefficients shrink to zero).

Lasso Regression

  • Objective: Minimize both the error and the magnitude of the coefficients.
  • $\sum{i=1}^{n}(yi - \hat{yi})^2 + \lambda \sum{j=1}^{p} |\beta_j|$
  • Modified Error Function: Minimize error + \lambda * (sum of absolute coefficients).

L1 vs. L2 Regularization

  • Lasso regression is also known as L1 regularization.
  • Ridge regression is also known as L2 regularization.
  • Difference: L1 uses the absolute value of coefficients, while L2 uses the square of coefficients.
  • Because of L1 and L2 norms.
    • L1 Norm looks like this.
    • L2 Norm looks like this.

Feature Selection with Lasso

  • When using lasso regression and increasing \lambda, some coefficients go to zero.
  • This means that this feature is not needed.
  • Advantage: Lasso regression can be used for feature selection by driving the coefficients of irrelevant features to zero.
  • Coefficient goes to zero quicker than L2.

Implementation in Python

  • Ridge Regression: linear_model.Ridge
  • Lasso Regression: linear_model.Lasso
  • Why Alpha: Lambda is a reserved keyword for lambda functions in Python.