Learning form Data - Week 3

Week 3: Learning from Data

Coursework Update

Coursework release delayed due to departmental moderation.
PDF and early page expected by Wednesday.

Recap

Goal: Find a mapping function between x and y.
Mapping function has a specific form with parameters \omega.
Best parameters are found by minimizing error (e.g., finding \beta0 and \beta1 that minimize error).
More complex models were introduced along with methods for comparison.
Challenge: Models are trained on data, but their performance on unseen data is unknown.
Solution: Split data into training and test sets to mimic real-world scenarios.
Techniques like cross-validation are used for data splitting.

Bias-Variance Trade-off

Focus: Model complexity from the perspective of bias-variance trade-off.
Regularization: Using techniques like Lasso and Ridge regression.
Applicability: Primarily discussed in the context of regression but applicable to other models like regression trees.

K-Fold Cross-Validation Revisited

Data is split into training and test sets.
Model is trained on the training set and tested on the test set.
Process is repeated k times, resulting in k measures of error.
Average error is calculated.
Variance: The spread of error measurements across different test sets is considered.
High Variance: If a model trained on one subset performs poorly on a different test set, it indicates high variance.

Polynomial Regression Example

Models of varying degrees are considered (e.g., linear regression).
Expected Behavior: As complexity increases, training error decreases (the model "memorizes" the data).
Zero Error: Possible to achieve zero error on the training data with a sufficiently complex model.
Test Set Error: Initially decreases with complexity, but then starts to increase as the model overfits.

Underfitting

Occurs when the model is too simple and fails to capture the underlying patterns in the data.

Overfitting

Occurs when the model is too complex and memorizes the training data, including the noise.

Sweet Spot

The ideal level of complexity where the model captures the real data or the real model generating that data, without overfitting.

Bias and Variance in Model Behavior

Underfitting Model: Makes systematic errors with low variance (consistent errors).
Overfitting Model: Performs well on data similar to the training set but makes significant errors on different data (high variance).

Bias vs. Variance Analogy (Target Shooting)

Low Bias, Low Variance: Shots are centered around the target and tightly grouped.
Low Bias, High Variance: Shots are centered around the target but widely scattered.
High Bias, Low Variance: Shots are consistently off-target but tightly grouped.

Error Sources

Model Error: The model may be inherently wrong or unstable.
Intrinsic Uncertainty: Data may have inherent randomness or noise.

Components of Error

Bias: Arises from the model being too simple and missing patterns in the data (underfitting).
Variance: Arises from the model capturing noise in addition to the underlying patterns (overfitting).
Uncertainty: Inherent randomness in the data or the phenomenon being modeled.

Bias-Variance Trade-off Illustrated

As model complexity increases, bias decreases, but variance increases.
The goal is to find the optimal balance between bias and variance to minimize overall error.

Tuning Model Complexity

In polynomial regression, the degree of the polynomial controls complexity.
The optimal degree can be found by observing the trade-off between training and test error.
Regularization techniques are needed for models where complexity isn't as straightforward to tune.

Regularization Intuition

Regularization makes model simpler by decreasing its complexity.

Example: Linear Regression with Limited Data

Scenario: A linear regression model is trained on only two data points.
Problem: The model perfectly fits the training data (zero error) but may not generalize well to unseen data.
Solution: Make the model "forget" the data by reducing the slope ([\beta_1]) of the regression line.
Beta one is defined here.
Decreasing the coefficient makes the training error increase but potentially improves test set performance.
This process is called shrinkage.

Regularization Techniques

Goal: Systematically decrease the values of the betas.
Method: Add a penalty term to the error function that accounts for the magnitude of the coefficients.

Ridge Regression

Objective: Minimize both the error and the magnitude of the coefficients.
$\sum{i=1}^{n}(yi - \hat{yi})^2 + \lambda \sum{j=1}^{p} \beta_j^2$
Modified Error Function: Minimize error + \lambda * (sum of squared coefficients).
Effect of Lambda ($\lambda$): Controls the strength of the penalty.
- $\lambda = 0$: Equivalent to linear regression.
- Large $\lambda$: Penalizes large coefficients, causing them to shrink towards zero.
Cross-validation is used to find the best value of \lambda.

Scaling Features

Importance: Features must be scaled to have a similar range to prevent features with larger ranges from dominating the minimization process.

Visual Example of Ridge Regression

Changing \lambda in a polynomial regression model visually demonstrates the effect.
- Low \lambda: Model fits training data closely.
- Increasing \lambda: Model becomes simpler and fits the data more generally.
- High \lambda: Model becomes a flat line (all coefficients shrink to zero).

Lasso Regression

Objective: Minimize both the error and the magnitude of the coefficients.
$\sum{i=1}^{n}(yi - \hat{yi})^2 + \lambda \sum{j=1}^{p} |\beta_j|$
Modified Error Function: Minimize error + \lambda * (sum of absolute coefficients).

L1 vs. L2 Regularization

Lasso regression is also known as L1 regularization.
Ridge regression is also known as L2 regularization.
Difference: L1 uses the absolute value of coefficients, while L2 uses the square of coefficients.
Because of L1 and L2 norms.
- L1 Norm looks like this.
- L2 Norm looks like this.

Feature Selection with Lasso

When using lasso regression and increasing \lambda, some coefficients go to zero.
This means that this feature is not needed.
Advantage: Lasso regression can be used for feature selection by driving the coefficients of irrelevant features to zero.
Coefficient goes to zero quicker than L2.

Implementation in Python

Ridge Regression: linear_model.Ridge
Lasso Regression: linear_model.Lasso
Why Alpha: Lambda is a reserved keyword for lambda functions in Python.