Learning form Data - Week 3
Week 3: Learning from Data
Coursework Update
- Coursework release delayed due to departmental moderation.
- PDF and early page expected by Wednesday.
Recap
- Goal: Find a mapping function between x and y.
- Mapping function has a specific form with parameters \omega.
- Best parameters are found by minimizing error (e.g., finding \beta0 and \beta1 that minimize error).
- More complex models were introduced along with methods for comparison.
- Challenge: Models are trained on data, but their performance on unseen data is unknown.
- Solution: Split data into training and test sets to mimic real-world scenarios.
- Techniques like cross-validation are used for data splitting.
Bias-Variance Trade-off
- Focus: Model complexity from the perspective of bias-variance trade-off.
- Regularization: Using techniques like Lasso and Ridge regression.
- Applicability: Primarily discussed in the context of regression but applicable to other models like regression trees.
K-Fold Cross-Validation Revisited
- Data is split into training and test sets.
- Model is trained on the training set and tested on the test set.
- Process is repeated k times, resulting in k measures of error.
- Average error is calculated.
- Variance: The spread of error measurements across different test sets is considered.
- High Variance: If a model trained on one subset performs poorly on a different test set, it indicates high variance.
Polynomial Regression Example
- Models of varying degrees are considered (e.g., linear regression).
- Expected Behavior: As complexity increases, training error decreases (the model "memorizes" the data).
- Zero Error: Possible to achieve zero error on the training data with a sufficiently complex model.
- Test Set Error: Initially decreases with complexity, but then starts to increase as the model overfits.
Underfitting
- Occurs when the model is too simple and fails to capture the underlying patterns in the data.
Overfitting
- Occurs when the model is too complex and memorizes the training data, including the noise.
Sweet Spot
- The ideal level of complexity where the model captures the real data or the real model generating that data, without overfitting.
Bias and Variance in Model Behavior
- Underfitting Model: Makes systematic errors with low variance (consistent errors).
- Overfitting Model: Performs well on data similar to the training set but makes significant errors on different data (high variance).
Bias vs. Variance Analogy (Target Shooting)
- Low Bias, Low Variance: Shots are centered around the target and tightly grouped.
- Low Bias, High Variance: Shots are centered around the target but widely scattered.
- High Bias, Low Variance: Shots are consistently off-target but tightly grouped.
Error Sources
- Model Error: The model may be inherently wrong or unstable.
- Intrinsic Uncertainty: Data may have inherent randomness or noise.
Components of Error
- Bias: Arises from the model being too simple and missing patterns in the data (underfitting).
- Variance: Arises from the model capturing noise in addition to the underlying patterns (overfitting).
- Uncertainty: Inherent randomness in the data or the phenomenon being modeled.
Bias-Variance Trade-off Illustrated
- As model complexity increases, bias decreases, but variance increases.
- The goal is to find the optimal balance between bias and variance to minimize overall error.
Tuning Model Complexity
- In polynomial regression, the degree of the polynomial controls complexity.
- The optimal degree can be found by observing the trade-off between training and test error.
- Regularization techniques are needed for models where complexity isn't as straightforward to tune.
Regularization Intuition
- Regularization makes model simpler by decreasing its complexity.
Example: Linear Regression with Limited Data
- Scenario: A linear regression model is trained on only two data points.
- Problem: The model perfectly fits the training data (zero error) but may not generalize well to unseen data.
- Solution: Make the model "forget" the data by reducing the slope ([\beta_1]) of the regression line.
- Beta one is defined here.
- Decreasing the coefficient makes the training error increase but potentially improves test set performance.
- This process is called shrinkage.
Regularization Techniques
- Goal: Systematically decrease the values of the betas.
- Method: Add a penalty term to the error function that accounts for the magnitude of the coefficients.
Ridge Regression
- Objective: Minimize both the error and the magnitude of the coefficients.
- $\sum{i=1}^{n}(yi - \hat{yi})^2 + \lambda \sum{j=1}^{p} \beta_j^2$
- Modified Error Function: Minimize error + \lambda * (sum of squared coefficients).
- Effect of Lambda ($\lambda$): Controls the strength of the penalty.
- $\lambda = 0$: Equivalent to linear regression.
- Large $\lambda$: Penalizes large coefficients, causing them to shrink towards zero.
- Cross-validation is used to find the best value of \lambda.
Scaling Features
- Importance: Features must be scaled to have a similar range to prevent features with larger ranges from dominating the minimization process.
Visual Example of Ridge Regression
- Changing \lambda in a polynomial regression model visually demonstrates the effect.
- Low \lambda: Model fits training data closely.
- Increasing \lambda: Model becomes simpler and fits the data more generally.
- High \lambda: Model becomes a flat line (all coefficients shrink to zero).
Lasso Regression
- Objective: Minimize both the error and the magnitude of the coefficients.
- $\sum{i=1}^{n}(yi - \hat{yi})^2 + \lambda \sum{j=1}^{p} |\beta_j|$
- Modified Error Function: Minimize error + \lambda * (sum of absolute coefficients).
L1 vs. L2 Regularization
- Lasso regression is also known as L1 regularization.
- Ridge regression is also known as L2 regularization.
- Difference: L1 uses the absolute value of coefficients, while L2 uses the square of coefficients.
- Because of L1 and L2 norms.
- L1 Norm looks like this.
- L2 Norm looks like this.
Feature Selection with Lasso
- When using lasso regression and increasing \lambda, some coefficients go to zero.
- This means that this feature is not needed.
- Advantage: Lasso regression can be used for feature selection by driving the coefficients of irrelevant features to zero.
- Coefficient goes to zero quicker than L2.
Implementation in Python
- Ridge Regression:
linear_model.Ridge
- Lasso Regression:
linear_model.Lasso
- Why Alpha: Lambda is a reserved keyword for lambda functions in Python.