W6: Generalization & Overfitting

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/25

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

26 Terms

1
New cards

Understanding Generalization

How well a model performs on new, unseen data after training.

  • Identifies patterns instead of just memorizing training examples.

  • Build models that make accurate predictions on future data, not just past data.

  • Important because real-world data changes over time.

  • May work perfectly on training data but fail in real-world applications.

Measured by comparing the model's accuracy on training data vs test data.

  • Big drop in accuracy from training to testing = overfitting.

  • More training data helps improve generalization, but only up to a point.

2
New cards

Overfitting

Happens when a machine learning model learns the training data too well, including noise and random details. Causes the model to perform well on training data but poorly on new data.

  • Occurs when a model is too complex for the training data, is trained for too long

  • Too many features compared to examples (the "curse of dimensionality").

3
New cards

Underfitting

Model is too simple to learn patterns in the data, leading to poor performance on both training and test data.

  • Common example is using a linear model for a complex, nonlinear problem. No matter how much data you add, the model cannot capture the real relationship.

  • Easy to spot because the model performs poorly overall.

  • Fixing it requires using a more complex model or adding relevant features, but this must be balanced to avoid overfitting.

4
New cards

Bias-Variance Tradeoff

Goal is to balance bias and variance for a model that generalizes well to new data. Visualized as a U-shaped curve:

  • Increasing model complexity reduces bias (less underfitting).

  • But it increases variance (risk of overfitting).

  • The optimal model finds the balance where total error is minimized, balance is key to building effective machine learning models.

<p>Goal is to balance <strong>bias </strong>and <strong>variance </strong>for a model that <strong>generalizes well </strong>to new data. Visualized as a <strong>U-shaped curve</strong>:</p><ul><li><p>Increasing model complexity <strong>reduces bias</strong> (less underfitting).</p></li><li><p>But it <strong>increases variance</strong> (risk of overfitting).</p></li><li><p>The optimal model finds the balance where total error is minimized, balance is key to building effective machine learning models.</p></li></ul><p></p>
5
New cards

Bias

Error from oversimplifying the model →

  • leads to underfitting (missing important patterns).

6
New cards

Variance

Sensitivity to training data →

  • leads to overfitting (capturing noise).

7
New cards

Techniques to Prevent Overfitting

Techniques to improve/simple model generalization:

  • Regularization

  • Early Stopping

  • Feature Selection & Dimensionality Reduction

8
New cards

Regularization

Adds a penalty to large parameter values to control complexity.

  1. L1 (Lasso): Shrinks some parameters to zero (feature selection).

  2. L2 (Ridge): Reduces parameter magnitude without eliminating them.

9
New cards

Early Stopping

Stops training when validation performance worsens, preventing the model from memorizing data

10
New cards

Feature Selection & Dimensionality Reduction

Reduces input features to focus on the most relevant data.

  1. Filter methods: Rank features using statistical measures.

  2. Wrapper methods: Test feature subsets based on model performance.

  3. Embedded methods: Select features during training.

11
New cards

Random Forest

Protects against overfitting by using multiple independently trained trees, making it robust and reliable with minimal tuning.

Prevents overfitting using three key techniques:

  • Bagging (Bootstrap Aggregating)

  • Feature Randomization

  • Out-of-Bag (OOB) Estimation

12
New cards

Bagging (Bootstrap Aggregating)

Each decision tree is trained on a random subset of the data, ensuring diversity.

  • Some data points appear multiple times in a tree’s training set, while others are left out.

  • Predictions from multiple trees are averaged (for regression) or voted on (for classification), the model focuses on real patterns rather than noise, reducing overfitting.

13
New cards

Feature Randomization

Instead of considering all features at each split, Random Forest selects a random subset, forcing trees to explore different feature combinations.

  • Prevents the model from relying too heavily on specific features, leading to better generalization.

  • Typically, sqrt(n) features are used for classification and n/3 for regression, where n is the total number of features.

14
New cards

Out-of-Bag (OOB) Estimation

Since about one-third of the data is left out of each bootstrap sample, these unused data points act as a built-in validation set.

  • By evaluating model performance on __ samples

  • Random Forest provides an unbiased estimate of its accuracy without needing a separate validation set, helping detect overfitting early.

15
New cards

XGBoost: Advanced Regularization for Better Generalization

Can deliver better performance with its focus on fixing errors, but needs careful tuning of its regularization to generalize well.

Prevents overfitting using three key techniques:

  • Regularization

  • Tree Pruning

  • Learning Rate (Shrinkage)

Mechanisms make it a powerful and reliable model, known for its strong generalization and success in machine learning tasks.

16
New cards

Regularization

Unlike standard gradient boosting, XGBoost adds both L1 (Lasso) and L2 (Ridge) to its objective function.

  • L1 helps by removing unnecessary features, making the model more sparse, while L2 limits the influence of any single feature, preventing it from dominating the model.

  • Terms ensure a balance between fitting the data and maintaining generalization.

17
New cards

Tree Pruning

Instead of allowing trees to grow indefinitely, XGBoost prunes them when additional splits provide little benefit.

  • "grow and prune" strategy removes unnecessary branches, keeping the model simple and reducing overfitting.

  • Pruning decisions are guided by the regularized objective function, ensuring only valuable splits are kept.

18
New cards

Learning Rate (Shrinkage)

XGBoost reduces the impact of each tree by applying a small learning rate (typically 0.01 to 0.3).

  • This slows down learning, preventing the model from memorizing noise.

  • While a lower learning rate requires more trees for optimal performance, it results in better generalization and stability.

19
New cards

Improving Generalization

To improve model generalization, focus on three main areas: data preparation, algorithm choice, and training strategies.

  • Cross-validation

  • Feature Engineering

  • Ensemble Methods

By following these practices, you can make your model more reliable on new, unseen data.

20
New cards

Cross-validation

Instead of a single train-test split, use cross-validation (like k-fold) to assess model performance across multiple data partitions.

  • This helps catch overfitting early and gives a more reliable estimate of how the model will perform on new data.

21
New cards

Feature Engineering

Transform raw data into meaningful features by adding new ones, removing irrelevant ones, and handling issues like multicollinearity.

  • This helps the model focus on the important patterns and improves generalization

22
New cards

Ensemble Methods

Combine multiple models (e.g., Random Forest and XGBoost) to create a stronger system.

  • Stacking and blending can help by reducing individual model biases and improving overall performance.

23
New cards

Real-World Applications and Case Studies

Generalization and overfitting have key impacts in real-world applications, especially when using Random Forest and XGBoost. Here are a few examples:

  • Financial Fraud Detection

  • Healthcare Predictive Models

  • E-commerce Recommendation Systems

In all these cases, ensuring good generalization helps avoid overfitting and keeps models effective over time.

24
New cards

Financial Fraud Detection

Fraud models need to generalize to new fraud patterns.

  • Random Forest works well here because it handles imbalanced data (fraud cases are rare) and resists overfitting.

  • It can detect new fraud methods, so companies regularly retrain models as fraud patterns change.

25
New cards

Healthcare Predictive Models

Models often struggle to generalize across diverse patient populations.

  • XGBoost helps by using regularization to focus on stable patterns and avoid irrelevant correlations.

  • These models also use domain knowledge and stratified sampling to ensure they work for different patient groups

26
New cards

E-commerce Recommendation Systems

These systems need to generalize from past purchase data to predict future preferences.

  • Overfitting leads to outdated recommendations.

  • A hybrid approach using both Random Forest and XGBoost captures complex patterns and handles categorical data.

  • Ensuring relevant recommendations through ongoing testing and validation.