Lecture Notes on Ensemble Learning and Regularization

Finalizing Groups

Reminder to finalize project groups by the next class.
- Students can use Canvas to find or finalize their groups.
- Note: around 8-9 students have not yet found a group.

Definition and Concept:
- "Ensemble" means a group of models or components.
- Ensemble learning is the process of combining multiple models to produce a better predictive performance.
- Example: TV shows like "Friends" illustrate ensemble situations where no main character dominates.
Purpose of Ensemble Learning:
- Combines multiple weaker models to create a stronger overall model.
- Real-world applications: eg. Kaggle competitions and Netflix challenge example where collaboration among teams enhanced prediction accuracy.

Bagging:
- Reduces model variance.
- Helps to mitigate overfitting by training models on random subsets of the data.
Boosting:
- Aims to reduce bias and improve model accuracy by sequentially training models that focus on the errors made by previous models.

Overfitting:
- Occurs when a model learns the training data too well, capturing noise as important patterns and performing poorly on new data.
- High variance in model predictions.
Underfitting:
- Happens when a model is too simple to learn the data's underlying structure leading to missed patterns and poor performance.
- High bias in predictions.

Bias refers to the error introduced by approximating a real-world problem with a simpler model.
Variance refers to the error introduced by the model's sensitivity to the fluctuations in the training set.
An ideal model would achieve low bias and low variance.

Bootstrapping:
- Random sampling with replacement.
- Creates multiple datasets from a single dataset to reduce overfitting by training different models on various subsets of data.

Decision trees are prone to overfitting due to their ability to split the training data into granular segments.
Stopping criteria help manage overfitting by preventing too many splits and