CN

Machine Learning Concepts Review

Lecture Details

  • Course: Machine Learning SEIS 763-02, Fall 2025

  • Lecture Topic: Ensemble Learning, Regularization, and Classification

Ensemble Learning

  • Definition: Ensemble Learning involves training multiple models, combining their predictions to enhance performance.

    • Utilizes multiple weaker models (weak learners) to create a more robust predictive model.

    • The final prediction is based on a combination of the outputs from each of the models trained.

    • Example Applications:

    • Kaggle competitions

    • Netflix dataset

Bagging and Boosting

  • Bagging (Bootstrap Aggregating)

    • Decreases the model’s variance, enhancing stability by averaging multiple models.

    • Utilizes subsets of the original training data sampled with replacement.

  • Boosting

    • Focuses on decreasing the model’s bias by sequentially training models.

    • Each subsequent model focuses on instances where previous models performed poorly.

Variance and Bias

  • Variance: The error due to the model's sensitivity to small fluctuations in the training set.

  • Bias: The error due to assumptions in the learning algorithm, leading to underfitting.

Bootstrapping

  • Definition: Bootstrapping is a resampling technique that involves creating subsets from the original dataset with replacement.

    • Subsets can be of equal or smaller size than the original dataset.

    • Example iteration of bootstrapping may yield a sample similar to, but distinct from, the original dataset.

Bagging (Bootstrap Aggregating)

  • Application: Bagging can reduce the high variance characteristic often found in decision trees.

    • Defined Steps:

    • Randomly sample from the original dataset with replacement.

    • Build a decision tree from the sample.

    • For predictions on new instances, average predictions from all trees.

  • Random Forests: An extension of bagging applied specifically for high-variance algorithms like decision trees.

Random Forest Regression

  • Steps

    1. Generate n samples from original data with replacement: this includes 1 to m features without replacement.

    2. Create decision trees based on these samples.

    3. For a new instance, each tree predicts a value, and the average of all predictions is computed.

    • Reduces overfitting effectively due to the diversity of the models.

Boosting

  • Definition: Boosting refers to sequential training of ensemble models, where each model aims to improve on the mistakes of its predecessor.

    • Prominent boosting methods include AdaBoost, Gradient Boost, and XGBoost.

AdaBoost (Adaptive Boosting)

  • Initially, a model is trained.

    • Misclassified instances have their weights increased to influence subsequent models.

    • Each model is evaluated, and its weight in the final prediction is determined based on its accuracy.

    • The ensemble prediction is a weighted average of all model predictions.

Gradient Boosting

  • Overview: Sequentially adds predictors, each one trained to correct errors from the previous model.

    • Optimizes by fitting new predictors to the residual errors of prior predictors.

Regularization

  • Definition: Regularization is the technique of constraining a model to prevent overfitting by reducing its complexity.

    • Introduces penalties to the model fit to help manage model complexity.

Types of Regularization Techniques

  1. Lasso Regression (L1 Regularization)

    • Adds absolute value of coefficient size as penalty to the loss function, leading to some coefficients being set to zero, thus performing feature selection.

  2. Ridge Regression (L2 Regularization)

    • Adds square of coefficient size as penalty to the loss function, preventing extreme coefficient values.

  3. Elastic Net: Combines L1 and L2 regularizations.

Overfitting and Underfitting

  • Overfitting: When a model performs well on training data but poorly on unseen data due to excessive complexity.

  • Underfitting: Occurs when a model is too simple to capture the data’s structure.

Evaluating Classifiers

  • Metrics:

    • Model Accuracy: Proportion of correct predictions relative to the total.

    • Precision: Fraction of relevant instances among retrieved instances (TP/(TP + FP)).

    • Recall: Fraction of relevant instances that were retrieved (TP/(TP + FN)).

    • F1 Score: Harmonizes precision and recall.

Imbalanced Datasets

  • Recognizes challenges in evaluating models when one class is underrepresented (class 0) compared to another (class 1).

  • Strategies to overcome imbalance in class datasets: collecting more data or using techniques such as oversampling and undersampling.

Strategies for Addressing Imbalance
  1. Collect more data: Balance the dataset by increasing minority class instances.

  2. Oversampling: Randomly duplicate instances of the minority class.

  3. Undersampling: Reduce the size of the majority class.

  4. Synthetic Augmentation: Use methods like SMOTE or ADASYN to generate new synthetic instances from the minority class.

Code Examples

  • Bagging with Random Forest:

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=10, random_state=42)
model.fit(X, y)
print(f"Model Prediction : {model.predict([[8.33]])[0]:.2f}")
  • Boosting with AdaBoost:

from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor()
model.fit(X, y)
print(f"AdaBoost Prediction: {model.predict(NEW_INSTANCE)[0]:.2f}")
  • Use SMOTE from imblearn package for balancing the dataset.

Conclusion

The integration of techniques such as bagging and boosting into machine learning can lead to more robust and accurate models. Proper regularization techniques can help to mitigate the risks of overfitting while evaluating performance metrics are essential for understanding the effectiveness of classifiers, especially in datasets with imbalanced classes.