2024-10-21-ensembles

Class Imbalance and Ensemble Learning

Overview

  • Course: CS 418 Introduction to Data Science

  • Instructor: Prof. Brian Ziebart

  • Date: October 21, 2024

  • Reading Assignment: CIML sections 5.5, 5.9, 13.1, 13.2


Announcements

  • Homework 3: Due 10/28 at 11:59 pm

  • Exam: Scheduled for 10/30 in class covering Neural Networks, Deep Learning, and LLMs


Previous Class Recap

  • Topics Covered:

    • Supervised Learning

    • Logistic Regression

    • Support Vector Machines

  • Process Overview:

    • Formulating a question or problem

    • Acquiring and cleaning data

    • Exploratory data analysis

    • Prediction and inference

    • Reporting, decisions, and solutions


Topic of the Day

  • Focus: Classification and practical issues related to data imbalance and evaluation metrics


Imbalanced Data Distribution

  • Definition: An imbalanced training set occurs when examples are drawn disproportionately from different classes.

  • Impact: This creates a "needle in a haystack" problem which can hinder performance in ML algorithms.

  • Example: Identifying fraudulent transactions in a dataset of credit card histories.


Strategies for Addressing Imbalanced Data

Inducing a New Binary Distribution

  • Undersampling the Majority Class:

    • More computationally efficient but may discard valuable data.

  • Oversampling the Minority Class:

    • Increases data for underrepresented classes but can lead to overfitting.

    • Implemented by incorporating weights instead of simply duplicating instances.


Evaluation of Classification Performance

  • Accuracy: Not effective when class importance differs.

  • Specific Cases:

    • Medical records: predicting presence of cancer.

    • Document retrieval: identifying relevant documents based on queries.


Importance of Class Labels

  • True/False Positive & Negative Definitions:

    • True Positive (TP): Correctly predicted positive

    • False Positive (FP): Incorrectly predicted positive

    • False Negative (FN): Incorrectly predicted negative

    • True Negative (TN): Correctly predicted negative


Precision and Recall Metrics

  • Precision: Percentage of positive predictions that are correct.

  • Recall: Percentage of true positive labels predicted correctly.

  • Precision-Recall Trade-off: Affected by varying cutoff thresholds, impacting the model evaluation.


F-score

  • Definition: The harmonic mean of precision and recall, providing a single metric to evaluate models.

  • Weighted F-score: Utilized when one of the components (precision/recall) is more crucial than the other.


Ensemble Learning

  • Goal: Combine multiple weak classifiers to form a strong classifier without new algorithms.

  • Approach:

    • Generate multiple classifiers that vote on predictions, leveraging their diversity.


Ensemble Methods

Bagging

  • Definition: Bootstrap Aggregation

  • Process:

    1. Generate multiple bootstrap samples from the training set.

    2. Train separate models on these samples.

    3. Average predictions for regression or use majority vote for classification.


Features of Bagging

  • More bootstrap samples generally improve performance, employing samples drawn with replacement.


Boosting

  • Concept: Sequential training with an emphasis on difficult to classify training examples.

  • Mechanism:

    • Each classifier trained iteratively, focusing on performance feedback from previous classifiers.

    • Final results achieved via a weighted sum of classifiers.


Conclusion

  • Illustration: Combining weak classifiers can lead to enhanced overall prediction accuracy, tapping into different aspects of the data for stronger decision-making.


Acknowledgments

  • Materials include contributions from the Berkeley DS 100 team and others.