2024-10-21-ensembles

Class Imbalance and Ensemble Learning

Overview

Course: CS 418 Introduction to Data Science
Instructor: Prof. Brian Ziebart
Date: October 21, 2024
Reading Assignment: CIML sections 5.5, 5.9, 13.1, 13.2

Announcements

Homework 3: Due 10/28 at 11:59 pm
Exam: Scheduled for 10/30 in class covering Neural Networks, Deep Learning, and LLMs

Previous Class Recap

Topics Covered:
- Supervised Learning
- Logistic Regression
- Support Vector Machines
Process Overview:
- Formulating a question or problem
- Acquiring and cleaning data
- Exploratory data analysis
- Prediction and inference
- Reporting, decisions, and solutions

Topic of the Day

Focus: Classification and practical issues related to data imbalance and evaluation metrics

Imbalanced Data Distribution

Definition: An imbalanced training set occurs when examples are drawn disproportionately from different classes.
Impact: This creates a "needle in a haystack" problem which can hinder performance in ML algorithms.
Example: Identifying fraudulent transactions in a dataset of credit card histories.

Strategies for Addressing Imbalanced Data

Inducing a New Binary Distribution

Undersampling the Majority Class:
- More computationally efficient but may discard valuable data.
Oversampling the Minority Class:
- Increases data for underrepresented classes but can lead to overfitting.
- Implemented by incorporating weights instead of simply duplicating instances.

Evaluation of Classification Performance

Accuracy: Not effective when class importance differs.
Specific Cases:
- Medical records: predicting presence of cancer.
- Document retrieval: identifying relevant documents based on queries.

Importance of Class Labels

True/False Positive & Negative Definitions:
- True Positive (TP): Correctly predicted positive
- False Positive (FP): Incorrectly predicted positive
- False Negative (FN): Incorrectly predicted negative
- True Negative (TN): Correctly predicted negative

Precision and Recall Metrics

Precision: Percentage of positive predictions that are correct.
Recall: Percentage of true positive labels predicted correctly.
Precision-Recall Trade-off: Affected by varying cutoff thresholds, impacting the model evaluation.

F-score

Definition: The harmonic mean of precision and recall, providing a single metric to evaluate models.
Weighted F-score: Utilized when one of the components (precision/recall) is more crucial than the other.

Ensemble Learning

Goal: Combine multiple weak classifiers to form a strong classifier without new algorithms.
Approach:
- Generate multiple classifiers that vote on predictions, leveraging their diversity.

Ensemble Methods

Bagging

Definition: Bootstrap Aggregation
Process:
1. Generate multiple bootstrap samples from the training set.
2. Train separate models on these samples.
3. Average predictions for regression or use majority vote for classification.

Features of Bagging

More bootstrap samples generally improve performance, employing samples drawn with replacement.

Boosting

Concept: Sequential training with an emphasis on difficult to classify training examples.
Mechanism:
- Each classifier trained iteratively, focusing on performance feedback from previous classifiers.
- Final results achieved via a weighted sum of classifiers.

Conclusion

Illustration: Combining weak classifiers can lead to enhanced overall prediction accuracy, tapping into different aspects of the data for stronger decision-making.

Acknowledgments

Materials include contributions from the Berkeley DS 100 team and others.