2024-10-21-ensembles
Class Imbalance and Ensemble Learning
Overview
Course: CS 418 Introduction to Data Science
Instructor: Prof. Brian Ziebart
Date: October 21, 2024
Reading Assignment: CIML sections 5.5, 5.9, 13.1, 13.2
Announcements
Homework 3: Due 10/28 at 11:59 pm
Exam: Scheduled for 10/30 in class covering Neural Networks, Deep Learning, and LLMs
Previous Class Recap
Topics Covered:
Supervised Learning
Logistic Regression
Support Vector Machines
Process Overview:
Formulating a question or problem
Acquiring and cleaning data
Exploratory data analysis
Prediction and inference
Reporting, decisions, and solutions
Topic of the Day
Focus: Classification and practical issues related to data imbalance and evaluation metrics
Imbalanced Data Distribution
Definition: An imbalanced training set occurs when examples are drawn disproportionately from different classes.
Impact: This creates a "needle in a haystack" problem which can hinder performance in ML algorithms.
Example: Identifying fraudulent transactions in a dataset of credit card histories.
Strategies for Addressing Imbalanced Data
Inducing a New Binary Distribution
Undersampling the Majority Class:
More computationally efficient but may discard valuable data.
Oversampling the Minority Class:
Increases data for underrepresented classes but can lead to overfitting.
Implemented by incorporating weights instead of simply duplicating instances.
Evaluation of Classification Performance
Accuracy: Not effective when class importance differs.
Specific Cases:
Medical records: predicting presence of cancer.
Document retrieval: identifying relevant documents based on queries.
Importance of Class Labels
True/False Positive & Negative Definitions:
True Positive (TP): Correctly predicted positive
False Positive (FP): Incorrectly predicted positive
False Negative (FN): Incorrectly predicted negative
True Negative (TN): Correctly predicted negative
Precision and Recall Metrics
Precision: Percentage of positive predictions that are correct.
Recall: Percentage of true positive labels predicted correctly.
Precision-Recall Trade-off: Affected by varying cutoff thresholds, impacting the model evaluation.
F-score
Definition: The harmonic mean of precision and recall, providing a single metric to evaluate models.
Weighted F-score: Utilized when one of the components (precision/recall) is more crucial than the other.
Ensemble Learning
Goal: Combine multiple weak classifiers to form a strong classifier without new algorithms.
Approach:
Generate multiple classifiers that vote on predictions, leveraging their diversity.
Ensemble Methods
Bagging
Definition: Bootstrap Aggregation
Process:
Generate multiple bootstrap samples from the training set.
Train separate models on these samples.
Average predictions for regression or use majority vote for classification.
Features of Bagging
More bootstrap samples generally improve performance, employing samples drawn with replacement.
Boosting
Concept: Sequential training with an emphasis on difficult to classify training examples.
Mechanism:
Each classifier trained iteratively, focusing on performance feedback from previous classifiers.
Final results achieved via a weighted sum of classifiers.
Conclusion
Illustration: Combining weak classifiers can lead to enhanced overall prediction accuracy, tapping into different aspects of the data for stronger decision-making.
Acknowledgments
Materials include contributions from the Berkeley DS 100 team and others.