5-Overfitting and Decision Threshold
Decision Boundary of a Classification Model
Definition:
A conceptual dividing line that separates different classes in a classification problem
Determined by machine learning algorithms based on training data
Importance:
More features improve the model but complicate visualization
The decision boundary acts as a mapping function for the training examples
Visualization of Decision Boundary
Decision Tree Model Example:
Pictorial representation of decision boundary based on Age and Balance vs Default status
Illustrates how different splits can lead to different labels (Default vs Not Default)
Complexity of Algorithms
Different Models:
Different algorithms can learn boundaries of varying complexities
Selected model must consider both assumptions and data nature
Stopping Criteria for Splitting
When to Stop Splitting:
Achieve minimum impurity (pure node with uniform label)
Avoid unnecessary additional splits that lead to no information gain (or decrease in Gini, or decrease in Variance)
Overfitting in Decision Trees
Definition of Overfitting:
Full trees may memorize training data without learning general patterns
Model learns noise as patterns from training data, failing to generalize
Example of Overfitted Model:
Complex decision rules based on too narrow thresholds (Humidity levels)
Reasons for Overfitting
Cause:
Random noise or fluctuations in training data are learned as significant patterns
Evaluating Overfitting
Error Analysis:
Shows performance discrepancies between training and test sets as tree complexity increases
Simpler Tree Structures
Goal:
Favor simpler models to avoid overfitting while still retaining predictive power
Importance of Majority Label:
Assign majority class label based on class distribution in node
Supervised Learning Goals
Primary Objective:
Discover patterns (making predictions) that generalize well to unseen data
A model that works extremely well on training data does not necessarily mean it will work well on testing data
Avoiding Overfitting: Simplifying Models
Techniques:
Early stopping in decision tree by setting a maximum depth
Hyper-Parameter Tuning (How do we determine the maximum depth?)
Hyperparameter: A parameter whose value is set before the learning process begins, which can significantly affect the performance of the model.
Tuning ML hyperparameters (such as decision-tree depth) is a tedious yet crucial task, as the performance of an algorithm can be highly dependent on the choice of hyperparameters.
Process (in order to find the best parameter):
Hyperparameter tuning involves searching through a range of values to find a subset of parameter results that deliver the best performance on your data.
Pitfalls in Hyper-Parameter Tuning
Common Mistakes:
Using test data in the training process causing data leakage
Need to keep test data separate for unbiased evaluations
Hyper-Parameter Tuning with Validation Set
Recommendation:
Utilize a validation set for tuning decisions while reserving the test set for final evaluations
Common split is 80% training and 20% testing
Grid Search vs Random Search
Grid Search:
Evaluates all combinations of hyperparameters within a pre-defined grid
Can guarantee optimal results if the grid is sufficiently large and well-defined but is computationally expensive
Random Search:
Selecting random combinations to evaluate instead of exhaustive evaluation
More efficient and often yields good results
K-Fold Cross Validation (another way in hyperparameter tuning)
Concept:
Divides dataset into K equal “folds” (combination of hyperparameters)
The model is trained and evaluated multiple times, using different training and validation sets each time to ensure that the model's performance is consistent and not reliant on any specific partition of the data.
Each fold serves as validation once while being excluded from training
Advantages:
Better data usage and mitigates dependency on single train-test splits
Disadvantages:
High computational cost and time-consuming
If theres 9 parameter options and 5 folds are used = 45 models built
Steps
1. Split the dataset into K folds.
2. Train the model K times, each time using K-1 folds for training and 1 fold for validation.
3. Compute the average performance metric (e.g., accuracy, F1-score, RMSE).
3. Select the best hyperparameter combination based on the highest validation performance.
4. Train the final model using the entire dataset with the best hyperparameters.
Standard Supervised Learning Process
Steps:
Split data into training and test sets (e.g., 80:20)
Apply simple train/validation split, or k-fold cross-validation on the training set only (e.g., 5-fold CV).
I. If using cross validation, each iteration uses k-1 folds for training and 1 fold for validation.
Final model evaluation conducted on the untouched test set for fair results
Assigning Labels for Leaf Nodes
Leaf Nodes:
Represent outcomes in classification (predicted labels) or regression (predicted values)
Majority voting for classification or mean calculations for regression
Defining the Decision Threshold
Decision Threshold Role:
Maps leaf nodes to binary labels based on a selected threshold (probability values accompanying predictions)
Adjustments to the threshold should be contextual and not fixed at 0.5. It should never be a fixed number, it can be changed.
Probabilistic Predictions and Impact of Decision Threshold
Example:
Predicted probabilities indicate likelihoods (e.g., chance of subscription cancellation)
A higher decision threshold leads to more conservative predictions (fewer classifications as positive)
Pros and Cons of Decision Tree Learning
Advantages:
Easy interpretation and understanding
Minimal data prep required for categorical features (no need for one-hot encoding for categorical features)
Drawbacks:
Prone to overfitting; requires pruning
Tree structure is not very stableand can vary significantly with small changes in the training data, leading to high variance in model performance.
Predictive accuracy can be mediocre