5-Overfitting and Decision Threshold

Decision Boundary of a Classification Model

  • Definition:

    • A conceptual dividing line that separates different classes in a classification problem

    • Determined by machine learning algorithms based on training data

  • Importance:

    • More features improve the model but complicate visualization

    • The decision boundary acts as a mapping function for the training examples

Visualization of Decision Boundary

  • Decision Tree Model Example:

    • Pictorial representation of decision boundary based on Age and Balance vs Default status

    • Illustrates how different splits can lead to different labels (Default vs Not Default)

Complexity of Algorithms

  • Different Models:

    • Different algorithms can learn boundaries of varying complexities

    • Selected model must consider both assumptions and data nature

Stopping Criteria for Splitting

  • When to Stop Splitting:

    • Achieve minimum impurity (pure node with uniform label)

    • Avoid unnecessary additional splits that lead to no information gain (or decrease in Gini, or decrease in Variance)

Overfitting in Decision Trees

  • Definition of Overfitting:

    • Full trees may memorize training data without learning general patterns

    • Model learns noise as patterns from training data, failing to generalize

  • Example of Overfitted Model:

    • Complex decision rules based on too narrow thresholds (Humidity levels)

Reasons for Overfitting

  • Cause:

    • Random noise or fluctuations in training data are learned as significant patterns

Evaluating Overfitting

  • Error Analysis:

    • Shows performance discrepancies between training and test sets as tree complexity increases

Simpler Tree Structures

  • Goal:

    • Favor simpler models to avoid overfitting while still retaining predictive power

  • Importance of Majority Label:

    • Assign majority class label based on class distribution in node

Supervised Learning Goals

  • Primary Objective:

    • Discover patterns (making predictions) that generalize well to unseen data

    • A model that works extremely well on training data does not necessarily mean it will work well on testing data

Avoiding Overfitting: Simplifying Models

  • Techniques:

    • Early stopping in decision tree by setting a maximum depth

Hyper-Parameter Tuning (How do we determine the maximum depth?)

  • Hyperparameter: A parameter whose value is set before the learning process begins, which can significantly affect the performance of the model.

  • Tuning ML hyperparameters (such as decision-tree depth) is a tedious yet crucial task, as the performance of an algorithm can be highly dependent on the choice of hyperparameters.

  • Process (in order to find the best parameter):

    • Hyperparameter tuning involves searching through a range of values to find a subset of parameter results that deliver the best performance on your data.

Pitfalls in Hyper-Parameter Tuning

  • Common Mistakes:

    • Using test data in the training process causing data leakage

    • Need to keep test data separate for unbiased evaluations

Hyper-Parameter Tuning with Validation Set

  • Recommendation:

    • Utilize a validation set for tuning decisions while reserving the test set for final evaluations

    • Common split is 80% training and 20% testing

Grid Search vs Random Search

  • Grid Search:

    • Evaluates all combinations of hyperparameters within a pre-defined grid

    • Can guarantee optimal results if the grid is sufficiently large and well-defined but is computationally expensive

  • Random Search:

    • Selecting random combinations to evaluate instead of exhaustive evaluation

    • More efficient and often yields good results

K-Fold Cross Validation (another way in hyperparameter tuning)

  • Concept:

    • Divides dataset into K equal “folds” (combination of hyperparameters)

    • The model is trained and evaluated multiple times, using different training and validation sets each time to ensure that the model's performance is consistent and not reliant on any specific partition of the data.

    • Each fold serves as validation once while being excluded from training

  • Advantages:

    • Better data usage and mitigates dependency on single train-test splits

  • Disadvantages:

    • High computational cost and time-consuming

  • If theres 9 parameter options and 5 folds are used = 45 models built

  • Steps

    1. Split the dataset into K folds.

    2. Train the model K times, each time using K-1 folds for training and 1 fold for validation.

    3. Compute the average performance metric (e.g., accuracy, F1-score, RMSE).

    3. Select the best hyperparameter combination based on the highest validation performance.

    4. Train the final model using the entire dataset with the best hyperparameters.

Standard Supervised Learning Process

  • Steps:

    1. Split data into training and test sets (e.g., 80:20)

    2. Apply simple train/validation split, or k-fold cross-validation on the training set only (e.g., 5-fold CV).

      I. If using cross validation, each iteration uses k-1 folds for training and 1 fold for validation.

    3. Final model evaluation conducted on the untouched test set for fair results

Assigning Labels for Leaf Nodes

  • Leaf Nodes:

    • Represent outcomes in classification (predicted labels) or regression (predicted values)

    • Majority voting for classification or mean calculations for regression

Defining the Decision Threshold

  • Decision Threshold Role:

    • Maps leaf nodes to binary labels based on a selected threshold (probability values accompanying predictions)

    • Adjustments to the threshold should be contextual and not fixed at 0.5. It should never be a fixed number, it can be changed.

Probabilistic Predictions and Impact of Decision Threshold

  • Example:

    • Predicted probabilities indicate likelihoods (e.g., chance of subscription cancellation)

    • A higher decision threshold leads to more conservative predictions (fewer classifications as positive)

Pros and Cons of Decision Tree Learning

  • Advantages:

    • Easy interpretation and understanding

    • Minimal data prep required for categorical features (no need for one-hot encoding for categorical features)

  • Drawbacks:

    • Prone to overfitting; requires pruning

    • Tree structure is not very stableand can vary significantly with small changes in the training data, leading to high variance in model performance.

    • Predictive accuracy can be mediocre