Definition:
A conceptual dividing line that separates different classes in a classification problem
Determined by machine learning algorithms based on training data
Importance:
More features improve the model but complicate visualization
The decision boundary acts as a mapping function for the training examples
Decision Tree Model Example:
Pictorial representation of decision boundary based on Age and Balance vs Default status
Illustrates how different splits can lead to different labels (Default vs Not Default)
Different Models:
Different algorithms can learn boundaries of varying complexities
Selected model must consider both assumptions and data nature
When to Stop Splitting:
Achieve minimum impurity (pure node with uniform label)
Avoid unnecessary additional splits that lead to no information gain (or decrease in Gini, or decrease in Variance)
Definition of Overfitting:
Full trees may memorize training data without learning general patterns
Model learns noise as patterns from training data, failing to generalize
Example of Overfitted Model:
Complex decision rules based on too narrow thresholds (Humidity levels)
Cause:
Random noise or fluctuations in training data are learned as significant patterns
Error Analysis:
Shows performance discrepancies between training and test sets as tree complexity increases
Goal:
Favor simpler models to avoid overfitting while still retaining predictive power
Importance of Majority Label:
Assign majority class label based on class distribution in node
Primary Objective:
Discover patterns (making predictions) that generalize well to unseen data
A model that works extremely well on training data does not necessarily mean it will work well on testing data
Techniques:
Early stopping in decision tree by setting a maximum depth
Hyperparameter: A parameter whose value is set before the learning process begins, which can significantly affect the performance of the model.
Tuning ML hyperparameters (such as decision-tree depth) is a tedious yet crucial task, as the performance of an algorithm can be highly dependent on the choice of hyperparameters.
Process (in order to find the best parameter):
Hyperparameter tuning involves searching through a range of values to find a subset of parameter results that deliver the best performance on your data.
Common Mistakes:
Using test data in the training process causing data leakage
Need to keep test data separate for unbiased evaluations
Recommendation:
Utilize a validation set for tuning decisions while reserving the test set for final evaluations
Common split is 80% training and 20% testing
Grid Search:
Evaluates all combinations of hyperparameters within a pre-defined grid
Can guarantee optimal results if the grid is sufficiently large and well-defined but is computationally expensive
Random Search:
Selecting random combinations to evaluate instead of exhaustive evaluation
More efficient and often yields good results
Concept:
Divides dataset into K equal “folds” (combination of hyperparameters)
The model is trained and evaluated multiple times, using different training and validation sets each time to ensure that the model's performance is consistent and not reliant on any specific partition of the data.
Each fold serves as validation once while being excluded from training
Advantages:
Better data usage and mitigates dependency on single train-test splits
Disadvantages:
High computational cost and time-consuming
If theres 9 parameter options and 5 folds are used = 45 models built
Steps
1. Split the dataset into K folds.
2. Train the model K times, each time using K-1 folds for training and 1 fold for validation.
3. Compute the average performance metric (e.g., accuracy, F1-score, RMSE).
3. Select the best hyperparameter combination based on the highest validation performance.
4. Train the final model using the entire dataset with the best hyperparameters.
Steps:
Split data into training and test sets (e.g., 80:20)
Apply simple train/validation split, or k-fold cross-validation on the training set only (e.g., 5-fold CV).
I. If using cross validation, each iteration uses k-1 folds for training and 1 fold for validation.
Final model evaluation conducted on the untouched test set for fair results
Leaf Nodes:
Represent outcomes in classification (predicted labels) or regression (predicted values)
Majority voting for classification or mean calculations for regression
Decision Threshold Role:
Maps leaf nodes to binary labels based on a selected threshold (probability values accompanying predictions)
Adjustments to the threshold should be contextual and not fixed at 0.5. It should never be a fixed number, it can be changed.
Example:
Predicted probabilities indicate likelihoods (e.g., chance of subscription cancellation)
A higher decision threshold leads to more conservative predictions (fewer classifications as positive)
Advantages:
Easy interpretation and understanding
Minimal data prep required for categorical features (no need for one-hot encoding for categorical features)
Drawbacks:
Prone to overfitting; requires pruning
Tree structure is not very stableand can vary significantly with small changes in the training data, leading to high variance in model performance.
Predictive accuracy can be mediocre