Decision Tree Models
Building a Default Decision Tree Model
Decision trees involve less data preparation and are easy to interpret, making them useful for initial data exploration and understanding relationships within the data.
They utilize a decision-split with IF-THEN logic, creating branches based on specific conditions to predict outcomes.
They have a tree-like graphical structure that visually represents the decision-making process, enhancing interpretability.
Essential Discovery Tasks
Select an algorithm that suits the data type and problem at hand, considering factors like target variable type and desired model complexity.
Improve the model by refining its structure, optimizing parameters, and addressing issues like overfitting or underfitting.
Optimize the complexity of the model by finding the right balance between model accuracy and generalization ability, often through techniques like pruning.
Regularize and tune the hyperparameters of the model to prevent overfitting and improve its performance on unseen data.
Build ensemble models by combining multiple decision trees to enhance predictive accuracy and robustness.
Basics of Decision Trees
Analysis Goal: Predict the outcome based on the location in the plot by partitioning the data space into distinct regions.
X1 = Primary Outcome, representing the main variable being predicted or analyzed.
X2 = Secondary Outcome, providing additional information or context to refine predictions.
Decision rules:
IF X2 < 0.63 THEN branch left, indicating a specific threshold for X2 that determines the path taken in the tree.
ELSE IF X2 ≥ 0.63 THEN branch right, providing an alternative path when the condition is not met.
Tree components:
root node: The starting point of the tree that represents the entire dataset.
interior node: Nodes within the tree that represent decision points based on input features.
leaf node: Terminal nodes that represent the final predicted outcome.
Categorical target:
yes: 40%, no: 60% - representing the distribution of classes at a specific node.
yes: 70%, no: 55% - showing how the class distribution changes as the tree branches.
An interval target can also be used, allowing for regression tasks where the target is a continuous variable.
Prediction is based on where the new case falls within the decision tree, following the path determined by its feature values.
Decision Tree Nodes
The leaf nodes provide the predictions, with each leaf corresponding to a specific outcome or value.
Modifying the Model: Tree Structure
Decision Trees for Categorical Targets: Classification Trees
Classifier Model decisions, aiming to assign instances to predefined categories.
Categorical Target estimates- 70%, 55%, 40%, 60%, indicating the probabilities of each class within the respective leaf nodes.
Decision Trees for Interval Targets: Regression Trees
Numeric predictions- 14, 25, 16, 7, providing continuous values as predictions based on the characteristics of the input data.
Handwriting Recognition Example
Using pen positions scaled from 0-100 for digits 0-9 to map input data to output categories.
IF-THEN/ELSE rules are applied to determine which digit is being written based on the pen positions.
Classification Tree Example
Two-way split with 10 leaves, dividing data into segments based on binary decisions.
Decision rules based on conditions such as X1 < 38.5 and X10 < 0.5, showing feature thresholds that determine the path taken through the tree.
Posterior Probability of Target Class
Defines several multivariate step functions, with each function representing how the probabilities of class membership change across varying feature values.
Each function corresponds to the posterior probability of a target class, providing insight into the likelihood of each class given the input features.
Decision Trees for Interval Targets: Regression Trees
Interval Target: Any numeric value within a certain range.
Regression Tree Example
Leaves = Boolean Rules: if RM ∈ {values} and NOX ∈ {values}, then MEDV=value, predicting a numeric outcome based on satisfying certain conditions.
Decision Tree Splits
A decision tree can have only two-way splits: False, as multi-way splits are also possible.
Improving the Decision Tree Model
Essential discovery tasks:
Select an algorithm.
Improve the model.
Optimize the complexity of the model.
Regularize and tune the hyperparameters of the model.
Build ensemble models.
Structure recursive partitioning method, breaking down the dataset into smaller subsets to create more refined decision rules.
Modifying the Model: Recursive Partitioning
Recursive partitioning
Top-down, greedy algorithm that makes locally optimal choices at each step.
Split search
Selecting the best split to predict whether the handwritten digit is 1, 7, or 9, optimizing which feature and threshold to use for each node.
Splitting criterion
Measures the reduction in variability of the target distribution to determine the best split.
Recursive Partitioning Steps
Identifies candidate splits, evaluating different features and thresholds to find the most effective partition.
Selects the best split based on the splitting criterion.
Continues until a stopping rule prevents growth, balancing model complexity and generalizability.
Splitting Criteria
Goal:
Reduce variability.
Increase purity.- Splitting Criterion impurity reduction measures for Categorical and Interval Targets.
Splitting Criterion Formula
Where impurity index of the parent node - weighted average of the impurity index values for the child nodes.
Node Gini Index
The probability that any two elements of a group, chosen at random, with replacement, are different.
Gini Index formula:
High Diversity, Low Purity with interspecific encounter of
Low Diversity, High Purity with interspecific encounter of
Interval
Reduce the variance of the target to improve the precision of the prediction.
Split Search for Binary and Categorical Targets
Potential Split Points
Interval: Each unique value, testing every possible threshold.
Categorical: Each combination of all levels to decide how to best divide the categories.
For each input, evaluates all possible splits and selects the best one
Selects the single best split from all candidate splits chi-squared
Branching Rule
if x_n < value branch left
if branch right
Use of classification matrix to gauge the accuracy of different branching rules.
Split Selection Metric
Logworth =
Adjusted logworth
Select split with max adjusted logworth
Maximal Tree
Generalize/Build.
Optimize complexity through pruning.
Redundant inputs?
Only the inputs that are predictive of the target.
Types of Splits
Two-way splits.
Multi-way splits.
Wider and shorter do not necessarily outperform trees with binary splits take more time to grow more challenging to interpret.
Number of Splits
If you build a decision tree model to predict a binary target, the rules in that decision tree are limited to two-way splits: False.
Optimizing the Complexity of a Decision Tree Model
Pruning.
Pruning
Training Data: Overfit does not generalize well.
Largest Tree (no stopping rules) splits until each observation has its own leaf.
Maximal Tree (with some stopping rules) likely to be overfit adapts to both the systematic variation (signal) and the random variation (noise).
Smaller Tree (with strict stopping rules) fails to adapt sufficiently to the systematic variation (signal).
Model Selection
Maximal Tree criterion top-down.
Pruning bottom-up Model Studio.
Pruning Steps
Maximal Tree (n Leaves) All Possible Sub-trees with n-1 Leaves one split optimal value of the model selection criterion.
Validation Data.
Regularizing and Tuning the Hyperparameters of a Machine Learning Model
Regularizing and Tuning the Hyperparameters of a Machine Learning Model Optimal Model.
Hyperparameters settings, Inputs parameters Target.
Parameters
Optimal Model.
Example: regression: intercept and slopes decision trees: splitting rule.
Tuning Options
Hyperparameters settings Inputs Target Optimal Model parameters.
decision trees: manual maximum tree depth automated specified externally.
Autotuning Nodes: Autotuning is disabled in SAS Viya for Learners, and selecting it results in an error.
Building Ensemble Models
Ensemble Model Predictions different algorithms same algorithm with different samples or tuning settings aggregation of multiple models.
Ensemble Model Explainability
Ensemble Model Explainability Interpretability Predictions
Perturb and Combine Methods
Ensemble Model unstable structure stable performance.
Trees are good candidates for ensemble models.
Stability in Ensemble Methods
Multiple candidate splits on the same and different inputs that give similar performance.
Competitor split Logworth Input Range Min Max X1 X2
Perturb and Combine Methods (P & C)
Create different models.
Create a single prediction.
Perturb
Create different models: resampling subsampling adding noise adaptively reweighting randomly choosing from the competitor splits.
Combine
Create a single prediction: voting weighted voting averaging.
Variance reduction.
Voting Combination
Ensemble Model vote(T1, T2, T3) = prediction.
Averaging Combination
Ensemble Model averaging = estimate prediction.
Relationship of Models
Ensemble Model True.
Relationship cannot be interpreted.
Examples of P & C Methods
Bagging.
Boosting.
Random Forest
Draw K bootstrap samples.
Build a tree on each bootstrap sample.
Combine the predictions by voting or averaging.
Bootstrap Sampling
Draw K bootstrap samples: random with replacement.
Building Trees
Pruning can be counterproductive. Grown independently.
Combine the Predictions
Classification: plurality vote of the predicted class mean of the posterior probabilities.
Interval: mean of the predicted values.
Boosting
Features:
Dependent.
Focuses on misclassified cases.
Comparison of Tree-Based Models
Tree-Based Ensemble: Bagging Tree-Based Ensemble: Boosting step functions.
Perturb and Combine Methods Are Not Exclusive to Decision Trees
Perturb and combine methods are used only with decision trees: False.
Gradient Boosting Framework
Dependent k = 1 misclassification count Case Weight M Focuses on misclassified cases.
Forest Algorithm
Select a sample of cases, with replacement
Select a sample of input variables from all available inputs
The input that has the strongest association with the target is used in the splitting rule for that node.
Forest Diversity
Each tree is created on a different sample of cases.
Each splitting rule is based on a different sample of inputs.
The individual models are more varied.
Forest Predictions
The forest is more stable and improves the predictions compared to a single decision tree.
Forest Model Studio
Forest algorithm in Model Studio: different combinations of cases and inputs for the splits bagging greater diversity better predictive accuracy.
Greater diversity.
Better predictive accuracy.
Out-of-Bag Samples
In-bag sample (bagged data).
Withheld cases out-of-bag sample selected cases.
Another validation data set for each tree.
Gradient Boosting versus Forest Models
Forests are based on bagging. Gradient boosting models are based on boosting.