Decision Tree Models

Building a Default Decision Tree Model

  • Decision trees involve less data preparation and are easy to interpret, making them useful for initial data exploration and understanding relationships within the data.

  • They utilize a decision-split with IF-THEN logic, creating branches based on specific conditions to predict outcomes.

  • They have a tree-like graphical structure that visually represents the decision-making process, enhancing interpretability.

    Essential Discovery Tasks

  • Select an algorithm that suits the data type and problem at hand, considering factors like target variable type and desired model complexity.

  • Improve the model by refining its structure, optimizing parameters, and addressing issues like overfitting or underfitting.

  • Optimize the complexity of the model by finding the right balance between model accuracy and generalization ability, often through techniques like pruning.

  • Regularize and tune the hyperparameters of the model to prevent overfitting and improve its performance on unseen data.

  • Build ensemble models by combining multiple decision trees to enhance predictive accuracy and robustness.

    Basics of Decision Trees

  • Analysis Goal: Predict the outcome based on the location in the plot by partitioning the data space into distinct regions.

  • X1 = Primary Outcome, representing the main variable being predicted or analyzed.

  • X2 = Secondary Outcome, providing additional information or context to refine predictions.

  • Decision rules:

    • IF X2 < 0.63 THEN branch left, indicating a specific threshold for X2 that determines the path taken in the tree.

    • ELSE IF X2 ≥ 0.63 THEN branch right, providing an alternative path when the condition is not met.

  • Tree components:

    • root node: The starting point of the tree that represents the entire dataset.

    • interior node: Nodes within the tree that represent decision points based on input features.

    • leaf node: Terminal nodes that represent the final predicted outcome.

  • Categorical target:

    • yes: 40%, no: 60% - representing the distribution of classes at a specific node.

    • yes: 70%, no: 55% - showing how the class distribution changes as the tree branches.

  • An interval target can also be used, allowing for regression tasks where the target is a continuous variable.

  • Prediction is based on where the new case falls within the decision tree, following the path determined by its feature values.

    Decision Tree Nodes

  • The leaf nodes provide the predictions, with each leaf corresponding to a specific outcome or value.

    Modifying the Model: Tree Structure

  • Decision Trees for Categorical Targets: Classification Trees

    • Classifier Model decisions, aiming to assign instances to predefined categories.

    • Categorical Target estimates- 70%, 55%, 40%, 60%, indicating the probabilities of each class within the respective leaf nodes.

  • Decision Trees for Interval Targets: Regression Trees

    • Numeric predictions- 14, 25, 16, 7, providing continuous values as predictions based on the characteristics of the input data.

    Handwriting Recognition Example

  • Using pen positions scaled from 0-100 for digits 0-9 to map input data to output categories.

  • IF-THEN/ELSE rules are applied to determine which digit is being written based on the pen positions.

    Classification Tree Example

  • Two-way split with 10 leaves, dividing data into segments based on binary decisions.

  • Decision rules based on conditions such as X1 < 38.5 and X10 < 0.5, showing feature thresholds that determine the path taken through the tree.

    Posterior Probability of Target Class

  • Defines several multivariate step functions, with each function representing how the probabilities of class membership change across varying feature values.

  • Each function corresponds to the posterior probability of a target class, providing insight into the likelihood of each class given the input features.

    Decision Trees for Interval Targets: Regression Trees

  • Interval Target: Any numeric value within a certain range.

    Regression Tree Example

  • Leaves = Boolean Rules: if RM ∈ {values} and NOX ∈ {values}, then MEDV=value, predicting a numeric outcome based on satisfying certain conditions.

    Decision Tree Splits

  • A decision tree can have only two-way splits: False, as multi-way splits are also possible.

    Improving the Decision Tree Model

  • Essential discovery tasks:

    • Select an algorithm.

    • Improve the model.

    • Optimize the complexity of the model.

    • Regularize and tune the hyperparameters of the model.

    • Build ensemble models.

  • Structure recursive partitioning method, breaking down the dataset into smaller subsets to create more refined decision rules.

    Modifying the Model: Recursive Partitioning

  • Recursive partitioning

    • Top-down, greedy algorithm that makes locally optimal choices at each step.

    Split search

  • Selecting the best split to predict whether the handwritten digit is 1, 7, or 9, optimizing which feature and threshold to use for each node.

    Splitting criterion

  • Measures the reduction in variability of the target distribution to determine the best split.

    Recursive Partitioning Steps

  • Identifies candidate splits, evaluating different features and thresholds to find the most effective partition.

  • Selects the best split based on the splitting criterion.

  • Continues until a stopping rule prevents growth, balancing model complexity and generalizability.

    Splitting Criteria

  • Goal:

    • Reduce variability.

    • Increase purity.- Splitting Criterion impurity reduction measures for Categorical and Interval Targets.

    Splitting Criterion Formula

  • Δ=impurity<em>0(n</em>1n<em>0impurity</em>1+n<em>2n</em>0impurity<em>2+n</em>3n<em>0impurity</em>3+n<em>4n</em>0impurity4)\Delta = impurity<em>0 - (\frac{n</em>1}{n<em>0}impurity</em>1 + \frac{n<em>2}{n</em>0}impurity<em>2 + \frac{n</em>3}{n<em>0}impurity</em>3 + \frac{n<em>4}{n</em>0}impurity_4)

    • Where impurity index of the parent node - weighted average of the impurity index values for the child nodes.

    Node Gini Index

  • The probability that any two elements of a group, chosen at random, with replacement, are different.

    Gini Index formula:

  • 1<em>i=1np</em>i21 - \sum<em>{i=1}^{n} p</em>i^2

  • High Diversity, Low Purity with interspecific encounter of 0.690.69

  • Low Diversity, High Purity with interspecific encounter of 0.240.24

    Interval

  • Reduce the variance of the target to improve the precision of the prediction.

    Split Search for Binary and Categorical Targets

  • Potential Split Points

    • Interval: Each unique value, testing every possible threshold.

    • Categorical: Each combination of all levels to decide how to best divide the categories.

  • For each input, evaluates all possible splits and selects the best one

  • Selects the single best split from all candidate splits chi-squared

    Branching Rule

  • if x_n < value branch left

  • if xnvaluex_n ≥ value branch right

    • Use of classification matrix to gauge the accuracy of different branching rules.

    Split Selection Metric

    • Logworth = log(chisquaredpvalue)-log(chi-squared p-value)

    • Adjusted logworth

  • Select split with max adjusted logworth

    Maximal Tree

  • Generalize/Build.

  • Optimize complexity through pruning.

    Redundant inputs?

  • Only the inputs that are predictive of the target.

    Types of Splits

  • Two-way splits.

  • Multi-way splits.

    • Wider and shorter do not necessarily outperform trees with binary splits take more time to grow more challenging to interpret.

    Number of Splits

  • If you build a decision tree model to predict a binary target, the rules in that decision tree are limited to two-way splits: False.

    Optimizing the Complexity of a Decision Tree Model

  • Pruning.

    Pruning

  • Training Data: Overfit does not generalize well.

    • Largest Tree (no stopping rules) splits until each observation has its own leaf.

    • Maximal Tree (with some stopping rules) likely to be overfit adapts to both the systematic variation (signal) and the random variation (noise).

    • Smaller Tree (with strict stopping rules) fails to adapt sufficiently to the systematic variation (signal).

    Model Selection

  • Maximal Tree criterion top-down.

  • Pruning bottom-up Model Studio.

    Pruning Steps

  • Maximal Tree (n Leaves) All Possible Sub-trees with n-1 Leaves one split optimal value of the model selection criterion.

  • Validation Data.

    Regularizing and Tuning the Hyperparameters of a Machine Learning Model

  • Regularizing and Tuning the Hyperparameters of a Machine Learning Model Optimal Model.

  • Hyperparameters settings, Inputs parameters Target.

    Parameters

  • Optimal Model.

  • Example: regression: intercept and slopes decision trees: splitting rule.

    Tuning Options

  • Hyperparameters settings Inputs Target Optimal Model parameters.

  • decision trees: manual maximum tree depth automated specified externally.

  • Autotuning Nodes: Autotuning is disabled in SAS Viya for Learners, and selecting it results in an error.

    Building Ensemble Models

  • Ensemble Model Predictions different algorithms same algorithm with different samples or tuning settings aggregation of multiple models.

    Ensemble Model Explainability

  • Ensemble Model Explainability Interpretability Predictions

    Perturb and Combine Methods

  • Ensemble Model unstable structure stable performance.

  • Trees are good candidates for ensemble models.

    Stability in Ensemble Methods

  • Multiple candidate splits on the same and different inputs that give similar performance.

    • Competitor split Logworth Input Range Min Max X1 X2

    Perturb and Combine Methods (P & C)

  • Create different models.

  • Create a single prediction.

    Perturb

  • Create different models: resampling subsampling adding noise adaptively reweighting randomly choosing from the competitor splits.

    Combine

  • Create a single prediction: voting weighted voting averaging.

  • Variance reduction.

    Voting Combination

  • Ensemble Model vote(T1, T2, T3) = prediction.

    Averaging Combination

  • Ensemble Model averaging = estimate prediction.

    Relationship of Models

  • Ensemble Model True.

  • Relationship cannot be interpreted.

    Examples of P & C Methods

  • Bagging.

  • Boosting.

    Random Forest

  • Draw K bootstrap samples.

  • Build a tree on each bootstrap sample.

  • Combine the predictions by voting or averaging.

    Bootstrap Sampling

  • Draw K bootstrap samples: random with replacement.

    Building Trees

  • Pruning can be counterproductive. Grown independently.

    Combine the Predictions

  • Classification: plurality vote of the predicted class mean of the posterior probabilities.

  • Interval: mean of the predicted values.

    Boosting

    Features:

    • Dependent.

    • Focuses on misclassified cases.

    Comparison of Tree-Based Models

  • Tree-Based Ensemble: Bagging Tree-Based Ensemble: Boosting step functions.

    Perturb and Combine Methods Are Not Exclusive to Decision Trees

  • Perturb and combine methods are used only with decision trees: False.

    Gradient Boosting Framework

  • Dependent k = 1 misclassification count Case Weight M Focuses on misclassified cases.

    Forest Algorithm

  • Select a sample of cases, with replacement

  • Select a sample of input variables from all available inputs

  • The input that has the strongest association with the target is used in the splitting rule for that node.

    Forest Diversity

  • Each tree is created on a different sample of cases.

  • Each splitting rule is based on a different sample of inputs.

  • The individual models are more varied.

    Forest Predictions

  • The forest is more stable and improves the predictions compared to a single decision tree.

    Forest Model Studio

  • Forest algorithm in Model Studio: different combinations of cases and inputs for the splits bagging greater diversity better predictive accuracy.

  • Greater diversity.

  • Better predictive accuracy.

    Out-of-Bag Samples

  • In-bag sample (bagged data).

  • Withheld cases out-of-bag sample selected cases.

  • Another validation data set for each tree.

    Gradient Boosting versus Forest Models

  • Forests are based on bagging. Gradient boosting models are based on boosting.