Decision Tree
Corresponding Decision Tree
How to Read a Node
Understanding how to interpret a node in a decision tree is crucial for utilizing this modeling technique effectively. A node represents a decision point in the tree that splits data based on certain criteria, guiding the flow through the tree until reaching an outcome.
Decision Tree Vocabulary
Key terms defined:
Node: A point in the tree where data is split based on certain feature criteria.
Branch: A line representing the outcomes of a node's decisions leading to further nodes or leaf nodes.
Leaf Node: The terminal nodes that signify a final outcome or prediction.
Classification and Regression Trees
Decision trees can be categorized into two main types: Classification Trees and Regression Trees. The choice between them depends on the target variable:
Classification Trees: Used when the target variable is categorical. They predict class labels.
Regression Trees: Used when the target variable is continuous. They output numeric predictions.
Motivating Question
The fundamental question driving the use of a decision tree is: "How can we predict the value of a given data point (either numeric or categorical) based on the existing data?" This involves taking a new data point and predicting its value by assessing similar, previously known values without depending on distance metrics like those used in k-Nearest Neighbors (kNN).
Approach to Decision Trees
The technique involves drawing a minimal number of horizontal and vertical lines to create segmented regions in the predictor space. Each region should ideally contain a minimal number of target values, which can effectively be represented in a decision tree format.
Example of Minimal Partitioning
In order to illustrate minimal partitioning using a decision tree, consider the following new values to classify based on the established model of age and income:
Age = 38, Income = $98k
Age = 45, Income = $75k
Age = 36, Income = $120k
These examples are used to demonstrate how the decision tree can classify incoming data points based on learned patterns from prior datasets.
Data Dictionary for eBay Auctions
The context for applying decision trees is presented through a business problem involving online auctions. The objective is to better understand auction behaviors to improve profits:
Business Problem: Difficulty in managing bidding wars on eBay which wastes time and decreases profitability.
Goal: Develop a predictive model to identify online auctions with minimal activity to maximize time efficiency and profitability.
Data Source: The dataset used is
eBayAuctions.csv, which includes both numeric and categorical variables related to online auction data.
R Packages
An understanding of R packages is essential for model building in the context where the data is derived from eBay auctions. It is important to note that:
The
rpart()function in R automatically determines whether to construct a Classification Tree or a Regression Tree based on the type of the target variable (categorical or numeric, respectively).
Data Preparation
In preparing the dataset, make sure that:
The target variable is a categorical variable (factor) for classification tasks, or a numeric variable for regression tasks.
Data types are correct and that unnecessary variables are eliminated to avoid potential issues during analysis.
Model Building and Prediction
The process of building a model includes several considerations for performance evaluation, particularly for numeric and categorical outcomes:
Numeric Outcome Assessment: Utilize Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) as metrics for evaluating predictive performance. A benchmark can be derived from the mean of the training data.
Categorical Outcome Assessment: Measure performance using the Error Rate. A benchmark can be derived from the mode of the training data.
Tree Splitting: How We Know Where to Split
The process of splitting nodes in decision trees focuses on achieving maximum node purity — meaning each node ideally contains data points of a single outcome.
Node Purity
Node Purity: Measures the homogeneity of the outcome variable within the node. The aim is to maximize this purity.
For binary outcomes, impurity can be calculated using the Gini impurity formula:
where is the proportion of TRUE observations in a node after the split.The formula indicates:
implies no TRUE outcomes (impurity=0).
implies all outcomes are TRUE (impurity=0).
indicates maximum impurity as there is an even split between TRUE and FALSE outcomes.
Splitting Process in R
During the construction of the tree, R evaluates all possible splits for each variable and selects the one that yields the lowest impurity.
When to Stop Splitting
Determining when to stop splitting nodes is critical in building an effective decision tree model:
Splitting is typically only considered if a node is not pure.
The
rpart()function implements default stopping criteria, but users can set their own:minsplit: Defines the minimum number of records required in a bucket, generally around 2% of total records.
minbucket: Indicates a minimum number of records to result from a split, typically around 1-2% of the total records.
cp: Stands for complexity parameter, which specifies the minimum percentage improvement in predictive power required for each split. Setting
cp = -1disables this parameter.
Pruning
In decision tree models, it can be challenging to identify excessive branches, as overfitting leads to models that perform poorly on unseen data:
Underfit Model: A model that fails to capture the complexity of the data.
Overfit Model: A model that is too complex, capturing noise along with the underlying structure.
Pruning: The process of cutting back unnecessary branches in a decision tree. This can be conducted in R after model building.
Cross Validation and Pruning Methods
Option A: Cross-validation is a technique utilized to objectively choose key model parameters, using portions of the training dataset to validate model performance.
Option B: Pruning can further refine the model effectiveness by eliminating branches that do not contribute significantly to predictive accuracy, thus addressing the risk of overfitting.