Decision Tree

FINANCE

STOCK

SALES

CUSTOMER

Classification using Decision Tree

ADM3308

Telfer School of Management

University of Ottawa

START

PLACE ORDER
RECEIVE
CANCEL ORDER
ORDER SEND

SHIPPING DATA

FINISH

CHECK STOCK

OK?

YES
SHIP ORDER
PROCESS PAYMENT

Business Data Mining Outline

n Classification

n Decision Tree

n Overfitting, Pruning the Tree

n Measuring Classification Performance

n Decision Tree Algorithms

Classification

Definition

Classes (categories) are pre-defined based on labels or a combination of labels.
Classification is a supervised method, requiring the data to be labeled.
The labeled data is called the "training" set.

Examples of Classification

Identify individuals with credit risks
- Classes: {approved, Not Approved}
Classify responders to a marketing campaign
- Classes: {Yes, No}
Classify financial transactions
- Classes: {Fraud, Non Fraud}
Classify patients based on symptoms
- Classes: {Healthy, Not Healthy}
Pattern recognition of letters A-Z
Speech recognition of letters A-Z
Many More…

Classification Example

Letter Recognition

View letters as constructed from five components:
- Letter C
- Letter E
- Letter A
- Letter D
- Letter F
- Letter B

Steps in Classification

Train the Model using labeled data.
- Model includes input attributes such as Decision Trees, Neural Networks, Regression, etc.
Employ the Model to classify (label) new data.
- Uses Labeled Data (training data) to predict Classes (Groups) for new instances of unseen data.
- Classification is a Supervised Learning Model.

Main Classification Techniques

Statistical Techniques

Logistic Regression
Naïve Bayesian
Decision Tree
Rule-based
Distance-based (K nearest neighbor)
Neural Networks

Example of a Decision Tree

Classes

Pre-defined as: {A, B, C, D, F}

Rules

If x ext{ } >= 90 then grade = A.
If 80 ext{ } <= x < 90 then grade = B.
If 70 ext{ } <= x < 80 then grade = C.
If 60 ext{ } <= x < 70 then grade = D.
If x < 50 then grade = F.

Structure

<90 >=90 <80 >=80 <70 >=70 F B A <50 >=60 C D

Formal Definition of Decision Tree

Given:

D = {t1, ext{…}, tn} where ti=1, ext{…},ti_h>
Database schema contains attributes {A1, A2, …, Ah}
Classes: C= {C1, ext{…},Cm}
A Decision or Classification Tree is associated with D such that:
- Each internal node is labeled with an attribute, A_i
- Each arc is labeled with a predicate applicable to the attribute at parent
- Each leaf node is labeled with a class, C_j

Example of a Decision Tree with Attributes

Attributes (Input Variables):

Outlook = {sunny, overcast, rain}
Humidity = {1,…, 100}
Windy = {True, False}

Target Variable (Label):

Play = {Yes, No}

Training Dataset Example

Rec#	Outlook	Humidity	Windy	Play
1	Sunny	65	True	Yes
2	Rain	75	True	No
…	…	…	…	…
14	Overcast	50	True	Yes

Decision Tree Structure

Components

Decision Tree Root (Attribute)
Child (Attribute)
Leaf (Class)
Arc (Values of the Attribute)

Leaves Contain Scores

Each leaf of a decision tree contains information for scoring.
- This particular leaf could indicate that 11,112 training instances landed here.
- Of those records, 96.5% were classified as NO.
- If used for classification, new records landing here would be classified as NO.
- If used for scoring estimation, the score would be 0.965, indicating a high likelihood of NO.
- Such scores could potentially inform marketing campaigns.

Using Tree for Various Applications

Objectives

To select variables
To identify which variables are most important
Variables to include in additional models (e.g., root variable)
Different choices of the target variable yield distinct trees containing varying variables
To produce score and probability estimation
To create rankings based on likelihood (e.g., higher or lower churn, most likely to respond)

Handling Missing Values

Decision trees can manage missing values by designating "Null" as an allowable value.
Retaining Null values may sometimes be preferred over removing records with missing values or substituting missing values.
The presence of Null can often carry predictive value; for example, individuals with various database traces (purchasing homes, engaging in marriages, etc.) are more likely to be interested in life insurance than those with more Null fields.

Advantages of Decision Trees

Transparent in how predictions or classifications are derived
Simple to visualize
Easy to formulate rules
A decision tree serves as a graphical representation of a set of rules, making them straightforward to interpret and implement

Continuing Advantages of Decision Trees

Automatic variable selection
Capable of functioning without extensive handling of missing data
No necessity for the assumptions typically required by statistical models

Overfitting

Definition

Statistical models can yield highly complex representations of relationships between variables.
While the model fit may be excellent, its performance on unused (unseen) data may be disappointing due to its complexity.

Overfitting (cont’d)

A polynomial of degree (N-1) can fit exactly through N data points.
- This 100% fit, however, provides no practical utility for unseen data.

Pruning the Tree

Issues with Full Trees

A natural conclusion of the decision tree building algorithm is full purity in each leaf.
However, full trees are intricate and susceptible to overfitting the data.
Pruning reduces model complexity, thereby enhancing model stability.

Pruning Tree to Avoid Overfitting

On training data, each split reduces the error rate.
As the number of leaves increases, the records contained within them decrease, resulting in reduced likelihood that outcome distributions within a leaf will align across datasets.
Beyond a certain level of tree complexity, the error rate on unseen data increases; pruning should occur at this junction where error rates are minimized on unseen data.

Stopping Criteria

Decision Tree Building

The tree building process is recursive.
The initial split generates child nodes.
The identical splitting methodology is applied to children of the parent nodes.
The process halts under one of the conditions:
- The quantity of records per node meets a defined limit.
- The depth of the tree reaches a defined limit.
- No splits that significantly enhance the purity of node children can be found.

Partitioning the Data

Model Development and Testing

When developing a model on training data, there's a risk of overfitting, underscoring the need for evaluation on test data.
Some approaches utilize test data to select parameters or choose between models, leading to potential overfitting of the test data.
Solution: Apply the final selected model to a holdout partition (unseen data) for an unbiased performance estimate on the new data.
Note: Algorithms have visibility of the "Training" and "Testing" Data, but not the "unseen (Holdout)" data throughout their training processes.

Training, Testing, and Unseen (Holdout) Datasets

Dataset Type	Percentage Estimate
Training	60% ~ 70%
Testing (Validation)	30% ~ 20%
Unseen (Holdout)	10%

Cross Validation

Overview

Repeated partitioning is known as Cross-Validation.
The method involves k-fold cross-validation, e.g., k=5.
For each fold, a portion of the data is set aside as test data (validation data) while the full remainder is utilized for training (i.e. 4/5 of data).
The validation folds are non-overlapping.
Sequentially, after completing the first test data, the next data portion is selected for testing while the remainder is employed for training.
After processing all folds, the average performance is calculated over k runs.

Confusion Matrix

Actual vs. Predictive Values

Actual emp	Approved	Not_Approved
Classified Approved (TP)	True Positive	False Negative
Classified Not Approved (FP)	False Positive	True Negative

Performance Measures using Confusion Matrix

Metrics

Accuracy (correctness):

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
Sensitivity (True Positive Rate):

Sensitivity = \frac{TP}{TP + FN}
Specificity (True Negative Rate):

Specificity = \frac{TN}{FP + TN}
Higher values of sensitivity, specificity, and accuracy indicate a more accurate classifier.

ROC Curve

Overview

The Receiver Operating Characteristic (ROC) curve illustrates the True Positive rate against the False Positive rate.
The area under the curve indicates performance:
- Close to 1: classifier performs excellently.
- Close to 0.5: classifier performs poorly (similar to random guessing).
- Much less than 0.5: classifier performs significantly worse than random guesses.

Additional Classification Performance Metrics

True Positive Rate (Recall or Sensitivity):

TP ext{ rate} = \frac{TP}{TP + FN}
False Positive Rate:

FP ext{ rate} = \frac{FP}{FP + TN} = 1 - Specificity
Precision:

Precision = \frac{TP}{TP + FP}
Misclassification Error:

Mis-Classification ext{ Error} = \frac{FP + FN}{FP + FN + TP + TN}
FMeasure: FMeasure = \frac{2 \times (Precision \times Recall)}{Precision + Recall}

Decision Tree Algorithms

Characteristics

Decision Tree algorithms vary based on tree construction, purity measurement, and pruning methods.
A useful algorithm for diverse real-world applications must:
- Accept numeric attributes
- Handle missing values
- Maintain robustness in noisy scenarios
Basic algorithms require modification to meet these prerequisites.

Decision Tree Algorithms (Specific Examples)

ID3 Algorithm

Handles categorical values and selects split attributes with the highest information gain (lowest entropy).
Extension to ID3 allows for numeric attributes and employs information gain ratio.
Includes mechanisms for addressing missing values and stability against noise.
End result: C4.5, recognized as the best-known and widely-used learning algorithm.

Decision Tree Algorithms (Continued)

C4.5 Algorithm

Can manage missing values.
Accepts continuous values.
Utilizes information gain ratio for optimal split determination.
Recognized as the best-known and widely-utilized learning algorithm.

C5 Algorithm

Commercial successor to C4.5.
C4.5 is open source while C5 is utilized in commercial packages such as IBM SPSS Modeler (formerly SPSS Clementine).

Decision Tree Algorithms (Final Aspects)

CART (Classification and Regression Tree)

Generates binary trees.
Effective for numeric (continuous) variables.
Employs “Reduction in Variance” technique to identify optimal splitting attributes.
Prediction is calculated as the average of the numeric target variable; C5 predictions rely on the majority class in the leaf.
The performance is measured by RMSE (root mean squared error).

References

Chapter 7, “Data Mining Techniques for Marketing, Sales, and Customer Relationship Management”, Third Edition, by Gordon Linoff and Michael Berry, Publisher: John Wiley, 2011.
Chapter 9, “Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner”, by Galit Shmueli, Peter C. Bruce, Amit V. Deokar, and Nitin R. Patel, Publisher: John Wiley, 2023.
Chapter 6, “Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition, by Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, Publisher: Morgan Kaufmann; 2017.