Decision Tree

FINANCE

STOCK

SALES

CUSTOMER

Classification using Decision Tree
ADM3308
Telfer School of Management
University of Ottawa

START

  • PLACE ORDER

  • RECEIVE

  • CANCEL ORDER

  • ORDER SEND

SHIPPING DATA

FINISH
CHECK STOCK
OK?
  • YES

  • SHIP ORDER

  • PROCESS PAYMENT


Business Data Mining Outline

n Classification

n Decision Tree

n Overfitting, Pruning the Tree

n Measuring Classification Performance

n Decision Tree Algorithms


Classification

Definition

  • Classes (categories) are pre-defined based on labels or a combination of labels.

  • Classification is a supervised method, requiring the data to be labeled.

  • The labeled data is called the "training" set.


Examples of Classification

  • Identify individuals with credit risks

    • Classes: {approved, Not Approved}

  • Classify responders to a marketing campaign

    • Classes: {Yes, No}

  • Classify financial transactions

    • Classes: {Fraud, Non Fraud}

  • Classify patients based on symptoms

    • Classes: {Healthy, Not Healthy}

  • Pattern recognition of letters A-Z

  • Speech recognition of letters A-Z

  • Many More…


Classification Example

Letter Recognition

  • View letters as constructed from five components:

    • Letter C

    • Letter E

    • Letter A

    • Letter D

    • Letter F

    • Letter B


Steps in Classification

  1. Train the Model using labeled data.

    • Model includes input attributes such as Decision Trees, Neural Networks, Regression, etc.

  2. Employ the Model to classify (label) new data.

    • Uses Labeled Data (training data) to predict Classes (Groups) for new instances of unseen data.

    • Classification is a Supervised Learning Model.


Main Classification Techniques

Statistical Techniques

  • Logistic Regression

  • Naïve Bayesian

  • Decision Tree

  • Rule-based

  • Distance-based (K nearest neighbor)

  • Neural Networks


Example of a Decision Tree

Classes

  • Pre-defined as: {A, B, C, D, F}

Rules

  • If x ext{ } >= 90 then grade = A.

  • If 80 ext{ } <= x < 90 then grade = B.

  • If 70 ext{ } <= x < 80 then grade = C.

  • If 60 ext{ } <= x < 70 then grade = D.

  • If x < 50 then grade = F.

Structure
  • <90 >=90 <80 >=80 <70 >=70 F B A <50 >=60 C D


Formal Definition of Decision Tree

Given:

  • D = {t1, ext{…}, tn} where ti=1, ext{…},ti_h>

  • Database schema contains attributes {A1, A2, …, Ah}

  • Classes: C= {C1, ext{…},Cm}

  • A Decision or Classification Tree is associated with D such that:

    • Each internal node is labeled with an attribute, A_i

    • Each arc is labeled with a predicate applicable to the attribute at parent

    • Each leaf node is labeled with a class, C_j


Example of a Decision Tree with Attributes

Attributes (Input Variables):

  • Outlook = {sunny, overcast, rain}

  • Humidity = {1,…, 100}

  • Windy = {True, False}

Target Variable (Label):

  • Play = {Yes, No}

Training Dataset Example

Rec#

Outlook

Humidity

Windy

Play

1

Sunny

65

True

Yes

2

Rain

75

True

No

14

Overcast

50

True

Yes


Decision Tree Structure

Components

  • Decision Tree Root (Attribute)

  • Child (Attribute)

  • Leaf (Class)

  • Arc (Values of the Attribute)


Leaves Contain Scores

  • Each leaf of a decision tree contains information for scoring.

    • This particular leaf could indicate that 11,112 training instances landed here.

    • Of those records, 96.5% were classified as NO.

    • If used for classification, new records landing here would be classified as NO.

    • If used for scoring estimation, the score would be 0.965, indicating a high likelihood of NO.

    • Such scores could potentially inform marketing campaigns.


Using Tree for Various Applications

Objectives

  • To select variables

  • To identify which variables are most important

  • Variables to include in additional models (e.g., root variable)

  • Different choices of the target variable yield distinct trees containing varying variables

  • To produce score and probability estimation

  • To create rankings based on likelihood (e.g., higher or lower churn, most likely to respond)


Handling Missing Values

  • Decision trees can manage missing values by designating "Null" as an allowable value.

  • Retaining Null values may sometimes be preferred over removing records with missing values or substituting missing values.

  • The presence of Null can often carry predictive value; for example, individuals with various database traces (purchasing homes, engaging in marriages, etc.) are more likely to be interested in life insurance than those with more Null fields.


Advantages of Decision Trees

  • Transparent in how predictions or classifications are derived

  • Simple to visualize

  • Easy to formulate rules

  • A decision tree serves as a graphical representation of a set of rules, making them straightforward to interpret and implement


Continuing Advantages of Decision Trees

  • Automatic variable selection

  • Capable of functioning without extensive handling of missing data

  • No necessity for the assumptions typically required by statistical models


Overfitting

Definition
  • Statistical models can yield highly complex representations of relationships between variables.

  • While the model fit may be excellent, its performance on unused (unseen) data may be disappointing due to its complexity.


Overfitting (cont’d)

  • A polynomial of degree (N-1) can fit exactly through N data points.

    • This 100% fit, however, provides no practical utility for unseen data.


Pruning the Tree

Issues with Full Trees

  • A natural conclusion of the decision tree building algorithm is full purity in each leaf.

  • However, full trees are intricate and susceptible to overfitting the data.

  • Pruning reduces model complexity, thereby enhancing model stability.


Pruning Tree to Avoid Overfitting

  • On training data, each split reduces the error rate.

  • As the number of leaves increases, the records contained within them decrease, resulting in reduced likelihood that outcome distributions within a leaf will align across datasets.

  • Beyond a certain level of tree complexity, the error rate on unseen data increases; pruning should occur at this junction where error rates are minimized on unseen data.


Stopping Criteria

Decision Tree Building

  • The tree building process is recursive.

  • The initial split generates child nodes.

  • The identical splitting methodology is applied to children of the parent nodes.

  • The process halts under one of the conditions:

    • The quantity of records per node meets a defined limit.

    • The depth of the tree reaches a defined limit.

    • No splits that significantly enhance the purity of node children can be found.


Partitioning the Data

Model Development and Testing

  • When developing a model on training data, there's a risk of overfitting, underscoring the need for evaluation on test data.

  • Some approaches utilize test data to select parameters or choose between models, leading to potential overfitting of the test data.

  • Solution: Apply the final selected model to a holdout partition (unseen data) for an unbiased performance estimate on the new data.

  • Note: Algorithms have visibility of the "Training" and "Testing" Data, but not the "unseen (Holdout)" data throughout their training processes.


Training, Testing, and Unseen (Holdout) Datasets

Dataset Type

Percentage Estimate

Training

60% ~ 70%

Testing (Validation)

30% ~ 20%

Unseen (Holdout)

10%


Cross Validation

Overview

  • Repeated partitioning is known as Cross-Validation.

  • The method involves k-fold cross-validation, e.g., k=5.

  • For each fold, a portion of the data is set aside as test data (validation data) while the full remainder is utilized for training (i.e. 4/5 of data).

  • The validation folds are non-overlapping.

  • Sequentially, after completing the first test data, the next data portion is selected for testing while the remainder is employed for training.

  • After processing all folds, the average performance is calculated over k runs.


Confusion Matrix

Actual vs. Predictive Values

Actual emp

Approved

Not_Approved

Classified Approved (TP)

True Positive

False Negative

Classified Not Approved (FP)

False Positive

True Negative


Performance Measures using Confusion Matrix

Metrics

  • Accuracy (correctness):

    Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

  • Sensitivity (True Positive Rate):

    Sensitivity = \frac{TP}{TP + FN}

  • Specificity (True Negative Rate):

    Specificity = \frac{TN}{FP + TN}

  • Higher values of sensitivity, specificity, and accuracy indicate a more accurate classifier.


ROC Curve

Overview

  • The Receiver Operating Characteristic (ROC) curve illustrates the True Positive rate against the False Positive rate.

  • The area under the curve indicates performance:

    • Close to 1: classifier performs excellently.

    • Close to 0.5: classifier performs poorly (similar to random guessing).

    • Much less than 0.5: classifier performs significantly worse than random guesses.


Additional Classification Performance Metrics

  • True Positive Rate (Recall or Sensitivity):

    TP ext{ rate} = \frac{TP}{TP + FN}

  • False Positive Rate:

    FP ext{ rate} = \frac{FP}{FP + TN} = 1 - Specificity

  • Precision:

    Precision = \frac{TP}{TP + FP}

  • Misclassification Error:

    Mis-Classification ext{ Error} = \frac{FP + FN}{FP + FN + TP + TN}

  • FMeasure: FMeasure = \frac{2 \times (Precision \times Recall)}{Precision + Recall}


Decision Tree Algorithms

Characteristics

  • Decision Tree algorithms vary based on tree construction, purity measurement, and pruning methods.

  • A useful algorithm for diverse real-world applications must:

    • Accept numeric attributes

    • Handle missing values

    • Maintain robustness in noisy scenarios

  • Basic algorithms require modification to meet these prerequisites.


Decision Tree Algorithms (Specific Examples)

ID3 Algorithm

  • Handles categorical values and selects split attributes with the highest information gain (lowest entropy).

  • Extension to ID3 allows for numeric attributes and employs information gain ratio.

  • Includes mechanisms for addressing missing values and stability against noise.

  • End result: C4.5, recognized as the best-known and widely-used learning algorithm.


Decision Tree Algorithms (Continued)

C4.5 Algorithm

  • Can manage missing values.

  • Accepts continuous values.

  • Utilizes information gain ratio for optimal split determination.

  • Recognized as the best-known and widely-utilized learning algorithm.

C5 Algorithm

  • Commercial successor to C4.5.

  • C4.5 is open source while C5 is utilized in commercial packages such as IBM SPSS Modeler (formerly SPSS Clementine).


Decision Tree Algorithms (Final Aspects)

CART (Classification and Regression Tree)

  • Generates binary trees.

  • Effective for numeric (continuous) variables.

  • Employs “Reduction in Variance” technique to identify optimal splitting attributes.

  • Prediction is calculated as the average of the numeric target variable; C5 predictions rely on the majority class in the leaf.

  • The performance is measured by RMSE (root mean squared error).


References

  • Chapter 7, “Data Mining Techniques for Marketing, Sales, and Customer Relationship Management”, Third Edition, by Gordon Linoff and Michael Berry, Publisher: John Wiley, 2011.

  • Chapter 9, “Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner”, by Galit Shmueli, Peter C. Bruce, Amit V. Deokar, and Nitin R. Patel, Publisher: John Wiley, 2023.

  • Chapter 6, “Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition, by Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, Publisher: Morgan Kaufmann; 2017.