03 Classification Trees

Classification Trees

  • Method: Majority Vote.

Growing a Classification Tree

Recursive Binary Splitting

  • Replace RSS with the Classification Error Rate for making splits:

    • Classification Error = Number of misclassifications in a region.

    • Classification Error Rate = Percentage of misclassified objects in that region (1 - percentage of majority class).

Measures for Assessing Node Quality

Node Purity Metrics

  • Important to evaluate purity to improve prediction confidence:

    • Classification Error Rate: Focuses on misclassification rate.

    • Gini Index: Measures how often a randomly chosen element from the set would be incorrectly labeled.

    • Entropy: Measures disorder or impurity within the node.

Characteristics of Node Purity

  • Smaller values in Gini Index and Entropy indicate higher node purity (clearer predictions).

  • Best purity occurs when all objects in a node belong to the same category (100% purity).

  • Worst purity happens when there is an equal distribution of classes (e.g., 50% apples and 50% pears).

Growing and Pruning the Tree

Growing Steps

  • Use Gini Index or Entropy during the recursive splitting to encourage splits that improve prediction confidence.

Pruning Steps

  • Use Classification Error Rate during pruning to eliminate branches that do not improve accuracy.

Handling Qualitative Input Variables

  • Categorical variables can also be split using binary criteria:

    • Assign values based on categories rather than numerical thresholds.

    • Specify two subsets for splits.