01 Decision Trees

Decision Trees

for both regression and classification tasks.

  • They categorize data based on attribute values, creating a tree structure.

Structure of a Decision Tree

Definition and Terminology

  • A decision tree splits data based on attribute values, creating bins or regions that classify output:

    • Root Node: The starting point of the decision tree using an important attribute.

    • Internal Nodes: Nodes that split the data based on an attribute threshold.

    • Leaf Nodes: Endpoint nodes that give predictions (e.g., predicted log-salary).

Example of a Regression Tree

Advantages of Decision Trees

  • Interpretability: Decision trees are easy to understand and explain, making them suitable for communication with non-experts.

  • Automatic Variable Importance: Attributes closer to the root are more significant for predictions.

Partitioning Input Space

  • Decision trees partition the input space into disjoint regions based on thresholds.

  • Residual Sum of Squares (RSS): The goal when creating splits is to minimize the overall RSS.

Recursive Binary Splitting

Definition

  • Decision trees are built using recursive binary splits:

    • Top-Down Approach: Starts from the root node, one split at a time.

    • Greedy Method: Chooses splits that minimize RSS at each step without considering future splits.

Splitting Process

  • Steps of recursive binary splitting:

    1. Select the best predictor and splitting value.

    2. Evaluate RSS for the resulting regions.

    3. Repeat for each sub-region until a stop criterion is met (e.g., a minimum number of observations in a node).

Limitations of Decision Trees

  • Decision trees can become overly simplistic and perform poorly in terms of prediction accuracy.

  • They may overlook combinations of variables due to the greedy approach.

  • Tendency to create regions that behave like a step function, leading to rough approximation surfaces.

Visualizing Decision Trees

  • Decision trees can be visualized to show splits and regions for better understanding.

  • Example of a regression tree maps splits to corresponding regions and average response values.