Decision Tree

Machine Learning Methods

Decision Tree Learning

Overview of Decision Trees
  • Definition: Decision trees are tree-based classifiers that represent instances as feature vectors.

  • Structure:

    • Nodes test features. Each node has one branch for each value of the feature, and leaves specify the classification category.

  • Functional Capability:

    • Can represent arbitrary conjunctions and disjunctions, covering any classification function over discrete feature vectors.

    • Can be transformed into a set of rules, exemplified in disjunctive normal form (DNF).

    • Example Rules:

      • If an object is red and circular, then classify it as positive (i.e., pos).

      • If red and circular then classify as A, if blue then classify as B.

      • If red and square then classify as B, if green then classify as C.

      • If red and triangular then classify as C.

Properties of Decision Tree Learning
  • Handling of Continuous Features: Decision trees can manage continuous (real-valued) features by allowing splits based on a threshold.

    • Example: Splitting a feature, "length", into ranges such as length < 3 and length ≥ 3.

  • Types of Trees:

    • Classification trees: yield discrete class labels at the leaves.

    • Regression trees: provide real-valued outputs at the leaves.

  • Efficiency: Algorithms designed to find consistent decision trees are efficient for processing large training datasets, facilitating effective data mining tasks.

  • Adaptability:

    • Developed methods for addressing noisy training data (including both class and feature noise).

    • Techniques for handling missing feature values have also been implemented.

Example of a Decision Tree

  • Context: A decision tree designed to determine if a day is suitable for playing tennis based on weather features (humidity, wind, outlook).

  • Classification Process: An example is classified by traversing the tree based on attribute values until reaching a leaf node, representing the classification result (either Yes or No).

Challenges in Decision Tree Learning

Structure of Instances
  • Instance Representation: Instances are portrayed by attribute-value pairs with fixed attributes (e.g., Temperature) having discrete values (e.g., Hot).

    • More complex scenarios arise for attributes with a broader set of real-valued attributes.

  • Target Function: The focus is on discrete output values. The decision tree assigns a boolean classification (e.g., Yes or No) to each instance.

    • Extensions exist for multi-output scenarios and real-valued outcomes, though less common.

Issues Encountered
  • Training Data Quality: Training data may include errors, with decision tree methods maintaining robustness against classification and attribute value errors.

  • Handling Missing Values: Decision tree methods can function effectively even when some training examples have missing attribute values.

Decision Tree Induction

Algorithmic Foundation
Pseudocode for Tree Construction
DTree(examples, features) returns a tree
1. If all examples are in one category:
   - Return a leaf node with that category label.
2. Else if the set of features is empty:
   - Return a leaf node with the most common category label among examples.
3. Else:
   - Pick a feature F and create a node R.
   - For each value vi of F:
     - Let examplesi be the subset of examples with value vi for F.
     - Add an edge E to node R labeled with vi.
     - If examplesi is empty:
       - Attach a leaf node to E labeled with the most common class from examples.
     - Else:
       - Call DTree(examplesi, features - {F}) and attach the resulting tree under E.
4. Return the subtree rooted at R.
Selecting the Optimal Feature for Splitting
  • Objective: Minimize the size of the resulting tree consistent with Occam’s razor.

  • Challenge: Finding the smallest decision tree is an NP-hard optimization problem.

    • A greedy search approach is common in top-down methods.

    • General principle in Machine Learning: "Greed is good."

  • Optimal Feature Selection: Features are selected to create subsets that are relatively pure and closer to being leaf nodes.

  • Heuristic Methods: Commonly uses information gain metrics developed in Quinlan's ID3 system (1979).

Measures of Feature Effectiveness

Information Gain
  • Definition: Information gain measures how well an attribute separates training examples according to target classifications.

  • Entropy: Represents the purity of a collection of examples.

Entropy Calculation
  • Entropy Formula for Binary Classification: Entropy(S)=p<em>1imesextlog</em>2(p<em>1)p</em>0imesextlog<em>2(p</em>0)Entropy(S) = -p<em>1 imes ext{log}</em>2(p<em>1) - p</em>0 imes ext{log}<em>2(p</em>0)

    • Where:

    • $p_1$: fraction of positive examples in set S.

    • $p_0$: fraction of negative examples in set S.

  • Entropy Behavior:

    • Pure Set (all examples in one category): Entropy is 0.

    • Mixed Set ($p1 = p0 = 0.5$): Maximum entropy of 1.

  • Generalization for Multi-class Problems: Entropy(S)=rac1cimesextsumfromi=1exttocextofp<em>iimesextlog</em>2(pi)Entropy(S) = - rac{1}{c} imes ext{sum from } i = 1 ext{ to } c ext{ of } p<em>i imes ext{log}</em>2 (p_i)

    • where $c$ represents the number of categories.

Example Calculation of Information Gain
  • Example of Classes: Data set of attribute combinations such as : +, : +, : −, : −.

  • Calculation Example:

    • Total positive: 4, negative: 2.

    • Compute entropy for different attributes and calculate information gain for the splits.

  • Information Gain Outcomes:

    • Gain results indicate the effectiveness of splitting on attributes like size, color or shape, resulting in an improved classification with quantifiable information gain values.

Searching the Hypothesis Space

  • Methodology: Batch learning approach, processing all training instances at once instead of incremental updates.

  • Techniques used: Greedy hill-climbing search can yield locally optimal solutions; guarantees finding a tree consistent with conflict-free training sets.

Understanding Bias in Decision Tree Induction

  • Bias Nature: Information gain predisposes towards shorter trees by governing the tree’s margins based on depth.

  • Search Strategy: Implementing search bias that prioritizes minimal tree complexities.

Historical Context of Decision Tree Research

Timeline of Developments
  • 1950s-60s: Hunt and colleagues employ exhaustively searching methods for decision trees to represent human concept learning.

  • 1970s: Development of ID3 by Quinlan introduces the information gain heuristic; Breiman et al. develop CART.

  • 1980s: Additional improvements are introduced for handling noise, dealing with continuous features, managing missing values, and improved criterions for splitting.

  • 1993: Release of Quinlan's updated decision-tree package (C4.5).

  • Present: Weka, a popular machine learning software suite provides a Java version of C4.5 known as J48.

Computational Complexity of Decision Trees

Complexity Analysis
  • Worst Case Scenario: Complete tree building where every path tests every feature.

    • For $n$ examples and $m$ features:
      <br>O(nimesm2)<br><br>O(n imes m^2)<br>

  • Practical Complexity: Typically linear with a much lesser complexity due to rarely complete learned trees.

Overfitting and Generalization

Overfitting Issues
  • Concept: Overfitting occurs when the model perfectly classifies training data but generalizes poorly to unseen instances.

    • Caused by noise and small datasets that lead to errant fitting.

  • Concept Definition: A hypothesis $h$ overfits if there exists another hypothesis $h’$ that performs better on unseen test data while having less error on training data.

Overfitting Examples
  • Hypothetical Scenario: Testing Ohm's Law, fitting polynomial curves can lead to overfitting.

    • Overfitting Demonstration: High-degree polynomial fits may classify training points well but fail to generalize, where linear functions yield more reliable and general models.

Noise and Its Impact
  • Effect of Noise: Classification categories or feature values can propagate noise, yielding misleading classifications.

  • Example of Noise Impact: Introducing conflicting instances can lead to unreliable classifications.

Overfitting Prevention Techniques**

Pruning Methods
  • Core Approaches:

    • Pre-pruning: Halting tree growth when insufficient data for reliable decision-making.

    • Post-pruning: Developing the full tree and then removing ineffective branches.

  • Methods for Pruning:

    • Employing cross-validation to evaluate subtree utility.

    • Utilizing statistical tests to determine the significance of observed patterns.

    • Applying the Minimum Description Length (MDL) principle to assess complexity versus utility.

Reduced Error Pruning
  • Construction Technique:

    • Develop a full tree from training data, evaluate accuracy from validation sets, and prune nodes resulting in the highest validation accuracy.

Challenges with Reduced Error Pruning
  • Data Utilization Concern: Potential wastage of valuable training data for validation purposes.

    • Learning Curve Impact: Sample test accuracy must be balanced against training instance quantity.

Cross-Validation Strategy without Data Loss
  • Algorithm Efficiency: Grow trees breadth-first to maintain training data integrity.

  • Finalizing Complexity: Average pruned-tree complexity levels across trials to finalize the final tree's complexity.

Additional Issues in Decision Tree Learning

  • Refinements Needed:

    • Engage better splitting criteria to enhance tree performance and mitigate biases.

    • Address challenges related to handling continuous feature values, predicting real-valued functions, and accounting for missing attribute data.

    • Consider misclassification costs and the integration of incremental learning methods like ID4 and ID5.

Overview of ID3 Algorithm

Characteristics of ID3
  • Foundation and Development: Mathematically formulated by J. Ross Quinlan in 1979, utilizing information theory principles established by Shannon in 1948.

  • Tree Construction: Top-down formation without backtracking, relying on information gain metrics for optimal classification attributes selection.

Information Gain Explained in ID3
  • Measurement: Information gain denotes the decrease in entropy after data partitioning based on attribute values.

    • Results in the generation of branches depending on attribute homogeneity.

  • Leaf Node Identification: Branches demonstrating zero entropy become terminal leaves, requiring further processing in non-terminal branches until all data points are classified.

Advantages of ID3
  • Clarity in Predictions: Results in understandable rules extracted from training data.

  • Efficiency: Constructs fast and compact decision trees, requiring minimal attribute testing to classify data accurately.

  • Data Search Integrity: Utilization of the entire dataset aids in creating robust trees.

Disadvantages of ID3
  • Overfitting Potential: Risk associated with small sample sizes may lead to overfitting/unreliable classification.

  • Single-Attribute Testing: Only one attribute is tested at a time, which may hinder performance.

  • Computational Load: Classifying continuous features can be resource-intensive, necessitating multiple tree iterations to find effective splits.

Practical Application: The Simpsons Example

  • Dataset Characteristics:

    • Feature attributes: hair length, weight, age, and gender classification.

  • Entropy Calculation: Example for certain thresholds (e.g., Hair Length <= 5).

  • Decision Tree Formulation: Continual evaluation and splits based on information gain & entropy calculations leading to simple classifications of male/female predictions based on given traits.

Summary and Further Reading

Key Takeaways
  • Decision tree learning is a proficient method for concept learning and discrete function training.

  • Tree generation through the ID3 algorithms is characterized by greedy selection of attributes, maintaining simplicity while expanding only as required.

  • Addressing overfitting is vital for effective decision tree learning, with pruning methods essential for balancing complexity with performance.

  • The exploration of numerous extensions to ID3 indicates a rich landscape for ongoing research within decision tree methodologies.