Decision Tree

Machine Learning Methods

Decision Tree Learning

Overview of Decision Trees

Definition: Decision trees are tree-based classifiers that represent instances as feature vectors.
Structure:
- Nodes test features. Each node has one branch for each value of the feature, and leaves specify the classification category.
Functional Capability:
- Can represent arbitrary conjunctions and disjunctions, covering any classification function over discrete feature vectors.
- Can be transformed into a set of rules, exemplified in disjunctive normal form (DNF).
- Example Rules:
  - If an object is red and circular, then classify it as positive (i.e., pos).
  - If red and circular then classify as A, if blue then classify as B.
  - If red and square then classify as B, if green then classify as C.
  - If red and triangular then classify as C.

Properties of Decision Tree Learning

Handling of Continuous Features: Decision trees can manage continuous (real-valued) features by allowing splits based on a threshold.
- Example: Splitting a feature, "length", into ranges such as length < 3 and length ≥ 3.
Types of Trees:
- Classification trees: yield discrete class labels at the leaves.
- Regression trees: provide real-valued outputs at the leaves.
Efficiency: Algorithms designed to find consistent decision trees are efficient for processing large training datasets, facilitating effective data mining tasks.
Adaptability:
- Developed methods for addressing noisy training data (including both class and feature noise).
- Techniques for handling missing feature values have also been implemented.

Example of a Decision Tree

Context: A decision tree designed to determine if a day is suitable for playing tennis based on weather features (humidity, wind, outlook).
Classification Process: An example is classified by traversing the tree based on attribute values until reaching a leaf node, representing the classification result (either Yes or No).

Challenges in Decision Tree Learning

Structure of Instances

Instance Representation: Instances are portrayed by attribute-value pairs with fixed attributes (e.g., Temperature) having discrete values (e.g., Hot).
- More complex scenarios arise for attributes with a broader set of real-valued attributes.
Target Function: The focus is on discrete output values. The decision tree assigns a boolean classification (e.g., Yes or No) to each instance.
- Extensions exist for multi-output scenarios and real-valued outcomes, though less common.

Issues Encountered

Training Data Quality: Training data may include errors, with decision tree methods maintaining robustness against classification and attribute value errors.
Handling Missing Values: Decision tree methods can function effectively even when some training examples have missing attribute values.

Decision Tree Induction

Algorithmic Foundation

Pseudocode for Tree Construction

DTree(examples, features) returns a tree
1. If all examples are in one category:
   - Return a leaf node with that category label.
2. Else if the set of features is empty:
   - Return a leaf node with the most common category label among examples.
3. Else:
   - Pick a feature F and create a node R.
   - For each value vi of F:
     - Let examplesi be the subset of examples with value vi for F.
     - Add an edge E to node R labeled with vi.
     - If examplesi is empty:
       - Attach a leaf node to E labeled with the most common class from examples.
     - Else:
       - Call DTree(examplesi, features - {F}) and attach the resulting tree under E.
4. Return the subtree rooted at R.

Selecting the Optimal Feature for Splitting

Objective: Minimize the size of the resulting tree consistent with Occam’s razor.
Challenge: Finding the smallest decision tree is an NP-hard optimization problem.
- A greedy search approach is common in top-down methods.
- General principle in Machine Learning: "Greed is good."
Optimal Feature Selection: Features are selected to create subsets that are relatively pure and closer to being leaf nodes.
Heuristic Methods: Commonly uses information gain metrics developed in Quinlan's ID3 system (1979).

Measures of Feature Effectiveness

Information Gain

Definition: Information gain measures how well an attribute separates training examples according to target classifications.
Entropy: Represents the purity of a collection of examples.

Entropy Calculation

Entropy Formula for Binary Classification: $Entropy(S) = -p1 imes ext{log}2(p1) - p0 imes ext{log}2(p0)$
- Where:
- $p_1$: fraction of positive examples in set S.
- $p_0$: fraction of negative examples in set S.
Entropy Behavior:
- Pure Set (all examples in one category): Entropy is 0.
- Mixed Set ($p1 = p0 = 0.5$): Maximum entropy of 1.
Generalization for Multi-class Problems: $Entropy(S) = - rac{1}{c} imes ext{sum from } i = 1 ext{ to } c ext{ of } pi imes ext{log}2 (p_i)$
- where $c$ represents the number of categories.

Example Calculation of Information Gain

Example of Classes: Data set of attribute combinations such as : +, : +, : −, : −.
Calculation Example:
- Total positive: 4, negative: 2.
- Compute entropy for different attributes and calculate information gain for the splits.
Information Gain Outcomes:
- Gain results indicate the effectiveness of splitting on attributes like size, color or shape, resulting in an improved classification with quantifiable information gain values.

Searching the Hypothesis Space

Methodology: Batch learning approach, processing all training instances at once instead of incremental updates.
Techniques used: Greedy hill-climbing search can yield locally optimal solutions; guarantees finding a tree consistent with conflict-free training sets.

Understanding Bias in Decision Tree Induction

Bias Nature: Information gain predisposes towards shorter trees by governing the tree’s margins based on depth.
Search Strategy: Implementing search bias that prioritizes minimal tree complexities.

Historical Context of Decision Tree Research

Timeline of Developments

1950s-60s: Hunt and colleagues employ exhaustively searching methods for decision trees to represent human concept learning.
1970s: Development of ID3 by Quinlan introduces the information gain heuristic; Breiman et al. develop CART.
1980s: Additional improvements are introduced for handling noise, dealing with continuous features, managing missing values, and improved criterions for splitting.
1993: Release of Quinlan's updated decision-tree package (C4.5).
Present: Weka, a popular machine learning software suite provides a Java version of C4.5 known as J48.

Computational Complexity of Decision Trees

Complexity Analysis

Worst Case Scenario: Complete tree building where every path tests every feature.
- For $n$ examples and $m$ features:
 $ O(n imes m^2) $
Practical Complexity: Typically linear with a much lesser complexity due to rarely complete learned trees.

Overfitting and Generalization

Overfitting Issues

Concept: Overfitting occurs when the model perfectly classifies training data but generalizes poorly to unseen instances.
- Caused by noise and small datasets that lead to errant fitting.
Concept Definition: A hypothesis $h$ overfits if there exists another hypothesis $h’$ that performs better on unseen test data while having less error on training data.

Overfitting Examples

Hypothetical Scenario: Testing Ohm's Law, fitting polynomial curves can lead to overfitting.
- Overfitting Demonstration: High-degree polynomial fits may classify training points well but fail to generalize, where linear functions yield more reliable and general models.

Noise and Its Impact

Effect of Noise: Classification categories or feature values can propagate noise, yielding misleading classifications.
Example of Noise Impact: Introducing conflicting instances can lead to unreliable classifications.

Overfitting Prevention Techniques**

Pruning Methods

Core Approaches:
- Pre-pruning: Halting tree growth when insufficient data for reliable decision-making.
- Post-pruning: Developing the full tree and then removing ineffective branches.
Methods for Pruning:
- Employing cross-validation to evaluate subtree utility.
- Utilizing statistical tests to determine the significance of observed patterns.
- Applying the Minimum Description Length (MDL) principle to assess complexity versus utility.

Reduced Error Pruning

Construction Technique:
- Develop a full tree from training data, evaluate accuracy from validation sets, and prune nodes resulting in the highest validation accuracy.

Challenges with Reduced Error Pruning

Data Utilization Concern: Potential wastage of valuable training data for validation purposes.
- Learning Curve Impact: Sample test accuracy must be balanced against training instance quantity.

Cross-Validation Strategy without Data Loss

Algorithm Efficiency: Grow trees breadth-first to maintain training data integrity.
Finalizing Complexity: Average pruned-tree complexity levels across trials to finalize the final tree's complexity.

Additional Issues in Decision Tree Learning

Refinements Needed:
- Engage better splitting criteria to enhance tree performance and mitigate biases.
- Address challenges related to handling continuous feature values, predicting real-valued functions, and accounting for missing attribute data.
- Consider misclassification costs and the integration of incremental learning methods like ID4 and ID5.

Overview of ID3 Algorithm

Characteristics of ID3

Foundation and Development: Mathematically formulated by J. Ross Quinlan in 1979, utilizing information theory principles established by Shannon in 1948.
Tree Construction: Top-down formation without backtracking, relying on information gain metrics for optimal classification attributes selection.

Information Gain Explained in ID3

Measurement: Information gain denotes the decrease in entropy after data partitioning based on attribute values.
- Results in the generation of branches depending on attribute homogeneity.
Leaf Node Identification: Branches demonstrating zero entropy become terminal leaves, requiring further processing in non-terminal branches until all data points are classified.

Advantages of ID3

Clarity in Predictions: Results in understandable rules extracted from training data.
Efficiency: Constructs fast and compact decision trees, requiring minimal attribute testing to classify data accurately.
Data Search Integrity: Utilization of the entire dataset aids in creating robust trees.

Disadvantages of ID3

Overfitting Potential: Risk associated with small sample sizes may lead to overfitting/unreliable classification.
Single-Attribute Testing: Only one attribute is tested at a time, which may hinder performance.
Computational Load: Classifying continuous features can be resource-intensive, necessitating multiple tree iterations to find effective splits.

Practical Application: The Simpsons Example

Dataset Characteristics:
- Feature attributes: hair length, weight, age, and gender classification.
Entropy Calculation: Example for certain thresholds (e.g., Hair Length <= 5).
Decision Tree Formulation: Continual evaluation and splits based on information gain & entropy calculations leading to simple classifications of male/female predictions based on given traits.

Summary and Further Reading

Key Takeaways

Decision tree learning is a proficient method for concept learning and discrete function training.
Tree generation through the ID3 algorithms is characterized by greedy selection of attributes, maintaining simplicity while expanding only as required.
Addressing overfitting is vital for effective decision tree learning, with pruning methods essential for balancing complexity with performance.
The exploration of numerous extensions to ID3 indicates a rich landscape for ongoing research within decision tree methodologies.