Decision Trees Lecture Notes
Decision Trees
- Decision Trees are non-parametric models used for both classification and regression.
- They are flexible, and their number of parameters doesn't increase with more features (if built correctly).
- They can output:
- Categorical predictions (e.g., classifying a plant).
- Numerical predictions (e.g., predicting house prices).
Decision Tree Elements
- Decision Trees are constructed using nodes and branches.
- At each node, a feature is evaluated to split observations or direct a data point during prediction.
- Decision trees are built recursively by evaluating features and selecting the one that best splits the data at each node.
Types of Nodes
- Root Node: The starting node that evaluates the variable that best splits the data.
- Intermediate Nodes: Nodes where variables are evaluated but are not the final prediction nodes.
- Leave Nodes: The final nodes where predictions (category or numerical value) are made.
If-Then Rule Set
- Decision Trees use a mutually exclusive and exhaustive if-then rule set for classification.
- Rules are learned sequentially from the training data.
- Each time a rule is learned, the covered tuples are removed.
- This process continues until a termination condition is met.
Advantages of Decision Trees
- Easy to interpret, providing a graphical and intuitive understanding of the algorithm's process.
- Require less data for training compared to other Machine Learning algorithms.
- Can be used for both Classification and Regression tasks.
- Simple to implement and understand.
Disadvantages of Decision Trees
- Prone to overfitting the training data and sensitive to outliers.
- Weak learners: single decision trees often require combination into ‘forests’ for stronger predictions.
- Bias towards features with many levels, potentially leading to biased predictions.
Quantifying Impurity
- Quantify uncertainty at each node.
- Calculate impurity first. Range is 0 to 1.
- Low impurity means low uncertainty, and high impurity means high uncertainty.
- Impurity=1−p, where p is the probability of an object being classified to a particular class.
Decision Tree Induction Techniques
- Decision tree induction is a top-down, recursive, divide-and-conquer approach.
- Choose an attribute and split the larger training set into smaller training sets.
- Different algorithms control:
- Choosing the best attribute to split (attribute selection measure).
- Splitting criteria.
- Some algorithms include ID3 and CART.
- Issues include:
- Determining how to split records.
- Specifying the attribute test condition.
- Determining the best split.
- Determining when to stop splitting.
Attribute Selection Measures
- An attribute selection measure is a heuristic for selecting the splitting criterion that best separates a data partition D into individual classes.
- Ideally, splitting D should result in pure partitions where all tuples belong to the same class.
- The best splitting criterion closely results in such a scenario.
Measures of Node Impurity
- Entropy
- Information Gain (used by ID3 algorithm)
- Gini Index (used by CART)
Entropy
- Entropy is an information theory metric measuring impurity or uncertainty in a group of observations.
- It is used to determine how a decision tree chooses to split data and measures how informative a node is.
- In Physics and Mathematics, entropy refers to the randomness or uncertainty of a random variable X.
- In information theory, it refers to the impurity in a group of examples.
- Splitting on any attribute results in the average entropy of resulting training subsets being less than or equal to the previous training set.
- Information gain is the decrease in entropy.
- Information gain computes the difference between entropy before split and average entropy after split.
- The attribute with the largest information gain is chosen as the splitting attribute.
- ID3 uses information gain as its attribute selection measure.
- Constructing a decision tree involves finding the attribute that returns the highest information gain (most homogeneous branches).
- The attribute with the highest information gain is chosen as the splitting attribute for node N.
Entropy Calculation
- Example: A=1,1,1,1,1,2,2,2,3
- There are 3 distinct classes.
- p<em>1=5/9,p</em>2=3/9,p3=1/9
- E=p<em>1log(1/p</em>1)+p<em>2log(1/p</em>2)+p<em>3log(1/p</em>3)
- Entropy of a Training Set:
- If there are k classes, and 1 2 for denotes the number of occurrences of = 1 classes divided by the total number of instances.
Examples for computing Entropy
- Calculate the Entropy values for each
- E=−∑P<em>ilog</em>2Pi
- If all the observations belong to the same class, entropy would be 0: E=−(1log21)=0
- Such a dataset does not have any impurity and would not be useful for learning.
- However, if we have a dataset with say, two classes, half made up of yellow and the other half being purple, the entropy will be one.
- E=−((0.5log<em>20.5)+(0.5log</em>20.5))=1
- This kind of a dataset is good for learning
- This attribute minimizes the information needed to classify the tuples in the resulting partitions.
- Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
- Info(D) is just the average amount of information needed to identify the class label of a tuple in D.
- where p<em>i is the nonzero probability that an arbitrary tuple in D belongs to class C</em>i and is estimated by ∣C<em>i,D∣/∣D</em>j∣.
- Info(D) known as the entropy of D.
- Info<em>A(D)=∑∣D∣∣D</em>j∣×Info(Dj)
- Calculate information gain for the attribute