Decision Trees Lecture Notes

Decision Trees

  • Decision Trees are non-parametric models used for both classification and regression.
  • They are flexible, and their number of parameters doesn't increase with more features (if built correctly).
  • They can output:
    • Categorical predictions (e.g., classifying a plant).
    • Numerical predictions (e.g., predicting house prices).

Decision Tree Elements

  • Decision Trees are constructed using nodes and branches.
  • At each node, a feature is evaluated to split observations or direct a data point during prediction.
  • Decision trees are built recursively by evaluating features and selecting the one that best splits the data at each node.

Types of Nodes

  • Root Node: The starting node that evaluates the variable that best splits the data.
  • Intermediate Nodes: Nodes where variables are evaluated but are not the final prediction nodes.
  • Leave Nodes: The final nodes where predictions (category or numerical value) are made.

If-Then Rule Set

  • Decision Trees use a mutually exclusive and exhaustive if-then rule set for classification.
  • Rules are learned sequentially from the training data.
  • Each time a rule is learned, the covered tuples are removed.
  • This process continues until a termination condition is met.

Advantages of Decision Trees

  • Easy to interpret, providing a graphical and intuitive understanding of the algorithm's process.
  • Require less data for training compared to other Machine Learning algorithms.
  • Can be used for both Classification and Regression tasks.
  • Simple to implement and understand.

Disadvantages of Decision Trees

  • Prone to overfitting the training data and sensitive to outliers.
  • Weak learners: single decision trees often require combination into ‘forests’ for stronger predictions.
  • Bias towards features with many levels, potentially leading to biased predictions.

Quantifying Impurity

  • Quantify uncertainty at each node.
  • Calculate impurity first. Range is 0 to 1.
  • Low impurity means low uncertainty, and high impurity means high uncertainty.
  • Impurity=1pImpurity = 1 - p, where pp is the probability of an object being classified to a particular class.

Decision Tree Induction Techniques

  • Decision tree induction is a top-down, recursive, divide-and-conquer approach.
  • Choose an attribute and split the larger training set into smaller training sets.
  • Different algorithms control:
    • Choosing the best attribute to split (attribute selection measure).
    • Splitting criteria.
  • Some algorithms include ID3 and CART.
  • Issues include:
    • Determining how to split records.
    • Specifying the attribute test condition.
    • Determining the best split.
    • Determining when to stop splitting.

Attribute Selection Measures

  • An attribute selection measure is a heuristic for selecting the splitting criterion that best separates a data partition DD into individual classes.
  • Ideally, splitting DD should result in pure partitions where all tuples belong to the same class.
  • The best splitting criterion closely results in such a scenario.

Measures of Node Impurity

  • Entropy
  • Information Gain (used by ID3 algorithm)
  • Gini Index (used by CART)

Entropy

  • Entropy is an information theory metric measuring impurity or uncertainty in a group of observations.
  • It is used to determine how a decision tree chooses to split data and measures how informative a node is.
  • In Physics and Mathematics, entropy refers to the randomness or uncertainty of a random variable XX.
  • In information theory, it refers to the impurity in a group of examples.
  • Splitting on any attribute results in the average entropy of resulting training subsets being less than or equal to the previous training set.
  • Information gain is the decrease in entropy.
  • Information gain computes the difference between entropy before split and average entropy after split.
  • The attribute with the largest information gain is chosen as the splitting attribute.

Attribute Selection Measures – Information Gain

  • ID3 uses information gain as its attribute selection measure.
  • Constructing a decision tree involves finding the attribute that returns the highest information gain (most homogeneous branches).
  • The attribute with the highest information gain is chosen as the splitting attribute for node NN.

Entropy Calculation

  • Example: A=1,1,1,1,1,2,2,2,3A = {1, 1, 1, 1, 1, 2, 2, 2, 3}
  • There are 3 distinct classes.
  • p<em>1=5/9,p</em>2=3/9,p3=1/9p<em>1 = 5/9, p</em>2 = 3/9, p_3 = 1/9
  • E=p<em>1log(1/p</em>1)+p<em>2log(1/p</em>2)+p<em>3log(1/p</em>3)E = p<em>1 \log(1/p</em>1) + p<em>2 \log(1/p</em>2) + p<em>3 \log(1/p</em>3)
  • Entropy of a Training Set:
    • If there are kk classes, and 1 2 for denotes the number of occurrences of = 1 classes divided by the total number of instances.

Examples for computing Entropy

  • Calculate the Entropy values for each
  • E=P<em>ilog</em>2PiE = -\sum P<em>i \log</em>2 P_i
  • If all the observations belong to the same class, entropy would be 0: E=(1log21)=0E = -(1 \log_2 1) = 0
  • Such a dataset does not have any impurity and would not be useful for learning.
  • However, if we have a dataset with say, two classes, half made up of yellow and the other half being purple, the entropy will be one.
  • E=((0.5log<em>20.5)+(0.5log</em>20.5))=1E = -((0.5 \log<em>2 0.5) + (0.5 \log</em>2 0.5)) = 1
  • This kind of a dataset is good for learning

Attribute Selection Measures - Information Gain

  • This attribute minimizes the information needed to classify the tuples in the resulting partitions.
  • Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.

Attribute Selection Measures - Information Gain Formulae

  • Info(D)Info(D) is just the average amount of information needed to identify the class label of a tuple in DD.
  • where p<em>ip<em>i is the nonzero probability that an arbitrary tuple in DD belongs to class C</em>iC</em>i and is estimated by C<em>i,D/D</em>j|C<em>{i,D}|/|D</em>j|.
  • Info(D)Info(D) known as the entropy of DD.
  • Info<em>A(D)=D</em>jD×Info(Dj)Info<em>A(D) = \sum \frac{|D</em>j|}{|D|} \times Info(D_j)

Attribute Selection Measures - Information Gain: Example

  • Calculate information gain for the attribute