Decision Trees Lecture Notes

Decision Trees are non-parametric models used for both classification and regression.
They are flexible, and their number of parameters doesn't increase with more features (if built correctly).
They can output:
- Categorical predictions (e.g., classifying a plant).
- Numerical predictions (e.g., predicting house prices).

Decision Trees are constructed using nodes and branches.
At each node, a feature is evaluated to split observations or direct a data point during prediction.
Decision trees are built recursively by evaluating features and selecting the one that best splits the data at each node.

Root Node: The starting node that evaluates the variable that best splits the data.
Intermediate Nodes: Nodes where variables are evaluated but are not the final prediction nodes.
Leave Nodes: The final nodes where predictions (category or numerical value) are made.

Decision Trees use a mutually exclusive and exhaustive if-then rule set for classification.
Rules are learned sequentially from the training data.
Each time a rule is learned, the covered tuples are removed.
This process continues until a termination condition is met.

Easy to interpret, providing a graphical and intuitive understanding of the algorithm's process.
Require less data for training compared to other Machine Learning algorithms.
Can be used for both Classification and Regression tasks.
Simple to implement and understand.

Prone to overfitting the training data and sensitive to outliers.
Weak learners: single decision trees often require combination into ‘forests’ for stronger predictions.
Bias towards features with many levels, potentially leading to biased predictions.

Quantify uncertainty at each node.
Calculate impurity first. Range is 0 to 1.
Low impurity means low uncertainty, and high impurity means high uncertainty.
$Impurity = 1 - p$ , where $p$ is the probability of an object being classified to a particular class.

Decision tree induction is a top-down, recursive, divide-and-conquer approach.
Choose an attribute and split the larger training set into smaller training sets.
Different algorithms control:
- Choosing the best attribute to split (attribute selection measure).
- Splitting criteria.
Some algorithms include ID3 and CART.
Issues include:
- Determining how to split records.
- Specifying the attribute test condition.
- Determining the best split.
- Determining when to stop splitting.

An attribute selection measure is a heuristic for selecting the splitting criterion that best separates a data partition $D$ into individual classes.
Ideally, splitting $D$ should result in pure partitions where all tuples belong to the same class.
The best splitting criterion closely results in such a scenario.

Entropy is an information theory metric measuring impurity or uncertainty in a group of observations.
It is used to determine how a decision tree chooses to split data and measures how informative a node is.
In Physics and Mathematics, entropy refers to the randomness or uncertainty of a random variable $X$ .
In information theory, it refers to the impurity in a group of examples.
Splitting on any attribute results in the average entropy of resulting training subsets being less than or equal to the previous training set.
Information gain is the decrease in entropy.
Information gain computes the difference between entropy before split and average entropy after split.
The attribute with the largest information gain is chosen as the splitting attribute.

ID3 uses information gain as its attribute selection measure.
Constructing a decision tree involves finding the attribute that returns the highest information gain (most homogeneous branches).
The attribute with the highest information gain is chosen as the splitting attribute for node $N$ .

Example: $A = {1, 1, 1, 1, 1, 2, 2, 2, 3}$
There are 3 distinct classes.
$p1 = 5/9, p2 = 3/9, p_3 = 1/9$
$E = p1 \log(1/p1) + p2 \log(1/p2) + p3 \log(1/p3)$
Entropy of a Training Set:
- If there are $k$ classes, and 1 2 for denotes the number of occurrences of = 1 classes divided by the total number of instances.

Calculate the Entropy values for each
$E = -\sum Pi \log2 P_i$
If all the observations belong to the same class, entropy would be 0: $E = -(1 \log_2 1) = 0$
Such a dataset does not have any impurity and would not be useful for learning.
However, if we have a dataset with say, two classes, half made up of yellow and the other half being purple, the entropy will be one.
$E = -((0.5 \log2 0.5) + (0.5 \log2 0.5)) = 1$
This kind of a dataset is good for learning

This attribute minimizes the information needed to classify the tuples in the resulting partitions.
Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.

$Info(D)$ is just the average amount of information needed to identify the class label of a tuple in $D$ .
where $pi$ is the nonzero probability that an arbitrary tuple in $D$ belongs to class $Ci$ and is estimated by $|C{i,D}|/|Dj|$ .
$Info(D)$ known as the entropy of $D$ .
$InfoA(D) = \sum \frac{|Dj|}{|D|} \times Info(D_j)$