R & C
Decision Tree
Decision trees split data into branches based on feature values, creating decision nodes until reaching final predictions. A decision tree is a supervised learning algorithm for classification and regression tasks. It represents decisions and their possible outcomes in a tree-like structure, making the decision-making process transparent and interpretable.
Root Node
The decision-making process starts at the root node, representing the entire dataset. The model evaluates the best feature to split the data based on a chosen criterion like Gini Impurity, Information Gain (for classification), or Mean Squared Error (for regression).
How do we find the root node
Calculate Impurity/Information Gain:
Gini Index: Measures impurity. Lower values are better.
Entropy: Measures disorder. Higher information gain is better.
Evaluate Each Feature:
For each feature, calculate the Gini Index or Entropy for all possible splits.
Determine the information gain for each split.
Select the Best Feature:
Choose the feature with the lowest Gini Index or highest information gain.
This feature becomes the root node.
Splitting
The dataset is divided into subsets based on the selected feature's values, creating branches from the root node. This process continues recursively at each subsequent node.
Internal Node
These nodes represent decision points where the dataset is further split based on other features.
Leaf Node
These are terminal nodes that provide the final prediction.
For classification tasks, the leaf nodes contain class labels, while for regression tasks, they provide numerical outputs.
Stopping Criteria:
All data points in a node belong to the same class (pure node).
■ A maximum tree depth is reached.
■ Minimum samples per leaf or minimum information gain criteria are met.
How do we split
Entropy
Ginny Index
Pure split vs Impure split
pure if all the resulting child nodes contain instances of only one class.
impure if the resulting child nodes contain a mix of different classes.
For huge data set we will use what split method
Gin Impurity (Categorical Feature)
problem with the decison tree
This will cause overfitting of data.
How do we reduce the overfittig of data
Post Pruning
Pre Pruning
Post Prunning
we will complete decision tree an later we will cut that (Pruning).
For smaller data we will use ____ Prunning
Post
Pre Pruning
we will use some parameters while construction of parameters.(max feature, Depth ,max depth, Split )
For huge data we will use
Pre Pruning
Decision Tree vs Random Forest
Feature | Decision Tree | Random Forest |
Algorithm Type | Single tree-based model | Ensemble of multiple decision trees |
Overfitting | Prone to overfitting | Less prone to overfitting due to averaging |
Bias-Variance Tradeoff | High variance, low bias | Lower variance, slightly higher bias |
Interpretability | Easy to interpret and visualize | Harder to interpret due to multiple trees |
Training Time | Faster training time | Slower training time due to multiple trees |
Prediction Time | Faster prediction time | Slower prediction time due to averaging predictions |
Accuracy | Can be less accurate due to overfitting | Generally more accurate due to ensemble approach |
How do we select feature
Using entropy (measure the randomness of a system)
Gini Index vs Entropy
Aspect | Gini Index | Entropy |
Definition | Measures the impurity or impurity of a dataset. | Measures the disorder or uncertainty in a dataset. |
Formula | Gini=1−∑i=1npi2Gini=1−∑i=1npi2 | Entropy=−∑i=1npilog2(pi)Entropy=−∑i=1npilog2(pi) |
Range | 0 (pure) to 0.5 (maximum impurity for binary classification) | 0 (pure) to 1 (maximum disorder for binary classification) |
Interpretation | Lower values indicate purer nodes. | Higher values indicate more disorder. |
Calculation Complexity | Simpler and faster to compute. | More complex and computationally intensive. |
Usage in Decision Trees | Often preferred due to computational efficiency. | Provides more information gain but is computationally heavier. |
Sensitivity to Changes | Less sensitive to changes in the dataset. | More sensitive to changes in the dataset. |
TYPES of Decision trees - CART vs C4.5 vs ID3
Aspect | CART | C4.5 | ID3 |
Splitting Criterion | Gini Index | Information Gain Ratio | Information Gain |
Tree Structure | Binary trees (two children per node) | Multi-way trees (multiple children per node) | Multi-way trees (multiple children per node) |
Data Types | Continuous and categorical | Continuous and categorical | Categorical only |
Pruning | Cost-complexity pruning | Error-based pruning | No pruning |
Handling Missing Values | Surrogate splits | Assigns probabilities | Does not handle missing values |
Advantages | Simple, fast, easy to interpret | Handles both data types, robust pruning | Simple, easy to understand |
Disadvantages | Can overfit without pruning | More complex, computationally intensive | Can overfit, does not handle continuous data |
Do we require Scaling ?
Tree-based Algorithms: Algorithms like Decision Trees and Random Forests don't need scaling because they split data based on feature values, not distances.