Module 5
Decision Tree Overview
Definition: A decision tree is a non-parametric machine learning modeling technique used for regression and classification problems.
Functionality: It makes sequential, hierarchical decisions about the outcome variable based on predictor data.
Structure of Decision Tree
Components
Root Node: Beginning of the decision tree; the dataset begins to split from here based on various features.
Decision Nodes: Nodes created after splitting the root, referred to as decision nodes.
Leaf Nodes (Terminal Nodes): Nodes where no further splitting is possible.
Sub-tree: A subsection of the decision tree akin to a sub-graph.
Pruning: The process of cutting down some nodes to prevent overfitting in the model.
Algorithm Steps for Decision Tree
Begin with Root Node: Start with the complete dataset.
Attribute Selection: Use the Attribute Selection Measure (ASM) to find the best attribute in the dataset.
Subsets Creation: Split the dataset into subsets based on the best attributes found.
Decision Nodes Generation: Create decision tree nodes with the selected attributes.
Recursive Building: Repeat the process using the created subsets until no further classification is possible; final nodes are leaf nodes.
Attribute Selection Measures
Information Gain: Measures the change in entropy and how much information an attribute provides about the class.
Gini Index: Another measure of impurity used in creating decision trees, particularly in CART algorithms.
Information Gain Overview
Calculation: Information Gain = Entropy(S) - [(Weighted Avg) * Entropy(each feature)]
Entropy: A metric to measure impurity or disorder in a dataset.
Entropy Formula: Entropy(S) = -P(yes) * log2(P(yes)) - P(no) * log2(P(no))
Significance: In a decision tree, an attribute with the highest information gain is tested or split first.
Gini Index Explanation
Definition: Measures impurity or purity while creating a decision tree.
Binary Splits Creation: Gini index is used for creating binary splits in CART algorithms.
Gini Index Formula: Gini Index = 1 - ∑j Pj²
Interpretation: Values range from [0, 0.5] for Gini index; lower indicates preferred attributes for splits.
Overfitting and Tree Pruning
Overfitting: Occurs when a decision tree becomes too complex, reflecting noise rather than the underlying data.
Mitigation Approaches:
Prepruning: Stop tree construction early if further splitting does not result in significant improvement.
Postpruning: Remove branches from a grown tree and use a separate dataset to determine the best-pruned tree.
Example Dataset for Decision Trees
Attributes: Outlook, Temperature, Humidity, Windy, Play Golf
Classification Goal: Predict whether to play golf based on conditions expressed in the dataset.
Confusion Matrix Analysis
Used to evaluate the performance of the classification and includes true positives, true negatives, false positives, and false negatives.
Ensemble Learning Techniques
General Overview
Definition: Combining predictions from multiple models to enhance accuracy and resilience.
Types: Bagging, Boosting, Stacking, and Blending.
Comparison
Bagging: Aims to decrease variance by averaging multiple models trained independently.
Boosting: Aims to reduce bias by converting weak learners into strong learners by sequentially correcting errors.
Gradient Boosting Implementation
Objective: Minimize bias error via successive models rectifying previous errors.
Key Steps:
Build a base model.
Calculate pseudo residuals to find errors.
Create subsequent models on errors and adjust weights.
Repeat until acceptable accuracy is achieved.
Parameter Tuning and Cross-Validation
Utilize methodologies like GridSearchCV for optimizing model parameters such as learning rate and n_estimators.
Employ strategies like K-Folds and Stratified K-Folds to validate and evaluate model performance using different subsets of the data.
Final Thoughts
A decision tree serves as a building block for various advanced algorithms, and understanding it is crucial for any data science practitioner.