Module 5

Decision Tree Overview

  • Definition: A decision tree is a non-parametric machine learning modeling technique used for regression and classification problems.

  • Functionality: It makes sequential, hierarchical decisions about the outcome variable based on predictor data.

Structure of Decision Tree

Components

  • Root Node: Beginning of the decision tree; the dataset begins to split from here based on various features.

  • Decision Nodes: Nodes created after splitting the root, referred to as decision nodes.

  • Leaf Nodes (Terminal Nodes): Nodes where no further splitting is possible.

  • Sub-tree: A subsection of the decision tree akin to a sub-graph.

  • Pruning: The process of cutting down some nodes to prevent overfitting in the model.

Algorithm Steps for Decision Tree

  1. Begin with Root Node: Start with the complete dataset.

  2. Attribute Selection: Use the Attribute Selection Measure (ASM) to find the best attribute in the dataset.

  3. Subsets Creation: Split the dataset into subsets based on the best attributes found.

  4. Decision Nodes Generation: Create decision tree nodes with the selected attributes.

  5. Recursive Building: Repeat the process using the created subsets until no further classification is possible; final nodes are leaf nodes.

Attribute Selection Measures

  • Information Gain: Measures the change in entropy and how much information an attribute provides about the class.

  • Gini Index: Another measure of impurity used in creating decision trees, particularly in CART algorithms.

Information Gain Overview

  • Calculation: Information Gain = Entropy(S) - [(Weighted Avg) * Entropy(each feature)]

  • Entropy: A metric to measure impurity or disorder in a dataset.

  • Entropy Formula: Entropy(S) = -P(yes) * log2(P(yes)) - P(no) * log2(P(no))

  • Significance: In a decision tree, an attribute with the highest information gain is tested or split first.

Gini Index Explanation

  • Definition: Measures impurity or purity while creating a decision tree.

  • Binary Splits Creation: Gini index is used for creating binary splits in CART algorithms.

  • Gini Index Formula: Gini Index = 1 - ∑j Pj²

  • Interpretation: Values range from [0, 0.5] for Gini index; lower indicates preferred attributes for splits.

Overfitting and Tree Pruning

  • Overfitting: Occurs when a decision tree becomes too complex, reflecting noise rather than the underlying data.

  • Mitigation Approaches:

    • Prepruning: Stop tree construction early if further splitting does not result in significant improvement.

    • Postpruning: Remove branches from a grown tree and use a separate dataset to determine the best-pruned tree.

Example Dataset for Decision Trees

  • Attributes: Outlook, Temperature, Humidity, Windy, Play Golf

  • Classification Goal: Predict whether to play golf based on conditions expressed in the dataset.

Confusion Matrix Analysis

  • Used to evaluate the performance of the classification and includes true positives, true negatives, false positives, and false negatives.

Ensemble Learning Techniques

General Overview

  • Definition: Combining predictions from multiple models to enhance accuracy and resilience.

  • Types: Bagging, Boosting, Stacking, and Blending.

Comparison

  • Bagging: Aims to decrease variance by averaging multiple models trained independently.

  • Boosting: Aims to reduce bias by converting weak learners into strong learners by sequentially correcting errors.

Gradient Boosting Implementation

  • Objective: Minimize bias error via successive models rectifying previous errors.

  • Key Steps:

    1. Build a base model.

    2. Calculate pseudo residuals to find errors.

    3. Create subsequent models on errors and adjust weights.

    4. Repeat until acceptable accuracy is achieved.

Parameter Tuning and Cross-Validation

  • Utilize methodologies like GridSearchCV for optimizing model parameters such as learning rate and n_estimators.

  • Employ strategies like K-Folds and Stratified K-Folds to validate and evaluate model performance using different subsets of the data.

Final Thoughts

  • A decision tree serves as a building block for various advanced algorithms, and understanding it is crucial for any data science practitioner.