Decision Tree Notes

Introduction to Decision Trees

What is Machine Learning?

  • Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without explicit programming.
  • It involves learning, predicting, and deciding based on data.
  • Helps in:
    • Describing data in new ways.
    • Categorizing data for better organization and retrieval.
    • Analyzing data to discover previously unseen patterns.
    • Recognition tasks (facial, driver, and automated car recognition).

Types of Machine Learning

  • Supervised Learning:
    • The data and the answers (target variable) are already known.
    • Example: Predicting loan defaults based on historical loan data.
  • Unsupervised Learning:
    • Only the data is available, without pre-defined answers.
    • Used to group similar data points together.
    • Example: Grouping photos of trees and houses without prior knowledge of what they are.
  • Reinforcement Learning:
    • Data is received sequentially, one piece at a time, and the system learns based on feedback (positive or negative).
    • Similar to how humans learn from experiences.

Machine Learning Problems

  • Classification:
    • Problems with categorical solutions (yes/no, true/false, 0/1).
    • Example: Determining if an item belongs to a specific group.
  • Regression:
    • Problems involving the prediction of continuous values.
    • Example: Predicting product prices based on historical data.
  • Clustering:
    • Organizing data to find specific patterns.
    • Example: Product recommendation systems that group products based on user behavior.

Decision Trees

  • A tree-shaped diagram used to determine a course of action.
  • Each branch represents a possible decision or occurrence.
  • Primarily used for classification problems.
Classification Tools:
  • Naive Bayes and Logistic Regression: Suitable for simpler datasets.
  • Decision Trees: Effective for more complex data.
  • Random Forests: Used for very large datasets; a decision tree is an integral part of a random forest.

How to Identify a Random Vegetable

  • Start by asking questions to classify the vegetable.
    • Is it red?
      • If no, it might be an eggplant (purple).
      • If yes, proceed to the next question.
    • Is the diameter greater than two?
      • If no, it might be a red chili.
      • If yes, it might be a red bell pepper (capsicum).

Problems Solved by Decision Trees

  • Classification:
    • Classification trees use logical if-then conditions to classify problems.
    • Example: Discriminating between types of flowers based on different features.
  • Regression:
    • Regression trees are used when the target variable is numerical.
    • A regression model is fit to the target variable using independent variables.
    • Each split is made based on the sum of squared error.

Advantages of Decision Trees

  • Simple to understand, interpret, and visualize.
  • Requires minimal data preparation.
  • Can handle both numerical and categorical data.
  • Nonlinear parameters do not affect its performance.

Disadvantages of Decision Trees

  • Overfitting:
    • Occurs when the algorithm captures noise in the data, leading to solutions specific to the training data rather than general solutions.
  • High Variance:
    • The model can become unstable due to small variations in the data.
  • Low Bias:
    • Highly complex decision trees tend to have low bias, making it difficult for the model to generalize to new data.

Decision Tree Terminology

  • Entropy:
    • Measure of randomness or unpredictability in the dataset.
    • High entropy indicates a mixed dataset where it is difficult to predict the class.
  • Information Gain:
    • Measure of the decrease in entropy after the dataset is split.
    • Splitting the data into subgroups reduces entropy and increases information gain.
    • Information Gain=Entropy<em>before splitEntropy</em>after splitInformation\ Gain = Entropy<em>{before\ split} - Entropy</em>{after\ split}
  • Leaf Node:
    • Contains a classification or decision.
    • Represents the final outcome.
  • Decision Node:
    • Has two or more branches.
    • Splits the data into different parts.
  • Root Node:
    • The topmost decision node.

How a Decision Tree Works

  • The goal is to classify different types of data (e.g., animals) based on their features using a decision tree.
  • The process involves framing conditions to split the data in such a way that the information gain is maximized.
  • Entropy Formula:
    • Entropy=<em>i=1KP</em>ilog<em>2(P</em>i)Entropy = - \sum<em>{i=1}^{K} P</em>i * log<em>2(P</em>i)
    • Where KK is the number of classes and PiP_i is the percentage of each class.
Steps:
  1. Calculate the entropy for the current dataset.
  2. Choose a condition that yields the highest information gain.
  3. Split the data based on the selected condition.
  4. Repeat the process for each branch until the entropy reaches a minimum value.

*Example: Classifying animals with color and height
*Initial dataset has high entropy, with mixed animals (giraffes, tigers, monkeys, elephants).
*Splitting the data based on color (e.g., Yellow) reduces the entropy.
*Further splitting based on height separates animals into distinct groups.

Use Case: Loan Repayment Prediction

  • Goal: Predict whether a customer will repay a loan.
  • Algorithm: Decision Tree.
Implementation Steps in Python:
  1. Import necessary packages:
    • numpy (as np) for numerical operations.
    • pandas (as pd) for data manipulation using DataFrames.
    • train_test_split from sklearn.model_selection to split the data into training and testing sets.
    • DecisionTreeClassifier from sklearn.tree for building the decision tree.
    • accuracy_score from sklearn.metrics for evaluating the model.
    • tree from sklearn to call the tree classifier.
  2. Load the data:
    • Use pd.read_csv() to load the dataset from a CSV file.
    • File path needs to be specified correctly.
  3. Explore the data:
    • Print the length of the dataset using len(data).
    • Print the shape of the dataset (number of rows and columns) using data.shape.
    • Display the first few rows of the dataset using data.head().
  4. Prepare the data:
    • Separate the data into features (X) and target (Y).
    • X contains the data used for prediction (e.g., initial payment, last payment, credit score, house number).
    • Y contains the target variable (whether the loan was repaid or not).
  5. Split the data into training and testing sets:
    • Use train_test_split(X, Y, test_size=0.3, random_state=100) to split the data.
    • test_size specifies the proportion of data to be used for testing (e.g., 0.3 means 30% for testing).
    • random_state ensures the split is reproducible.
  6. Train the decision tree:
    • Create a DecisionTreeClassifier object with specified parameters (e.g., criterion='entropy', random_state=100, max_depth=3, min_samples_leaf=5).
    • Fit the model to the training data using clf_entropy.fit(X_train, Y_train).
  7. Make predictions:
    • Use the trained model to make predictions on the test data using Y_pred = clf_entropy.predict(X_test).
  8. Evaluate the model:
    • Calculate the accuracy score using accuracy_score(Y_test, Y_pred). Multiply by 100 to express as a percentage.

Conclusion

  • Decision trees are a powerful tool for classification and regression problems in machine learning.
  • They are easy to understand, interpret, and visualize.
  • Python and its machine learning libraries (e.g., scikit-learn) provide tools to implement decision trees effectively.