Decision Tree Notes

Introduction to Decision Trees

What is Machine Learning?

Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without explicit programming.
It involves learning, predicting, and deciding based on data.
Helps in:
- Describing data in new ways.
- Categorizing data for better organization and retrieval.
- Analyzing data to discover previously unseen patterns.
- Recognition tasks (facial, driver, and automated car recognition).

Types of Machine Learning

Supervised Learning:
- The data and the answers (target variable) are already known.
- Example: Predicting loan defaults based on historical loan data.
Unsupervised Learning:
- Only the data is available, without pre-defined answers.
- Used to group similar data points together.
- Example: Grouping photos of trees and houses without prior knowledge of what they are.
Reinforcement Learning:
- Data is received sequentially, one piece at a time, and the system learns based on feedback (positive or negative).
- Similar to how humans learn from experiences.

Machine Learning Problems

Classification:
- Problems with categorical solutions (yes/no, true/false, 0/1).
- Example: Determining if an item belongs to a specific group.
Regression:
- Problems involving the prediction of continuous values.
- Example: Predicting product prices based on historical data.
Clustering:
- Organizing data to find specific patterns.
- Example: Product recommendation systems that group products based on user behavior.

Decision Trees

A tree-shaped diagram used to determine a course of action.
Each branch represents a possible decision or occurrence.
Primarily used for classification problems.

Classification Tools:

Naive Bayes and Logistic Regression: Suitable for simpler datasets.
Decision Trees: Effective for more complex data.
Random Forests: Used for very large datasets; a decision tree is an integral part of a random forest.

How to Identify a Random Vegetable

Start by asking questions to classify the vegetable.
- Is it red?
  - If no, it might be an eggplant (purple).
  - If yes, proceed to the next question.
- Is the diameter greater than two?
  - If no, it might be a red chili.
  - If yes, it might be a red bell pepper (capsicum).

Problems Solved by Decision Trees

Classification:
- Classification trees use logical if-then conditions to classify problems.
- Example: Discriminating between types of flowers based on different features.
Regression:
- Regression trees are used when the target variable is numerical.
- A regression model is fit to the target variable using independent variables.
- Each split is made based on the sum of squared error.

Advantages of Decision Trees

Simple to understand, interpret, and visualize.
Requires minimal data preparation.
Can handle both numerical and categorical data.
Nonlinear parameters do not affect its performance.

Disadvantages of Decision Trees

Overfitting:
- Occurs when the algorithm captures noise in the data, leading to solutions specific to the training data rather than general solutions.
High Variance:
- The model can become unstable due to small variations in the data.
Low Bias:
- Highly complex decision trees tend to have low bias, making it difficult for the model to generalize to new data.

Decision Tree Terminology

Entropy:
- Measure of randomness or unpredictability in the dataset.
- High entropy indicates a mixed dataset where it is difficult to predict the class.
Information Gain:
- Measure of the decrease in entropy after the dataset is split.
- Splitting the data into subgroups reduces entropy and increases information gain.
- $Information\ Gain = Entropy{before\ split} - Entropy{after\ split}$
Leaf Node:
- Contains a classification or decision.
- Represents the final outcome.
Decision Node:
- Has two or more branches.
- Splits the data into different parts.
Root Node:
- The topmost decision node.

How a Decision Tree Works

The goal is to classify different types of data (e.g., animals) based on their features using a decision tree.
The process involves framing conditions to split the data in such a way that the information gain is maximized.
Entropy Formula:
- $Entropy = - \sum{i=1}^{K} Pi * log2(Pi)$
- Where $K$ is the number of classes and $P_i$ is the percentage of each class.

Steps:

Calculate the entropy for the current dataset.
Choose a condition that yields the highest information gain.
Split the data based on the selected condition.
Repeat the process for each branch until the entropy reaches a minimum value.

*Example: Classifying animals with color and height
*Initial dataset has high entropy, with mixed animals (giraffes, tigers, monkeys, elephants).
*Splitting the data based on color (e.g., Yellow) reduces the entropy.
*Further splitting based on height separates animals into distinct groups.

Use Case: Loan Repayment Prediction

Goal: Predict whether a customer will repay a loan.
Algorithm: Decision Tree.

Implementation Steps in Python:

Import necessary packages:
- numpy (as np) for numerical operations.
- pandas (as pd) for data manipulation using DataFrames.
- train_test_split from sklearn.model_selection to split the data into training and testing sets.
- DecisionTreeClassifier from sklearn.tree for building the decision tree.
- accuracy_score from sklearn.metrics for evaluating the model.
- tree from sklearn to call the tree classifier.
Load the data:
- Use pd.read_csv() to load the dataset from a CSV file.
- File path needs to be specified correctly.
Explore the data:
- Print the length of the dataset using len(data).
- Print the shape of the dataset (number of rows and columns) using data.shape.
- Display the first few rows of the dataset using data.head().
Prepare the data:
- Separate the data into features (X) and target (Y).
- X contains the data used for prediction (e.g., initial payment, last payment, credit score, house number).
- Y contains the target variable (whether the loan was repaid or not).
Split the data into training and testing sets:
- Use train_test_split(X, Y, test_size=0.3, random_state=100) to split the data.
- test_size specifies the proportion of data to be used for testing (e.g., 0.3 means 30% for testing).
- random_state ensures the split is reproducible.
Train the decision tree:
- Create a DecisionTreeClassifier object with specified parameters (e.g., criterion='entropy', random_state=100, max_depth=3, min_samples_leaf=5).
- Fit the model to the training data using clf_entropy.fit(X_train, Y_train).
Make predictions:
- Use the trained model to make predictions on the test data using Y_pred = clf_entropy.predict(X_test).
Evaluate the model:
- Calculate the accuracy score using accuracy_score(Y_test, Y_pred). Multiply by 100 to express as a percentage.

Conclusion

Decision trees are a powerful tool for classification and regression problems in machine learning.
They are easy to understand, interpret, and visualize.
Python and its machine learning libraries (e.g., scikit-learn) provide tools to implement decision trees effectively.