Decision Tree Notes
Introduction to Decision Trees
What is Machine Learning?
- Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without explicit programming.
- It involves learning, predicting, and deciding based on data.
- Helps in:
- Describing data in new ways.
- Categorizing data for better organization and retrieval.
- Analyzing data to discover previously unseen patterns.
- Recognition tasks (facial, driver, and automated car recognition).
Types of Machine Learning
- Supervised Learning:
- The data and the answers (target variable) are already known.
- Example: Predicting loan defaults based on historical loan data.
- Unsupervised Learning:
- Only the data is available, without pre-defined answers.
- Used to group similar data points together.
- Example: Grouping photos of trees and houses without prior knowledge of what they are.
- Reinforcement Learning:
- Data is received sequentially, one piece at a time, and the system learns based on feedback (positive or negative).
- Similar to how humans learn from experiences.
Machine Learning Problems
- Classification:
- Problems with categorical solutions (yes/no, true/false, 0/1).
- Example: Determining if an item belongs to a specific group.
- Regression:
- Problems involving the prediction of continuous values.
- Example: Predicting product prices based on historical data.
- Clustering:
- Organizing data to find specific patterns.
- Example: Product recommendation systems that group products based on user behavior.
Decision Trees
- A tree-shaped diagram used to determine a course of action.
- Each branch represents a possible decision or occurrence.
- Primarily used for classification problems.
Classification Tools:
- Naive Bayes and Logistic Regression: Suitable for simpler datasets.
- Decision Trees: Effective for more complex data.
- Random Forests: Used for very large datasets; a decision tree is an integral part of a random forest.
How to Identify a Random Vegetable
- Start by asking questions to classify the vegetable.
- Is it red?
- If no, it might be an eggplant (purple).
- If yes, proceed to the next question.
- Is the diameter greater than two?
- If no, it might be a red chili.
- If yes, it might be a red bell pepper (capsicum).
- Is it red?
Problems Solved by Decision Trees
- Classification:
- Classification trees use logical if-then conditions to classify problems.
- Example: Discriminating between types of flowers based on different features.
- Regression:
- Regression trees are used when the target variable is numerical.
- A regression model is fit to the target variable using independent variables.
- Each split is made based on the sum of squared error.
Advantages of Decision Trees
- Simple to understand, interpret, and visualize.
- Requires minimal data preparation.
- Can handle both numerical and categorical data.
- Nonlinear parameters do not affect its performance.
Disadvantages of Decision Trees
- Overfitting:
- Occurs when the algorithm captures noise in the data, leading to solutions specific to the training data rather than general solutions.
- High Variance:
- The model can become unstable due to small variations in the data.
- Low Bias:
- Highly complex decision trees tend to have low bias, making it difficult for the model to generalize to new data.
Decision Tree Terminology
- Entropy:
- Measure of randomness or unpredictability in the dataset.
- High entropy indicates a mixed dataset where it is difficult to predict the class.
- Information Gain:
- Measure of the decrease in entropy after the dataset is split.
- Splitting the data into subgroups reduces entropy and increases information gain.
- Leaf Node:
- Contains a classification or decision.
- Represents the final outcome.
- Decision Node:
- Has two or more branches.
- Splits the data into different parts.
- Root Node:
- The topmost decision node.
How a Decision Tree Works
- The goal is to classify different types of data (e.g., animals) based on their features using a decision tree.
- The process involves framing conditions to split the data in such a way that the information gain is maximized.
- Entropy Formula:
- Where is the number of classes and is the percentage of each class.
Steps:
- Calculate the entropy for the current dataset.
- Choose a condition that yields the highest information gain.
- Split the data based on the selected condition.
- Repeat the process for each branch until the entropy reaches a minimum value.
*Example: Classifying animals with color and height
*Initial dataset has high entropy, with mixed animals (giraffes, tigers, monkeys, elephants).
*Splitting the data based on color (e.g., Yellow) reduces the entropy.
*Further splitting based on height separates animals into distinct groups.
Use Case: Loan Repayment Prediction
- Goal: Predict whether a customer will repay a loan.
- Algorithm: Decision Tree.
Implementation Steps in Python:
- Import necessary packages:
numpy(asnp) for numerical operations.pandas(aspd) for data manipulation using DataFrames.train_test_splitfromsklearn.model_selectionto split the data into training and testing sets.DecisionTreeClassifierfromsklearn.treefor building the decision tree.accuracy_scorefromsklearn.metricsfor evaluating the model.treefromsklearnto call the tree classifier.
- Load the data:
- Use
pd.read_csv()to load the dataset from a CSV file. - File path needs to be specified correctly.
- Use
- Explore the data:
- Print the length of the dataset using
len(data). - Print the shape of the dataset (number of rows and columns) using
data.shape. - Display the first few rows of the dataset using
data.head().
- Print the length of the dataset using
- Prepare the data:
- Separate the data into features (X) and target (Y).
Xcontains the data used for prediction (e.g., initial payment, last payment, credit score, house number).Ycontains the target variable (whether the loan was repaid or not).
- Split the data into training and testing sets:
- Use
train_test_split(X, Y, test_size=0.3, random_state=100)to split the data. test_sizespecifies the proportion of data to be used for testing (e.g., 0.3 means 30% for testing).random_stateensures the split is reproducible.
- Use
- Train the decision tree:
- Create a
DecisionTreeClassifierobject with specified parameters (e.g.,criterion='entropy',random_state=100,max_depth=3,min_samples_leaf=5). - Fit the model to the training data using
clf_entropy.fit(X_train, Y_train).
- Create a
- Make predictions:
- Use the trained model to make predictions on the test data using
Y_pred = clf_entropy.predict(X_test).
- Use the trained model to make predictions on the test data using
- Evaluate the model:
- Calculate the accuracy score using
accuracy_score(Y_test, Y_pred). Multiply by 100 to express as a percentage.
- Calculate the accuracy score using
Conclusion
- Decision trees are a powerful tool for classification and regression problems in machine learning.
- They are easy to understand, interpret, and visualize.
- Python and its machine learning libraries (e.g., scikit-learn) provide tools to implement decision trees effectively.