3 Decision tree

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/20

flashcard set

Earn XP

Description and Tags

R & C

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

21 Terms

1
New cards

Decision Tree

Decision trees split data into branches based on feature values, creating decision nodes until reaching final predictions. A decision tree is a supervised learning algorithm for classification and regression tasks. It represents decisions and their possible outcomes in a tree-like structure, making the decision-making process transparent and interpretable.

2
New cards

Root Node

The decision-making process starts at the root node, representing the entire dataset. The model evaluates the best feature to split the data based on a chosen criterion like Gini Impurity, Information Gain (for classification), or Mean Squared Error (for regression).

3
New cards

How do we find the root node

  1. Calculate Impurity/Information Gain:

    • Gini Index: Measures impurity. Lower values are better.

    • Entropy: Measures disorder. Higher information gain is better.

  2. Evaluate Each Feature:

    • For each feature, calculate the Gini Index or Entropy for all possible splits.

    • Determine the information gain for each split.

  3. Select the Best Feature:

    • Choose the feature with the lowest Gini Index or highest information gain.

    • This feature becomes the root node.

4
New cards

Splitting

The dataset is divided into subsets based on the selected feature's values, creating branches from the root node. This process continues recursively at each subsequent node.

5
New cards

Internal Node

These nodes represent decision points where the dataset is further split based on other features.

6
New cards

Leaf Node

These are terminal nodes that provide the final prediction.

For classification tasks, the leaf nodes contain class labels, while for regression tasks, they provide numerical outputs.

7
New cards

Stopping Criteria:

All data points in a node belong to the same class (pure node).

■ A maximum tree depth is reached.

■ Minimum samples per leaf or minimum information gain criteria are met.

8
New cards

How do we split

  • Entropy

  • Ginny Index

<ul><li><p>Entropy</p></li><li><p>Ginny Index</p></li></ul><p></p>
9
New cards

Pure split vs Impure split

pure if all the resulting child nodes contain instances of only one class.

impure if the resulting child nodes contain a mix of different classes.

<p><span>pure if all the resulting child nodes contain instances of only one class.</span></p><p></p><p><span>impure if the resulting child nodes contain a mix of different classes.</span></p>
10
New cards

For huge data set we will use what split method

Gin Impurity (Categorical Feature)

11
New cards

problem with the decison tree

This will cause overfitting of data.

<p>This will cause overfitting of data.</p>
12
New cards

How do we reduce the overfittig of data

  • Post Pruning

  • Pre Pruning

13
New cards

Post Prunning

we will complete decision tree an later we will cut that (Pruning).

14
New cards

For smaller data we will use ____ Prunning

Post

15
New cards

Pre Pruning

we will use some parameters while construction of parameters.(max feature, Depth ,max depth, Split )

16
New cards

For huge data we will use

Pre Pruning

17
New cards

Decision Tree vs Random Forest

Feature

Decision Tree

Random Forest

Algorithm Type

Single tree-based model

Ensemble of multiple decision trees

Overfitting

Prone to overfitting

Less prone to overfitting due to averaging

Bias-Variance Tradeoff

High variance, low bias

Lower variance, slightly higher bias

Interpretability

Easy to interpret and visualize

Harder to interpret due to multiple trees

Training Time

Faster training time

Slower training time due to multiple trees

Prediction Time

Faster prediction time

Slower prediction time due to averaging predictions

Accuracy

Can be less accurate due to overfitting

Generally more accurate due to ensemble approach

18
New cards

How do we select feature

Using entropy (measure the randomness of a system)

19
New cards

Gini Index vs Entropy

Aspect

Gini Index

Entropy

Definition

Measures the impurity or impurity of a dataset.

Measures the disorder or uncertainty in a dataset.

Formula

Gini=1−∑i=1npi2Gini=1−∑i=1n​pi2​

Entropy=−∑i=1npilog⁡2(pi)Entropy=−∑i=1n​pi​log2​(pi​)

Range

0 (pure) to 0.5 (maximum impurity for binary classification)

0 (pure) to 1 (maximum disorder for binary classification)

Interpretation

Lower values indicate purer nodes.

Higher values indicate more disorder.

Calculation Complexity

Simpler and faster to compute.

More complex and computationally intensive.

Usage in Decision Trees

Often preferred due to computational efficiency.

Provides more information gain but is computationally heavier.

Sensitivity to Changes

Less sensitive to changes in the dataset.

More sensitive to changes in the dataset.

20
New cards

TYPES of Decision trees - CART vs C4.5 vs ID3

Aspect

CART

C4.5

ID3

Splitting Criterion

Gini Index

Information Gain Ratio

Information Gain

Tree Structure

Binary trees (two children per node)

Multi-way trees (multiple children per node)

Multi-way trees (multiple children per node)

Data Types

Continuous and categorical

Continuous and categorical

Categorical only

Pruning

Cost-complexity pruning

Error-based pruning

No pruning

Handling Missing Values

Surrogate splits

Assigns probabilities

Does not handle missing values

Advantages

Simple, fast, easy to interpret

Handles both data types, robust pruning

Simple, easy to understand

Disadvantages

Can overfit without pruning

More complex, computationally intensive

Can overfit, does not handle continuous data

21
New cards

Do we require Scaling ?

Tree-based Algorithms: Algorithms like Decision Trees and Random Forests don't need scaling because they split data based on feature values, not distances.