Ch. 8 Data Mining

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/13

Earn XP

Description and Tags

Tree Based Methods

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

14 Terms

New cards

Tree Based Models

Non-Linear

New cards

Classification and Regression Trees (CART) (Decision Trees)

Goal of the decision tree is to split the data into like chunks or pure nodes

Then use the tree structure to make inferences on data

Makes few assumptions about the input dataset

○ No linearity assumptions!

New cards

Graph Theory

New cards

Gini of Split

Used to split decision tree chunks

Want the split with the lowest possible Gini Impurity

If multiple are equal randomly choose one

Get for all predictors

New cards

How do we know to stop splitting?

Tree Depth - how many times do you want to split the data?

Minimum samples in leaf - how small do you want the leaf nodes?

New cards

Feature Importance

Tells us how much each input variable (feature) contributes to the prediction of a model

It helps us understand which features matter most in determining the output

New cards

Decision Tree Split Types

Classification - Gini, Entropy, Log Lost

Regression - MSE, Absolute Error, Poisson

New cards

Cost Complexity Pruning (Weakest Link Pruning)

A very large tree may overfit the data, want to prune the tree by removing some of the unnecessary branches

Start at full deep tree with many terminal nodes (𝛼=0), as 𝛼 increases the cost of having so many terminal nodes increases, and branches get pruned

Cross Validation to find optimal 𝛼, but 𝛼 and T are interacting so often people use them interchangeably

New cards

Improvement on Decision Trees

Decision Trees are weak learners with high variance

Eachs split will be very different from each other
We can use Decision Trees as the base for more complex models

New cards

Bootstrapping

Sub select the samples for the root node at random

Sampling without replacement, can be selected only once

New cards

Out of Bag (OOB) Error Estimation

Average the error across trees

Is a valid estimate of the test data since none of the individual trees have seen the given test sample

New cards

Random Forest (RF)

Sub-select the samples for the root node at random (using bootstrapping)

Sub-select the features at random at each split (sampling without replacement, can be selected only once)

Trees within the forest are not pruned

New cards

Boosted Trees

Builds trees sequentially

For each data point, calculate the difference between the predicted value and the actual value.

These residuals represent what the first tree did not predict correctly.

This way, it focuses on the mistakes of the previous tree.

Fits each tree hard and may overfit, Boosted Trees are a slow learner

Fit each tree to the residuals of the previous tree instead of Ytrain

New cards

Iterative Random Forest (iRF)

The model is retrained multiple times, giving more weight to features that were consistently important in previous iterations.

This helps stabilize the identification of truly important features and reduce noise.