Ch. 8 Data Mining

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/13

flashcard set

Earn XP

Description and Tags

Tree Based Methods

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

14 Terms

1
New cards

Tree Based Models

Non-Linear

2
New cards

Classification and Regression Trees (CART) (Decision Trees)

Goal of the decision tree is to split the data into like chunks or pure nodes

Then use the tree structure to make inferences on data

Makes few assumptions about the input dataset

○ No linearity assumptions!

3
New cards

Graph Theory

knowt flashcard image
4
New cards

Gini of Split

Used to split decision tree chunks

Want the split with the lowest possible Gini Impurity

If multiple are equal randomly choose one

Get for all predictors

5
New cards

How do we know to stop splitting?

Tree Depth - how many times do you want to split the data?

Minimum samples in leaf - how small do you want the leaf nodes?

6
New cards

Feature Importance

Tells us how much each input variable (feature) contributes to the prediction of a model

It helps us understand which features matter most in determining the output

7
New cards

Decision Tree Split Types

Classification - Gini, Entropy, Log Lost

Regression - MSE, Absolute Error, Poisson

8
New cards

Cost Complexity Pruning (Weakest Link Pruning)

A very large tree may overfit the data, want to prune the tree by removing some of the unnecessary branches

Start at full deep tree with many terminal nodes (𝛼=0), as 𝛼 increases the cost of having so many terminal nodes increases, and branches get pruned

Cross Validation to find optimal 𝛼, but 𝛼 and T are interacting so often people use them interchangeably

9
New cards

Improvement on Decision Trees

Decision Trees are weak learners with high variance

  • Eachs split will be very different from each other

  • We can use Decision Trees as the base for more complex models

10
New cards

Bootstrapping

Sub select the samples for the root node at random

Sampling without replacement, can be selected only once

11
New cards

Out of Bag (OOB) Error Estimation

Average the error across trees

Is a valid estimate of the test data since none of the individual trees have seen the given test sample

12
New cards

Random Forest (RF)

Sub-select the samples for the root node at random (using bootstrapping)

Sub-select the features at random at each split (sampling without replacement, can be selected only once)

Trees within the forest are not pruned

13
New cards

Boosted Trees

Builds trees sequentially

For each data point, calculate the difference between the predicted value and the actual value.

These residuals represent what the first tree did not predict correctly.

  • This way, it focuses on the mistakes of the previous tree.

Fits each tree hard and may overfit, Boosted Trees are a slow learner

Fit each tree to the residuals of the previous tree instead of Ytrain

14
New cards

Iterative Random Forest (iRF)

The model is retrained multiple times, giving more weight to features that were consistently important in previous iterations.

This helps stabilize the identification of truly important features and reduce noise.