Decision Trees - Predictive Analytics Final Exam

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/42

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

43 Terms

New cards

What is a node?

A decision/test on a predictor (e.g., is it raining?)

New cards

What is a branch?

Outcome of the test (left = true statement and right = false statement) (e.g., is it windy?)

New cards

What is a leaf node?

Final outcome (e.g., bring a raincoat)

New cards

Each leaf node corresponds to a ______

decision rule

New cards

Each decision is a ___ split on the sample space

binary

New cards

What does splitting on a numerical variable look like?

X > 5 & X is less than or equal to 5

New cards

What does splitting on a categorical variable look like?

XE{A} and XE{B, C} OR

XE{A, B} and XE{C}

New cards

What is the intuition behind the Gini index?

It measures how mixed or impure a set of observations is.

New cards

What is the formula for the Gini index?

Gini = 1 − ∑^m k=1 p²k

New cards

What does mmm represent in the Gini formula?

The number of classes

New cards

What does pk represent in the Gini formula?

The proportion of observations in class k

New cards

What type of measure is the Gini index?

An impurity measure

New cards

What is the best possible Gini value?

0 (all observations belong to the same class)

New cards

What is the worst possible Gini value?

1− 1/ m (all classes equally represented).

New cards

What does a Gini index of 0 mean?

The node is perfectly pure

New cards

What does a higher Gini index indicate?

More class mixing (more impurity).

New cards

How do you evaluate a potential split in a decision tree?

Compute the Gini index for the left and right regions.

New cards

How are the two region Gini values combined?

By taking their weighted average.

New cards

What weights are used in the weighted Gini average?

The proportion of observations in each region.

New cards

What do you compare the weighted Gini against?

The Gini index before the split.

New cards

When should you make a split?

When there is a large decrease in Gini impurity.

New cards

What is the goal of splitting in a classification tree?

To reduce impurity as much as possible.

New cards

What impurity measure is used in regression trees?

Mean Squared Error (MSE)

New cards

What replaces the Gini index in regression trees?

Sum of Squared differences from the mean

New cards

What value is stored in a regression tree leaf node?

The average of all observations in that region

New cards

How does this differ from classification trees?

Classification uses the majority class instead of an average

New cards

What is the natural stopping point of a decision tree?

100% purity in each leaf node

New cards

Why is reaching 100% purity a problem?

It likely causes overfitting.

New cards

What does overfitting lead to?

Low predictive accuracy on new data

New cards

What does overfitting lead to?

They are too complex and overfit the data

New cards

When do we typically stop growing a decision tree?

When the tree becomes large

New cards

Why stop when leaf nodes have few observations?

Small leaf sizes increase overfitting risk.

New cards

What does a small decrease in impurity indicate?

The split is not very useful.

New cards

What is the goal of stopping early?

To balance model complexity and accuracy.

New cards

What is Option 1 for building a smaller tree?

Only split if impurity reduction exceeds a high threshold (but this is short-sighted and may block good later splits)

New cards

What is Option 2 for building a smaller tree?

Grow a large tree then prune it back (allows flexibility and better final performance)

New cards

Why might a full decision tree perform poorly?

It is too complex and overfits the data. Therefore we prune back to reduce the number of splits

New cards

How does pruning affect training and testing accuracy?

It usually lowers training fit and may improve the generalization for testing data

New cards

What is a benefit of a smaller tree?

Better interpretability

New cards

What does cross-validation help select?

The optimal number of leaf (terminal) nodes

New cards

Why can’t we rely only on training data?

It leads to overfitting.

New cards

What does cv_tree( ) do automatically?

Performs the training–validation split.

New cards

What is the goal of cross-validation in trees?

Balance bias and variance.