1/42
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
What is a node?
A decision/test on a predictor (e.g., is it raining?)
What is a branch?
Outcome of the test (left = true statement and right = false statement) (e.g., is it windy?)
What is a leaf node?
Final outcome (e.g., bring a raincoat)
Each leaf node corresponds to a ______
decision rule
Each decision is a ___ split on the sample space
binary
What does splitting on a numerical variable look like?
X > 5 & X is less than or equal to 5
What does splitting on a categorical variable look like?
XE{A} and XE{B, C} OR
XE{A, B} and XE{C}
What is the intuition behind the Gini index?
It measures how mixed or impure a set of observations is.
What is the formula for the Gini index?
Gini = 1 − ∑^m k=1 p²k
What does mmm represent in the Gini formula?
The number of classes
What does pk represent in the Gini formula?
The proportion of observations in class k
What type of measure is the Gini index?
An impurity measure
What is the best possible Gini value?
0 (all observations belong to the same class)
What is the worst possible Gini value?
1− 1/ m (all classes equally represented).
What does a Gini index of 0 mean?
The node is perfectly pure
What does a higher Gini index indicate?
More class mixing (more impurity).
How do you evaluate a potential split in a decision tree?
Compute the Gini index for the left and right regions.
How are the two region Gini values combined?
By taking their weighted average.
What weights are used in the weighted Gini average?
The proportion of observations in each region.
What do you compare the weighted Gini against?
The Gini index before the split.
When should you make a split?
When there is a large decrease in Gini impurity.
What is the goal of splitting in a classification tree?
To reduce impurity as much as possible.
What impurity measure is used in regression trees?
Mean Squared Error (MSE)
What replaces the Gini index in regression trees?
Sum of Squared differences from the mean
What value is stored in a regression tree leaf node?
The average of all observations in that region
How does this differ from classification trees?
Classification uses the majority class instead of an average
What is the natural stopping point of a decision tree?
100% purity in each leaf node
Why is reaching 100% purity a problem?
It likely causes overfitting.
What does overfitting lead to?
Low predictive accuracy on new data
What does overfitting lead to?
They are too complex and overfit the data
When do we typically stop growing a decision tree?
When the tree becomes large
Why stop when leaf nodes have few observations?
Small leaf sizes increase overfitting risk.
What does a small decrease in impurity indicate?
The split is not very useful.
What is the goal of stopping early?
To balance model complexity and accuracy.
What is Option 1 for building a smaller tree?
Only split if impurity reduction exceeds a high threshold (but this is short-sighted and may block good later splits)
What is Option 2 for building a smaller tree?
Grow a large tree then prune it back (allows flexibility and better final performance)
Why might a full decision tree perform poorly?
It is too complex and overfits the data. Therefore we prune back to reduce the number of splits
How does pruning affect training and testing accuracy?
It usually lowers training fit and may improve the generalization for testing data
What is a benefit of a smaller tree?
Better interpretability
What does cross-validation help select?
The optimal number of leaf (terminal) nodes
Why can’t we rely only on training data?
It leads to overfitting.
What does cv_tree( ) do automatically?
Performs the training–validation split.
What is the goal of cross-validation in trees?
Balance bias and variance.