Lecture 9: Decision Trees and Random Forest 2

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/9

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

10 Terms

New cards

Q: How do you pick a split in a decision tree?

Choose the variable and cutoff that best separates the data, usually using Gini impurity for classification.

New cards

Q: When do you stop growing a decision tree?

When nodes are pure (one label) or have too few data points.

New cards

Q: What is bootstrap aggregation (bagging)?

Resampling data with replacement, training models on each sample, and aggregating their results to reduce overfitting.

New cards

Q: Why does a fully grown single decision tree overfit?

It perfectly memorizes the training data, losing generalization.

New cards

Q: How do random forests improve over single decision trees?

They grow many trees on bootstrapped samples and average or vote across them to reduce variance.

New cards

Q: What extra randomness is added in random forests?

Each split considers a random subset of predictor variables instead of all variables.

New cards

Q: In Machine Learning, what are the three key steps?

Train on past data, predict on new data, evaluate performance.

New cards

Q: Why use ensemble models?

Because no single model is perfect; combining models can improve accuracy.

New cards

Q: What is stacking in ensemble modeling?

Using outputs from different models as new features for a final model.

New cards

Q: How does linear weighted stacking work?

Split training data into two parts (train1 and train2).
1. Train several models (e.g., Random Forest, GLM, GBM, SVM) on train1.
2. Score each model on train2, using the scores as new features.
3. Combine these new features with original features on train2.
4. Train a final GLM model (including interaction terms) on this combined data.
5. Apply the stacked pipeline to test data