Ensemble Methods

Dimensions of Theta

In linear cost buyers, thetas were the same amount as x (features), meaning each feature had a corresponding weight.
With trees, features can be re-encountered, and rules can be established on the same feature to split the search space multiple times. This allows for more complex decision boundaries.
Data doesn't necessarily imply it's the same function; different datasets can lead to different models, emphasizing the importance of data quality and representation.

Epibetics

Used to describe logistic models and model networks, where epibetics refers to the study of changes in organisms caused by modification of gene expression rather than alteration of the genetic code itself. This analogy highlights the flexibility and adaptability of these models.

Classifier Predictions

y hat: Class list tested predicted by classifier type of features. Represents the predicted class label.
Binary Classification:
- If f(x) > 0.5, prediction is a positive class (specified if positive class is 1). This threshold is a decision boundary; values above are positive, below are negative.

Uncertainty:
- Prediction is the maximum probability value, indicating the confidence of the classifier.
- Example: Logistic regression with uncertainty around the decision, predictions close to 0.5. Such predictions indicate the model is unsure about the outcome.
- If a value of 0.7 is associated with the positive class, it suggests some level of confidence but not certainty. The model is leaning towards the positive class.

Multi-Class Classification

Ground truth label for classes (e.g., a, b, c, d) as a vector where the index corresponds to the class. This is a one-hot encoded vector.
- Example:
- Ground truth (y list): a = 1, others = 0. Indicates that the true class is 'a'.
- Prediction (y hat): a value between 0 and 1 for class a. Represents the probability that the instance belongs to class 'a'.

Ensembling for Healthcare

Combine expertise from multiple individuals to get closer to the actual solution. Emphasizes collaborative decision-making.
Benefit from the expertise of each person surveyed, leveraging diverse knowledge to improve accuracy.

Pretrained Classifiers

Class probabilities for class 1 to the positive class from each classifier. Probabilities indicate the likelihood of belonging to the positive class.
Example: Binary classification problem for a pumpkin (whether it has a particular disease).
Get probabilities from each person (m people). Aggregate these probabilities to make a final decision.
Hard predictions based on probabilities:
- Above 50% probability: yes.
- Below 50% probability: no.

Majority Vote

Take the majority vote to decide the final prediction. A simple ensembling technique.
Classifiers can be any type, or the same classifier sub-sampled on the dataset. Allows for flexibility in model selection.

Different Models, Different Thresholds

Training set is the same for all three models. Ensures comparability between models.
The three different models came up with a different threshold value, and then the threshold value has some sort of error associated with it. This variability can be due to different model biases.
Example: Three different trees that are essentially just stops (one little node).
Combine all these and see what the error is in the end. Evaluate the ensemble performance.

Small Depth Trees

Small depth trees, literally just one level tree, can create one level trees, and we're able to get this cute aggregated decision boundary by ensembling them. Useful for creating simple, interpretable models.
Weak learners: Single decision stump, one single node with two children. These are simple models that perform slightly better than random guessing.
Alternative to one tree with greater depth. Offers a different approach to model complexity.
Advantage: Smaller search space over shallower trees. Reduces computational complexity.

Weighted Average

Instead of blindly averaging, assign weighting values to models. Give more importance to better-performing models.
Ensure weights are between 0 and 1 to keep probabilities between 0 and 1. Maintains probabilistic interpretability.
Unweighted average: Set everything to the same weight ( $1/m$ ). Each model contributes equally.

Addressing Model Mishmash

Future scaling is one of the ways to address model mishmash. Standardizes the input features.
Ensure that you systematically introduce things to ensemble. Avoid haphazardly combining models.

Subsampling the Dataset

Create different predictors by sampling the training set. Train models on different subsets of the data.
Ensemble the output to see the prediction. Combine the predictions from multiple models.

Validation and Test Sets

Check error and accuracy on the validation set. Tune hyperparameters and evaluate model performance.
Hold the test set out completely. Provides an unbiased estimate of performance on unseen data.
Hybrid competition: Test set hidden, leaderboard generated. Motivates model improvement through competition.

Combining Models Technique

Train on training data, learn weights on validation data. Optimize model weights based on validation performance.
Use the logistic model and log loss for error. Common choices for classification tasks.
Calculate these things during training. Track performance metrics.

Weighted Averaging

See how everything performs on the validation data and adjust. Fine-tune model weights based on performance.
Models are already trained. Weights are optimized post-training.
Weights are assigned based on validation performance. Prioritizes accurate models.

Ensemble Technique Motivation

Works really well. Ensembling often improves predictive performance.
Netflix clicks prize competition ( $1,000,000$ prize) was won using ensembles (stacking technique). Highlights the effectiveness of ensembling.
Pretrained models and stacked outcomes based on validation set. Leverages existing models and optimizes their combination.
Predictions happen on the validation set, not the training set. Avoids overfitting on the training data.
Weight part is a different step. Optimize weights separately.

Combining Outputs on Validation Set

Take them for the vote. Use a voting mechanism.
Average all of them with similar weighting power. Simple averaging.
Stacking: Weighted average based on validation dataset performance, up-weighting accurate models and down-weighting inaccurate models. A more sophisticated ensembling technique.

Classifier Variance

Want classifiers with variance. Diversity in models is beneficial.
How much would the model change given different datasets of size n? Measures model stability.
Sample data (training data) and split it randomly into equal-sized subsets (size d). Create multiple subsets of the training data.
Model with high variance can have different structure/parameters when fit on different subsets. Indicates sensitivity to the training data.

Data Subsets

Create subsets of training data and train a model on each. Generate diverse models.
Each tree trained on a subset might have different structures. Promotes model diversity.

Gini Index

$Gini = 0$ in the left most each nodes. Indicates perfect purity.
$Gini = 0.219$ in other. Measures impurity or disorder.
Decision trees tend to have high variance depending on sub-sampling. Sensitive to variations in the training data.

Generating Subsets

Randomly generate m subsets of the data. Create multiple subsets.
Have m models, each trained on a different subset. Generate diverse models.
Randomly select rows from the training dataset with replacement. Use bootstrapping.
Build a tree for each of the m subsets. Train each model.
Combine trees with equal vote, averaging, or weighted averaging. Ensemble the models.

Bootstrapping

Creating random subsets. A sampling technique.
Create random draws from training data (drawing from a hat). Sample with replacement.
With replacement (can encounter the same row multiple times). Allows for duplicates in the sample.
Train models on these datasets. Generate diverse models.

Bootstrap Aggregating (Bagging)

Bootstrap: Picking from the hat. Sampling with replacement.
Aggregating: After training, get predictions and average them. Ensemble the models.
The bootstrap part is picking from the hat. Sampling with replacement.
The aggregating part is after I trained a bottle on the number it picked from my hat. Combining predictions.
Get some sort of prediction, and I can essentially do that for each of the different models and then aggregate their outcomes. Ensemble prediction.

Generating Bootstraps

All I'm doing is I'm randomly drawing each of these rows, and then creating a little dataset from that. Creating subsets.
Notice that I'm not freaking out if I get repeaters. Duplicates are allowed.
This is with replacement. Bootstrapping.

Sample Training

Once I copy this d one, d two, d three, d four, and train a model on each unit. Creating diverse models.
I get some sort of decision tree from each unit. Different trees.
Those decision trees are gonna give me different predictions, and then I can aggregate what they predicted. Ensemble predictions.
I can either weigh them to be more than a quote. I can do any of those tricks that I did before when I was talking about. Weighted averaging.

Random Number Generator

You can do rand int in Python, which is kinda nice. Simplifies random sampling.
You're gonna run x through each of the trees, see what each tree gives you where it takes you in terms of its prediction, and then you can combine all the end predictions to get its overall prediction. Process for making predictions.
You can either make the advantage the advantage, average. Ensemble predictions.
You could weight some of them. Weighted averaging.

Random Features

Trees in and of themselves are going to have high variance. Sensitive to small changes in the data.
But collectively together, we can sometimes combat this variance by using the. Ensembling reduces variance.

Bootstrapping

Trees are low widest, but high variance. Tradeoff between bias and variance.
Bootstrap sampling simulates equally likely datasets. Creates diverse datasets.

Combat Memorization

Combat the memorization effect that was happening when we were running our training to generate one decision tree by itself. Prevents overfitting.
Effectively lower generalization of performance, and our intuition is that we want to reduce variant variance while keeping the bias low as well. Improves generalization.

Bias Variance Trade Off

This is our sort of bias variance trade off discussion, right, that as our you know, we have higher variance and lower bias. Balancing model complexity.
We're kinda hitting this divergence here, and while complexity is increasing, there is something here where we have higher variance bias with lower variance. Finding the right balance.

Problem: large datasets

With large datasets, we can often learn roughly the same tree over and over again, and averaging won't necessarily help because they might all be a little bit too related to each other. Lack of diversity.

Solution: subset of features

We want to only look at a subset of the features. Feature selection.
Sample the features without replacement to create new trees. Improves diversity.
Some trees see certain attributes, others see different ones. Different perspectives.

Random Selection

Still gonna create subsets of the training dataset, and we're going to do this, like, random selection with replacement on the bootstrap samples. Creating subsets of the data.
Now when we are kind of splitting on the decision tree, right, we are only going to search over our randomly selected features. Focusing on selected features.
We are going to select a certain amount of features, and we're gonna do it without replacement. Ensuring diversity in feature selection.
That's going to ensure that our trees are different. Different perspectives.

Random Forest Key Takeaways

Run through all of the entries, combine them, and they are going to be quite different from one another because they are going to rely on, for example, thresholding on different features. Heterogeneous models.
In a random forest, we are doing what we did before, which was doing batting. Bootstrapping.

Ensemble Learning

Create little new datasets with replacements. Generate diverse datasets.
There are lots of things you're going to be selecting (min sample split, build a tree, etc) Tuning hyperparameters.

Random Forest Parameters Explanation

Very common thing to test it against differnt models, e.g, a decision tree. Benchmarking performance.
Random forest is ensembling the subsets of the dataset that were selected only for some features. Focuses on selected features.

Decision Tree Vs Random Forest

Two models: vanilla decision tree and random forest. Compare different approaches.
Assess error rate by increasing number of trees. Evaluate performance.
Increase the number of trees to a thousand, and you can see that it gets a pretty good error rate earlier, slightly earlier, right, by by one by one tree death. More trees often improve performance.

Weakness

Can potentially be hard to interpret and understand forests and trees of how you're splitting that. Lack of interpretability.
The same decision boundaries, trees, and stuff. Complex model behavior.

Boosting algorithms (Ensemble Method)

Training m task buyers in parallel or sequential and focus on models that get things wrong (sequentially). Sequential learning.
Rather than doing everything in parallel, we're doing everything sequentially. Focuses on mistakes.

Sequential Classifer Learning

Learn this classifier. Train the model.
Fixated on things that I got wrong, and I'm gonna ruminate on them, I'm gonna make them bigger. Emphasizing mistakes.
They are problematic, so I need to focus on them. Giving importance to difficult examples.
Updated the weights, and now I'm, like, weighting certain misclassified points greater than everything else. Adjust weights.
Create another classifier by giving preferential treatment to what has been learned. Iterative learning.

Ataboost

Each of these are little one known decision stumps, which are the simplest classifier you could get, and then you combine them. Ensemble simple models.
The most famous one of these boosting algorithms is known as Ataboost. Popular boosting technique.
Works well because you have this sort of bull's eye shape. Suitable data distribution.
Pretty good for random