Ensemble Learning and Cross-Validation in Finance
Ensemble learning is a powerful meta-approach in machine learning designed to improve predictive accuracy and robustness by integrating predictions from multiple base models. It leverages the diversity of individual models to achieve better performance than any single model could achieve on its own. The core principle behind ensemble learning is that a group of models can often correct each other's errors, leading to more reliable and accurate predictions.
Key Ensemble Learning Techniques
Bagging: Fitting multiple decision trees on different data samples (bootstrap samples) and averaging the results to reduce variance and avoid overfitting. Bagging is particularly effective when the base models are complex and prone to overfitting.
Stacking: Fitting various model types on the same data and then training another model (a meta-learner or level-1 model) to optimally combine the predictions of the base models (level-0 models). Stacking can leverage the strengths of different model types to achieve higher accuracy.
Boosting: Sequentially adding ensemble members to correct prior model predictions by weighting the examples that were misclassified by previous models. Boosting focuses on reducing bias and creating a strong learner from a series of weak learners, with the final prediction being a weighted average of these predictions.
Bagging Ensemble Learning
Bagging (Bootstrap Aggregation) enhances ensemble diversity by training individual models on different subsets of the training data, drawn with replacement (bootstrapping). This process ensures that each model is exposed to slightly different training data, promoting independence and reducing the likelihood of overfitting.
The name Bagging is derived from Bootstrap AGGregatING
It is based on two key components:
Bootstrap sampling: Creating multiple subsets of the original training data by sampling with replacement.
Aggregation: Combining the predictions of the individual models, typically through averaging (for regression) or voting (for classification).
Typically, bagging involves training a single machine learning algorithm, most commonly unpruned decision trees, on each bootstrap sample. The diversity in training data leads to variations in the learned models, and their predictions are then combined using simple statistical methods like voting or averaging to produce a final prediction.
A key aspect of bagging is the preparation of each dataset sample for training ensemble members:
Each model is trained on its own unique dataset sample.
Examples (rows) are randomly selected from the dataset but with replacement.
Replacement means a row can be selected multiple times for a given training dataset; this process is also known as bootstrap sampling.
Bagging's key elements are:
Bootstrap samples of the training dataset, providing diverse training sets for each model.
Unpruned decision trees fit on each sample, allowing for complex models that can capture intricate relationships in the data.
Simple voting or averaging of predictions to combine the outputs of individual models into a final, more robust prediction.
Bagging achieves diversity through variations in bootstrapped replicas and the use of relatively weak classifiers, which, when combined, create a strong ensemble.
Algorithms based on Bagging
Bagged Decision Trees (canonical bagging): The basic implementation of bagging using decision trees.
Random Forest: An extension of bagging that introduces additional randomness by selecting a random subset of features at each split in the decision trees.
Extra Trees (Extremely Randomized Trees): Similar to Random Forest but with even more randomization in the tree-building process.
Stacking Ensemble Learning
Stacked Generalization (stacking) leverages different model types on the same training data, using another model (meta-learner) to learn how to best combine the predictions of the base models. This approach allows the ensemble to capitalize on the strengths of different algorithms.
Ensemble members are level-0 models, which are the base models trained on the original data.
The model combining the predictions is a level-1 model (meta-learner), which learns how to weight the contributions of the base models.
More layers of models can be used, creating a hierarchical structure of learners.
Any machine learning model can be used as the meta-learner, but linear models like linear regression (for regression tasks) and logistic regression (for classification tasks) are commonly employed. The meta-learner is trained on the predictions of the base models, learning to correct their individual errors and combine their strengths.
Stacking's key elements are:
An unchanged training dataset, ensuring that all base models are trained on the same foundational data.
Different machine learning algorithms for each ensemble member, promoting diversity in the types of relationships captured in the data.
A machine learning model to learn how to combine predictions, allowing the ensemble to adaptively weight the contributions of each base model.
Diversity in stacking arises from the use of different machine learning models as ensemble members. A suite of models learned or constructed differently ensures varied assumptions and less correlated prediction errors, leading to improved generalization performance.
Algorithms based on Stacking
Stacked Models (canonical stacking): The basic implementation of stacking.
Blending: A simplified version of stacking where the meta-learner is trained on a holdout set rather than through cross-validation.
Super Ensemble: A more complex form of stacking that can involve multiple layers of meta-learners.
Boosting Ensemble Learning
Boosting adapts training data to emphasize examples that previous models got wrong. The models are fit sequentially, with each model trying to correct the errors of its predecessors. This iterative process focuses on difficult-to-classify instances, gradually improving the ensemble's performance.
It employs simple decision trees (weak learners) with single or few decisions, ensuring that each model is quick to train and less prone to overfitting.
The predictions are combined via simple voting or averaging, weighted by performance, giving more influence to models that perform better on the training data.
The goal is to develop a "strong" learner from many "weak" learners, with each weak learner contributing to the overall predictive power of the ensemble.
Instead of training on random subsets of the data, boosting algorithms focus on specific examples (rows of data) based on whether they were correctly or incorrectly predicted by prior ensemble members. This adaptive resampling technique allows the algorithm to concentrate on the most challenging cases.
Boosting's key elements are:
Biasing training data toward hard-to-predict examples, allowing the ensemble to focus on the most informative instances.
Iteratively adding ensemble members to correct prior models' predictions, gradually improving the overall accuracy of the ensemble.
Combining predictions using a weighted average of models, giving more weight to models that perform better on the training data.
Algorithms based on Boosting
AdaBoost (Adaptive Boosting): The canonical boosting algorithm that adjusts the weights of training instances based on the performance of previous models.
Gradient Boosting Machines: A generalization of AdaBoost that allows for the use of different loss functions and weak learners.
Stochastic Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost): An extension of Gradient Boosting that introduces randomness in the training process to further reduce overfitting and improve performance.
Walk-forward Modeling for Time Series Prediction
Challenges in Market Prediction
Market prediction poses unique challenges for machine learning, largely due to the inherent characteristics of financial data.
Low signal-to-noise ratio: Financial data often contains more noise than actual predictive signals, making it difficult to discern meaningful patterns.
Non-stationarity (regime switching): The statistical properties of financial time series change over time, meaning that patterns observed in the past may not hold in the future.
Market adaptation (reflexivity): Market participants react to predictions, which can alter the underlying patterns, creating a feedback loop that complicates the prediction process.
Issues with Traditional Cross-Validation
Train-Test Split: Using only a fraction of data for out-of-sample performance and biasing towards the most recent period, which may not be representative of the overall data distribution.
Cross-Validation: Training models on non-sequential data, resulting in data spillover, where information from the future leaks into the training set, leading to overly optimistic performance estimates.
Walk-Forward Train/Test
A more robust approach that simulates models in a sequence, periodically retraining to incorporate all data available at that point in time. This method provides a more realistic assessment of a model's performance in a dynamic environment.
Training Models
Create an end-of-year dates list to define the points at which the model will be retrained.
Slice out data up to that point to create the training dataset for each iteration.
Train a model with the data to learn the relationships between features and target variables.
Save the fitted model object to preserve the learned parameters for later use.
Rolling Window
Considering only the 90 days' data prior to each model re-training provides a more focused view of recent market dynamics. This approach gives f01 has overtaken f02 in significance by the end of the time period, allowing the model to adapt to changing market conditions.
Using Models
Loop through each trained model and predict the period of time until the next model becomes available, simulating real-time prediction scenarios.
PurgedKFold Cross-Validation
Limitations of Traditional Cross-Validation
Traditional k-fold cross-validation fails in finance because the observations cannot be expected to be drawn via an IID (independent and identically distributed) process, leading to leakage and selection bias. Additionally, the testing set is employed several times during the development of a model, resulting in multiple testing and selection bias.
Reasons Observations cannot be expected to be drawn via an IID process; therefore, k-fold CV fails in finance.
Another cause for CV’s failure is that the testing set is employed several times during the development of a model, resulting in multiple testing and selection bias. Let us focus on the first part argument.
Addressing Leakage
Purging: Removing samples from the training set before a test set to eliminate any overlapping data.
Embargoing: Removing samples from the training set after a test set to prevent leakage from future data points influencing the model's predictions.
The walk-forward (WF) methodology is the most used backtest method in the literature. WF is a historical simulation of the strategy’s performance in the past. Each strategy decision is based on information gathered before the decision was made, ensuring that the backtest is realistic and unbiased.
Advantages of CV Method
The test is not based on a specific (historical) scenario, providing a more generalized assessment of the model's performance.
CV evaluates k different scenarios, only one of which matches the historical sequence, offering a broader perspective on the model's robustness.
Every judgment is based on equal-sized groups, ensuring that each fold contributes equally to the overall evaluation.
Disadvantages of CV Method
A single backtest path is simulated, similar to WF, limiting the scope of the evaluation.
Per observation, one and only one forecast is generated, which may not capture the full range of possible outcomes.
CV lacks a solid historical context, making it difficult to assess the model's performance in real-world scenarios.
TheCombinatorial PurgedCross-Validation Backtesting Algorithm CPCV
CPCV provides the exact number of combinations of training/testing sets required to construct a set of backtesting paths while purging training observations that contain leaked information, given a set of backtest paths targeted by the researcher. This algorithm ensures that the backtesting process is rigorous and free from bias.
PurgedKFoldCV Code
Two requirements for a dataframe:
It has to be a time series to maintain the temporal integrity of the data.
It needs target labels that are produced on future values to ensure that the model is predicting future outcomes based on past data.
Keras Loss Functions
Loss functions compute the quantity that a model should minimize during training. Keras offers a variety of loss functions, including probabilistic, regression, and hinge losses, each designed for different types of tasks and data distributions.
Base Loss API
Loss class Loss base class. This is the class to subclass in order to create new custom losses, allowing users to define their own loss functions tailored to specific problems.
call(): Contains the logic for loss calculation using ytrue, ypred, which are the true labels and predicted labels, respectively. This method defines how the loss is computed based on the model's predictions.
Standalone usage of Losses
A loss is a callable with arguments lossfn(ytrue, ypred, sampleweight=None), allowing for the use of sample weights to give more importance to certain data points during training.
Creating Custom Losses
Any callable with the signature lossfn(ytrue, y*pred) that returns an array of losses (one of sample in the input batch) can be passed to compile() as a loss. This flexibility allows users to define custom loss functions that are specifically tailored to their problem.
Custom loss functions, take care of some rules
The loss function must only take two values: true labels and predicted labels. This ensures that the loss function is compatible with the training process.
Make sure that you are making the use of y*pred or predicted value in the loss function because, if you do not do so, the gradient expression would not be defined, and it can throw some error. The gradient is necessary for the optimization process.
There are many reasons that our loss function in Keras gives NaN values, which can be problematic during training.
Missing Values in training dataset: Ensure that the dataset is preprocessed to handle missing values appropriately.
Loss is unable to get traction on training dataset: This may indicate that the learning rate is too high, or the model architecture is not well-suited for the data.
Exploding Gradients: Gradient clipping can be used to mitigate this issue, which occurs when the gradients become excessively large during training.
Dataset is not scaled: Scaling the dataset can help to stabilize the training process and prevent NaN values.
Dying ReLU problem: This occurs when ReLU neurons become inactive and stop learning. Using alternative activation functions like LeakyReLU can help.
Not a good choice of optimizer function: Experiment with different optimizers to find one that is well-suited for the problem at hand.