Backtest with AIFactorValidation / Cross Validation via K-Fold CV (blocked)

Andreas Himmelreich

Feb 03, 2025

Let us say we have the following variables for an AI Factor Model (build on portfolio123.com).

Now what will happen in a K-fold CV Test?

The AI Model will be able to build the portfolio for 21.6 Years (in Orange: Validation Holdout).

It builds the portfolio based on predictions of the Validation Training Periods (in green: Validation - Training).

Plus, we cross validate!

Some ChatGPT on Cross Validation with the ML Model ExtraTrees —>

“Why is K-Fold Cross-Validation Effective for Preventing Overfitting?

K-fold cross-validation is a method to prevent overfitting in machine learning models, including Extra Trees (Extremely Randomized Trees). Here's why:

1. More Reliable Performance Estimates

Instead of relying on a single train-test split, k-fold cross-validation splits the dataset into multiple train-test splits.
This ensures that the model is evaluated on different subsets of data, providing a more stable and unbiased estimate of performance.

2. Reduces Overfitting on a Single Training Set

If you train Extra Trees on just one train-test split, the model may overfit to that particular training data.
K-fold forces the model to train on different subsets and validate on different holdout sets, preventing it from memorizing patterns from a single training set.

3. Ensures Model Generalization

Overfitting occurs when a model performs well on training data but poorly on unseen data.
K-fold cross-validation tests the model on unseen data multiple times, ensuring that it generalizes well across different portions of the dataset.

4. Better Utilization of Data

With a standard train-test split, a portion of the data is never used for training.
In k-fold CV, every data point is used for both training and validation at some point, maximizing data efficiency.

5. Extra Trees & Overfitting Control

Extra Trees (Extremely Randomized Trees) are less prone to overfitting due to random feature splits, but they can still overfit if trained on small or unbalanced datasets.
K-fold CV ensures that every part of the data contributes to both training and validation, reducing the risk of the model learning noise instead of general patterns.

Example: How K-Fold Works in Extra Trees

Dataset is split into k folds (e.g., k=4).
The model is trained on k-1 folds and tested on the remaining fold.
This process repeats k times, each time using a different fold as the test set.
The final performance metric is the average across all k iterations.

Summary: Why K-Fold Cross-Validation Prevents Overfitting

More robust performance estimates
Reduces model reliance on a specific train-test split
Ensures model generalization across different subsets
Prevents the model from memorizing data patterns
Works well with tree-based models like Extra Trees”

This (K-fold CV Validation & Training) also means we can take an AIFactorValidation to test the whole time for which we loaded the data in a Portfolio Strategy.

For this we take the Model(s) with the best results:

We click on the ML model (the “Fx” symbol):

And this window pops up:

Now we copy the String: AIFactorValidation("6MRel + Feature Importance +JJJ K-Fold", "extra trees II") into a ranking system and include the ranking system in a portfolio strategy and can test the AIFactorValidation from 01/04/2003 - 07/27/2024.

And here is the Backtest Result —>

Again, the above test is still an out of sample test, because “K-fold cross-validation tests the model on unseen data multiple times”.

Yes, but for the first portfolio built it takes data from future training data to be trained. On the other hand, a K-Fold test makes sure the ML has much more Training Data.

Hmmm, still need to get my head around it…

Best Regards

Andreas

Trading for a sub 1 Million Account

Discussion about this post