DEV Community

Cover image for Failed Machine Learning Experiment: Training XGBoost Classifier with 1.5m signals
Daniel Stepanian
Daniel Stepanian

Posted on

Failed Machine Learning Experiment: Training XGBoost Classifier with 1.5m signals

In 2022 I started creating trading strategies in Python, and I had in mind some powerful ML-based strategies, but had neither knowledge, nor abilities to code and test them. Now, although I still have no experience with professional Machine Learning with deep mathematics, I thought that I could use AI to write code for this (Sonnet 4.5), and suggest model parameters (Grok Thinking).

When looking at many market price charts, I was under the impression that there are some patterns that can be utilized and with a right set of trading strategy and position optimization to get automated at least a couple percents of return. It’s clear that this is not true - usually the market presents a distorted picture. Yet, tempted with the ability to check it myself in a quick prototype project, I proceeded with an experiment to verify if a hypothesis based on my earlier impressions could be true. I used 2 Jupyter notebooks: XGBoost model training and Strategy backtest.

First, I downloaded 5y data of 15m timeframe price points for top 30 crypto tokens into parquet files. Then created an algorithm to find all price points, after which there was a price drop bigger than 3% within the 10 next 15 min blocks, and extracted preceding 10 price points with technical analysis indicators as training data for XGBoost classifier - for identification of moments preceding price drops. 500k Drop signals were found, and I added another 1 million with random non-drop preceding samples, in total 1.5m training samples, with 20% from these used for testing.

I’ve also normalized drops, as 3% drop on Bitcoin has a different magnitude than the same drop on Dogecoin. So I’ve chose the drop threshold = -2 expressed with Z-score approach: drop_zscore=drop_pct/volatility. It means that it’s a drop with 2x typical volatility (based on std deviation).
Then feature engineering process based on indicators momentum, volatility, price differences. Data preparation, then XGBoost training with parameters from Grok’s recommendation:

*Recommended hyperparameters:**
- `max_depth`: 3-7 (prevents memorizing noise)
- `learning_rate`: 0.01-0.1 (smaller = better with more trees)
- `n_estimators`: 200-500 (with early stopping)
- `subsample` / `colsample_bytree`: 0.6-0.9 (prevents overfitting)
- `scale_pos_weight`: 3-10 (handles class imbalance)

Enter fullscreen mode Exit fullscreen mode

The model performed very similarly on test predictions and train set:

============================================================
TRAIN SET PERFORMANCE
============================================================
ROC-AUC Score: 0.6899

Classification Report:
              precision    recall  f1-score   support

   No Signal       0.93      0.62      0.74   3149036
      Signal       0.19      0.66      0.29    426220

    accuracy                           0.62   3575256
   macro avg       0.56      0.64      0.52   3575256
weighted avg       0.84      0.62      0.69   3575256

Confusion Matrix:
[[1938267 1210769]
 [ 144995  281225]]

============================================================
TEST SET PERFORMANCE (Unseen Data)
============================================================
ROC-AUC Score: 0.6761

Classification Report:
...
Train AUC: 0.6899
Test AUC:  0.6761
Difference: 0.0138
✓ Good generalization - minimal overfitting

Enter fullscreen mode Exit fullscreen mode

Basic returns turned out to be the most important feature for drop prediction. Yet there are too many false positives, which could hurt a portfolio.

Confusion Matrix by Feature Space

So I thought, maybe a set of position parameters could save this signal and make it usable? So I proceed with the backtesting notebook. I loaded the model, created a backtesting trading simulation environment, and a set of trading position parameters: TP, SL, delay, cooldown. I tried a grid search optimization approach - to test 900 scenarios with parameter combinations and find algorithmically the best one. It took 3 hours on my local computer, and yet… All scenarios resulted in 100% loss! The process failed miserably.

It was nice working on this step-by-step with Cursor + Sonnet 4.5. I’ve read a lot about XGBoost when building this, so just telling the assistant what needs to be done and why, and seeing it creating neat notebooks that work out-of-the-box or after 1-2 debug-fix iterations, felt almost seamless. Working with Jupyter Notebooks in Cursor is not convenient - the notebook needs to be closed, reopened and rerunned manually after changes applied in Agent mode. So I ended up in Ask Mode and pasting the code blocks manually.

Top comments (0)