vladik1314

Posted on Jun 2

I Built an ML Bitcoin Trading Bot — Then Proved It Had No Edge.

#machinelearning #python #datascience #playwright

A case study in building crypto trading infrastructure and, more importantly, testing it honestly — including when the honest answer is "no edge."

What I built

A clean, modular system (one responsibility per file) for trading BTC on Bybit:

Signal layer — a Random Forest over 20 engineered features (EMA gaps, RSI, MACD, Bollinger position, ATR, volume, momentum, time-of-day).
Regime detection — classifies the market into BULL / BEAR / NEUTRAL / CRASH / EUPHORIA and swaps strategy parameters accordingly.
Execution — multi-position management, trailing stops, 50% scale-out at TP1, 4h position aging, daily-loss and trade-rate risk limits.
Entry timing — Optimal Trade Entry using Fibonacci retracement zones, plus higher-timeframe and session filters.

The engineering is solid. The architecture is something I'd happily put in production. But good engineering and a profitable strategy are two completely different things — and conflating them is how people lose money.

The trap almost everyone falls into

The original model reported 52.9% accuracy and felt promising. But it was predicting the wrong thing: "will the next 15-minute candle close higher?" That target is:

Near-random — next-bar direction is dominated by microstructure noise; the theoretical ceiling is barely above 50%.
Disconnected from P&L — a +0.01% tick and a +2% rip are both labeled "up," but only one makes money after a 0.4% stop and fees.

A model can be "accurate" on a meaningless target and still lose every dollar. So I rebuilt the test.

Testing it properly

Triple-barrier labeling. Instead of "next candle up?", I labeled each sample by what a real trade does: does price hit +0.5% before −0.5% within the next 8 bars? Indecisive samples (no real move) are dropped. Now the model predicts tradeable outcomes.

Eliminated train/serve skew. I found the features were computed by two separate code paths (training vs live) — a classic silent bug where the model scores on slightly different inputs than it trained on. I refactored both to call one shared function. (There was even a third, broken copy that hardcoded constants for two features.)

Walk-forward validation. Chronological folds — train on the past, test on the future — so the model never sees data from its own future. No lookahead bias.

Event-driven backtest. I wrote a backtester that replays the actual trade mechanics — stop-loss, scale-out, trailing stop, timeout — including fees and slippage, with pessimistic intrabar fill assumptions. Then I ran it strictly out-of-sample.

The results

Test	Result
15m next-candle direction (the illusion)	52.9%
15m triple-barrier, walk-forward	51.8%
1h ATR triple-barrier, out-of-sample	50.3%
Full backtest, real mechanics, 12 months	Win rate 32%, profit factor 0.37

Three independent tests, two timeframes, two labeling schemes — all converging on the same answer: a coin flip. The 32% live-mechanics win rate was even worse than the raw accuracy, because a 0.4% stop sits inside Bitcoin's 15-minute noise band, so 86% of trades got stopped out before the thesis could play out.

The feature importances told the story too: the model leaned almost entirely on volatility features (ATR, Bollinger width) and essentially ignored the directional ones. It had learned to predict how big the next move would be — but not which way. Which is exactly the part that matters, and exactly the part that isn't predictable.

Why — and why that's not surprising

This is market efficiency in action. Technical indicators are public information; any directional signal in them gets arbitraged away by faster participants almost instantly. The conclusion isn't "tune more parameters" (that's just overfitting noise until something looks good by chance) — it's that TA-derived features don't predict BTC direction at retail timeframes. A well-documented result I confirmed on my own data.

What I did next

If prediction doesn't work, the answer is to stop predicting and look for structural edges:

Funding-rate capture — a delta-neutral (long spot / short perp) strategy that harvests the perpetual funding premium regardless of direction. I built a monitor for it — and found that at current rates it pays less than the risk-free rate after fees, so the disciplined move is to wait for the regime where it's actually profitable.
Prediction-market arbitrage — I built a scanner for "underround" mispricings on Polymarket. It found that liquid markets are priced to the tick (every binary summed to exactly 1.001), with no accessible arbitrage — efficiency again.

Each negative result is itself a finding: it tells you precisely where not to put your money.

What this project actually demonstrates

Anyone can wire up an exchange API and curve-fit a backtest. The harder, rarer skill is knowing whether what you built is real — and being honest when it isn't. This project shows:

End-to-end ML pipeline design (feature engineering, labeling, validation, serving)
Awareness of the failure modes that fool most people (lookahead bias, train/serve skew, meaningless targets, overfitting)
Realistic backtesting with costs and conservative assumptions
The judgment to kill a strategy that doesn't work instead of shipping it

In trading, not losing money on a bad strategy is worth as much as finding a good one. This is the discipline behind that.

Code and full architecture on GitHub. Built with Python, scikit-learn, NumPy, and the Bybit v5 API.

DEV Community