I Trained a Crypto Quantile Predictor on 47M Klines. The Transformer Lost to LightGBM.
This is what 47.68 million klines, 27 LightGBM models, and one failed Transformer spike taught me about building a crypto quantile predictor that holds up under out-of-sample stress — and what the OOS calibration numbers actually showed when I stopped narrating and started measuring.
The honest answer to "is your trading edge real" is "wait two to four years for enough live round-trips and find out." That's what statistical significance actually requires for the Sharpe levels retail traders chase. Two years of paper trading. Four if your sample-per-day is thin.
I've watched enough people decide three months of decent paper is enough, flip live trading, and blow up by month nine to know the math isn't the hard part. The hard part is the patience. I didn't have it either, so I ran a 3-week compressed sprint to compress that wait into offline out-of-sample validation on historical klines. 47.68 million of them. This is what the data showed, what broke, and why the model I shipped is the boring one.
The data setup
Binance Vision archives are free and well-structured. I pulled 1-minute klines for 30 perpetual pairs across 2023-2026 — 1,093 monthly parquet files, 2.4 GB on disk, 47.68M rows. Integrity checks all passed (no gaps wider than 5 minutes, no negative volumes, no zero-spread rows).
From that I generated 9.53M training rows by sampling every 5 minutes per pair. Each row had 10 numeric features (RSI, EMA gaps, ATR, realized volatility, return windows) and 3 categorical features. Each row also had 3 forward-return labels: returns at 5 minutes, 10 minutes, and 30 minutes ahead.
Three horizons because crypto's signal-to-noise ratio is awful at 5 minutes and decent at 30. I wanted the calibration data to tell me which horizon was actually predictable, not assume one.
Why LightGBM became the crypto quantile predictor I shipped
I went with LightGBM quantile regression as the first attempt. Three reasons.
First: quantile regression gives you P10, P50, P90 instead of a single point estimate. For a trading gate you want "what's the floor of my downside under this signal" more than "what's the expected return." Point estimates lie. Tail quantiles don't lie as much.
Second: walk-forward CV is cheap with gradient boosting. I split the historical kline window (2023-Q4 through 2026-Q1, all already-closed bars at the time of writing in late April 2026) into three folds, training on past, testing on next-period OOS. 27 models total: 3 folds × 3 horizons × 3 quantiles. Trained in 45 minutes on a single M1 Max core.
Third: it's interpretable enough to debug. When fold 2 (2025-Q3 drift regime) showed negative TP/SL uplift on the 5-minute horizon, I could see in the feature importance plot that the model had over-weighted ATR — fixed by adjusting the lookback window.
OOS results across folds: P10 hit rate stayed within 0.6% of the 10% target. P90 hit rate stayed within 0.6% of the 90% target. Directional accuracy ranged 51.17-52.59% across the three folds, with the most recent regime (fold 3, 2026-Q1) hitting the high end. Tail-quantile improvement over a fixed-baseline TP/SL: roughly 10% consistent across all three folds. Modest but real.
The Transformer spike that didn't beat the bar
Then I burned 3.5 hours on a Transformer spike. I wanted to see if attention could pick up cross-pair structure that gradient boosting was missing.
First attempt: PyTorch with MPS backend, attention layer hit NaN at epoch 4. Known PyTorch MPS attention instability on Apple Silicon — the softmax saturates when you have tiny-std return features that get z-scored to ±100σ before clipping.
Second attempt: tighter feature normalization (±5σ post-clip), still NaN at epoch 6. Different layer this time.
Third attempt: dropped MPS, ran on CPU. No NaN. Slow — about 10 minutes per epoch. Trained for 8 epochs, beat LightGBM on the OOS tail-quantile metric by exactly 0.40%.
The pre-set go/no-go bar was 5%. So the Transformer cleared 8% of the bar.
I cut it. Saved the spike report (spike_report.json with the FAIL_BAR verdict), saved the failed training scripts for reference, and skipped the planned full sweep that would have burned 6 more hours of CPU time and a week of deploy work for marginal gain.
This is the result I would have wanted to find. Not the dramatic win, the boring confirmation. LightGBM with the current feature set already captures most of the extractable signal. More model class isn't the bottleneck. More features (or more horizons, or different labels) might be.
What 30 minutes told me that 5 minutes didn't
Per-horizon uplift was the most useful thing the OOS analysis surfaced. On the 30-minute horizon, the model improved fixed-baseline TP/SL by +33%, +125%, +172% across the three quantiles tested. On the 5-minute horizon: -87%, +168%, +87%. The 30-minute numbers are uniformly positive. The 5-minute numbers swing wildly — which is honest about how noisy a 5-minute crypto forecast actually is.
I switched the default TP/SL planning horizon from 5 minutes (which the original v1 predictor used) to 30 minutes. Inference still runs on 5-minute scan cadence. The horizon change was config, not architecture.
Pivot to the winning book
The deployment plan originally targeted Book 14 — a new book I'd just enabled with no track record. I caught myself mid-sprint and pivoted to Book 13 instead, which had 41 round-trips and 51.22% win rate at the time. The reasoning: deploying a quantile-regression overlay onto a book with no history is guess plus guess. Onto a book with proof, it's guess plus proof.
The pivot also surfaced an unrelated bug. Three symbols in the B13 invert-long list — DOGE, AVAX, ATOM — were responsible for the bulk of the negative PnL across 28 round-trips, while six symbols in the invert-short list carried the positive contribution. I pruned the three losers from the invert list. Free win, no model change, just cleaning out the asymmetric bleeders. (If you want the post-mortem on a similar book-routing bug that cost me a week of bad numbers, it's in How a Missing book_id Kwarg Quietly Tanked My Inverted-Alpha Paper Trade.)
The honest scope
This is paper-only. None of it has touched live trading yet. The plan is: log the v2 predictor in shadow mode alongside v1 for 3-5 days, verify alignment >85% across regimes, then wire v2's TP/SL into B13 paper trades. After 30-50 paper round-trips with v2 active, if the net is positive and the win rate holds above 55%, $5 of USDT goes onto the winning symbol subset. Not before.
The OOS calibration was clean. That doesn't mean the model is right. It means the model is consistent under the data slices I tested, which is necessary but not sufficient. Live execution introduces fees, slippage, and regime shifts the historical sample didn't see. I'll know in 30-50 more paper round-trips whether the offline numbers transferred or not.
What I'd tell my past self
Three things from the sprint that surprised me.
The first: the Transformer FAIL_BAR was a faster, more honest signal than I expected. Spending 3 hours to confirm "the simple model already won" beats spending a week to confirm "the complex model didn't beat the simple one by enough." Spike, set a clear bar before you start, accept the verdict.
The second: tail-quantile calibration mattered more than directional accuracy. 52% directional sounds barely-above-random and it is. P10 calibration within 0.6% of target across three regimes is genuinely useful — it lets the trading gate ask "what's my realistic downside under this signal" with a number it can trust.
The third: deploying onto the winning book is not the same as deploying onto the convenient book. I almost shipped onto B14 because it was newer and cleaner. B13 had 41 round-trips of proof. The pivot took 30 minutes of decision and saved an unknown amount of "wait what is this signal even doing" debugging later.
Subscribe + follow along
I'm running this whole stack on a single M1 Max — paper trading 30 pairs, MLX inference for the language pieces (the Apple Silicon write-up covers why that hardware choice mattered), LightGBM for the trading pieces. The earlier inverted-control bot post-mortem is the closest sibling to this one in spirit — both are the paper-trail of an idea that survived contact with reality. All numbers in this post (and the sprint reports they came from) are real. The wins and the FAIL_BARs both.
If you want the next post — probably the 30-50 paper round-trip post-mortem on whether the OOS numbers held in live execution — subscribe at sleepyquant.rest.
Come along for the ride — see me fall or thrive, whichever comes first.
Top comments (0)