Almost every strategy that dies in production looked great in a backtest. The backtest wasn't unlucky — it was wrong, in one of three specific, detectable ways. Here's each one, the exact test that catches it, and why your usual metrics never warn you.
1. Lookahead bias — the silent killer
It's almost never a deliberate shift(-1). It hides in subtle places:
- Structural indicators computed over the whole series — swing highs/lows, pivots, "the trend", regime labels. If the value at bar t depends on bars after t, every signal derived from it is contaminated.
- Global-statistic normalization — z-scoring with the full-sample mean/std, fitting a scaler on all data.
-
Resampling/fills that peek —
ffillafterresample, using a daily close to trade the same day's open. - Label leakage in ML — targets overlapping features in time; train/test folds sharing information.
Why metrics don't warn you: a leaking backtest produces a beautiful equity curve — high Sharpe, high win rate, shallow drawdowns. Those numbers can't distinguish a real edge from a leak, because a leak makes them all better.
The test — execution-delay scan: re-run the strategy delaying execution by 0, 1, 2, 3 bars.
- Clean edge: Sharpe decays gently and smoothly — no cliff.
- Lookahead: Sharpe is huge at delay 0 (or the illegal delay −1) and falls off a cliff at delay 1, often to ~0 or negative.
The smoothness is the proof. A vertical drop between delay 0 and 1 is damning.
Rule of thumb: always design and report at delay ≥ 1. If your edge needs same-bar execution, it's a leak, not an edge.
2. Overfitting — the luckiest config, not an edge
The more configurations you tried, the more likely the "winner" is just the luckiest draw. A Sharpe of 2.0 means something very different after 1,000 trials than after 1.
- Deflated Sharpe Ratio (DSR): adjusts your Sharpe for how many configs you tried (plus short samples, skew, fat tails). Brutal and correct — the same track record can show DSR 0.97 as a one-shot and 0.01 once you admit it was the best of 300. Count every parameter you eyeballed and discarded.
- PBO via CSCV: feed it the per-period returns of every config you tried (one column each). It repeatedly splits time in half, picks the in-sample winner, and checks where it ranks out-of-sample. PBO near 0.5+ means your selection is essentially picking noise.
See Bailey & López de Prado on PSR/DSR, and Bailey-Borwein-López de Prado-Zhu on PBO.
3. Fantasy fills & understated costs
The most clarifying number: break-even cost — the per-trade cost (bps) at which net Sharpe hits zero. Compare it to what you actually pay:
- Break-even 102 bps vs real cost 3 bps → robust.
- Break-even 4 bps vs real cost 3 bps → you're trading for your broker.
High-turnover strategies die here. Futures traders: don't let the backtest fill your roll at the stale settlement price of an illiquid expiring contract — charge a conservative roll spread and confirm fills sit on the liquid contract.
4. Out-of-sample discipline that works
A single train/test split is one noisy draw. Use walk-forward: select parameters on each training window, score them on the next, unseen window, stitch the OOS pieces. The number that matters is the IS→OOS degradation — a real edge degrades a little; an overfit one collapses.
The honest pre-deployment checklist
- Build at execution delay ≥ 1; never report same-bar fills.
- Run the delay scan — no smooth decay, stop and find the leak.
- Count your trials; report DSR, not raw Sharpe; run PBO.
- Prefer a plateau parameter over the global peak.
- Charge real costs; confirm break-even beats them with margin.
- Confirm on walk-forward; report IS→OOS degradation.
A backtest that passes all of these isn't guaranteed to make money. But one that fails any of them is almost guaranteed to lose it.
I packaged correct, unit-tested implementations of all of these into a small numpy+pandas kit (PSR, Deflated Sharpe, PBO/CSCV, execution-delay scan, break-even cost, walk-forward) — one call to run_full_validation() prints a GO / CAUTION / NO-GO verdict. It's strategy-agnostic and never sees your alpha: you pass a returns series, it returns diagnostics.
If it's useful: https://924499172462.gumroad.com/l/quant-validation-kit
(The methodology above is enough to self-audit; the kit just runs every test for you in one call.)
Top comments (0)