Why Your Backtest Is Lying to You — 3 Tests That Catch Lookahead Bias, Overfitting, and Fantasy Fills

#datascience #machinelearning #python #testing

Almost every strategy that dies in production looked great in a backtest. The backtest wasn't unlucky — it was wrong, in one of three specific, detectable ways. Here's each one, the exact test that catches it, and why your usual metrics never warn you.

1. Lookahead bias — the silent killer

It's almost never a deliberate shift(-1). It hides in subtle places:

Structural indicators computed over the whole series — swing highs/lows, pivots, "the trend", regime labels. If the value at bar t depends on bars after t, every signal derived from it is contaminated.
Global-statistic normalization — z-scoring with the full-sample mean/std, fitting a scaler on all data.
Resampling/fills that peek — ffill after resample, using a daily close to trade the same day's open.
Label leakage in ML — targets overlapping features in time; train/test folds sharing information.

Why metrics don't warn you: a leaking backtest produces a beautiful equity curve — high Sharpe, high win rate, shallow drawdowns. Those numbers can't distinguish a real edge from a leak, because a leak makes them all better.

The test — execution-delay scan: re-run the strategy delaying execution by 0, 1, 2, 3 bars.

Clean edge: Sharpe decays gently and smoothly — no cliff.
Lookahead: Sharpe is huge at delay 0 (or the illegal delay −1) and falls off a cliff at delay 1, often to ~0 or negative.

The smoothness is the proof. A vertical drop between delay 0 and 1 is damning.

Rule of thumb: always design and report at delay ≥ 1. If your edge needs same-bar execution, it's a leak, not an edge.

2. Overfitting — the luckiest config, not an edge

The more configurations you tried, the more likely the "winner" is just the luckiest draw. A Sharpe of 2.0 means something very different after 1,000 trials than after 1.

Deflated Sharpe Ratio (DSR): adjusts your Sharpe for how many configs you tried (plus short samples, skew, fat tails). Brutal and correct — the same track record can show DSR 0.97 as a one-shot and 0.01 once you admit it was the best of 300. Count every parameter you eyeballed and discarded.
PBO via CSCV: feed it the per-period returns of every config you tried (one column each). It repeatedly splits time in half, picks the in-sample winner, and checks where it ranks out-of-sample. PBO near 0.5+ means your selection is essentially picking noise.

See Bailey & López de Prado on PSR/DSR, and Bailey-Borwein-López de Prado-Zhu on PBO.

3. Fantasy fills & understated costs

The most clarifying number: break-even cost — the per-trade cost (bps) at which net Sharpe hits zero. Compare it to what you actually pay:

Break-even 102 bps vs real cost 3 bps → robust.
Break-even 4 bps vs real cost 3 bps → you're trading for your broker.

High-turnover strategies die here. Futures traders: don't let the backtest fill your roll at the stale settlement price of an illiquid expiring contract — charge a conservative roll spread and confirm fills sit on the liquid contract.

4. Out-of-sample discipline that works

A single train/test split is one noisy draw. Use walk-forward: select parameters on each training window, score them on the next, unseen window, stitch the OOS pieces. The number that matters is the IS→OOS degradation — a real edge degrades a little; an overfit one collapses.

The honest pre-deployment checklist

Build at execution delay ≥ 1; never report same-bar fills.
Run the delay scan — no smooth decay, stop and find the leak.
Count your trials; report DSR, not raw Sharpe; run PBO.
Prefer a plateau parameter over the global peak.
Charge real costs; confirm break-even beats them with margin.
Confirm on walk-forward; report IS→OOS degradation.

A backtest that passes all of these isn't guaranteed to make money. But one that fails any of them is almost guaranteed to lose it.

I packaged correct, unit-tested implementations of all of these into a small numpy+pandas kit (PSR, Deflated Sharpe, PBO/CSCV, execution-delay scan, break-even cost, walk-forward) — one call to run_full_validation() prints a GO / CAUTION / NO-GO verdict. It's strategy-agnostic and never sees your alpha: you pass a returns series, it returns diagnostics.

If it's useful: https://924499172462.gumroad.com/l/quant-validation-kit
(The methodology above is enough to self-audit; the kit just runs every test for you in one call.)