DEV Community: eedgee

Why Your Backtest Is Lying to You — 3 Tests That Catch Lookahead Bias, Overfitting, and Fantasy Fills

eedgee — Tue, 09 Jun 2026 14:06:22 +0000

Almost every strategy that dies in production looked great in a backtest. The backtest wasn't unlucky — it was wrong, in one of three specific, detectable ways. Here's each one, the exact test that catches it, and why your usual metrics never warn you.

1. Lookahead bias — the silent killer

It's almost never a deliberate shift(-1). It hides in subtle places:

Structural indicators computed over the whole series — swing highs/lows, pivots, "the trend", regime labels. If the value at bar t depends on bars after t, every signal derived from it is contaminated.
Global-statistic normalization — z-scoring with the full-sample mean/std, fitting a scaler on all data.
Resampling/fills that peek — ffill after resample, using a daily close to trade the same day's open.
Label leakage in ML — targets overlapping features in time; train/test folds sharing information.

Why metrics don't warn you: a leaking backtest produces a beautiful equity curve — high Sharpe, high win rate, shallow drawdowns. Those numbers can't distinguish a real edge from a leak, because a leak makes them all better.

The test — execution-delay scan: re-run the strategy delaying execution by 0, 1, 2, 3 bars.

Clean edge: Sharpe decays gently and smoothly — no cliff.
Lookahead: Sharpe is huge at delay 0 (or the illegal delay −1) and falls off a cliff at delay 1, often to ~0 or negative.

The smoothness is the proof. A vertical drop between delay 0 and 1 is damning.

Rule of thumb: always design and report at delay ≥ 1. If your edge needs same-bar execution, it's a leak, not an edge.

2. Overfitting — the luckiest config, not an edge

The more configurations you tried, the more likely the "winner" is just the luckiest draw. A Sharpe of 2.0 means something very different after 1,000 trials than after 1.

Deflated Sharpe Ratio (DSR): adjusts your Sharpe for how many configs you tried (plus short samples, skew, fat tails). Brutal and correct — the same track record can show DSR 0.97 as a one-shot and 0.01 once you admit it was the best of 300. Count every parameter you eyeballed and discarded.
PBO via CSCV: feed it the per-period returns of every config you tried (one column each). It repeatedly splits time in half, picks the in-sample winner, and checks where it ranks out-of-sample. PBO near 0.5+ means your selection is essentially picking noise.

See Bailey & López de Prado on PSR/DSR, and Bailey-Borwein-López de Prado-Zhu on PBO.

3. Fantasy fills & understated costs

The most clarifying number: break-even cost — the per-trade cost (bps) at which net Sharpe hits zero. Compare it to what you actually pay:

Break-even 102 bps vs real cost 3 bps → robust.
Break-even 4 bps vs real cost 3 bps → you're trading for your broker.

High-turnover strategies die here. Futures traders: don't let the backtest fill your roll at the stale settlement price of an illiquid expiring contract — charge a conservative roll spread and confirm fills sit on the liquid contract.

4. Out-of-sample discipline that works

A single train/test split is one noisy draw. Use walk-forward: select parameters on each training window, score them on the next, unseen window, stitch the OOS pieces. The number that matters is the IS→OOS degradation — a real edge degrades a little; an overfit one collapses.

The honest pre-deployment checklist

Build at execution delay ≥ 1; never report same-bar fills.
Run the delay scan — no smooth decay, stop and find the leak.
Count your trials; report DSR, not raw Sharpe; run PBO.
Prefer a plateau parameter over the global peak.
Charge real costs; confirm break-even beats them with margin.
Confirm on walk-forward; report IS→OOS degradation.

A backtest that passes all of these isn't guaranteed to make money. But one that fails any of them is almost guaranteed to lose it.

I packaged correct, unit-tested implementations of all of these into a small numpy+pandas kit (PSR, Deflated Sharpe, PBO/CSCV, execution-delay scan, break-even cost, walk-forward) — one call to run_full_validation() prints a GO / CAUTION / NO-GO verdict. It's strategy-agnostic and never sees your alpha: you pass a returns series, it returns diagnostics.

If it's useful: https://924499172462.gumroad.com/l/quant-validation-kit
(The methodology above is enough to self-audit; the kit just runs every test for you in one call.)

DIY OFAC SDN monitoring for crypto addresses — and where it silently breaks

eedgee — Tue, 02 Jun 2026 14:33:32 +0000

If your product touches crypto and you have any AML/sanctions obligation, sooner or later someone asks: "How do we know if an address we interact with lands on the OFAC SDN list?"

The reassuring part: the data is free. The U.S. Treasury publishes the Specially Designated Nationals (SDN) list, including the crypto addresses tied to sanctioned entities, as public downloads. Chainalysis even gives away a free sanctions screening API and an on-chain oracle. So the instinct is: I'll just poll it myself.

You can. It's also a deceptively deep little pipeline, and the ways it breaks are quiet — which is the dangerous kind. Here's the honest map of building it yourself.

The naive version

# 1. download the SDN data (XML/CSV from treasury.gov)
# 2. extract the crypto addresses (the "Digital Currency Address" fields)
# 3. compare against the set you saw last time
# 4. if a watched address newly appears (or disappears), alert someone

sdn = fetch_sdn_list()
current = extract_crypto_addresses(sdn)      # {"XBT": {...}, "ETH": {...}, ...}
added   = current - last_snapshot
removed = last_snapshot - current
if my_watched & (added | removed):
    notify("a watched address changed on the SDN list")
save(current)

Ship it on a cron, done? Not quite. Here's where reality leaks in.

Where it silently breaks

1. The diff is harder than `==`

Addresses don't compare cleanly across chains:

Ethereum addresses appear in mixed case (EIP-55 checksum) in some sources and lowercase in others. 0xAbC… and 0xabc… are the same address; a naive set diff sees two. Normalize to a canonical form per chain before diffing, or you'll fire false alerts and miss real ones.
Bitcoin is the opposite — case is significant, and you've got legacy, P2SH, and bech32 formats for what may be related holdings.
OFAC re-lists and restructures entries. An address can move between SDN entries, or an entity can be re-added under a new listing. If you key your snapshot on the entry instead of the normalized address, a reshuffle looks like a churn of adds/removes that aren't real.

2. Delivery is the actual hard part

Detecting the change is maybe 30% of the work. Reliably telling someone is the other 70%:

A webhook that fails once and isn't retried is a silent miss. Your endpoint was redeploying for 90 seconds; the one alert that mattered fell on the floor.
No delivery log means you can't answer "were we notified?" — which is exactly the question an examiner or your own incident review will ask.
Unsigned webhooks mean the receiver can't trust the payload. You want HMAC-SHA256 signatures so the other side can verify it's really you.
The moment you add email and Telegram as channels, each has its own failure modes (bounces, rate limits, bot token expiry) and you're now running three delivery systems.

3. The watcher dies and nobody watches the watcher

This is the one that actually bites people. Cron jobs fail silently. Treasury tweaks the XML schema and your parser throws — but only in the logs nobody reads. The poller has been dead for three weeks and everything looks fine because no news looks identical to good news. You need a dead-man's switch: something that alarms when the pipeline stops producing, not just when it finds a change.

4. Freshness vs. politeness

How often do you poll? Too rare and you're stale when it counts; too aggressive and you're hammering a government endpoint. You'll want conditional requests (ETag / If-Modified-Since), sane backoff, and a defensible "we re-check every N" story you can put in front of an auditor.

What "done right" actually requires

If you build it yourself, get these four things right or don't bother:

Idempotent diffing on normalized, per-chain canonical addresses — not raw string equality, not entry-keyed snapshots.
Signed webhooks + retries with backoff, plus email/Telegram fan-out that degrades gracefully.
A delivery-status history you can point at to prove every detected change was actually dispatched.
A dead-man's switch on the pipeline itself, because silence is the failure you won't notice.

None of this is exotic. It's just boring, and easy to get 80%-right in a way that fails exactly when it matters. That gap — between "it runs" and "I'd stake an audit on it" — is the whole job.

Or don't build it

I got tired of watching every crypto team rebuild this same plumbing, so I packaged the boring layer as OFAC Alert: hourly-refreshed SDN data, normalized cross-chain diffing, HMAC-signed webhooks with retries, delivery history, batch screening, and a REST API (live docs). If the piece you actually want is "tell me the moment a watched address changes," that's exactly what its OFAC SDN change alerts do. The free tier monitors one address with no signup gate, so you can see the shape of it.

To be clear about scope: it is not a Chainalysis/TRM/Elliptic replacement — no risk scoring, no clustering, no enterprise contract. It's the monitoring-and-delivery layer for the free sanctions data, built once so it's reliable and not your problem.

But honestly — whether you use it or roll your own — get those four things right. The data being free is the easy part. Staking your compliance posture on a cron job is the part that keeps people up at night.