I Backtested 96 SPY Put Credit Spread Strategies - Here's the Bug Log

#trading #python #quant #finance

I built a backtest engine for SPY put credit spreads. 7 years of 1-minute option chains. 96 grid cells across (delta, DTE, profit-target, stop-loss). 16,024 trades that survived validation.

This post is the part most quant tutorials skip: the bugs, the lying simulators, and the one-line feature that humiliated my 3-layer composite signal. If you've ever shipped a backtest with great-looking numbers, you might want to read the bug log section.

The first lesson: mid-fill backtests are lying to you

The naive engine filled at the bid-ask mid. Numbers looked great. Then I implemented what an actual desk does - post a limit at combo_ask + $0.04, wait for someone to cross it, accept rejection - and CAGR dropped 30–60% across the entire grid. Several cells flipped from positive to negative.

Real fill stats from one run:

Run	Posted	Filled	Fill rate	Avg wait	Edge captured
45Δ 30 DTE PT 50% No stop	479	112	23.4%	12.4 min	−$0.04 to −$0.07
45Δ 30 DTE PT 50% SL 100%	633	155	24.5%	12.2 min	−$0.04 to −$0.07
45Δ 30 DTE PT 50% SL 200%	383	64	16.7%	13.3 min	−$0.04 to −$0.07

You fill ~20–25% of orders, after a ~12-minute wait, at 4–7 cents worse than mid. The MM doesn't make money at the fill - they make it over the hold, when theta beats realized vol. Any backtest filling at mid is silently gifting your strategy that 4–7 cents per trade.

If you remember nothing else from this post: build a real fill model.

The single biggest finding: SL=100% IS the strategy

Same cell. Same fills. Same period. Toggle the stop loss:

Configuration	Trades	Return	CAGR	MaxDD	Calmar
10Δ 7 DTE PT 50% No stop	-	−100%	wipeout	breaker	-
10Δ 7 DTE PT 50% SL 100%	460	+5,439%	+66.0%	30.1%	2.19
10Δ 7 DTE PT 75% SL 100%	360	+2,947%	+53.9%	31.1%	1.73
10Δ 7 DTE PT 25% SL 100%	649	+1,752%	+44.6%	30.2%	1.48

+5,439% with a stop. −100% without. Same cell.

A single bad Monday on a no-stop short-vol position takes the account to zero. The stop loss isn't a parameter you sweep - at short DTE it's the entire risk model.

The trap nobody warns you about: SL=200%

I expected SL=200% to be the sensible compromise. Looser than 100% so I don't get noise-stopped, tighter than no-SL so a tail doesn't kill me. Wrong:

SL setting	Trades	Return	CAGR	Sharpe	MaxDD
No stop	112	+30.6%	+3.8%	+0.23	30.1%
SL 100%	155	+28.0%	+3.5%	+0.22	29.3%
SL 200% ← trap	64	−16.5%	−2.5%	−0.17	31.2%

Both no-stop and SL=100% made money. SL=200% lost.

Mechanics: by the time the loss has grown to 200% of credit, you're deep ITM and gamma is doing the marking, not theta. You stop out at the worst possible price, on the worst possible day, after letting the position breathe past the point of recovery. SL=100% stops you before gamma takes over. No SL at least gives you the chance to be bailed out by a recovery.

Pick tight or pick none. Never the middle.

My fancy signal got beaten by a one-liner

I built a 3-layer composite - Premium / Danger / Stabilization scores, z-scored macro inputs, continuous Kelly multiplier. Then I ran t-tests across all 16,024 trades.

The strongest predictor in the entire feature set wasn't my composite. It was a single boolean flag on a free macro series - the kind of thing you write in three lines of pandas:

# pseudocode - the actual flag stays with me
signal_on = (macro_series.diff(5) < 0) & (macro_series < macro_series.rolling(20).mean())

t-stat over 8, on 5,000+ trades. Bonferroni-correct it across the entire feature space and it still wins.

The lesson generalizes: if your "edge" is a 47-feature gradient-boosted model, replace it with the most economically obvious single flag and check how much you actually lose. Often that flag does 80% of the work and your model is overfitting the residual.

Two related findings worth knowing:

The market pays the most right before it bites. Top-quintile VRP days had 66% winrate vs 74% baseline. By the time premium is that rich, something is actually wrong.
Don't sell into rising fear with an inverted term curve. Wait for the curve to normalize OR for the fear to fade. Either is fine. Both being wrong is the worst regime in the data.

Multi-DTE pooling cost me 9pp of CAGR

Seemed obvious: instead of committing to one tenor, rank candidates across 30/45/60-DTE chains every entry, pick the best EV-per-dollar-at-risk:

Tenor selection	Trades	Return	CAGR	Sharpe	MaxDD
Pool 30/45/60 DTE No stop	112	+30.6%	+3.8%	+0.23	30.1%
Pool 30/45/60 DTE SL 100%	155	+28.0%	+3.5%	+0.22	29.3%
Focused 30 DTE No stop	82	+70%	+12.5%	+0.72	27.5%

The ranker correctly picked 30-DTE ~68% of the time. The other ~32% it picked 45-DTE specifically because the 30-DTE chain looked worse than usual that bar - which is the textbook definition of adverse selection.

More degrees of freedom = more ways for the optimizer to be wrong, not more ways to be right.

The bug log

This is the part most quant blogs skip. Every one of these bugs would have inflated the headline numbers. Most were caught only after a code review:

Bug	What it did	Damage
Mid-fill assumption	Filled at bid-ask average always	Flipped multiple losing strategies positive
Look-ahead in signal	Day-D signal used end-of-D data at 10:05 AM entry	Inflated CAGR ~5–10pp on most cells
Stale-quote acceptance	"Fills" at quotes that were no longer real liquidity	~30% of fills had negative edge captured
EV-sorted tiebreak	Higher-EV candidate "filled" first when two crossed same bar	Subtle but real per-trade lift
Warmup sizing bug	Full Kelly applied before signal had any history	Cratered 2018 results
Validation walk-back mismatch	Validator used exact-date lookup, engine walked back 7 days	Bogus regression stats on weekend dates
Walk-the-limit (rejected)	Drop limit a penny each minute if unfilled	Caught before merge - adverse selection

The pre-fix engine on the same data returned −1% to −8% CAGR across every cell. So when post-fix numbers turned positive, that wasn't simulator noise - it was what the bias was masking.

If your backtest looks great on the first run, you have a bug. Find it.

The sizing layer

SIZING = {
    "kelly_default": 0.05,         # half-Kelly, conservative on purpose
    "kelly_max": 0.25,
    "kelly_f_hard_cap": 1.0,       # leveraged math can multiply to 1.25 without it
    "drawdown_breaker_pct": 0.30,  # halts the run, not the position
    "absolute_floor_pct": 0.50,    # secondary backstop
    "warmup_multiplier": 0.0,      # cratered 2018 when this was 1.0
    "vrp_on_mult": 1.0,            # 2.5 in the leveraged stress test
}

Half-Kelly is conservative on purpose. Mean-variance Kelly assumes Gaussian returns; short-vol returns are skewed left with fat tails. The "true" Kelly under fat tails is below the mean-variance Kelly, so half-Kelly isn't lazy - it's roughly correct.

The 30% drawdown breaker doesn't reduce position size or pause for a day. It halts the entire run. If your strategy needs 30% drawdown to work, you don't have a strategy, you have a martingale.

TL;DR

Mid-fill backtests overstate CAGR by 30–60%. Build a real fill model.
Stop loss at 100% of credit IS the game at short DTE. +5,400% with, −100% without.
SL=200% is worse than no stop. Pick tight or pick none.
The strongest signal in 16k trades was a one-line boolean flag.
Multi-DTE pooling cost 9pp of CAGR. Adverse selection.
Higher delta moves equity vol, not alpha.
The market pays the most right before it bites.
If your backtest looks great on the first run, find the bug.

The full write-up with all 96 cells, the leveraged stress preset, and where the Calmar-leader cell sits in the grid is on FlashAlpha: original article.

The actual constraint on reproducing this isn't the engine (the engine is bookkeeping) - it's having minute-resolution SPY chains going back to 2018 with surface-consistent IVs. That's what FlashAlpha's Alpha tier historical API is for.