tomasz dobrowolski

Posted on Apr 16 • Originally published at flashalpha.com

Backtesting Gamma-Exposure Strategies with Minute-Level Historical Data

#options #python #quant #machinelearning

If you've ever read a blog post claiming "short straddles when GEX is positive returns 30% a year" and wondered why the claim evaporates when you try to reproduce it, the answer is usually the same: the backtest cheated somewhere. Either it used data that didn't exist at the decision time, it labelled regimes using end-of-day aggregates that smear across the true intraday flip, or it skipped execution costs that swamp the signal.

FlashAlpha's Historical Analytics API makes the honest version of this backtest easier than the cheat. One at parameter, minute-level regime labels since 2018, leak-free percentiles by design. This article walks through the whole pipeline.

What We're Testing

A simple, well-known hypothesis: SPY mean-reverts intraday when dealers are long gamma and trends when dealers are short gamma. Specifically, we'll test whether short-term (30-minute) returns after 15:30 ET differ by dealer regime. If the hypothesis is real, 15:30-to-close returns should be smaller and more mean-reverting during positive-gamma regimes and larger (in either direction) during negative-gamma regimes.

This is a textbook GEX claim. It's also exactly the kind of thing that looks real in a sloppy backtest and dies in a careful one. Let's do it carefully.

Step 1: Pull the Regime Labels

We need, for every trading day in the test window, the dealer regime at 15:30 ET. That's /v1/exposure/summary/{symbol} with at set to 15:30 on each day.

import httpx, pandas as pd
from tqdm import tqdm

API_KEY = "..."
BASE = "https://historical.flashalpha.com"

dates = pd.bdate_range("2022-01-03", "2025-12-31")
rows = []

with httpx.Client(headers={"X-Api-Key": API_KEY}, timeout=30) as c:
    for d in tqdm(dates):
        ts = d.strftime("%Y-%m-%dT15:30:00")
        r = c.get(f"{BASE}/v1/exposure/summary/SPY", params={"at": ts})
        if r.status_code != 200:
            continue
        j = r.json()
        rows.append({
            "date": d,
            "spot_1530": j["underlying_price"],
            "net_gex": j["exposures"]["net_gex"],
            "net_dex": j["exposures"]["net_dex"],
            "regime": j["regime"],
            "gamma_flip": j["gamma_flip"],
            "zero_dte_pct": j.get("zero_dte", {}).get("pct_of_total_gex", 0),
        })

gex = pd.DataFrame(rows).set_index("date")

About 1,000 business days. At roughly 200ms per call, this runs in ~3 minutes single-threaded. Run it once, cache to parquet, move on.

Leak check: every number in gex is computed from data available at or before 15:30 ET on that date. The net_gex number reflects OI from that morning's EOD load (yesterday's OI) plus minute-level greek recomputation from 15:30 spot and vols. Nothing later than 15:30 ET on the given date contributes.

Step 2: Pull the Forward Returns

For each date, we need the 15:30-to-16:00 ET return. Two stock-quote calls per day: one at 15:30, one at 16:00.

close_rows = []
with httpx.Client(headers={"X-Api-Key": API_KEY}, timeout=30) as c:
    for d in tqdm(dates):
        r = c.get(
            f"{BASE}/v1/stockquote/SPY",
            params={"at": d.strftime("%Y-%m-%dT16:00:00")}
        )
        if r.status_code == 200:
            close_rows.append({"date": d, "spot_close": r.json()["mid"]})

closes = pd.DataFrame(close_rows).set_index("date")
df = gex.join(closes, how="inner").dropna()
df["ret_30min"] = df["spot_close"] / df["spot_1530"] - 1

Leak check: the forward return uses 16:00 ET data to measure performance of a signal generated at 15:30 ET. There's a 30-minute gap, which is the signal window. No feature crosses the signal-to-outcome boundary.

Step 3: Condition on Regime

by_regime = df.groupby("regime")["ret_30min"].agg(
    n="count",
    mean_bps=lambda x: x.mean() * 10000,
    std_bps=lambda x: x.std() * 10000,
    abs_mean_bps=lambda x: x.abs().mean() * 10000,
)
print(by_regime)

The interesting column is abs_mean_bps — the average absolute move. If the mean-reversion hypothesis is right, it should be smaller in positive_gamma than in negative_gamma. If the directional hypothesis is right (short puts in positive gamma), mean_bps should differ too.

Running this against the 2022–2025 window, you'll typically find a real but small effect: absolute 15:30-to-close moves are roughly 30–40% smaller in positive-gamma days. That's enough to matter for an intraday vol-selling strategy and not nearly enough to matter as a directional overlay. This is the kind of nuance that clean backtests produce and messy ones miss.

Step 4: Stratify on GEX Magnitude

The regime label is a coarse cut. The underlying net_gex is continuous. Bucket by deciles and look at the monotone relationship:

df["gex_decile"] = pd.qcut(df["net_gex"], 10, labels=False)
by_decile = df.groupby("gex_decile").agg(
    gex_mean=("net_gex", "mean"),
    n=("ret_30min", "count"),
    abs_move_bps=("ret_30min", lambda x: x.abs().mean() * 10000),
)
print(by_decile)

What you want to see is a monotone curve — low GEX deciles should show larger absolute moves, high GEX deciles smaller. If the curve is flat or non-monotone, the regime signal isn't really tracking what you think it's tracking and the rest of the strategy won't work.

Step 5: Scale Up — Every Minute, Every Day

Once the daily-grid version works, the minute-level version is mechanical. Instead of one sample per day at 15:30, pull every minute from 9:30 to 16:00 and condition on intraday regime changes. That's 390 calls per day × 1,000 days = 390,000 calls.

The minute-level grid lets you answer strictly-better questions:

What happens in the 15 minutes after a regime flip? Filter df to rows where regime[t] != regime[t-1], compute forward returns over varying windows.
Do moves compress as spot approaches the gamma flip? Compute spot_to_flip_pct and stratify.
Is 0DTE GEX a stronger signal intraday than all-expiry GEX? The zero_dte block in the summary gives you both; run the same test on each.

These are the questions the dealer-positioning literature claims to answer but rarely tests at the resolution of the claim. The Historical API makes them runnable.

The Three Traps This Setup Avoids

1. Rolling-Percentile Leakage

If your backtest uses VRP percentile or GEX percentile as a feature, make sure the percentile at time t is computed only from data strictly before t. FlashAlpha's VRP endpoint does this by default. Full write-up →

2. End-of-Day Smearing

Net GEX can flip intraday. A backtest that uses close-of-day GEX to label a 10:30 ET entry is measuring a feature that didn't exist yet. The Historical API's minute resolution removes this problem — you can label each bar with the regime as it stood at that bar's open.

3. Survivorship in Symbol Selection

If you backtest only on SPY, you don't have a survivorship problem. If you start adding single-names later, make sure your universe at each date reflects what was tradable then, not what's listed now. FlashAlpha's coverage endpoint is explicit about which symbols had data when.

Going Beyond Regime Labels

Everything above uses the scalar regime and net GEX as the signal. The full exposure/summary response is richer — it includes net DEX (directional dealer positioning), net VEX (vanna), net CHEX (charm), and interpretation text. For ML workflows, pull the whole response and let the model sort out which fields matter:

def features_at(client, symbol, at_ts):
    s = client.get(
        f"{BASE}/v1/exposure/summary/{symbol}", params={"at": at_ts}
    ).json()
    v = client.get(
        f"{BASE}/v1/vrp/{symbol}", params={"at": at_ts}
    ).json()
    return {
        "spot": s["underlying_price"],
        "net_gex": s["exposures"]["net_gex"],
        "net_dex": s["exposures"]["net_dex"],
        "net_vex": s["exposures"]["net_vex"],
        "net_chex": s["exposures"]["net_chex"],
        "gamma_flip": s["gamma_flip"],
        "spot_to_flip_pct": (
            s["underlying_price"] - s["gamma_flip"]
        ) / s["underlying_price"],
        "zero_dte_pct": s.get("zero_dte", {}).get("pct_of_total_gex", 0),
        "vrp_20d": v["vrp"]["vrp_20d"],
        "vrp_z": v["vrp"]["z_score"],
        "vrp_pct": v["vrp"]["percentile"],
        "atm_iv": v["vrp"]["atm_iv"],
        "hv_20": v["vrp"]["rv_20d"],
    }

That's ~15 features per timestamp with two API calls. For tabular models (XGBoost, LightGBM), this is already in the right shape. For transformers, pull the full nested responses and flatten lazily.

What You Actually Pay For

The cost of this backtest — end-to-end, against 4 years of history — is a function of how many minute-level samples you pull. Daily snapshots across 4 years ≈ 1,000 calls per endpoint per symbol. Minute-level is 390× that.

The alternative — building this dataset yourself — is a 3-to-6 month engineering project: ThetaData subscription, options parquets, BSM pipeline, SVI fitter, columnar store, minute-level partitioning, holidays/half-days handling, coverage reports. FlashAlpha's Historical API is the bought-not-built version of that stack.

Historical API is available on the Alpha tier. Get your free API key to start with the live endpoints.

DEV Community