tomasz dobrowolski

Posted on Jun 8 • Originally published at flashalpha.com

Quant Options Backtesting on Point-in-Time Data: The Complete Guide

#quant #python #trading #datascience

Originally published on the FlashAlpha Research blog.

Summary

Backtesting options strategies is structurally harder than backtesting equity momentum or fixed-income carry. The payoff is path-dependent. The instrument set re-generates every week. Fills are wide and asymmetric. And the data problem (point-in-time chains, computed dealer exposure, arbitrage-free surfaces) is usually bigger than the signal problem.

This guide is written for a quant or systematic developer who wants to:

Research a dealer-positioning, VRP, or dispersion signal on real historical data.
Build a candidate set across the options universe, not just one ticker.
Run a fill model that reflects what you would have actually paid.
Avoid the ten systematic mistakes that inflate backtested Sharpe before live trading erases the edge.
Connect the same code to a live production signal without rewriting the data layer.

The historical endpoint set covers: GEX/DEX/VEX/CHEX exposure replay (/v1/exposure/*?date=), VRP time-series and percentiles (/v1/vrp/{t}/history), full option chains with greeks and IV at minute resolution (/v1/options/{t}?date=), IV surfaces and SVI parameters (/v1/adv_volatility), and the universe screener (POST /v1/screener). Everything replays since 2018. The response shape is identical to the live endpoints: add or remove the date parameter to switch modes.

What systematic options research actually needs

Three requirements separate a backtest you can trust from a backtest that lies to you.

1. Point-in-time data

A request for /v1/exposure/gex/SPY?date=2020-03-16T14:30:00Z must return the state of the book exactly as it existed at that minute. Open interest, greeks, and computed gamma exposure as of 14:30:00, not end-of-day OI stamped back, not a forward fill from the prior session.

This matters for two concrete reasons. First, dealer-positioning signals use OI to compute GEX, and end-of-day OI is not known at 14:30. Second, option greeks depend on implied vol, which moves continuously, so a 15:30 vol observation is not the same as a 14:30 vol observation on a volatile day. Any historical dataset that stamps a single daily value and calls it "intraday" is baking lookahead bias into every signal that uses it.

Gamma exposure is:

GEX(t) = Σ_k OI_k(t) · Γ_k(S_t, σ_k(t), T_k − t)

Each term depends on spot S_t, the per-strike IV σ_k(t), and time to expiry at exactly time t. Substituting end-of-day values for any of these introduces forward-looking information.

2. No lookahead bias

Lookahead is not only about timestamps. It also lives in:

OI revision. Exchanges publish preliminary OI and revise it the next morning. A historical dataset assembled from the revised numbers makes yesterday's GEX look different from what any live system would have computed.
Methodology drift. If the analytics vendor updates the GEX formula, historical replays under the new formula produce different numbers than a live system running the old formula would have produced. Archive raw responses if you need bit-exact reproducibility.
Surface parameterization. SVI calibration parameters (a, b, ρ, m, σ) are an end-of-day fit in this dataset. An intraday request returns the most recent prior EOD SVI fit. If your signal uses the SVI latent at minute resolution, you have one observation per trading day, not per minute. Use the minute-level surface grid for intraday surface features, and SVI params for daily cross-sectional work.

3. Same schema, research to production

The single most underrated property of a historical API: research and production should call the same function. If your backtest calls hx.exposure_summary("SPY", at="2022-06-15T14:30:00") and your live signal calls fa.exposure_summary("SPY") and the response shapes differ, you will discover the discrepancy the first time the live system hits a field the backtest parser never saw. Every undocumented difference between historical and live response shapes is a production incident waiting to happen.

The FlashAlpha API uses the same JSON schema for both modes. The date parameter controls replay; omit it for live. One exception: the historical option-quote endpoint returns a flat array with renamed fields (implied_vol instead of iv, open_interest instead of oi) and historical-only fields (iv_bid, iv_ask, vanna, charm, rho). Write a thin adapter and test it against both modes before you backtest at scale.

The data: what the historical API covers

Dealer exposure replay (GEX/DEX/VEX/CHEX)

All four exposures replay at minute resolution from 2018, available via their own endpoints and via the unified exposure summary.

# Historical exposure summary, the one-call option
curl -H "X-Api-Key: YOUR_KEY" \
  "https://lab.flashalpha.com/v1/exposure/summary/SPY?date=2020-03-16T14:30:00Z"

# Or per-metric:
curl -H "X-Api-Key: YOUR_KEY" \
  "https://lab.flashalpha.com/v1/exposure/gex/SPY?date=2020-03-16T14:30:00Z"

from flashalpha_historical import FlashAlphaHistorical

hx = FlashAlphaHistorical(api_key="YOUR_KEY")

snap = hx.exposure_summary("SPY", at="2020-03-16T14:30:00")

print(f"Regime: {snap['regime']['label']}")
print(f"Net GEX: {snap['net_gex']:,.0f}")
print(f"Gamma flip: ${snap['regime']['gamma_flip']}")
print(f"Call wall: ${snap['levels']['call_wall']}")
print(f"Put wall:  ${snap['levels']['put_wall']}")

The response includes as_of, so you can confirm the timestamp is what you asked for. On March 16 2020 the net GEX was deeply negative; the COVID-crash dealer-positioning replay is the canonical stress-test for any strategy that conditions on GEX regime.

VRP time series and percentiles

The volatility risk premium (VRP) measures implied minus realized vol. The history endpoint returns a daily time series of VRP, z-score, and percentile rank against the trailing window, all point-in-time: the percentile for day T is computed using only data through day T.

curl -H "X-Api-Key: YOUR_KEY" \
  "https://lab.flashalpha.com/v1/vrp/SPY/history?lookback=252"

curl -H "X-Api-Key: YOUR_KEY" \
  "https://lab.flashalpha.com/v1/vrp/SPY?date=2022-10-14"

The vrp_percentile field is the fraction of days in the trailing window where VRP was below the current reading, computed with only prior observations. This is the operative number for a premium-selling trigger: if it is 85 or above, implied vol has been richer than this 85% of the time in the lookback window, conditioning only on past data.

Full option chains at minute resolution

The full chain endpoint returns every listed contract (bid, ask, IV, delta, gamma, theta, vega, vanna, charm, rho, OI) for a symbol at any minute since 2018.

curl -H "X-Api-Key: YOUR_KEY" \
  "https://lab.flashalpha.com/v1/options/SPY?date=2020-03-16T14:30:00Z"

Honest resolution table for the chain endpoint:

Field	Resolution in history
Bid, ask, IV, delta, gamma, theta, vega	Minute-level (9:30 to 16:00 ET)
Vanna, charm, rho (historical-only)	Minute-level
Open interest	EOD-stamped (one value per trading day)
Volume	Always 0; use OI for liquidity proxy
SVI-smoothed vol (`svi_vol`)	Always null (`svi_vol_gated: "backtest_mode"`)

If your GEX calculation uses OI and you replay at minute resolution, you are using that day's opening OI for every minute of the session. That is what any live system would have done, since OI is published once per morning. So the EOD stamp is correct for intraday GEX replay, not a limitation.

IV surface and SVI parameters

The 50x50 implied vol surface grid (moneyness by maturity) evolves at minute resolution, driven by per-contract quotes. The SVI calibration parameters are EOD-stamped.

surface = hx.surface("SPY", at="2022-06-15T14:30:00")
adv     = hx.adv_volatility("SPY", at="2022-06-15T14:30:00")
# surface: minute-level 50x50 grid
# adv: EOD-stamped SVI params {a, b, rho, m, sigma} +
#      arbitrage-free flags + variance-swap strike

The universe screener: building a candidate set

Every cross-sectional strategy starts with a candidate set. The screener endpoint ranks and filters across the full symbol universe on GEX, VRP, IV rank, 0DTE contribution, and custom score formulas.

curl -X POST "https://lab.flashalpha.com/v1/screener" \
  -H "X-Api-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"field": "vrp_percentile", "op": "gte", "value": 75},
      {"field": "iv_rank",        "op": "gte", "value": 50},
      {"field": "regime",         "op": "eq",  "value": "positive_gamma"}
    ],
    "sort": {"field": "vrp_zscore", "direction": "desc"},
    "limit": 30
  }'

In a backtest, run the screener at the decision timestamp on each rebalance date, then pull per-name history for those candidates only. This is both computationally efficient and methodologically correct: you only see names the screener would have surfaced on that date, not names that survived to the end of your sample.

Building a backtest: the full stack

The canonical loop:

Signal(t) → [screener] → Candidates(t) → [history] → Per-name data(t)
          → [fill model] → Fills(t) → [accum] → Metrics

Signal design

A signal is a function from the point-in-time state of one or more endpoints to a trade decision. Examples:

Dealer-positioning (GEX): Enter a momentum position when net GEX is negative and the 5-day average delta of net GEX is declining.
VRP premium-selling: Sell a put credit spread when VRP percentile is at least 80 and IV rank is at least 50, exit when VRP z-score reverts below its median.
Dispersion: Long single-name straddle, short index straddle, when the ratio of single-name IV to index IV is below its 6-month median.

Signals must be computable from endpoint fields available point-in-time. GEX is available at minute resolution. SVI parameters in a time series are EOD, so your signal updates once per day. OI trajectory is EOD, so the "trajectory" is one observation per day.

Candidate set via screener

import requests

API = "https://lab.flashalpha.com"
HEADERS = {"X-Api-Key": "YOUR_KEY", "Content-Type": "application/json"}

def screener_candidates(date_str: str, vrp_pctile: int = 75, limit: int = 30):
    """Run the screener point-in-time on a rebalance date."""
    body = {
        "date": date_str,
        "filters": {
            "op": "and",
            "conditions": [
                {"field": "vrp_percentile", "operator": "gte", "value": vrp_pctile},
                {"field": "iv_rank",        "operator": "gte", "value": 40},
                {"field": "regime",         "operator": "in",
                 "value": ["positive_gamma", "neutral"]}
            ]
        },
        "sort":  [{"field": "vrp_zscore", "direction": "desc"}],
        "limit": limit
    }
    r = requests.post(f"{API}/v1/screener", headers=HEADERS, json=body)
    r.raise_for_status()
    return [row["symbol"] for row in r.json()["data"]]

candidates = screener_candidates("2022-10-03")

Per-name history retrieval

from flashalpha_historical import FlashAlphaHistorical

hx = FlashAlphaHistorical(api_key="YOUR_KEY")

def fetch_signal_state(symbol: str, ts: str) -> dict:
    """Fetch point-in-time VRP + exposure for a candidate."""
    vrp  = hx.vrp(symbol, at=ts)
    summ = hx.exposure_summary(symbol, at=ts)
    vol  = hx.volatility(symbol, at=ts)
    return {
        "symbol":     symbol,
        "as_of":      ts,
        "vrp_pctile": vrp["vrp_percentile"],
        "vrp_zscore": vrp["vrp_zscore"],
        "vrp_regime": vrp["regime"],
        "gex_regime": summ["regime"]["label"],
        "gamma_flip": summ["regime"]["gamma_flip"],
        "net_gex":    summ["net_gex"],
        "iv_rank":    vol["iv_rank"],
        "iv_atm_30d": vol["atm_iv_30d"],
    }

The fill model is the edge

This is where most backtests lie. The naive fill model uses the midpoint price at decision time. That model is wrong in every direction that matters for premium sellers and spread traders:

You sell at the bid, not the mid. For a short put spread with a mid of $1.20 and a $0.10 wide bid-ask, your actual fill is closer to $1.10 than $1.20. At 250 trades per year, that $0.10 slip compounds.
The bid-ask widens before earnings and into events. Exactly when VRP looks richest, spreads are widest.
Volume is zero in historical chains. Use OI as a liquidity proxy and apply a conservative fill model for low-OI strikes.

A tractable fill model:

Fill_sell = Mid(t) − α · (Ask(t) − Bid(t)) / 2
Fill_buy  = Mid(t) + α · (Ask(t) − Bid(t)) / 2

Where α in [0.3, 0.7] is a market-impact parameter. Use α = 0.5 as a baseline; test sensitivity before claiming edge.

def fill_price(bid: float, ask: float, side: str, alpha: float = 0.5) -> float:
    mid = (bid + ask) / 2
    half_spread = (ask - bid) / 2
    if side == "sell":
        return mid - alpha * half_spread
    return mid + alpha * half_spread

def fill_spread(leg1, leg1_side, leg2, leg2_side, alpha=0.5) -> float:
    f1 = fill_price(leg1["bid"], leg1["ask"], leg1_side, alpha)
    f2 = fill_price(leg2["bid"], leg2["ask"], leg2_side, alpha)
    return f1 - f2

Metrics

Sharpe assumes normal returns; options premium-selling has negative skewness and excess kurtosis. Report Sharpe as a directional indicator, but lean on Sortino, Calmar, and maximum drawdown for sizing. A 70% win rate with a 3:1 loss-to-win payoff is a losing strategy.

Example strategies

Dealer-positioning momentum

When dealers are net short gamma, they buy into rallies and sell into drops to stay delta-neutral, amplifying directional moves. A momentum signal conditioned on negative-gamma regime should outperform an unconditional one.

At 9:45 ET, pull /v1/exposure/summary/{t} to determine the regime.
If net_gex < 0 and the 5-day EMA of net_gex is declining, the regime is "amplifying."
Compute the opening 15-minute return. If positive in an amplifying regime, go long a 1-week ATM call debit spread; if negative, a 1-week ATM put debit spread.
Exit at 14:30 or 2x the initial mid, whichever comes first.
Flat when regime is positive gamma or undefined.

VRP premium-selling

Implied vol systematically overstates realized vol on average. When VRP is in the upper tercile, selling premium has historically had positive expected value.

Daily, gate on vrp_percentile ≥ 75 and vrp_regime == "rich".
Sell a 30-delta put credit spread 21 DTE, short put at the put wall, long put $5 further OTM.
Exit at 50% of initial credit or 7 DTE.
Apply the fill model with α = 0.5, cap at 2% notional per trade.

Realized-vol dispersion

The implied vol of an index exceeds the weighted implied vol of its constituents; the excess is the correlation risk premium. Long single-name straddles, short index straddle, vega-neutral. Pull /v1/adv_volatility/{name} per constituent and for the index on the same timestamp.

Validation and pitfalls

Walk-forward, not in-sample optimization. Fit on a training window, evaluate on the next held-out window, roll forward.
Out-of-sample periods must include stress events. A backtest that excludes March 2020, August 2024, and Q4 2018 is an in-sample fit on calm markets.
Costs must be large enough to matter. If removing slippage changes Sharpe by more than 0.3, the edge is in the cost model.
Regime sensitivity test. Split by VIX tercile. If it only works in one regime, say so.

The kinks and common mistakes

Kink 1: Survivorship bias. A fixed ticker list (today's S&P 500) excludes names delisted or merged before today but present at the time. Construct the universe from point-in-time index membership.

Kink 2: Lookahead in computed analytics. VRP percentile computed over the full sample ranks a 2019 reading against observations that had not happened yet. Verify the window is trailing.

Kink 3: Restatement and non-determinism. Recomputed analytics change if methodology updates. Archive raw responses for bit-exact reproducibility.

Kink 4: Ignoring fills and slippage. A 1-DTE short straddle mid-priced at $3.20 with a $0.25 spread costs $0.50 round-trip, 15.6% of premium. Apply the fill model and vary α from 0.3 to 0.7.

Kink 5: Overfitting strike, expiry, and threshold. Many free parameters. Treat threshold as a hyper-parameter, fit on training only, test out-of-sample with it locked.

Kink 6: Regime dependence without disclosure. VRP selling wins slowly in calm markets, loses quickly in spikes. Report Calmar, worst drawdown, and drawdown duration. Backtest any drawdown breaker as a system parameter.

Kink 7: Point-in-time OI vs revised OI. Using finalized OI for a 14:30 signal is lookahead. Verify you use the preliminary morning publication.

Kink 8: Train/test leakage via shared normalization. Fit all preprocessors on training data only, then transform both train and test with frozen parameters.

Kink 9: Research vs production data mismatch. The historical option-quote shape differs from live (flat array, renamed fields). Write one parser that handles both, test against both before deploying.

Kink 10: Assignment and pin risk near expiry. A short put ITM within 2 to 3 DTE carries early-assignment risk. For credit spreads, pin risk leaves a naked overnight position with gap risk. Flag any position at 1 DTE with the short strike within 1% of spot, and apply a conservative exit fill.

Worked example: a VRP backtest sketch

A condensed but complete sketch of a 30-delta put credit spread backtest on SPY using the VRP signal.

from datetime import date
import pandas as pd
import requests
from flashalpha_historical import FlashAlphaHistorical

API_BASE = "https://lab.flashalpha.com"
API_KEY  = "YOUR_KEY"
hx = FlashAlphaHistorical(api_key=API_KEY)

SIGNAL_OPEN_HOUR  = "15:30:00"
VRP_PCTILE_THRESH = 75
IV_RANK_THRESH    = 40
ALPHA_FILL        = 0.5
DTE_OPEN          = 21

start, end = date(2019, 1, 2), date(2024, 12, 31)
cal_days = pd.bdate_range(start, end, freq="B")
rebalance_dates = cal_days[::5]  # weekly

trades = []
for rd in rebalance_dates:
    ts = f"{rd.isoformat()}T{SIGNAL_OPEN_HOUR}"
    try:
        vrp_data = hx.vrp("SPY", at=ts)
        vol_data = hx.volatility("SPY", at=ts)
        exp_data = hx.exposure_summary("SPY", at=ts)
    except Exception:
        continue

    if (vrp_data["vrp_percentile"] < VRP_PCTILE_THRESH or
            vol_data["iv_rank"] < IV_RANK_THRESH or
            exp_data["regime"]["label"] not in ("positive_gamma", "neutral")):
        continue

    put_wall = exp_data["levels"]["put_wall"]
    chain = requests.get(
        f"{API_BASE}/v1/options/SPY",
        headers={"X-Api-Key": API_KEY},
        params={"date": ts}
    ).json()

    short_put = next(
        (c for c in chain
         if c["option_type"] == "P"
         and abs(c["strike"] - put_wall) <= 2.5
         and DTE_OPEN - 3 <= c["days_to_expiry"] <= DTE_OPEN + 3), None)
    if short_put is None:
        continue
    long_put = next(
        (c for c in chain
         if c["option_type"] == "P"
         and c["strike"] == short_put["strike"] - 5
         and c["days_to_expiry"] == short_put["days_to_expiry"]), None)
    if long_put is None:
        continue

    short_fill = fill_price(short_put["bid"], short_put["ask"], "sell", ALPHA_FILL)
    long_fill  = fill_price(long_put["bid"],  long_put["ask"],  "buy",  ALPHA_FILL)
    net_credit = short_fill - long_fill
    if net_credit <= 0:
        continue

    trades.append({
        "open_date":    rd,
        "short_strike": short_put["strike"],
        "long_strike":  short_put["strike"] - 5,
        "net_credit":   round(net_credit, 4),
        "vrp_pctile":   vrp_data["vrp_percentile"],
    })

df = pd.DataFrame(trades)
print(f"Trades: {len(df)}")

What this sketch leaves out, deliberately, as kink-avoidance exercises: the exit loop, the assignment model, the drawdown breaker, and walk-forward parameter selection.

Tooling: endpoints and the MCP connector

Endpoint	Use	Tier
`GET /v1/exposure/gex/{t}?date=`	Net GEX by strike, gamma flip, walls	Alpha
`GET /v1/exposure/summary/{t}?date=`	Full dealer-positioning state	Alpha
`GET /v1/options/{t}?date=`	Full option chain: IV, greeks, OI, bid/ask	Alpha
`GET /v1/vrp/{t}/history`	VRP series, z-score, percentile	Alpha
`GET /v1/volatility/{t}?date=`	IV, realized vol, skew, term structure	Alpha
`GET /v1/surface/{t}?date=`	50x50 IV surface grid (minute-level)	Alpha
`GET /v1/adv_volatility/{t}?date=`	SVI params (EOD), variance surface, arb flags	Alpha
`POST /v1/screener`	Universe filter on GEX, VRP, IV rank, regime	Growth

# Historical replay
from flashalpha_historical import FlashAlphaHistorical
hx = FlashAlphaHistorical(api_key="YOUR_KEY")
snap = hx.exposure_summary("SPY", at="2022-06-15T14:30:00")

# Live production, same method names, no at=
from flashalpha import FlashAlpha
fa = FlashAlpha(api_key="YOUR_KEY")
snap = fa.exposure_summary("SPY")

For LLM-augmented research, the quant MCP connector exposes the full Historical API as callable tools:

{
  "mcpServers": {
    "flashalpha-quant": {
      "url": "https://lab.flashalpha.com/mcp-oauth/quant"
    }
  }
}

Conclusion

The systematic options research pipeline has three honest bottlenecks: point-in-time data, a realistic fill model, and discipline against the ten kinks that inflate backtested Sharpe before live trading reveals the truth.

Once the backtest is validated, the production migration is a one-parameter change: remove at= from each call and the same code runs live. That is the architectural reason to use a pre-computed analytics layer rather than building a raw-chain pipeline from scratch.

This post was originally published on FlashAlpha. FlashAlpha provides pre-computed options analytics (GEX, DEX, VEX, CHEX, SVI surfaces, VRP) for 6,000+ US equities and ETFs via API. Free API key, no credit card.

DEV Community