tomasz dobrowolski

Posted on May 30 • Originally published at flashalpha.com

Machine Learning on Options Data: An Honest Quant ML Guide

#machinelearning #quant #python #api

Options data is one of the densest public-markets modalities and one of the worst-served by ML toolchains. This is a direct tour of eight ML methodologies that have published research or plausible workflows on options, grouped by maturity, with the data shape each one needs, where historical replay actually delivers minute-level signal versus end-of-day, and the honest list of what none of this solves.

I maintain the calculator stack this is written against, so I'll be upfront about the bias: I sell pre-computed analytics on top of historical options data. For a trader the pitch has always been "you don't have to build the calculator yourself." For an ML engineer the pitch is sharper: you don't have to build the training corpus yourself. Most of the exposure summaries, volatility analytics, and quote streams that power the live API are also available point-in-time across the history window. Some pieces (SVI parameters, open interest, macro overlays) are EOD-stamped rather than minute-level, and I'll flag exactly where that matters per methodology.

If you want the case against before the survey, skip to the failures section at the end.

The pipeline problem (why this field is small)

Before any model trains, the data has to be shaped, and options data is uniquely punishing here.

Raw chains aren't features. A quote feed gives you bids and asks per strike. You still need spot, the forward curve, dividends, and a consistent Greeks pass before any of those numbers are usable as model inputs.
Surface fits are not optional. Without an arbitrage-free fit, your "implied volatility" feature is whatever Newton-Raphson converged to per strike, which means your skew, term, and butterfly features are noise plus signal. Models trained on that overfit to fit instability.
Multi-year minute-level coverage is large and hard to join. Spot bars, option quotes per strike per expiration, open interest, macro overlays (VIX, VVIX, SKEW, MOVE), event calendars. Assembling all of this from raw tick providers is measured in months, and then you own the maintenance.
Leakage hides in EOD. Most academic options datasets are end-of-day. Intraday strategies trained on EOD labels are quietly using settlement information that wouldn't have been known mid-session. Backtest looks great, live looks terrible.
Greeks consistency. If your training data has Greeks from one IV-surface assumption and your inference path uses another, you have silent training-serving skew. The feature drifts and you blame the model.

This is why most published ML-on-options papers use a tiny strike subset and a short window. Every methodology below assumes the data is shaped right. If it isn't, your research timeline is consumed before you ever read a paper.

Two SDKs, one wire format

There are two FlashAlpha Python packages, and conflating them is a common mistake.

flashalpha wraps the live service. High-level methods do not accept an at= kwarg.
flashalpha-historical wraps the historical service. Every high-level method requires at= for point-in-time replay.

# Recommended: dedicated historical SDK
# pip install flashalpha-historical
from flashalpha_historical import FlashAlphaHistorical

hx = FlashAlphaHistorical(api_key="YOUR_KEY")
snap = hx.exposure_summary("SPY", at="2020-03-16T15:30:00")
vol  = hx.volatility("SPY", at="2024-06-03T14:30:00")

# Fallback: hit historical REST directly with the same X-Api-Key
import requests
BASE = "https://historical.flashalpha.com"
HEADERS = {"X-Api-Key": "YOUR_KEY"}

def get(path, at, **params):
    params["at"] = at
    r = requests.get(f"{BASE}{path}", headers=HEADERS, params=params, timeout=60)
    r.raise_for_status()
    return r.json()

vol = get("/v1/volatility/SPY", at="2024-06-03T14:30:00")

What historical actually ships: minute-level vs EOD

"Same response shape as live" is true for the analytics surface. It is materially misleading for a handful of endpoints, and the resolution truth matters more than the schema.

Field	Resolution
Per-contract bid, ask, IV, Greeks, spot	Minute, 9:30 to 16:00 ET
50x50 surface grid values	Minute (driven by quotes)
SVI calibration parameters {a, b, ρ, m, σ}	EOD-stamped (one fit per trading day)
Open interest	EOD-stamped
Macro (VIX, VVIX, SKEW, MOVE)	EOD-stamped
Forward prices	EOD-stamped
Per-contract `svi_vol`	Always null in historical (`backtest_mode`)
Per-contract `volume`	Always 0 in historical (use `open_interest`)

That table dictates which methodology survives at minute resolution and which is bounded to EOD.

The maturity map

The eight are not equally mature.

Production-proven: realised-vol regression (1), deep hedging (4) for market makers and option desks.
Research-backed: sequence models on surfaces (3), vol-surface anomaly detection (5), causal inference and event studies (8).
Plausible but not production: regime classification on pre-computed dealer labels (2), generative path augmentation (6), GNNs on chains (7).

I include all eight because the data shape supports all eight. I am not selling all eight as alpha.

If you read nothing else, read this table. It pulls the rest of the article into one place: what each method is mature enough to ship, the resolution constraint that actually bites it, and the tier that unblocks it.

#	Methodology	Maturity	Resolution constraint	Tier to run it
1	Realised-vol regression	Production	Macro overlay is EOD; features otherwise minute	Growth live / Alpha replay
2	Regime classification	Plausible	Labels minute-level; classifier methodology drifts	Growth live / Alpha replay
3	Sequence models on surfaces	Research	Grid minute, SVI latent EOD	Alpha
4	Deep hedging / RL	Production (desks)	Greeks minute, OI EOD, full chain per step	Alpha
5	Surface anomaly detection	Research	Grid minute, SVI reference EOD	Alpha
6	Generative path augmentation	Plausible	Grid minute; can't synthesise unseen tails	Alpha
7	GNNs on chains	Speculative	Chain minute; no known PnL profile	Alpha
8	Causal inference / event studies	Research	Surface minute; statistical power is the limit	Alpha

The pattern is blunt: live feature engineering starts at Growth, and every method that needs point-in-time history is Alpha. The free tier inspects response shapes on a single non-index equity and nothing more.

1. Realised-volatility forecasting (regression)

The simplest, most-published, most reliably useful-at-the-margin target: predict realised vol at a forward horizon (1d, 5d, 21d) given the current implied surface and recent realised history.

Realised vol over horizon h is RV(t, h) = sqrt((252/h) · Σ r²). The vol risk premium is VRP = σ_IV − σ_RV. The forecasting target is the realised side; the implied side is a feature, because traders pricing protection are betting on the future return distribution and that signal leaks into level, skew, and term structure.

Tree ensembles (XGBoost, LightGBM) and small recurrent models exploit this with a feature set that is easy to assemble: realised vol at multiple lookbacks, ATM 30-day IV, 25-delta risk reversal (skew), 30d-vs-90d term slope, 25-delta strangle vs ATM (butterfly), and the macro overlay (EOD-stamped in history).

vol = hx.volatility("SPY", at="2024-06-14T15:30:00")
# realised-vol ladders, ATM IV, skew, term structure
# field names follow the live /v1/volatility response

What it doesn't solve: regime change (the model always lags) and event-driven shocks (FOMC and earnings follow a different conditional distribution; treat them as a separate feature regime). This is the Kaggle-shaped end of options ML. Useful as a building block, not institutional alpha standalone.

2. Regime classification (dealer gamma, VRP)

Conditional strategies are where most options alpha actually lives: sell premium when VRP is rich, buy gamma when dealers are short, fade rallies in positive-gamma regimes, ride them in negative-gamma. Every one needs a label, and labels are where most ML projects on options quietly fail, either look-ahead-biased or arbitrary.

The exposure summary endpoint returns a categorical regime label per minute alongside net GEX, gamma flip, 0DTE contribution, and dealer-positioning interpretations, with the same classifier running across the replay window.

summary = hx.exposure_summary("SPY", at="2024-06-14T15:30:00")
# net GEX/DEX/VEX/CHEX, gamma flip, 0DTE contribution,
# dealer-hedging narrative, regime label

Honest caveat: the classifier evolves as bugs are fixed, so replay reflects current methodology, not a frozen-at-the-time snapshot. If you need bit-exact reproducibility against an older pull, archive your responses. The labels are descriptive, not causal: knowing the market is in a negative-gamma regime doesn't tell you when it ends. Pair with a separate transition-timing model. That's why this sits in "plausible but not production."

3. Sequence models on IV surfaces (LSTM, transformer)

The surface at time t is an (n_strikes by n_expirations) tensor; the sequence over a day is a tensor time series. This is squarely in the modality patch-based transformers, TCNs, and dilated convolutions handle well. The canonical reference is Horvath, Muguruza, and Tomas, "Deep Learning Volatility" (2019), which uses neural nets to price options under rough vol; the inverse problem of predicting surface dynamics uses the same input shape.

SVI fits each expiration slice's total variance w(k) = a + b{ρ(k − m) + sqrt((k − m)² + σ²)}: five parameters per slice, arbitrage-free under explicit constraints.

Feature engineering, in increasing dimensional efficiency: raw IV grid (most expressive, hardest to train), SVI parameters per slice, or whole-surface SVI parameters as global state (lowest-dimensional latent).

Resolution caveat that bites here. The 50x50 grid evolves at minute resolution because it's driven by quotes. The SVI calibration parameters are EOD-stamped. An intraday at= returns the most recent EOD SVI parameters, not a fresh intraday calibration. If your model forecasts the SVI latent, you have one observation per trading day, not per minute. If you use the raw grid, the intraday tensor is real.

surface = hx.surface("SPY", at="2024-06-14T15:30:00")   # minute-level grid
adv     = hx.adv_volatility("SPY", at="2024-06-14T15:30:00")  # SVI (EOD) + arb checks

What it doesn't solve: sub-second mid-quote prediction. Market makers see flow you don't. Stay at multi-second horizons or longer.

4. Deep hedging and RL hedging

This is where the academic literature is strongest and industrial deployment most mature, inside market-maker and exotic-desk shops. You hold a path-dependent position and want a hedging policy that minimises transaction-cost-aware variance, CVaR, or another risk measure. Analytical delta hedging is provably suboptimal under transaction costs and jumps; neural-network policies trained on simulated and real paths beat it.

The objective: min over policy π of ρ(V_T − C_T), where V_T = V_0 + Σ π_t ΔS_t − Σ TC(π_t) and ρ is a convex risk measure. References: Buehler, Gonon, Teichmann, Wood, "Deep Hedging" (2018); Kolm and Ritter (2019); Cao, Chen, Hull, Poulos (2021).

The data requirement is the hardest in this article: the full option chain at every step of every rollout, with full Greeks, so the policy can choose its hedge.

chain = hx.option_quote("SPY", at="2024-06-14T15:30:00")
# flat array, renamed fields: implied_vol (not iv), open_interest (not oi),
# lastUpdate (camelCase). Historical-only: iv_bid, iv_ask, vanna, charm, rho.
# volume always 0 (use open_interest); svi_vol always null (backtest_mode);
# open_interest EOD-stamped.

The minute-level Greeks, bid/ask, and spot anchor are real and usable; EOD OI and the null intraday smoother are the honest limits. Deep hedging is a hedging technology, not an alpha. PnL comes from being a market maker; deep hedging shrinks the residual variance. If you're a directional trader, this isn't your methodology.

5. Vol-surface anomaly detection (unsupervised)

Mispriced surfaces produce structural arbitrage (butterfly, calendar, sticky-strike vs sticky-delta) and unusual structural shifts that precede directional moves. Both are unsupervised: you learn what normal looks like and flag deviations.

Autoencoder on the surface grid. Reconstruction error is the anomaly score. The 50x50 grid is minute-level historically, so this works at minute resolution.
Quote-vs-fit residual analysis. The deviation between minute-level quotes and the EOD SVI fit is itself a signal; a persistent large deviation often precedes a quote correction or flags a stale fit.

Reference: Ackerer, Tagasovska, Vatter, "Deep Smoothing of the Implied Volatility Surface" (2020). The residuals of any smoother are usable as anomaly signals.

adv = hx.adv_volatility("SPY", at="2024-06-14T15:30:00")
# SVI (EOD), variance grid, butterfly/calendar arb flags,
# variance-swap fair values, higher-order Greeks surfaces

What it doesn't solve: anomalies on illiquid strikes are mostly fit noise. Use the liquidity-weighted version.

6. Generative models for vol path augmentation

The 8-year intraday history sounds long until you condition on a joint state: negative dealer gamma, VRP above its 90th percentile, earnings week, VIX between 18 and 22. The sample count drops to single digits. Generative models synthesise realistic-but-novel paths to augment training in those undersampled regions.

Reference: Wiese, Knobloch, Korn, Kretschmer, "Quant GANs" (2020); the architecture has since extended to TimeGAN, conditional GANs on surface tensors, and diffusion models on financial series.

for ts in iter_market_minutes(start="2018-04-16", end="2026-01-01"):
    grid = hx.surface("SPY", at=ts.isoformat())
    yield surface_grid_to_tensor(grid)

What it doesn't solve: synthetic paths can't introduce tail behaviour that wasn't in the training set. A model trained on calm regimes will not invent a COVID-style crash. Generative augmentation expands the interior of your distribution, not its tails. Plan stress tests separately.

7. Graph neural nets on option chains

The most speculative section. I include it because the data shape is right, not because industry adoption is strong. Contracts on the same underlying form a natural graph: strikes connect via butterflies, expirations via calendars, related underlyings (SPX/SPY, sector ETFs and constituents) via vol-of-vol relationships. A GNN with these structural priors can handle the whole surface jointly rather than slice by slice.

chain = hx.option_quote("SPY", at="2024-06-14T15:30:00")
# nodes = per-strike contracts; edges via moneyness/expiry proximity;
# messages carry IV and Greek information

Worth reaching for cross-chain mispricing detection (SPX vs SPY drift, sector ETFs vs constituents) or research where the joint surface structure matters. It's a research direction, not a deployed methodology with a known PnL profile. Calibrate expectations accordingly.

8. Causal inference and event studies

The most under-rated angle. Classical finance has decades of event-study methodology; modern ML adds heterogeneous treatment effect estimation (causal forests, X-learners, doubly-robust estimators give per-firm CATE rather than a single average), counterfactual surface construction (synthetic control on the surface tensor), and IV-crush prediction.

A concrete flow: pull the surface around every earnings announcement in the window. Use trading-day offsets, not calendar-day, and align to the actual announcement timestamp (most issuers report before-open or after-close).

from datetime import timedelta
import pandas_market_calendars as mcal
nyse = mcal.get_calendar("XNYS")

def trading_days_before(dt, n):
    sched = nyse.schedule(start_date=dt - timedelta(days=n*2), end_date=dt)
    return sched.index[-n].to_pydatetime()

def pull_pre_post(symbol, announce_ts, when_announced):
    pre_day  = trading_days_before(announce_ts, 5)
    last_day = trading_days_before(announce_ts, 1)
    if when_announced == "before_open":
        post_day = announce_ts
    else:
        post_day = nyse.valid_days(announce_ts, announce_ts + timedelta(days=5))[1]
    pre  = hx.volatility(symbol, at=f"{pre_day.date()}T15:30:00")
    last = hx.volatility(symbol, at=f"{last_day.date()}T15:30:00")
    post = hx.volatility(symbol, at=f"{post_day.date()}T15:30:00")
    return pre, last, post

Earnings and FOMC dates are well known; the matched surface state isn't, and that's the bottleneck. What it doesn't solve: rare-event causal estimates have wide error bars regardless of ML sophistication. The framework cleans up the analysis; it doesn't manufacture statistical power.

Leakage and point-in-time correctness

The single most common reason ML-on-options papers don't replicate live: training labels built from end-of-day or settlement data that wouldn't have been known at the decision time. The model looks fine in cross-validation and loses money in paper trading. The cause is upstream of the model.

The primitive against this is point-in-time replay. Every historical endpoint accepts at=YYYY-MM-DDTHH:mm:ss and returns the response as it would have been computed at that minute. Feature construction is leak-free by default if you ask for features at t and labels at t+h using at for both.

Checklist:

Are your features as-of t or as-of t-1? Decide explicitly and verify.
Are your labels computed using only data timestamped ≤ t + horizon, not back-revised?
Are your regime labels from the same classifier across history, or did methodology drift?
Does your hold-out split respect time (chronological), not random?
Are event dates aligned to the announcement timestamp (before-open vs after-close)?
Are you treating EOD-stamped fields (SVI, OI, macro) as as-of-most-recent-close, not as-of-minute t?

What ML on options won't solve

The credibility section. Every quant ML engineer has met a vendor who claimed everything.

Sub-second mid-quote prediction. Market makers see the flow; you don't. The information asymmetry is structural.
Regime change prediction. Detection works. Change prediction does not, robustly. Detection plus position sizing is the honest play.
Cross-asset macro. Options data tells you about this underlying. Rates, credit, FX, commodities come from elsewhere.
Survivorship bias. Cross-sectional ML on single-name options is biased toward winners that didn't get delisted. Index ETFs largely sidestep it; single-name work needs explicit handling.
Intraday SVI / OI / macro evolution. EOD-stamped here. If your architecture requires minute-level SVI dynamics, this dataset is not it.
"Same response shape" for everything. True for most analytics endpoints; not for optionquote, maxpain, or stock-summary macro objects. Write your client with awareness.

If your project requires any of the above, this stack is part of the answer, not the whole answer.

The substrate, not the strategy

Eight methodologies, grouped by maturity, mapped to verified endpoints. The honest pitch: this is the data substrate that compresses ML wall-clock from quarters to weeks. It does not manufacture alpha; it removes the pipeline tax. Pick the methodology that fits your research question, recognise that almost all require historical replay, and budget accordingly. For a single non-index equity sniff test, the free tier inspects response shapes. For real work you need the historical replay tier.

Originally published on FlashAlpha Research. Free API key, no credit card: flashalpha.com/pricing.

DEV Community