tomasz dobrowolski

Posted on Apr 15 • Originally published at flashalpha.com

Look-Ahead Bias in Volatility Backtests — Why Most VRP Percentiles Silently Cheat (and How to Fix It)

#python #datascience #fintech #tutorial

Ask any quant what kills backtests and you'll get the same short list: survivorship bias, execution costs, overfitting. The one that gets mentioned less often — and is probably the most common in amateur-to-intermediate volatility research — is percentile leakage. It's a specific, boring, easy-to-miss form of look-ahead bias that makes dead strategies look alive, and it's baked into almost every "VRP percentile" or "IV rank" column you'll find in public datasets.

What Is Percentile Leakage in Options Backtesting?

Consider a feature you've seen a hundred times: VRP 20-day percentile. Take the current 20-day VRP, compare it to the distribution of historical VRPs, express as a percentile (0 to 100). High percentile = rich vol premium, time to sell. Low percentile = compressed, stand down.

Now, how did you compute the distribution?

The wrong way: grab all historical VRP values in your dataset (say 2018–2026), sort them, and on any given date t, compute the percentile of VRP[t] against that full sorted list.

That's leakage. In 2019, your "percentile" is scored against a distribution that includes March 2020, August 2024, and every other extreme event that hadn't happened yet. The 2019 percentile a real trader would have seen — using only data before 2019 — is different, often very different.

When you bet on that cheated percentile in your backtest, you are using information that did not exist.

How to Compute Walk-Forward Percentiles Without Look-Ahead Bias

At each date t, compute the percentile of VRP[t] against the distribution of VRP values from strictly before *t. That's a *walk-forward or expanding-window percentile. Every sample is scored against the knowledge available at its own moment, and nothing later.

Straightforward to describe. A minor nuisance to implement correctly:

Store every historical VRP observation with its date.
On each query, filter to observations dated strictly before the query date.
Compute percentile against that filtered set.
Do this efficiently — the naive implementation is O(n) per query across thousands of queries.

Most custom percentile code gets step 2 subtly wrong (off-by-one on the date filter), step 4 badly wrong (recomputing the full distribution per query), or both. A correct implementation shipped behind an API removes the excuse.

How FlashAlpha's Historical VRP API Avoids Look-Ahead Bias

FlashAlpha's Historical VRP endpoint is explicit about this:

VRP percentile and z-score are computed from snapshot rows with date strictly less than the query date, so at any historical point the percentile reflects what was knowable at that moment.

The filter is a SQL predicate in the query itself — not a convention, not a best-effort, not something you can accidentally bypass.

Ask for /v1/vrp/SPY?at=2022-06-14T15:30:00 and the percentile is computed against data before 2022-06-14. Ask for the same symbol at a later date and the percentile changes, because the rolling window includes more data.

curl -H "X-Api-Key: YOUR_API_KEY" \
  "https://historical.flashalpha.com/v1/vrp/SPY?at=2022-06-14T15:30:00"

{
  "symbol": "SPY",
  "vrp": {
    "vrp_20d": 8.11,
    "z_score": 2.84,
    "percentile": 100,
    "history_days": 60
  }
}

That percentile: 100 means this VRP reading is above every observation in the preceding 60-day window — computed honestly, as a trader at 15:30 ET on June 14, 2022 would have actually seen it. Not as a backtest writer looking backwards from 2026 would retroactively assign it.

How Much Does Look-Ahead Bias Inflate Your Sharpe Ratio?

The amplitude of the leakage depends on where in the history you're testing. A backtest across a 2018–2026 window using full-sample percentiles will:

Under-assign extreme percentiles in early history (the extremes that hadn't yet happened are in the denominator).
Over-assign extreme percentiles in late history (mirror image).
Inflate Sharpe ratios for any strategy that triggers on percentile thresholds, because the threshold breaches are non-random in time.

In the VRP case specifically, a short-strangle strategy gated on "VRP percentile > 80" will typically show 15–30% higher Sharpe on full-sample percentiles than on walk-forward percentiles over a 5+ year window. That's enough to turn a live-unviable strategy into a slide-deck-impressive one. It's also exactly the kind of edge that evaporates the moment you try to trade it, because the live version doesn't get to cheat.

Python: Walk-Forward VRP Backtest Using the Historical API

Here's a practical research pattern: pull VRP daily across your test window, and at each date, record the percentile the API returns. That percentile is already walk-forward.

import httpx, pandas as pd
from tqdm import tqdm

API_KEY = "..."
BASE = "https://historical.flashalpha.com"
dates = pd.bdate_range("2020-01-01", "2025-12-31")

rows = []
with httpx.Client(headers={"X-Api-Key": API_KEY}, timeout=30) as c:
    for d in tqdm(dates):
        r = c.get(f"{BASE}/v1/vrp/SPY", params={"at": d.strftime("%Y-%m-%d")})
        if r.status_code != 200:
            continue
        j = r.json()["vrp"]
        rows.append({
            "date": d,
            "vrp_20d": j["vrp_20d"],
            "vrp_pct": j["percentile"],
            "vrp_z": j["z_score"],
            "atm_iv": j["atm_iv"],
            "rv_20d": j["rv_20d"],
        })

vrp = pd.DataFrame(rows).set_index("date")
# vrp["vrp_pct"] is walk-forward by construction — no leakage
entries = vrp[vrp["vrp_pct"] > 80]

That entries dataframe is the set of honest trigger days for a "short strangle when VRP percentile > 80" study. Join forward returns (next-day or next-20-day underlying path / straddle P&L), measure edge. If it works with the walk-forward percentile, it has a chance of working live. If it only works with a full-sample percentile, you're looking at a statistical artifact.

Other Options Backtesting Features That Leak the Future

Percentiles are the most common case, but the same class of look-ahead bias hides in:

Rolling z-scores with expanding-window means and standard deviations. Same walk-forward fix required. FlashAlpha's z_score is computed from the same date-bounded snapshot set as the percentile.
Regime labels ("normal" vs "elevated") derived from thresholds on historical distributions. If the thresholds come from the full dataset, early samples get labelled against knowledge that didn't exist.
Volatility-of-volatility features. Any derived statistic that rolls over the entire history is a candidate for leakage.
Cross-asset signals that blend SPY and another underlying's history. The leakage compounds across instruments.

For anything you compute yourself on top of the Historical API's raw outputs, the rule is the same: at time t, use only data with timestamps strictly less than t. For percentile and z-score on VRP specifically, the endpoint handles it for you.

Why Look-Ahead Bias Is Worse for Machine Learning Models

For ML workflows, leakage is a larger problem because your model learns whatever signal exists — including whatever cheating signal the features leak. If VRP percentile is a feature and it's computed against the full dataset, the model will learn that "high percentile" is more informative in early history than it actually was, because the labelling itself encodes future information. Gradient boosters in particular are excellent at exploiting these micro-leaks.

Using the Historical API's walk-forward percentile as a feature removes that failure mode for VRP. Apply the same discipline to your other derived features, and your cross-validated metrics start matching your paper-trading metrics — which is the whole point.

The Real Point: Make the Right Thing Frictionless

Calculator correctness gets a lot of ink in quant writing. Leakage discipline gets much less. But leakage is where most real-world research gets quietly ruined — not in the calculator, but in the feature pipeline that feeds it.

The reason FlashAlpha's Historical VRP is notable isn't that percentiles are hard to compute. It's that shipping a historical API where the percentiles are honest by default means every user starts from a correct baseline. Research quality is defined by the lowest-friction option; make the right thing frictionless and most people will do the right thing.

DEV Community