The `epsActual` That Wasn't: 15% of an LLM Backtest's Trades Were Decided on Data That Didn't Exist Yet

#python #machinelearning #finance #datascience

We were backtesting an LLM-driven earnings signal against a field called epsActual — the kind of field everyone treats as ground truth. It isn't.

About 41.4% of those "actual" values were different from what the vendor had first reported. About 15.3% differed enough to flip a tradeable decision. When we re-ran the backtest using only the values that actually existed at each decision date, the strategy kept ~73% of its returns and ~82% of its Sharpe. The rest was look-ahead bias — and it rode in through a field whose name promised it was final.

This is a writeup of how we found it, how we measured it honestly, and the one-line invariant that turns it from a silent inflation into a loud test failure.

The setup

The signal is a post-earnings drift play: at each earnings print, an LLM scores the release and we take a position. To backtest it you replay history — for every past print, reconstruct what the model would have decided, then check what happened next.

That reconstruction needs one obviously-trustworthy input: what the earnings number actually was. Our vendor exposes exactly that, in a field named epsActual. "Actual." Final. Settled. You query a print from two years ago and get a number back. What could go wrong?

The invisible killer

Vendor "actuals" are not frozen at print time. They get backfilled, corrected, and restated — sometimes the next day, sometimes months later. Restatements, late filings, parser fixes, standardization passes: all of them quietly rewrite history. The value you query today for a 2023 print is not, in general, the value that was available the day after that print.

This is textbook look-ahead bias, and it's especially dangerous here because it doesn't look like leakage. Nobody fed the model future data on purpose. It rode in on a field everyone trusts — and "actual" is about the most trustworthy-sounding name a field can have. A backtest built on today's epsActual is quietly asking the model to react to numbers that, on the decision date, did not yet exist.

How we measured it honestly

You can't detect this from a single snapshot of the database — by definition the revision has already overwritten the original. So we built a forward-polling harness: poll the vendor on a schedule, snapshot every value we care about, and watch for changes over time. It had accumulated ~1,400 snapshots in the first day of polling.

The decision that mattered most:

Detect revisions by the value itself, not by the vendor's lastUpdated timestamp.

lastUpdated is unreliable — it doesn't reliably fire on silent backfills, and trusting it would have hidden exactly the revisions we were hunting. So change detection keys on the value-tuple: if any tracked field changes between two snapshots, that's a revision, regardless of what the metadata claims.

# Revision = the tracked value-tuple changed between snapshots,
# NOT "the vendor bumped lastUpdated".
def is_revision(prev_snapshot, curr_snapshot, tracked_fields):
    prev = tuple(prev_snapshot[f] for f in tracked_fields)
    curr = tuple(curr_snapshot[f] for f in tracked_fields)
    return prev != curr

To quantify the trading impact, we compared two backtests over a four-month point-in-time window: a naive one using today's revised epsActual, and an as-of one using only each value as first seen on (or before) the decision date.

What we found

41.4% of epsActual values (896/2163) differed between first-seen and final.
15.3% of cases (332/2163) differed enough to flip a tradeable decision — a sign change or a threshold crossing in the signal.
Over the four-month window, the as-of backtest retained ~73% of the naive backtest's returns and ~82% of its Sharpe. (The FINAL leg keeps drifting as the vendor keeps revising, so treat the ratio as more stable than the levels.)
Read inversely: roughly a quarter of the headline returns, and a fifth of the Sharpe, were look-ahead artifacts.

The encouraging half: most of the strategy survives honest data. The sobering half: a naive backtest overstated it by a wide margin, and a meaningful fraction of "winning" trades were decided on numbers that did not exist at decision time. A 15% decision-flip rate is not noise you can wave away.

Why this is structural, not a one-off

The natural reaction is "okay, we'll be careful with that field." That doesn't hold. The risk is reintroduced by every new feature, every new vendor, every rerun, every teammate who reaches for "the actual value." Carefulness is a property of a person on a good day; as-of correctness has to be a property of the pipeline.

So treat the question "could this value have been known at the decision time we're simulating?" as an invariant the code enforces and CI checks. A vendor "actual" is time-versioned reference data: it only becomes valid at the instant you first observed it. Use it to decide before that instant and you're using a value from the future.

That's exactly what the look-ahead invariant below checks — it requires valid_from <= feature_as_of:

from traceguard.validators.lookahead import validate_reference_timing

# The eps "actual" is time-versioned reference data: valid_from is when this
# specific value first existed (first-seen in our snapshots), feature_as_of is
# the decision moment we are simulating.
validate_reference_timing(
    valid_from=eps_first_seen,    # when this value actually existed
    feature_as_of=decision_date,  # the moment we're simulating
    kind="vendor_eps_actual",
)  # raises InvariantViolation if eps_first_seen > decision_date

When a value is used before its availability timestamp, the run fails loudly rather than silently inflating a Sharpe ratio.

Two kinds of look-ahead — don't conflate them

It's worth being precise about scope. There are two distinct kinds of look-ahead bias in LLM pipelines:

Training contamination — the model itself was pre-trained on the future you're predicting, so it "recalls" rather than reasons. That's a separate research problem (membership-inference tests, point-in-time LLMs, claim-level temporal verification), and it needs different tooling.
Harness / pipeline leakage — your code uses a value, prompt, or model that didn't exist at the simulated time. This story is entirely about this kind, and it's the kind a pipeline can be made to refuse structurally.

Both matter. They are not the same problem, and conflating them is how teams "fix" one and ship the other.

A checklist you can apply today

Treat every actual / final / reported vendor field as a moving target until you've proven otherwise with your own snapshots.
Detect revisions by value, not by the vendor's update timestamp.
Backtest on as-of (first-seen) data, and explicitly measure the gap against revised data. That gap is your look-ahead tax — quantify it instead of assuming it's zero.
Encode "known at decision time?" as a CI invariant, so the failure mode is a red test, not a flattering backtest.

Limitations

One vendor, one field, a four-month window. The exact percentages are dataset-specific and should not be read as universal constants — your numbers will differ. And again: this addresses harness leakage only, not whether the model itself has seen the future.

The validators and point-in-time instrumentation here are part of traceguard — an open-source Python library for point-in-time-correct LLM instrumentation: a model registry that refuses anachronistic picks, a git-tracked prompt registry, canonical input hashing, and look-ahead invariants you call in CI. It's not a dashboard — it exports OpenTelemetry spans into Langfuse / Phoenix, so it sits underneath your observability stack and keeps the timeline honest.

pip install traceguard

If you've been burned by a backtest that looked great and meant nothing, I'd genuinely like to hear how it happened — that's the failure mode this is built to catch.