Eval Integrity: How We Found the Leakage and Why Our Baseline Lied

#ai #machinelearning #tutorial #python

Why this matters for agent developers

If you're building an AI agent that calls Chart Library, you're trusting our historical base rates. If those base rates are inflated by leakage, your agent's sizing, stop placement, and confidence calibration are all downstream of a lie. That's not acceptable for anyone we want as a customer, so we audited ourselves and published what we found.

Short version: our internal baseline for shape-embedding direction accuracy was 51.6% — barely above the 51.2% coin-flip floor. That 0.4 percentage points was being measured against a split that let the model find near-duplicates of every query in the training set. Once we fixed it, the number went back to ~51.2%. We never had signal where we thought we did. Now we know.

Bug #1 — Same-symbol cross-split correlation

We split training from validation by date (train < 2025, val = 2025). The problem: AAPL has a chart-pattern embedding for every trading day. Its embedding on 2024-12-30 and 2025-01-02 are almost the same vector (same symbol, mostly the same preceding bars).

When we ran k-nearest-neighbor evaluation on val samples, the nearest training neighbor for AAPL 2025-01-02 was AAPL 2024-12-30 — which IS in the training set. The model wasn't finding 'similar historical patterns.' It was finding itself from a few days earlier.

53.6% of validation samples had a same-symbol training neighbor within 20 trading days
Direction-accuracy lift was driven almost entirely by this correlation
Symbol-disjoint splits (hold out tickers entirely, not dates) give honest numbers

Bug #2 — Forward-return window leakage

The 5-day forward return for a training sample dated 2024-12-30 uses closing prices through 2025-01-06. That's inside the validation window. So the training label itself depends on bars the model supposedly hasn't seen.

This is smaller than bug #1 (24,448 train rows affected, ~3K in our 1M random sample), but it's additive. The correct fix is a purge-and-embargo window at every split boundary equal to the longest forward horizon. We use 10 trading days.

What we changed

Going forward, every evaluation on Chart Library's embedding quality uses:

Symbol-disjoint splits — 70% of tickers in train, 15% in val, 15% in test. No ticker appears in more than one split.
Purge-embargo of 10 trading days at any remaining date boundary (e.g. walk-forward).
Sample-size reporting on every reported metric, with confidence intervals.
Open publication of the baseline so future model updates have to beat an honest number, not an inflated one.

What this implies about the product

Pure shape-similarity direction accuracy on a symbol-disjoint, embargoed holdout is at or near 51% — essentially coin-flip. This isn't a flaw in our embeddings; it's the actual state of the problem. Predicting 5-day direction from a single chart shape is one of the hardest signal-extraction problems in finance, and it's well-documented in the academic literature that pure-price features have very low information ratio before regime/liquidity/volume conditioning.

The leverage is not in the average. It's in the cohort. When you condition on regime bucket + sector + liquidity + event proximity, the conditional distribution of outcomes is materially different from the unconditional one. That's why we're building toward a Conditional Distribution API — one call, filter by context, get back path percentiles with sample size.

INFO — If you're evaluating a historical-pattern vendor and their baseline looks too good, ask them: (1) how are splits constructed, (2) what's the embargo window, (3) how do you handle same-symbol overlap. If they can't answer, the numbers are probably lying.

Our ongoing commitment

We'll keep publishing what we find, including when our own numbers go the wrong direction. Agent developers should be able to trust the calibration of any base rate we expose. That means honest audits, honest baselines, and honest docs — even when the honest answer is less impressive than the marketing one.

Originally published at chartlibrary.io. Chart Library is the stock-market memory for AI agents — free Sandbox tier at chartlibrary.io/developers.