<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Li Zhuojun</title>
    <description>The latest articles on DEV Community by Li Zhuojun (@lizhuojunx86).</description>
    <link>https://dev.to/lizhuojunx86</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3990737%2Fecf68c91-8302-4910-8f26-ae04f9489100.jpg</url>
      <title>DEV Community: Li Zhuojun</title>
      <link>https://dev.to/lizhuojunx86</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lizhuojunx86"/>
    <language>en</language>
    <item>
      <title>The `epsActual` That Wasn't: 15% of an LLM Backtest's Trades Were Decided on Data That Didn't Exist Yet</title>
      <dc:creator>Li Zhuojun</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:16:59 +0000</pubDate>
      <link>https://dev.to/lizhuojunx86/the-epsactual-that-wasnt-15-of-an-llm-backtests-trades-were-decided-on-data-that-didnt-exist-17k3</link>
      <guid>https://dev.to/lizhuojunx86/the-epsactual-that-wasnt-15-of-an-llm-backtests-trades-were-decided-on-data-that-didnt-exist-17k3</guid>
      <description>&lt;p&gt;We were backtesting an LLM-driven earnings signal against a field called &lt;code&gt;epsActual&lt;/code&gt; — the kind of field everyone treats as ground truth. It isn't.&lt;/p&gt;

&lt;p&gt;About &lt;strong&gt;41.4%&lt;/strong&gt; of those "actual" values were &lt;em&gt;different&lt;/em&gt; from what the vendor had first reported. About &lt;strong&gt;15.3%&lt;/strong&gt; differed enough to flip a tradeable decision. When we re-ran the backtest using only the values that actually existed at each decision date, the strategy kept ~&lt;strong&gt;73%&lt;/strong&gt; of its returns and ~&lt;strong&gt;82%&lt;/strong&gt; of its Sharpe. The rest was look-ahead bias — and it rode in through a field whose name promised it was final.&lt;/p&gt;

&lt;p&gt;This is a writeup of how we found it, how we measured it honestly, and the one-line invariant that turns it from a silent inflation into a loud test failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The signal is a post-earnings drift play: at each earnings print, an LLM scores the release and we take a position. To backtest it you replay history — for every past print, reconstruct what the model &lt;em&gt;would&lt;/em&gt; have decided, then check what happened next.&lt;/p&gt;

&lt;p&gt;That reconstruction needs one obviously-trustworthy input: what the earnings number actually &lt;em&gt;was&lt;/em&gt;. Our vendor exposes exactly that, in a field named &lt;code&gt;epsActual&lt;/code&gt;. "Actual." Final. Settled. You query a print from two years ago and get a number back. What could go wrong?&lt;/p&gt;

&lt;h2&gt;
  
  
  The invisible killer
&lt;/h2&gt;

&lt;p&gt;Vendor "actuals" are not frozen at print time. They get backfilled, corrected, and restated — sometimes the next day, sometimes months later. Restatements, late filings, parser fixes, standardization passes: all of them quietly rewrite history. &lt;strong&gt;The value you query today for a 2023 print is not, in general, the value that was available the day after that print.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is textbook look-ahead bias, and it's especially dangerous here because it doesn't &lt;em&gt;look&lt;/em&gt; like leakage. Nobody fed the model future data on purpose. It rode in on a field everyone trusts — and "actual" is about the most trustworthy-sounding name a field can have. A backtest built on today's &lt;code&gt;epsActual&lt;/code&gt; is quietly asking the model to react to numbers that, on the decision date, did not yet exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we measured it honestly
&lt;/h2&gt;

&lt;p&gt;You can't detect this from a single snapshot of the database — by definition the revision has already overwritten the original. So we built a &lt;strong&gt;forward-polling harness&lt;/strong&gt;: poll the vendor on a schedule, snapshot every value we care about, and watch for changes over time. It had accumulated ~&lt;strong&gt;1,400 snapshots&lt;/strong&gt; in the first day of polling.&lt;/p&gt;

&lt;p&gt;The decision that mattered most:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Detect revisions by the value itself, not by the vendor's &lt;code&gt;lastUpdated&lt;/code&gt; timestamp.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;lastUpdated&lt;/code&gt; is unreliable — it doesn't reliably fire on silent backfills, and trusting it would have hidden exactly the revisions we were hunting. So change detection keys on the &lt;strong&gt;value-tuple&lt;/strong&gt;: if any tracked field changes between two snapshots, that's a revision, regardless of what the metadata claims.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Revision = the tracked value-tuple changed between snapshots,
# NOT "the vendor bumped lastUpdated".
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_revision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev_snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;curr_snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracked_fields&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev_snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tracked_fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;curr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;curr_snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tracked_fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;curr&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To quantify the &lt;em&gt;trading&lt;/em&gt; impact, we compared two backtests over a four-month point-in-time window: a &lt;strong&gt;naive&lt;/strong&gt; one using today's revised &lt;code&gt;epsActual&lt;/code&gt;, and an &lt;strong&gt;as-of&lt;/strong&gt; one using only each value as first seen on (or before) the decision date.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we found
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;41.4%&lt;/strong&gt; of &lt;code&gt;epsActual&lt;/code&gt; values (896/2163) differed between first-seen and final.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15.3%&lt;/strong&gt; of cases (332/2163) differed enough to flip a tradeable decision — a sign change or a threshold crossing in the signal.&lt;/li&gt;
&lt;li&gt;Over the four-month window, the as-of backtest retained ~&lt;strong&gt;73%&lt;/strong&gt; of the naive backtest's returns and ~&lt;strong&gt;82%&lt;/strong&gt; of its Sharpe. (The FINAL leg keeps drifting as the vendor keeps revising, so treat the &lt;em&gt;ratio&lt;/em&gt; as more stable than the levels.)&lt;/li&gt;
&lt;li&gt;Read inversely: roughly a quarter of the headline returns, and a fifth of the Sharpe, were look-ahead artifacts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The encouraging half: most of the strategy survives honest data. The sobering half: a naive backtest overstated it by a wide margin, and a meaningful fraction of "winning" trades were decided on numbers that did not exist at decision time. A 15% decision-flip rate is not noise you can wave away.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is structural, not a one-off
&lt;/h2&gt;

&lt;p&gt;The natural reaction is "okay, we'll be careful with that field." That doesn't hold. The risk is reintroduced by every new feature, every new vendor, every rerun, every teammate who reaches for "the actual value." Carefulness is a property of a person on a good day; &lt;strong&gt;as-of correctness has to be a property of the pipeline.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So treat the question &lt;em&gt;"could this value have been known at the decision time we're simulating?"&lt;/em&gt; as an invariant the code enforces and CI checks. A vendor "actual" is &lt;strong&gt;time-versioned reference data&lt;/strong&gt;: it only becomes valid at the instant you first observed it. Use it to decide &lt;em&gt;before&lt;/em&gt; that instant and you're using a value from the future.&lt;/p&gt;

&lt;p&gt;That's exactly what the look-ahead invariant below checks — it requires &lt;code&gt;valid_from &amp;lt;= feature_as_of&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;traceguard.validators.lookahead&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_reference_timing&lt;/span&gt;

&lt;span class="c1"&gt;# The eps "actual" is time-versioned reference data: valid_from is when this
# specific value first existed (first-seen in our snapshots), feature_as_of is
# the decision moment we are simulating.
&lt;/span&gt;&lt;span class="nf"&gt;validate_reference_timing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;valid_from&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;eps_first_seen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# when this value actually existed
&lt;/span&gt;    &lt;span class="n"&gt;feature_as_of&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decision_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# the moment we're simulating
&lt;/span&gt;    &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_eps_actual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# raises InvariantViolation if eps_first_seen &amp;gt; decision_date
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a value is used before its availability timestamp, the run fails loudly rather than silently inflating a Sharpe ratio.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two kinds of look-ahead — don't conflate them
&lt;/h2&gt;

&lt;p&gt;It's worth being precise about scope. There are &lt;strong&gt;two&lt;/strong&gt; distinct kinds of look-ahead bias in LLM pipelines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Training contamination&lt;/strong&gt; — the model itself was pre-trained on the future you're predicting, so it "recalls" rather than reasons. That's a separate research problem (membership-inference tests, point-in-time LLMs, claim-level temporal verification), and it needs different tooling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harness / pipeline leakage&lt;/strong&gt; — your code uses a value, prompt, or model that didn't exist at the simulated time. &lt;em&gt;This story is entirely about this kind&lt;/em&gt;, and it's the kind a pipeline can be made to refuse structurally.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both matter. They are not the same problem, and conflating them is how teams "fix" one and ship the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  A checklist you can apply today
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Treat every &lt;code&gt;actual&lt;/code&gt; / &lt;code&gt;final&lt;/code&gt; / &lt;code&gt;reported&lt;/code&gt; vendor field as a &lt;strong&gt;moving target&lt;/strong&gt; until you've proven otherwise with your own snapshots.&lt;/li&gt;
&lt;li&gt;Detect revisions by &lt;strong&gt;value&lt;/strong&gt;, not by the vendor's update timestamp.&lt;/li&gt;
&lt;li&gt;Backtest on &lt;strong&gt;as-of (first-seen)&lt;/strong&gt; data, and explicitly measure the gap against revised data. That gap is your look-ahead tax — quantify it instead of assuming it's zero.&lt;/li&gt;
&lt;li&gt;Encode "known at decision time?" as a &lt;strong&gt;CI invariant&lt;/strong&gt;, so the failure mode is a red test, not a flattering backtest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;One vendor, one field, a four-month window. The exact percentages are dataset-specific and should not be read as universal constants — your numbers will differ. And again: this addresses harness leakage only, not whether the model itself has seen the future.&lt;/p&gt;




&lt;p&gt;The validators and point-in-time instrumentation here are part of &lt;a href="https://github.com/lizhuojunx86/traceguard" rel="noopener noreferrer"&gt;&lt;strong&gt;traceguard&lt;/strong&gt;&lt;/a&gt; — an open-source Python library for point-in-time-correct LLM instrumentation: a model registry that refuses anachronistic picks, a git-tracked prompt registry, canonical input hashing, and look-ahead invariants you call in CI. It's not a dashboard — it exports OpenTelemetry spans into Langfuse / Phoenix, so it sits &lt;em&gt;underneath&lt;/em&gt; your observability stack and keeps the timeline honest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;traceguard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you've been burned by a backtest that looked great and meant nothing, I'd genuinely like to hear how it happened — that's the failure mode this is built to catch.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>finance</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
