<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gauthier Piarrette</title>
    <description>The latest articles on DEV Community by Gauthier Piarrette (@gauthierpiarrette).</description>
    <link>https://dev.to/gauthierpiarrette</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1348309%2F9e0209ce-fd9f-40c5-af9a-fed0f8f2906e.jpeg</url>
      <title>DEV Community: Gauthier Piarrette</title>
      <link>https://dev.to/gauthierpiarrette</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gauthierpiarrette"/>
    <language>en</language>
    <item>
      <title>Your ML Model Is Training on the Future</title>
      <dc:creator>Gauthier Piarrette</dc:creator>
      <pubDate>Wed, 11 Feb 2026 16:59:41 +0000</pubDate>
      <link>https://dev.to/gauthierpiarrette/your-ml-model-is-training-on-the-future-2dig</link>
      <guid>https://dev.to/gauthierpiarrette/your-ml-model-is-training-on-the-future-2dig</guid>
      <description>&lt;p&gt;Your model is training on the future. Not metaphorically. A single wrong join operator lets feature values from after the label event leak into every training row. At 10 million labels with 50 features, that's hundreds of millions of corrupted values. The worst part: your offline metrics will actually &lt;em&gt;improve&lt;/em&gt;, because future information is genuinely predictive, just not available at inference time.&lt;/p&gt;

&lt;p&gt;Leakage rarely crashes your pipeline. It just makes your model look smarter than it really is. This is what point-in-time (PIT) joins prevent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You're building a churn prediction model. You have a labels table (one row per prediction target):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;label_time&lt;/th&gt;
&lt;th&gt;churned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-06-15&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-03-01&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And a features table (many rows per entity, evolving over time):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;updated_at&lt;/th&gt;
&lt;th&gt;country&lt;/th&gt;
&lt;th&gt;tier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-01-10&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-04-22&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-08-01&lt;/td&gt;
&lt;td&gt;UK&lt;/td&gt;
&lt;td&gt;pro&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the label at &lt;code&gt;2025-06-15&lt;/code&gt;, which feature row should you use?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wrong&lt;/strong&gt;: The latest row (Aug 1). The user moved to UK &lt;em&gt;after&lt;/em&gt; the churn event.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong&lt;/strong&gt;: All rows. You'd have duplicates and future data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Correct&lt;/strong&gt;: The Apr 22 row. It's the most recent row &lt;em&gt;before&lt;/em&gt; the label time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the label at &lt;code&gt;2025-03-01&lt;/code&gt;, the correct row is Jan 10.&lt;/p&gt;

&lt;p&gt;This is a point-in-time join: for each label, find the most recent feature row that was available &lt;em&gt;before&lt;/em&gt; the label event.&lt;/p&gt;

&lt;h3&gt;
  
  
  What goes wrong without it
&lt;/h3&gt;

&lt;p&gt;Here's the actual training data produced by each approach:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Naive join (latest row globally):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;label_time&lt;/th&gt;
&lt;th&gt;churned&lt;/th&gt;
&lt;th&gt;country&lt;/th&gt;
&lt;th&gt;tier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-06-15&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;UK&lt;/td&gt;
&lt;td&gt;pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-03-01&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;td&gt;UK&lt;/td&gt;
&lt;td&gt;pro&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both labels get the Aug 1 snapshot, data from &lt;em&gt;after&lt;/em&gt; the churn event. The model learns "UK, pro" predicts churn, but the user was still "US, pro" when the churn decision happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Point-in-time correct join:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;label_time&lt;/th&gt;
&lt;th&gt;churned&lt;/th&gt;
&lt;th&gt;country&lt;/th&gt;
&lt;th&gt;tier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-06-15&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-03-01&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each label sees only the features that existed at prediction time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Invariant
&lt;/h2&gt;

&lt;p&gt;Some features aren't available the instant they're recorded. Consider a "30-day rolling spend" computed in a nightly batch at 2 AM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The feature is &lt;strong&gt;timestamped&lt;/strong&gt; as &lt;code&gt;2025-06-14&lt;/code&gt; (yesterday's data)&lt;/li&gt;
&lt;li&gt;The feature becomes &lt;strong&gt;available&lt;/strong&gt; at &lt;code&gt;2025-06-15 02:00&lt;/code&gt; (when the batch completes)&lt;/li&gt;
&lt;li&gt;A label at &lt;code&gt;2025-06-15 00:00&lt;/code&gt; (midnight) should &lt;strong&gt;not&lt;/strong&gt; use this feature. It didn't exist yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gap between "timestamped" and "available" is called &lt;strong&gt;embargo&lt;/strong&gt;. Without it, you use features that appear to exist but hadn't actually been computed yet.&lt;/p&gt;

&lt;p&gt;The formal rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;feature_time &amp;lt; label_time - embargo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;feature_time&lt;/code&gt; is when the feature value was recorded&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;label_time&lt;/code&gt; is when the prediction target was observed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;embargo&lt;/code&gt; is the buffer for computation lag (default: 0)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strict less-than. Not less-than-or-equal. If a feature was recorded at the exact same time as the label, it may contain information about the outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Join Window
&lt;/h2&gt;

&lt;p&gt;Given a label row &lt;code&gt;(key=42, time=June 15, embargo=1d, lookback=365d)&lt;/code&gt;, the join window is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Earliest valid feature: June 15 - 365d = June 15, 2024
Latest valid feature:   June 15 - 1d   = June 14, 2025

Window: [June 15 2024, June 14 2025)    ← half-open: includes left, excludes right
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From our features table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jan 10: inside the window. Candidate.&lt;/li&gt;
&lt;li&gt;Apr 22: inside the window. Candidate. &lt;strong&gt;Most recent, so this one is selected.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Aug 1: outside the window (after June 14). Blocked by embargo + future rule.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what that looks like on a timeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd4sy7k0ipwb2s4xwa9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd4sy7k0ipwb2s4xwa9p.png" alt="Timeline showing the join window for user 42 with label time June 15, 2025 and embargo of 1 day. The green lookback window spans June 2024 to June 2025. Feature rows from Jan 10 and Apr 22 are candidates inside the window, with Apr 22 selected as the most recent. The Aug 1 feature is blocked by embargo and future rules." width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The general form with all three parameters:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hds5p8sokpusjbenmvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hds5p8sokpusjbenmvd.png" alt="Diagram showing the Point-in-Time Join Window with four zones on a timeline: " start="" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three parameters control the window:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Max lookback&lt;/strong&gt; (&lt;code&gt;L&lt;/code&gt;): Ignore features older than this. A 365-day lookback means a feature value from 2 years ago won't be used. Too stale to be meaningful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embargo&lt;/strong&gt; (&lt;code&gt;E&lt;/code&gt;): Buffer for computation lag. A 1-day embargo excludes features timestamped within 1 day of the label, because those values may not have been available in production yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Max staleness&lt;/strong&gt; (&lt;code&gt;S&lt;/code&gt;, optional): If the best available feature is older than this threshold, treat it as missing (null) rather than using outdated data. Must be between &lt;code&gt;E&lt;/code&gt; and &lt;code&gt;L&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation in SQL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ROW_NUMBER (works everywhere)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;churned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;feature_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt;
            &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
        &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt;                        &lt;span class="c1"&gt;-- strict less-than&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'365 days'&lt;/span&gt; &lt;span class="c1"&gt;-- lookback bound&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;churned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature_time&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Aug 1 row is excluded by &lt;code&gt;f.updated_at &amp;lt; l.label_time&lt;/code&gt;. The remaining candidates are ranked by recency within each &lt;code&gt;(user_id, label_time)&lt;/code&gt; partition. After &lt;code&gt;WHERE rn = 1&lt;/code&gt;, the result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;label_time&lt;/th&gt;
&lt;th&gt;churned&lt;/th&gt;
&lt;th&gt;country&lt;/th&gt;
&lt;th&gt;feature_time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-06-15&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;2025-04-22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;2025-03-01&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;2025-01-10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each label gets exactly one feature row: the most recent one before the label time. When the LEFT JOIN finds no matching feature, it still produces one row (with all &lt;code&gt;f.*&lt;/code&gt; columns as NULL), and &lt;code&gt;ROW_NUMBER()&lt;/code&gt; assigns it &lt;code&gt;rn = 1&lt;/code&gt;, so labels with missing features are preserved automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding embargo:&lt;/strong&gt; shift the upper bound from &lt;code&gt;label_time&lt;/code&gt; to &lt;code&gt;label_time - 1 day&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ASOF JOIN (DuckDB, ClickHouse, KDB+)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;churned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
&lt;span class="n"&gt;ASOF&lt;/span&gt; &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ASOF JOIN is purpose-built for this: it walks both tables in sorted order and finds the last feature before each label time. No windowing, no ranking. A single merge pass. Significantly faster than ROW_NUMBER at scale, but only supported in DuckDB, ClickHouse, and KDB+ (not PostgreSQL, Snowflake, or BigQuery as of 2025).&lt;/p&gt;

&lt;p&gt;With embargo, shift the left side of the comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;ASOF&lt;/span&gt; &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: &lt;code&gt;&amp;lt;=&lt;/code&gt; instead of &lt;code&gt;&amp;lt;&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- WRONG: allows same-timestamp features&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt;

&lt;span class="c1"&gt;-- CORRECT: strict less-than&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One character. If your feature pipeline runs at midnight and the label event is at midnight, &lt;code&gt;&amp;lt;=&lt;/code&gt; includes the feature computed &lt;em&gt;at&lt;/em&gt; the label time. That feature may already reflect the outcome, for example a "daily active users" count that includes the churn day itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2: Missing the LEFT in LEFT JOIN
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- WRONG: drops labels where no feature exists&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;-- CORRECT: keeps all labels, nulls for missing features&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An inner join silently drops labels where no feature exists within the lookback window, shrinking your training set and biasing it toward entities with complete histories. Use LEFT JOIN. Missing features should be null, not dropped.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 3: Joining on the latest row globally
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- WRONG: uses the most recent feature regardless of when the label occurred&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;latest&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;latest&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives every label the user's &lt;em&gt;current&lt;/em&gt; state, not their state at prediction time. A label from January uses August data. A label from 2022 uses 2025 data. The model trains on the future for every historical label.&lt;/p&gt;

&lt;p&gt;This is the single most common source of temporal leakage. The query is simple, fast, and produces plausible-looking results. There's no error, no warning, and offline metrics actually &lt;em&gt;improve&lt;/em&gt; because future information is genuinely predictive, just not available at inference time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 4: No lookback bound
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- RISKY: could match a feature from 5 years ago&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_time&lt;/span&gt;
    &lt;span class="c1"&gt;-- No lower bound on how old the feature can be!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without a max lookback, a label from 2025 could match a feature from 2019. A user's country or product tier from five years ago is noise, not signal. Bound the window: &lt;code&gt;AND f.updated_at &amp;gt;= l.label_time - INTERVAL '365 days'&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automating the Invariant
&lt;/h2&gt;

&lt;p&gt;These queries work for one feature table. In practice you're joining 10-20 feature sources to the same label set, each with different embargos and lookback requirements. The SQL gets unwieldy fast, and every join is another place to get the invariant wrong. I kept finding the same temporal bugs in production pipelines, so I built &lt;a href="https://timefence.dev" rel="noopener noreferrer"&gt;Timefence&lt;/a&gt; to handle it declaratively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;

&lt;span class="c1"&gt;# Declare sources, features, and labels
&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/users.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;txns&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/transactions.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;country&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Feature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;rolling_spend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Feature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;txns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spend_30d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embargo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/labels.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;churned&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Build with guaranteed temporal correctness, then audit
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rolling_spend&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_lookback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;365d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timefence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;audit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rolling_spend&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ALL CLEAN - no temporal leakage detected
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It uses ASOF JOIN when possible, falls back to ROW_NUMBER, and verifies the temporal invariant on the output. &lt;a href="https://github.com/timefence/timefence" rel="noopener noreferrer"&gt;Open-source on GitHub&lt;/a&gt; if you want to try it.&lt;/p&gt;

&lt;h2&gt;
  
  
  "But I Already Split by Time"
&lt;/h2&gt;

&lt;p&gt;A common objection: "I use a time-based train/test split, so I'm already preventing leakage."&lt;/p&gt;

&lt;p&gt;Time-based splits prevent &lt;em&gt;period-level&lt;/em&gt; leakage: no test-period data in training. But they don't prevent &lt;em&gt;row-level&lt;/em&gt; leakage: a training label from March using a feature snapshot from April is temporal leakage, even though both dates fall within the training window. You need both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time-based splits&lt;/strong&gt;: No test-period data in training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PIT joins&lt;/strong&gt;: No future-of-label data in any individual training row&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Use &lt;code&gt;&amp;lt;&lt;/code&gt;, not &lt;code&gt;&amp;lt;=&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Same-timestamp features may contain the outcome&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Always &lt;code&gt;LEFT JOIN&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Missing features should be null, not dropped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bound the lookback window&lt;/td&gt;
&lt;td&gt;Stale features are worse than missing features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add embargo for lagged features&lt;/td&gt;
&lt;td&gt;If a feature has computation delay, account for it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verify after building&lt;/td&gt;
&lt;td&gt;Audit the output, don't just trust the SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Get PIT joins right, and your offline metrics honestly reflect production performance. Get them wrong, and you ship a model trained on the future.&lt;/p&gt;




&lt;p&gt;What's the most subtle data leakage bug you've found in a training pipeline? I'd love to hear what patterns other teams run into.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
