DEV Community

Cover image for Your ML Model Is Training on the Future
Gauthier Piarrette
Gauthier Piarrette

Posted on

Your ML Model Is Training on the Future

Your model is training on the future. Not metaphorically. A single wrong join operator lets feature values from after the label event leak into every training row. At 10 million labels with 50 features, that's hundreds of millions of corrupted values. The worst part: your offline metrics will actually improve, because future information is genuinely predictive, just not available at inference time.

Leakage rarely crashes your pipeline. It just makes your model look smarter than it really is. This is what point-in-time (PIT) joins prevent.

The Problem

You're building a churn prediction model. You have a labels table (one row per prediction target):

user_id label_time churned
42 2025-06-15 true
42 2025-03-01 false

And a features table (many rows per entity, evolving over time):

user_id updated_at country tier
42 2025-01-10 US free
42 2025-04-22 US pro
42 2025-08-01 UK pro

For the label at 2025-06-15, which feature row should you use?

  • Wrong: The latest row (Aug 1). The user moved to UK after the churn event.
  • Wrong: All rows. You'd have duplicates and future data.
  • Correct: The Apr 22 row. It's the most recent row before the label time.

For the label at 2025-03-01, the correct row is Jan 10.

This is a point-in-time join: for each label, find the most recent feature row that was available before the label event.

What goes wrong without it

Here's the actual training data produced by each approach:

Naive join (latest row globally):

user_id label_time churned country tier
42 2025-06-15 true UK pro
42 2025-03-01 false UK pro

Both labels get the Aug 1 snapshot, data from after the churn event. The model learns "UK, pro" predicts churn, but the user was still "US, pro" when the churn decision happened.

Point-in-time correct join:

user_id label_time churned country tier
42 2025-06-15 true US pro
42 2025-03-01 false US free

Each label sees only the features that existed at prediction time.

The Invariant

Some features aren't available the instant they're recorded. Consider a "30-day rolling spend" computed in a nightly batch at 2 AM:

  • The feature is timestamped as 2025-06-14 (yesterday's data)
  • The feature becomes available at 2025-06-15 02:00 (when the batch completes)
  • A label at 2025-06-15 00:00 (midnight) should not use this feature. It didn't exist yet

This gap between "timestamped" and "available" is called embargo. Without it, you use features that appear to exist but hadn't actually been computed yet.

The formal rule:

feature_time < label_time - embargo
Enter fullscreen mode Exit fullscreen mode

Where:

  • feature_time is when the feature value was recorded
  • label_time is when the prediction target was observed
  • embargo is the buffer for computation lag (default: 0)

Strict less-than. Not less-than-or-equal. If a feature was recorded at the exact same time as the label, it may contain information about the outcome.

The Join Window

Given a label row (key=42, time=June 15, embargo=1d, lookback=365d), the join window is:

Earliest valid feature: June 15 - 365d = June 15, 2024
Latest valid feature:   June 15 - 1d   = June 14, 2025

Window: [June 15 2024, June 14 2025)    ← half-open: includes left, excludes right
Enter fullscreen mode Exit fullscreen mode

From our features table:

  • Jan 10: inside the window. Candidate.
  • Apr 22: inside the window. Candidate. Most recent, so this one is selected.
  • Aug 1: outside the window (after June 14). Blocked by embargo + future rule.

Here's what that looks like on a timeline:

Timeline showing the join window for user 42 with label time June 15, 2025 and embargo of 1 day. The green lookback window spans June 2024 to June 2025. Feature rows from Jan 10 and Apr 22 are candidates inside the window, with Apr 22 selected as the most recent. The Aug 1 feature is blocked by embargo and future rules.

The general form with all three parameters:

Diagram showing the Point-in-Time Join Window with four zones on a timeline:

Three parameters control the window:

  • Max lookback (L): Ignore features older than this. A 365-day lookback means a feature value from 2 years ago won't be used. Too stale to be meaningful.
  • Embargo (E): Buffer for computation lag. A 1-day embargo excludes features timestamped within 1 day of the label, because those values may not have been available in production yet.
  • Max staleness (S, optional): If the best available feature is older than this threshold, treat it as missing (null) rather than using outdated data. Must be between E and L.

Implementation in SQL

ROW_NUMBER (works everywhere)

WITH ranked AS (
    SELECT
        l.user_id,
        l.label_time,
        l.churned,
        f.country,
        f.updated_at AS feature_time,
        ROW_NUMBER() OVER (
            PARTITION BY l.user_id, l.label_time
            ORDER BY f.updated_at DESC
        ) AS rn
    FROM labels l
    LEFT JOIN features f
        ON f.user_id = l.user_id
        AND f.updated_at < l.label_time                        -- strict less-than
        AND f.updated_at >= l.label_time - INTERVAL '365 days' -- lookback bound
)
SELECT user_id, label_time, churned, country, feature_time
FROM ranked
WHERE rn = 1
Enter fullscreen mode Exit fullscreen mode

The Aug 1 row is excluded by f.updated_at < l.label_time. The remaining candidates are ranked by recency within each (user_id, label_time) partition. After WHERE rn = 1, the result:

user_id label_time churned country feature_time
42 2025-06-15 true US 2025-04-22
42 2025-03-01 false US 2025-01-10

Each label gets exactly one feature row: the most recent one before the label time. When the LEFT JOIN finds no matching feature, it still produces one row (with all f.* columns as NULL), and ROW_NUMBER() assigns it rn = 1, so labels with missing features are preserved automatically.

Adding embargo: shift the upper bound from label_time to label_time - 1 day:

AND f.updated_at < l.label_time - INTERVAL '1 day'
Enter fullscreen mode Exit fullscreen mode

ASOF JOIN (DuckDB, ClickHouse, KDB+)

SELECT l.user_id, l.label_time, l.churned, f.country
FROM labels l
ASOF LEFT JOIN features f
    ON l.user_id = f.user_id
    AND l.label_time > f.updated_at
Enter fullscreen mode Exit fullscreen mode

ASOF JOIN is purpose-built for this: it walks both tables in sorted order and finds the last feature before each label time. No windowing, no ranking. A single merge pass. Significantly faster than ROW_NUMBER at scale, but only supported in DuckDB, ClickHouse, and KDB+ (not PostgreSQL, Snowflake, or BigQuery as of 2025).

With embargo, shift the left side of the comparison:

ASOF LEFT JOIN features f
    ON l.user_id = f.user_id
    AND l.label_time - INTERVAL '1 day' > f.updated_at
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: <= instead of <

-- WRONG: allows same-timestamp features
WHERE f.updated_at <= l.label_time

-- CORRECT: strict less-than
WHERE f.updated_at < l.label_time
Enter fullscreen mode Exit fullscreen mode

One character. If your feature pipeline runs at midnight and the label event is at midnight, <= includes the feature computed at the label time. That feature may already reflect the outcome, for example a "daily active users" count that includes the churn day itself.

Mistake 2: Missing the LEFT in LEFT JOIN

-- WRONG: drops labels where no feature exists
FROM labels l
JOIN features f ON ...

-- CORRECT: keeps all labels, nulls for missing features
FROM labels l
LEFT JOIN features f ON ...
Enter fullscreen mode Exit fullscreen mode

An inner join silently drops labels where no feature exists within the lookback window, shrinking your training set and biasing it toward entities with complete histories. Use LEFT JOIN. Missing features should be null, not dropped.

Mistake 3: Joining on the latest row globally

-- WRONG: uses the most recent feature regardless of when the label occurred
WITH latest AS (
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY user_id ORDER BY updated_at DESC
    ) AS rn
    FROM features
)
SELECT l.*, f.country
FROM labels l
LEFT JOIN latest f ON l.user_id = f.user_id AND f.rn = 1
Enter fullscreen mode Exit fullscreen mode

This gives every label the user's current state, not their state at prediction time. A label from January uses August data. A label from 2022 uses 2025 data. The model trains on the future for every historical label.

This is the single most common source of temporal leakage. The query is simple, fast, and produces plausible-looking results. There's no error, no warning, and offline metrics actually improve because future information is genuinely predictive, just not available at inference time.

Mistake 4: No lookback bound

-- RISKY: could match a feature from 5 years ago
FROM labels l
LEFT JOIN features f
    ON f.user_id = l.user_id
    AND f.updated_at < l.label_time
    -- No lower bound on how old the feature can be!
Enter fullscreen mode Exit fullscreen mode

Without a max lookback, a label from 2025 could match a feature from 2019. A user's country or product tier from five years ago is noise, not signal. Bound the window: AND f.updated_at >= l.label_time - INTERVAL '365 days'.

Automating the Invariant

These queries work for one feature table. In practice you're joining 10-20 feature sources to the same label set, each with different embargos and lookback requirements. The SQL gets unwieldy fast, and every join is another place to get the invariant wrong. I kept finding the same temporal bugs in production pipelines, so I built Timefence to handle it declaratively:

import timefence

# Declare sources, features, and labels
users = timefence.Source("data/users.parquet", keys=["user_id"], timestamp="updated_at")
txns  = timefence.Source("data/transactions.parquet", keys=["user_id"], timestamp="created_at")

country       = timefence.Feature(source=users, columns=["country"])
rolling_spend = timefence.Feature(source=txns, name="spend_30d", embargo="1d", sql="...")

labels = timefence.Labels("data/labels.parquet", keys=["user_id"], label_time="label_time", target=["churned"])

# Build with guaranteed temporal correctness, then audit
result = timefence.build(labels=labels, features=[country, rolling_spend], output="train.parquet", max_lookback="365d")
report = timefence.audit("train.parquet", features=[country, rolling_spend], keys=["user_id"], label_time="label_time")
# ALL CLEAN - no temporal leakage detected
Enter fullscreen mode Exit fullscreen mode

It uses ASOF JOIN when possible, falls back to ROW_NUMBER, and verifies the temporal invariant on the output. Open-source on GitHub if you want to try it.

"But I Already Split by Time"

A common objection: "I use a time-based train/test split, so I'm already preventing leakage."

Time-based splits prevent period-level leakage: no test-period data in training. But they don't prevent row-level leakage: a training label from March using a feature snapshot from April is temporal leakage, even though both dates fall within the training window. You need both:

  • Time-based splits: No test-period data in training
  • PIT joins: No future-of-label data in any individual training row

Summary

Rule Why
Use <, not <= Same-timestamp features may contain the outcome
Always LEFT JOIN Missing features should be null, not dropped
Bound the lookback window Stale features are worse than missing features
Add embargo for lagged features If a feature has computation delay, account for it
Verify after building Audit the output, don't just trust the SQL

Get PIT joins right, and your offline metrics honestly reflect production performance. Get them wrong, and you ship a model trained on the future.


What's the most subtle data leakage bug you've found in a training pipeline? I'd love to hear what patterns other teams run into.

Top comments (0)