Gauthier Piarrette

Posted on Feb 11

Your ML Model Is Training on the Future

#dataengineering #datascience #machinelearning #sql

Your model is training on the future. Not metaphorically. A single wrong join operator lets feature values from after the label event leak into every training row. At 10 million labels with 50 features, that's hundreds of millions of corrupted values. The worst part: your offline metrics will actually improve, because future information is genuinely predictive, just not available at inference time.

Leakage rarely crashes your pipeline. It just makes your model look smarter than it really is. This is what point-in-time (PIT) joins prevent.

The Problem

You're building a churn prediction model. You have a labels table (one row per prediction target):

user_id	label_time	churned
42	2025-06-15	true
42	2025-03-01	false

And a features table (many rows per entity, evolving over time):

user_id	updated_at	country	tier
42	2025-01-10	US	free
42	2025-04-22	US	pro
42	2025-08-01	UK	pro

For the label at 2025-06-15, which feature row should you use?

Wrong: The latest row (Aug 1). The user moved to UK after the churn event.
Wrong: All rows. You'd have duplicates and future data.
Correct: The Apr 22 row. It's the most recent row before the label time.

For the label at 2025-03-01, the correct row is Jan 10.

This is a point-in-time join: for each label, find the most recent feature row that was available before the label event.

What goes wrong without it

Here's the actual training data produced by each approach:

Naive join (latest row globally):

user_id	label_time	churned	country	tier
42	2025-06-15	true	UK	pro
42	2025-03-01	false	UK	pro

Both labels get the Aug 1 snapshot, data from after the churn event. The model learns "UK, pro" predicts churn, but the user was still "US, pro" when the churn decision happened.

Point-in-time correct join:

user_id	label_time	churned	country	tier
42	2025-06-15	true	US	pro
42	2025-03-01	false	US	free

Each label sees only the features that existed at prediction time.

The Invariant

Some features aren't available the instant they're recorded. Consider a "30-day rolling spend" computed in a nightly batch at 2 AM:

The feature is timestamped as 2025-06-14 (yesterday's data)
The feature becomes available at 2025-06-15 02:00 (when the batch completes)
A label at 2025-06-15 00:00 (midnight) should not use this feature. It didn't exist yet

This gap between "timestamped" and "available" is called embargo. Without it, you use features that appear to exist but hadn't actually been computed yet.

The formal rule:

feature_time < label_time - embargo

Where:

feature_time is when the feature value was recorded
label_time is when the prediction target was observed
embargo is the buffer for computation lag (default: 0)

Strict less-than. Not less-than-or-equal. If a feature was recorded at the exact same time as the label, it may contain information about the outcome.

The Join Window

Given a label row (key=42, time=June 15, embargo=1d, lookback=365d), the join window is:

Earliest valid feature: June 15 - 365d = June 15, 2024
Latest valid feature:   June 15 - 1d   = June 14, 2025

Window: [June 15 2024, June 14 2025)    ← half-open: includes left, excludes right

From our features table:

Jan 10: inside the window. Candidate.
Apr 22: inside the window. Candidate. Most recent, so this one is selected.
Aug 1: outside the window (after June 14). Blocked by embargo + future rule.

Here's what that looks like on a timeline:

The general form with all three parameters:

Three parameters control the window:

Max lookback (L): Ignore features older than this. A 365-day lookback means a feature value from 2 years ago won't be used. Too stale to be meaningful.
Embargo (E): Buffer for computation lag. A 1-day embargo excludes features timestamped within 1 day of the label, because those values may not have been available in production yet.
Max staleness (S, optional): If the best available feature is older than this threshold, treat it as missing (null) rather than using outdated data. Must be between E and L.

Implementation in SQL

ROW_NUMBER (works everywhere)

WITH ranked AS (
    SELECT
        l.user_id,
        l.label_time,
        l.churned,
        f.country,
        f.updated_at AS feature_time,
        ROW_NUMBER() OVER (
            PARTITION BY l.user_id, l.label_time
            ORDER BY f.updated_at DESC
        ) AS rn
    FROM labels l
    LEFT JOIN features f
        ON f.user_id = l.user_id
        AND f.updated_at < l.label_time                        -- strict less-than
        AND f.updated_at >= l.label_time - INTERVAL '365 days' -- lookback bound
)
SELECT user_id, label_time, churned, country, feature_time
FROM ranked
WHERE rn = 1

The Aug 1 row is excluded by f.updated_at < l.label_time. The remaining candidates are ranked by recency within each (user_id, label_time) partition. After WHERE rn = 1, the result:

user_id	label_time	churned	country	feature_time
42	2025-06-15	true	US	2025-04-22
42	2025-03-01	false	US	2025-01-10

Each label gets exactly one feature row: the most recent one before the label time. When the LEFT JOIN finds no matching feature, it still produces one row (with all f.* columns as NULL), and ROW_NUMBER() assigns it rn = 1, so labels with missing features are preserved automatically.

Adding embargo: shift the upper bound from label_time to label_time - 1 day:

AND f.updated_at < l.label_time - INTERVAL '1 day'

ASOF JOIN (DuckDB, ClickHouse, KDB+)

SELECT l.user_id, l.label_time, l.churned, f.country
FROM labels l
ASOF LEFT JOIN features f
    ON l.user_id = f.user_id
    AND l.label_time > f.updated_at

ASOF JOIN is purpose-built for this: it walks both tables in sorted order and finds the last feature before each label time. No windowing, no ranking. A single merge pass. Significantly faster than ROW_NUMBER at scale, but only supported in DuckDB, ClickHouse, and KDB+ (not PostgreSQL, Snowflake, or BigQuery as of 2025).

With embargo, shift the left side of the comparison:

ASOF LEFT JOIN features f
    ON l.user_id = f.user_id
    AND l.label_time - INTERVAL '1 day' > f.updated_at

Common Mistakes

Mistake 1: `<=` instead of `<`

-- WRONG: allows same-timestamp features
WHERE f.updated_at <= l.label_time

-- CORRECT: strict less-than
WHERE f.updated_at < l.label_time

One character. If your feature pipeline runs at midnight and the label event is at midnight, <= includes the feature computed at the label time. That feature may already reflect the outcome, for example a "daily active users" count that includes the churn day itself.

Mistake 2: Missing the LEFT in LEFT JOIN

-- WRONG: drops labels where no feature exists
FROM labels l
JOIN features f ON ...

-- CORRECT: keeps all labels, nulls for missing features
FROM labels l
LEFT JOIN features f ON ...

An inner join silently drops labels where no feature exists within the lookback window, shrinking your training set and biasing it toward entities with complete histories. Use LEFT JOIN. Missing features should be null, not dropped.

Mistake 3: Joining on the latest row globally

-- WRONG: uses the most recent feature regardless of when the label occurred
WITH latest AS (
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY user_id ORDER BY updated_at DESC
    ) AS rn
    FROM features
)
SELECT l.*, f.country
FROM labels l
LEFT JOIN latest f ON l.user_id = f.user_id AND f.rn = 1

This gives every label the user's current state, not their state at prediction time. A label from January uses August data. A label from 2022 uses 2025 data. The model trains on the future for every historical label.

This is the single most common source of temporal leakage. The query is simple, fast, and produces plausible-looking results. There's no error, no warning, and offline metrics actually improve because future information is genuinely predictive, just not available at inference time.

Mistake 4: No lookback bound

-- RISKY: could match a feature from 5 years ago
FROM labels l
LEFT JOIN features f
    ON f.user_id = l.user_id
    AND f.updated_at < l.label_time
    -- No lower bound on how old the feature can be!

Without a max lookback, a label from 2025 could match a feature from 2019. A user's country or product tier from five years ago is noise, not signal. Bound the window: AND f.updated_at >= l.label_time - INTERVAL '365 days'.

Automating the Invariant

These queries work for one feature table. In practice you're joining 10-20 feature sources to the same label set, each with different embargos and lookback requirements. The SQL gets unwieldy fast, and every join is another place to get the invariant wrong. I kept finding the same temporal bugs in production pipelines, so I built Timefence to handle it declaratively:

import timefence

# Declare sources, features, and labels
users = timefence.Source("data/users.parquet", keys=["user_id"], timestamp="updated_at")
txns  = timefence.Source("data/transactions.parquet", keys=["user_id"], timestamp="created_at")

country       = timefence.Feature(source=users, columns=["country"])
rolling_spend = timefence.Feature(source=txns, name="spend_30d", embargo="1d", sql="...")

labels = timefence.Labels("data/labels.parquet", keys=["user_id"], label_time="label_time", target=["churned"])

# Build with guaranteed temporal correctness, then audit
result = timefence.build(labels=labels, features=[country, rolling_spend], output="train.parquet", max_lookback="365d")
report = timefence.audit("train.parquet", features=[country, rolling_spend], keys=["user_id"], label_time="label_time")
# ALL CLEAN - no temporal leakage detected

It uses ASOF JOIN when possible, falls back to ROW_NUMBER, and verifies the temporal invariant on the output. Open-source on GitHub if you want to try it.

"But I Already Split by Time"

A common objection: "I use a time-based train/test split, so I'm already preventing leakage."

Time-based splits prevent period-level leakage: no test-period data in training. But they don't prevent row-level leakage: a training label from March using a feature snapshot from April is temporal leakage, even though both dates fall within the training window. You need both:

Time-based splits: No test-period data in training
PIT joins: No future-of-label data in any individual training row

Summary

Rule	Why
Use `<`, not `<=`	Same-timestamp features may contain the outcome
Always `LEFT JOIN`	Missing features should be null, not dropped
Bound the lookback window	Stale features are worse than missing features
Add embargo for lagged features	If a feature has computation delay, account for it
Verify after building	Audit the output, don't just trust the SQL

Get PIT joins right, and your offline metrics honestly reflect production performance. Get them wrong, and you ship a model trained on the future.

What's the most subtle data leakage bug you've found in a training pipeline? I'd love to hear what patterns other teams run into.

DEV Community

Your ML Model Is Training on the Future

The Problem

What goes wrong without it

The Invariant

The Join Window

Implementation in SQL

ROW_NUMBER (works everywhere)

ASOF JOIN (DuckDB, ClickHouse, KDB+)

Common Mistakes

Mistake 1: `<=` instead of `<`

Mistake 2: Missing the LEFT in LEFT JOIN

Mistake 3: Joining on the latest row globally

Mistake 4: No lookback bound

Automating the Invariant

"But I Already Split by Time"

Summary

Top comments (0)

The Problem

What goes wrong without it

The Invariant

The Join Window

Implementation in SQL

ROW_NUMBER (works everywhere)

ASOF JOIN (DuckDB, ClickHouse, KDB+)

Common Mistakes

Mistake 1: <= instead of <

Mistake 2: Missing the LEFT in LEFT JOIN

Mistake 3: Joining on the latest row globally

Mistake 4: No lookback bound

Automating the Invariant

"But I Already Split by Time"

Summary

Mistake 1: `<=` instead of `<`