I built a feature store in pure Python to finally understand the point-in-time join

#python #machinelearning #dataengineering #mlops

For a long time the phrase "feature store" sat in my head as a box on an architecture diagram. Feast, Tecton, Databricks Feature Store. I knew they served features to models and I knew they were a Real Thing that ML platform teams care about, but I could not have told you what the hard part actually was. So I did the thing that always works for me: I rebuilt a tiny version from scratch, with no pandas and no pyarrow, just the Python standard library and sqlite. The project is called Asof, and building it taught me that the whole category exists to solve a single deceptively nasty problem.

That problem is the point-in-time join, and once it clicked I could not unsee it.

Live demo and system map: https://hajirufai.github.io/asof/
Code: https://github.com/hajirufai/asof

The one problem a feature store is really for

Say you are training a churn model. You have a label: customer c1 churned on January 15. You want to attach features to that label, things like the customer's rolling 7 day spend, how many orders they placed recently, their account age. Those features change over time, so you have a whole history of them.

Here is the trap. If you join "customer c1's features" to "the January 15 label" by just grabbing c1's most recent feature row, you will probably grab a row from January 20 or February 3, because that is the latest one sitting in your table. You just trained your model on data from the future of the thing it is trying to predict. Your offline accuracy looks fantastic. Then you ship it, and in production the model only ever sees the past, and it falls apart.

That gap between what your training pipeline saw and what your serving pipeline sees has a name: train/serve skew. The specific flavour above, using data from after the label timestamp, is label leakage. The point-in-time join is the fix, and it is the thing a feature store gets right so you do not have to remember to.

The rule is simple to say and easy to get wrong: for a label at time t, use the most recent feature value with a timestamp less than or equal to t, and nothing newer.

What Asof actually is

Asof keeps two stores, which is the shape every feature store has:

an offline store, the full timestamped history of every feature, used to build training sets, and
an online store, just the latest value per entity, used to serve a model at inference where latency matters.

flowchart LR
  S[CSV / JSONL sources] --> FV[Feature views]
  FV --> OFF[(Offline store<br/>full history)]
  OFF -->|get_historical_features<br/>as-of join| TRAIN[Point-in-time<br/>training set]
  OFF -->|materialize| ON[(Online store<br/>latest per entity)]
  ON -->|get_online_features| SERVE[Feature server to model]

You register definitions, load history, and then two operations carry the weight. get_historical_features builds a training set with the as-of join. get_online_features serves the freshest values. In between, materialize sweeps the offline history into the online store.

Defining things looks like this:

from datetime import timedelta
from asof import FeatureStore, Entity, FeatureView, Field, ValueType

store = FeatureStore()

customer = Entity("customer", join_key="customer_id")
view = FeatureView(
    name="customer_stats",
    entity=customer,
    schema=[Field("rolling_7d_spend", ValueType.FLOAT), Field("tier", ValueType.STRING)],
    ttl=timedelta(days=14),
)
store.apply(customer, view)

The as-of join, the slow way and the fast way

The obviously correct version is a nested loop. For every label row, scan every feature row for that entity, throw away anything newer than the label, throw away anything older than the TTL, keep the latest of what remains. That is O(n times m) and it is the version I wrote first, on purpose, because it is impossible to get subtly wrong.

for ent_row in entity_rows:
    best = None
    for fr in feature_rows:
        if fr[join_key] != ent_row[join_key]:
            continue
        if fr["event_timestamp"] > ent_row["event_timestamp"]:
            continue  # this is the future. never use it.
        if ttl and (ent_row["event_timestamp"] - fr["event_timestamp"]) > ttl:
            continue  # too old, expired
        if best is None or fr["event_timestamp"] >= best["event_timestamp"]:
            best = fr

That is fine for a unit test and hopeless for real data. The fast version groups both sides by entity key, sorts each side by timestamp, and then sweeps a single pointer through the sorted feature events as the sorted label times move forward, remembering the most recent valid row as it goes. After the sort it is one linear pass, O(n plus m).

ent_list.sort(key=lambda pair: pair[1]["event_timestamp"])
feats.sort(key=lambda r: r["event_timestamp"])
ptr = 0
latest = None
for idx, ent_row in ent_list:
    label_ts = ent_row["event_timestamp"]
    while ptr < len(feats) and feats[ptr]["event_timestamp"] <= label_ts:
        latest = feats[ptr]
        ptr += 1
    chosen = latest
    # then apply the TTL check on chosen

The merge is the same idea behind a merge join in a database, just specialized so it keeps the latest match instead of every match. On the bundled benchmark it runs over 300 times faster than the nested loop on 50,000 feature rows, and here is the part I care about more: I never have to trust that it is correct. The test suite runs both implementations on random data every single run and asserts they return identical results. If the fast path ever disagrees with the brute-force path, CI goes red.

rows (feat x label)     keys    fast (s)    naive (s)   speedup
5000 x 1000             50      0.0034      0.1582      46.1
20000 x 4000            200     0.0197      2.6587      134.9
50000 x 10000           500     0.0470      15.7845     336.0

Proving the leak instead of asserting it

I did not want a README that says "naive joins leak, trust me." I wanted the repo to show it. So there is a third join in the code, naive_last_value_join, which does exactly the wrong thing on purpose: it grabs each entity's single most recent feature value and ignores the label timestamp entirely. That is the shortcut people actually reach for.

The demo builds the training set both ways and counts how many rows differ. On the example dataset every label row comes out different, and each difference is a value the naive join pulled from that customer's future. The benchmark gate in CI fails the build if the naive join ever stops leaking, which means the demonstration cannot quietly rot into something that no longer proves the point. A gate that can never reject anything is just decoration, so this one actively checks that the wrong path is detectably wrong.

The live demo has a little interactive timeline where you drag the label time t and watch the as-of join pick the right event while the naive join keeps clutching the last one, even when it is sitting in the future. Watching the two numbers diverge made the whole concept concrete for me in a way no paragraph did.

The boring details that actually matter

A few things bit me, and they are the kind of thing the production feature stores quietly handle:

Timestamps. Everything normalizes to timezone-aware UTC on the way in, and gets stored as epoch microseconds as an integer. That means sqlite orders them exactly and the TTL math is plain integer arithmetic, no float drift, no naive-versus-aware datetime explosions halfway through a join.

Ties. When two feature rows share the exact same timestamp at or before t, which one wins? You have to pick a rule and stick to it, because the fast path and the slow path must agree. Asof keeps the last row in source order, deterministically, in both implementations. I found this the hard way when the benchmark mismatched on dense data: the stable sort kept ties in source order, the brute-force loop kept the first one it saw, and they disagreed. Changing one comparison from greater-than to greater-than-or-equal fixed it.

Materialize must not regress. Sweeping offline rows into the online store is latest-wins, and the upsert refuses to move an entity backward in time. So if you re-run an old window after a newer one, it does not clobber the fresh value. That makes materialize idempotent and safe to retry, which is the only way a scheduled job is allowed to behave.

What I would build next

This is a learning project, not a Feast replacement, and the gaps are deliberate. On-demand transforms, where a feature is computed at request time from other features, would be the next thing. Push sources for streaming updates. A proper feature server with auth instead of the little stdlib http.server inspector I ship now. Maybe a Parquet-backed offline store instead of sqlite once the history gets big.

But the core is done, and it does the one hard thing correctly. If you have ever nodded along to "feature store" without being totally sure what the difficult part was, the difficult part is this: never, not by one microsecond, let a feature from the future touch a label from the past. Everything else is plumbing.

The code is on GitHub at https://github.com/hajirufai/asof, MIT licensed, zero dependencies, with the interactive system map at https://hajirufai.github.io/asof/. It runs on nothing but the standard library, so you can clone it and poke at the join in about thirty seconds.