DEV Community

Todd Sullivan
Todd Sullivan

Posted on

When Your Training Data Pipeline Has Three Different Ideas About the Same Thing

If you're building ML pipelines that consume data from multiple API endpoints, you've probably hit this: the same thing — a product, a user, a record — arrives in three subtly different shapes depending on which path it took to get to you.

We hit this in a computer vision training pipeline recently. The pipeline synthesises training images for product classifiers — takes seed images of known products, composites them into scene images, generates bounding box annotations, trains a model. Standard stuff.

The bug: seed images were being silently dropped during dataset preparation. Not erroring — just gone. The model would train on an incomplete dataset and we'd only notice when accuracy came back lower than expected.

Root cause: UID lookup using exact string match, but three different API callers were sending the same product reference in three different formats:

'Tesco Cornflakes Cereal 500G'        # raw label, spaces preserved
'tesco_cornflakes_cereal_500g'        # stringToFilename output, lowercase underscored
'Tesco_Cornflakes_Cereal_500G'        # case-preserved underscored (from external productCode)
Enter fullscreen mode Exit fullscreen mode

The on-disk index used case-preserved underscored filenames. So if you came through the raw label path, your seed images were quietly dropped. No exception. No warning. Just a smaller dataset than you thought you had.

Why This Happens

Three different API routes, built at different times, by different people, each making a reasonable local decision about how to normalise a string. The bug only appears when you try to join across them using the output of one as the key into an index built from another.

The fix was to make the lookup tolerant — normalise both the incoming ref and the index key before comparison, so any of the three shapes resolves to the same entry.

def normalise_uid(uid: str) -> str:
    return uid.lower().replace(" ", "_")
Enter fullscreen mode Exit fullscreen mode

Two lines. But the reason you need them is worth understanding.

The Broader Pattern

Silent data loss in ML pipelines is particularly nasty because:

  1. It doesn't fail loudly. The pipeline completes successfully. The model trains. You get results. You just don't realise the results are for a smaller, different dataset than you intended.

  2. The signal is weak. Lower accuracy could be bad data, bad hyperparameters, distribution shift, or a dozen other things. You might spend days investigating the model before you look at the pipeline.

  3. It only manifests at scale. In dev, you're running with a handful of products. Everyone has clean, matching UIDs. In production, you have hundreds of products, multiple API callers, and the mismatch rate goes up.

What to Add to Your Pipeline

If you're building training data pipelines that consume product/entity references from multiple sources:

  • Assert dataset size at each stage. Expected 120 seed images for this batch? Assert that before training starts.
  • Log dropped items explicitly. Don't silently skip — log the UID that couldn't be resolved so you can catch shape mismatches immediately.
  • Normalise at ingestion, not lookup. Standardise the UID format the moment it enters your system, rather than trying to be tolerant at every lookup point downstream.
  • Cross-reference your callers. If you have multiple API endpoints that all feed the same pipeline, explicitly document which normalisation each one applies. It'll be someone else's problem in six months, and that someone might be you.

The actual ML work — model architecture, training loops, hyperparameter tuning — gets a lot of attention. The data pipeline that feeds it is equally important and tends to get much less scrutiny. Bugs there don't throw exceptions. They just quietly make your model worse.

Top comments (0)