The 70% Data-Prep Tax in AI Development (and How to Cut It in Half)

#datascience #ai #productivity #machinelearning

Amershi et al. (2019) studied AI development workflows at Microsoft and reported a now-famous number: ~70% of AI development time is spent on data preparation and feature engineering. Not modeling. Not deployment. Not evaluation. Data wrangling.

If you're building anything ML-adjacent in 2026, that number is still mostly true — and it's still mostly avoidable if you treat data infrastructure as a first-class system instead of an afterthought you write inside notebooks.

Where the 70% actually goes

Amershi's breakdown across 551 ML practitioners pointed at four sinks:

Schema reconciliation — same entity, four upstream sources, four shapes
Imputation and outlier handling — null rates that move week-over-week
Feature pipelines that drift silently — train/serve skew nobody owns
Labeling and re-labeling — the cost line nobody budgets for

The single most common antipattern is treating each of these as a notebook exercise instead of a deployable artifact.

What actually cuts the tax

Treat features as code. Versioned, tested, deployable. Feast, Tecton, or just a Python package with semantic versioning — pick one, don't keep all three.
Schema-on-write, not schema-on-read. Push validation upstream of the warehouse, not into the model training loop. Great Expectations + dbt tests are the floor.
Synthetic labels via weak supervision for the long tail. Snorkel-style label functions catch 60–80% of the labeling cost for non-safety-critical labels.
Train/serve parity tests in CI. Compute a feature in your offline pipeline, compute it again at serve time, assert equality on a frozen sample. Most production ML failures live in this gap.

The bigger pattern

Software engineering took 30 years to develop the discipline of build systems, dependency management, and CI. ML engineering is collapsing that timeline into about five. The teams shipping reliable ML in 2026 aren't smarter than the teams that aren't — they've just stopped treating data prep as a research activity and started treating it as a platform.

Citation: Amershi, S., Begel, A., Bird, C., et al. (2019). Software Engineering for Machine Learning: A Case Study. ICSE-SEIP 2019.