Animesh Choubey

Posted on Dec 30, 2025 • Originally published at Medium

ML Systems: The Part They Skip in the Diagram

#machinelearning #mlsystems #decisionmaking

This classic meme, in all its simplicity, explains more about ML systems in organizations than most articles you’ll find online. On the surface, tutorials promise a neat, step-by-step journey: define the objective, align on success metrics, collect data, train the killer model — and then… “draw the rest of the owl.” Wait — where’s the guidance on the hard part?

The truth is, the missing steps aren’t technical — they’re contextual. Everything changes when humans get involved. Every ML system in a real organization starts with a spreadsheet, a Slack thread, and the inevitable question: “Can we override this if it looks wrong?” The beauty isn’t that companies can’t build models — they can. The challenge is that these models must operate in organizations that were never designed to handle probabilistic, uncertain decisions.

Frameworks Don’t Fit Reality

Out-of-the-box frameworks almost never fit real problems. Most prebuilt ML frameworks assume objectives are stable, feedback loops are clean, and success can be neatly expressed as a metric that converges over time. Real businesses don’t work that way: priorities shift mid-quarter, incentives change across stakeholders, and signals are partial, delayed, or misleading. Failures rarely happen because the model is wrong — they occur because models misalign with how decisions are actually made.

Picture this: you’re building a pricing system, a demand forecast, or a ranking algorithm. The internet’s ML “bible” tells you exactly how this should go: define the objective, collect data, train, validate offline, deploy, iterate. Clean. Reproducible. Comforting. But the moment your system meets reality, cracks appear:

Pricing managers override prices
Promotions distort demand
Leadership changes the goal from revenue to margin overnight

The framework didn’t break because it was technically flawed; it broke because it assumed organizational clarity that almost no real system enjoys.

Decision Ownership > Modeling

Mathematics is rarely the bottleneck; decision ownership is.

In most mature organizations, producing a forecast, score, or recommendation is technically solvable — often to a level that is directionally “good enough.” The ecosystem has largely matured: techniques for handling sparsity, seasonality, cold start, delayed labels, and noisy signals are well understood; open-source libraries and cloud platforms abstract away much of the heavy lifting; and entire teams have built similar systems repeatedly across domains.

What remains far harder is defining who is accountable when the model’s output collides with human judgment, legacy processes, or shifting business incentives. When a forecast contradicts a merchant’s intuition, a pricing recommendation threatens a short-term target, or a ranking change risks upsetting a key account, the system enters an organizational gray zone. Decisions get deferred, overridden, or selectively applied — often without being logged or fed back into the model.

This isn’t a technical failure; it’s a structural one. No tutorial explains who has the authority to trust the model, who bears the cost when it’s wrong, or how exceptions propagate through the system.

Without explicit ownership of that decision loop, accuracy degrades into a vanity metric — optimizable, defensible, and largely disconnected from outcomes. The path forward is rarely a better model; it’s clearer interfaces between prediction and action: defined decision rights, auditable overrides, and feedback loops that treat human intervention as signal, not noise.

Humans aren’t Bugs — They’re Features

Most production ML systems fail not because humans interfere, but because the system was designed as if humans wouldn’t. In reality, overrides, exceptions, and gut-feel interventions are not anomalies; they are expressions of information the model does not yet have — context about urgency, risk, relationships, or downstream consequences that rarely appear in training data. Treating these interventions as noise severs the very feedback loops the system depends on to learn.

Production ML is inherently socio-technical. Predictions do not operate in isolation; they interact with incentives, trust, accountability, and judgment. When human actions are ignored, partially logged, or stripped of intent, the system learns a distorted version of reality — one where recommendations appear to be followed when they are not, or where outcomes seem random despite high offline accuracy.

Designing for human involvement — making overrides explicit, measurable, and auditable — is not a compromise on automation. It is the only way automated systems remain aligned as they scale.

Organizations, Not Models, Kill ML Systems

Most ML systems don’t fail at inference time; they fail in meetings. Leaders ask for certainty, while ML offers probabilities. Middle layers optimize for predictability, while ML introduces variance. Each layer is acting rationally within its incentives — but together, they create an environment where probabilistic systems struggle to survive.

The model may be statistically sound, but it enters an organization designed to reward confidence, not calibrated uncertainty.

This is why many production failures have little to do with data drift or model decay. The real friction shows up earlier: when a 70% confidence score meets a leadership culture that expects yes-or-no answers; when a recommendation challenges a plan already socialized upward; when accountability for outcomes is diffuse, but blame for variance is immediate. In such systems, ML is tolerated as long as it confirms intuition — and quietly sidelined the moment it complicates decision-making.

The organization doesn’t reject the model explicitly; it renders it irrelevant.

Systems Move Boundaries, Not Replace Decisions

A production ML system does not have to replace decisions to be valuable; often, its role is to move the decision boundary. If nothing changes when a model is introduced — no thresholds, no defaults, no escalation paths — then the system doesn’t exist yet, regardless of how advanced the modeling is. Many failures begin right here, quietly, under the assumption that better predictions automatically translate into better decisions.

It is also naïve to assume that a few months of modeling experience can — or should — override years of domain judgment. In most organizations, decision authority is earned through context, risk ownership, and credibility, not accuracy metrics. Veterans who still hold that authority are not irrational obstacles to automation; if they continue to exist without deeply learning the technology, it is precisely because they are solving a different class of problem.

Any system that attempts to hijack that authority outright is almost guaranteed to fail — unless its perceived impact is small enough to be ignored.

LLMs: Amplifiers, Not Fixes

LLMs make it astonishingly easy to build systems that sound intelligent. They collapse months of feature engineering, templating, and UX work into a few prompts and APIs. Prototypes appear overnight, demos impress instantly, and perceived intelligence skyrockets. But beneath the fluency, the same old problems remain — only louder.

Trust is still unresolved. Ownership is still ambiguous. Accountability is still missing.

We have simply moved faster into them.

A Framework That Survives Reality

Start with the decision. Define tolerable errors. Plan for overrides. Measure trust before optimizing accuracy. These are not optional; they are the foundation for a system that can survive reality. Accept upfront that organizational dynamics, incentives, and human behavior will force compromises — and design with them, not against them.

Real-world ML systems aren’t defined by architecture diagrams, model choice, or algorithmic sophistication — we have thousands of those. They are defined by the decisions they inform, the incentives that shape behavior, and the humans who live with the outcomes. Models are just one component; the system only works when prediction, action, and accountability form a coherent loop.

Until we design for that reality, we will keep shipping models that work — and systems that don’t.

But when we embrace it, ML stops being a technical exercise and becomes a decision-support ecosystem: probabilistic yet trusted, flexible yet auditable, sophisticated yet aligned with human judgment. That is the framework that doesn’t just run — it endures.

Top comments (1)

ovwigho kevwe Emmanuel • Jan 5

This was really insightful, What resonated most is the idea that ML doesn’t fail in production because of math — it fails in meetings. Until decision rights, feedback loops, and auditable human input are designed intentionally, better models will keep producing worse outcomes.

This is the kind of conversation the industry needs more of — not “how to train faster,” but how to build systems that survive real organizations.

Really insightful post — thanks for sharing.