Praveen Govindaraj

Posted on May 21

What falls between

#machinelearning #ai #datascience #agenticai

The seams between agents are where reality leaks in. Most teams don’t have a name for that yet

Photo by Joseph Frank on Unsplash

Fifth in a short series. The first four pieces were about what production-grade agentic systems require: an honest reckoning with cost (one), a plumbing layer (two), a tolerance for asymmetric risk (three), and a discipline of policy as code (four). This piece is about the place where multi-agent systems most often break, and the reason they break there. It is a place that has no name on most architecture diagrams, and that absence is part of the problem.

If you have ever watched a kitchen pass during dinner service — the long stainless-steel counter where finished plates wait for the runner — you will have noticed something that any cook can tell you, and that no diagram of a restaurant ever captures.

Most of the failures of a busy service do not happen in the cooking. They happen at the pass. A dish sits a minute too long under the heat lamp. A garnish goes on the wrong plate. A ticket gets placed under another ticket and forgotten. The cook did their job. The runner did their job. The thing that failed is the place between them. And the head chef, if they are any good, spends most of their attention not on the cooking but on that two-meter strip of metal where one person’s work becomes another’s.

Agentic systems have a pass. Almost nobody is watching it.

The places no diagram shows

Consider any production workflow that uses more than one agent — which, in 2026, means most of them. You will be shown a diagram. The diagram will have boxes. The boxes will be labelled with what each agent does. There will be arrows between the boxes, indicating that one agent’s output becomes the next one’s input. The diagram will look clean.

The boxes are not the problem. Each agent, taken alone, is usually fine. It has been tested. It has a known input shape and a known output shape. Engineers have prompted it, evaluated it, retried it under load. The box is the smallest possible unit of “this works.”

The arrows are where the work is. The arrows are also, on most diagrams, drawn as straight lines — implying instantaneous, lossless, well-defined transfer. They are not. They are little zones of indeterminacy. Things happen in those arrows that nobody scheduled, that no test covers, that no metric counts.

What happens? A field gets renamed in the upstream schema, but the downstream agent is still looking for the old name. A claim type that nobody trained on shows up, and the upstream extractor returns null in a field that the downstream scorer treats as zero. The policy version was bumped on Tuesday, but only the decide-step was updated; the score-step still applies the old thresholds and the system as a whole now violates the rule it was supposed to enforce. None of these are the agent’s fault. None of these would show up in any of the agents’ unit tests. They live in the arrows.

Press enter or click to view image in full size

I have come to believe this is the single most underweighted truth in agentic engineering today. The hype is about what each agent can do. The reality is about what happens between agents. We are spending almost all of our attention on the cooking and almost none on the pass.

what each agent actually sees

To understand why the seams are so dangerous, you have to be honest about how much an agent doesn’t see.

An agent has a context window. The context window is finite. The things in the context window are the things the agent can reason about. Everything else may as well not exist. This is true for the largest models in the world. It is true for the smallest. It is the ground truth of how language models work, and no amount of marketing about “shared context” or “unified memory” changes it.

When you compose three agents into a workflow, you are not creating a single intelligence with three faculties. You are creating three small, separate intelligences who each have access to a partial view of the world. The extractor knows about PDFs and OCR. The scorer knows about risk models and historical priors. The decision agent knows about approval ladders and audit format. They do not know each other’s domains. They do not, in general, see each other’s reasoning.

What they share is whatever you, the engineer, made an effort to put on both sides of the seam. By default, that is almost nothing. Often it is a single string — an ID that points back to a database row that nobody is loading the same way. The famous “shared context” of multi-agent systems is, in practice, a thin overlap, much smaller than the boxes in the diagram suggest.

This is the architectural lie at the heart of most multi-agent demos. The demo shows three agents working together on a task, and the audience reads “they’re collaborating.” What is actually happening is that one agent is producing some output, and another agent is taking some words from that output and treating them as inputs, and the second agent has no real understanding of what the first agent meant. They are not collaborating. They are passing notes under the desk.

This works fine when the notes are simple and the desk is small. It stops working when the notes are complex and the consequences of misreading them are large. Which is to say: it stops working precisely when you start putting agents in production.

The cost of an unclear handoff

To make this concrete, watch what happens when you let an agent summarise the state of the world in prose, and then hand that prose to another agent.

Here is an extractor agent producing a summary of an incoming claim. It is a fluent summary. It uses words like “straightforward” and “probably approve.” A human reading this would have a clear sense of what the extractor thinks. The fluency is exactly the trap. Because the summary is in prose, and the next agent in the workflow is a language model, the next agent will read the prose and form its own interpretation. And that interpretation is not deterministic.

Run the same prose through the same scoring agent ten times, on the same model, in the same hour, and you will get different scores. Not wildly different — the model is consistent enough at the surface level — but different in the ways that matter. Sometimes the scorer will pick up on “straightforward” and route to auto-approval. Sometimes it will pick up on “probably” and route to human review. Sometimes it will see “basement” and trigger a fraud check pattern that has nothing to do with what the extractor was trying to communicate.

Each of these is, taken alone, a defensible reading. The prose supports all three. The scorer is doing what it was asked to do — interpret the natural-language input. The problem is not the scorer. The problem is the handoff itself.

I have watched teams spend months tuning the prompts of their downstream agents to “be more consistent” without ever realising that the inconsistency is upstream of the prompt. The downstream agent is being asked to interpret an ambiguous artifact. It will, with great fluency, produce different interpretations on different runs. Tuning the agent does not fix this. The fix is to stop handing the agent prose

Typed seams

Here is what the same handoff looks like when it is treated as a contract instead of a chat message.

The handoff has a name. It has required fields, each with a type. It has a validator that runs at the moment of write, and again at the moment of read. If the upstream agent forgets a required field, the write fails and the workflow halts at that step — not three steps later, when the missing field surfaces as a strange downstream behaviour, but right there, in the agent that produced the bad output.

If the downstream agent tries to read a field the contract doesn’t include, that is a compile error. Not a runtime guess. The system refuses to ship a workflow that reads from a handoff that doesn’t define what is being read.

This sounds, to anyone who has shipped a piece of software in the last forty years, deeply unremarkable. Of course you would type your interfaces. Of course you would validate your messages. The IDL pattern is older than most engineers writing AI code. What is remarkable is how rare this is in agentic systems shipped today, and how much surprise teams express when they discover that introducing typed handoffs makes their multi-agent systems suddenly stable.

So write it down. Typed. Named. Versioned. With required fields and a validator that the runtime enforces. The cost of this is small — a few extra lines of declaration per agent boundary. The benefit is that you stop running a coin-toss workflow and start running an engineered one.

An aside, for anyone tempted to say “but the model writes better when you let it talk freely.” This is sometimes true at the level of a single response. It is almost never true at the level of a multi-agent workflow. Free-form prose between agents is a local optimisation that costs the system its global coherence. Pay the small price of a schema. The downstream stability is worth more than the upstream eloquence, by a factor of about a hundred.

the shared spine

Once you have typed handoffs at every seam, a question arises naturally: where do the handoffs live? Who owns them? Who governs them? What happens when one of them needs to change?

The answer is structural, and it is the part of multi-agent design that almost no one has a vocabulary for.

The agents in a working multi-agent system do not actually talk to each other. They talk to a shared spine — a registry that holds the typed handoffs, the audit record, the governance rules, and the coordination protocol. Each agent reads from the spine and writes to the spine. No agent has a direct line to another. The architecture is, to use a phrase from an older era, a hub-and-spoke. The hub is what makes the whole thing work.

This is not a trivial detail. It changes how the system behaves under stress. When you need to change a handoff, you change one declaration in one place, and the runtime can verify that all readers and all writers are still compatible. When you need to audit a decision, you read the spine; the trace is already there, in one record, by name. When you need to add a new agent, you connect it to the spine; you do not need to update three other agents to know about the new one.

Compare this to the alternative — agents passing prose summaries to each other directly, point to point — and you see the structural fragility immediately. The point-to-point system has n times n minus one possible failure modes between any pair of agents. The hub-and-spoke system has n failure modes, all centralised, all monitored, all governable. The point-to-point system grows in complexity quadratically. The hub-and-spoke system grows linearly. This difference compounds rapidly as soon as the workflow has more than three steps.

The tools — the agents — are interchangeable. You can swap one for another, upgrade one, retrain one, replace one with a deterministic rule. What you cannot swap is the bench. The bench is the architecture. It is where work meets work. It is the place where the whole system becomes governable, or doesn’t.

The teams that get this right are the teams whose multi-agent systems survive the second year of production. The teams that don’t will spend that second year discovering, painfully, that their dozen agents are actually a dozen unrelated systems sharing nothing but a Slack channel.

A test you can run on Monday

If you want to know whether your team has thought about this — whether you are building a workshop or a Slack channel — there is a simple diagnostic.

Pick any two adjacent agents in your production workflow. Find an engineer who works on the upstream agent. Find a different engineer who works on the downstream agent. Put them in a room. Ask each of them, separately, to write down — on paper, no looking — the exact list of fields that pass between their two agents, with types.

Compare the lists.

If the lists match, your seams are typed; you are running a workshop. If the lists don’t match, you have just found, in five minutes, the place where production will fail next quarter. Almost no team passes this test on the first try. Many teams fail it spectacularly — different field names, different types for the same field, fields one engineer thinks are required and the other thinks are optional, fields one team didn’t know existed.

The exercise costs nothing. It does not require a tool, a vendor, or a quarter of investment. It requires a piece of paper and twenty minutes. The fact that almost no team does this is the most damning thing you can say about the current state of agentic engineering practice.

The fix, when you discover the discrepancy, is not a meeting. The fix is to put the handoff in source — typed, named, versioned, validated. So that the next time those two engineers each write down the fields, they are writing the same thing, because they are both reading from the same declaration.

The workshop, not the army

I want to close with an observation about how we talk about multi-agent systems, because I think the language we use is part of why we keep getting the architecture wrong.

The dominant metaphor in agentic marketing, in 2026, is military. We talk about “swarms” of agents. We talk about “armies.” We talk about “agent teams” doing “missions.” The metaphor implies a kind of centralised command — a general handing down orders, a chain of command, a hierarchy of obedience. It also implies, more subtly, that the agents themselves are the locus of capability. The general gives the orders; the agents carry them out.

This metaphor is wrong, and the systems built under it are fragile in proportion to how seriously their architects took it.

The right metaphor is older, and quieter, and almost never used in the marketing. It is the workshop. A workshop has specialists. The specialists do not report to each other. They share a bench. The bench has tools laid out in a known place. There is a master craftsman who keeps the bench in order — sharpens the tools, organises the materials, knows what each specialist needs and ensures it is there. The work flows around the bench, not through any one person.

The locus of capability in a workshop is not the specialists. It is the bench. Replace any specialist with another, and the workshop continues. Lose the bench, and the workshop is finished, no matter how skilled the specialists are.

This is the architecture that survives. It is older than computing, older than industry, older than civilisation as we usually count it. It is the way human work has been coordinated, in every settled tradition, since people have had work to coordinate. The agentic moment is, at its best, a chance to apply this old wisdom to a new substrate. The agents are tools. The workflow is the work. The spine is the bench. Build the bench first.

The previous piece in this series ended with the line build it like it has to last.

I want to end this one with the practical corollary, the thing every old craftsman knows and every new engineer has to learn the hard way.

The series will pause here. The pieces describe a discipline that exists, that has been tested, and that is not particularly fashionable. Whether you find a platform that respects it, or build one, or stitch one together from open-source pieces, the test is the same: can two engineers draw the same handoff. The job is to be able to answer yes.