DEV Community

I Run a Fleet of AI Agents in Production — Here's the Architecture That Keeps Them Honest

Mike on February 27, 2026

Everyone's building AI agents. Tutorials show you how to make one. "Build an AI agent in 15 minutes!" Great. Now build twelve of them. Give them ...

Read full post

Matthew Hou • Feb 28

"We're building fleets and forgetting to install brakes" — that stat (88% of orgs had security incidents with AI agents, only 47% monitor them) is damning.

The one-agent-one-job principle is the right call. The Vercel case study backs this up from a different angle: they had one agent with 15 tools at 80% accuracy, cut it to 2 tools and hit 100%. Same model. The failure was in the tool surface, not the reasoning.

Curious about one thing: how do you handle the cases where an agent's one job requires context from another agent's domain? Like if the crash tracker detects a pattern that needs telemetry data to diagnose. Do agents communicate, or does a human bridge the gap? That handoff design is where I've seen most multi-agent systems get messy.

Mike • Mar 1

@signalstack nailed it: we do the same, orchestrator translates between agents using structured summary packets with a strict schema. The receiving agent never seees raw output from the sender, just typed parameters. Honestly this whole cross-domain handoff topic deserves its own article.

Guilherme Zaia • Feb 28

The unsexy truth: your supervisor pattern only works because you kept orchestration deterministic. Most teams fail here—they LLM-route tasks, then wonder why prod behavior is stochastic. One gap: you mention multi-LLM councils for high-stakes decisions but skip the latency cost. Council consensus (3+ models voting) adds 2-5s per decision. For crash triage that's fine. For real-time telemetry? You need fallback to single-model with confidence thresholds. Also—your $0.02/task assumes agents don't retry on transient failures. What's your exponential backoff strategy? In .NET distributed systems, we'd use Polly with jittered retry + circuit breakers. Without that, one flaky API turns your cost model into roulette. The 'padded room' cliffhanger better include filesystem sandboxing—agents writing to shared volumes is the #1 way orgs turn 'no credentials' into 'oops, deleted logs'.

Mike • Mar 1

Fair points. Worth noting this is an in-house agent system, not user-facing — so latency isn't a hard constraint. That said, council voting only triggers for high-stakes decisions; most tasks just use schema validation + confidence thresholds on a single model. Retries and circuit breaking happen at the proxy level, transparent to the agent. Filesystem sandboxing is covered in part 2 — agents get ephemeral scratch space only.

TylerDurden1983 • Mar 17

This setup is exactly what I've envision for mechanicalsheep.com, fleets of agents rented out to perform jobs for buyers. Just launched and looking for early agent builders! The idea is offset/cover the overhead cost of your agents, optimally make a profit.

signalstack • Feb 28

The cross-agent context thing is where 'one job' architectures get complicated in practice. Hit this exact problem running a similar setup.

What worked: the orchestrator never passes raw agent output directly to another agent. It sends structured summary packets — a defined schema that strips the crash tracker's output down to just [pattern_type, affected_endpoint, timestamp_range] before injecting it into the telemetry analyzer's context. The receiving agent doesn't know it came from another agent. It just got parameters.

This matters because when you let agents pass full context to each other, the receiving agent latches onto whatever the sending agent was most confident about — including stuff that's totally irrelevant to its job. You end up with reasoning chains: Agent A's conclusion becomes Agent B's premise becomes Agent C's hallucinated 'fact.' The summary packet forces you to be explicit about what actually transfers at each handoff.

Second benefit: it keeps each agent's prompt surface minimal. The telemetry analyzer shouldn't know about crash classification logic. When it does, you get weird bleed.

For the genuinely ambiguous cross-domain cases, humans bridge the gap. But for structured handoffs, the orchestrator-as-translator pattern has been the cleanest approach I've found.

klement Gunndu • Feb 27

The cost engineering breakdown is the most useful part — running 80% of agents on Haiku-tier and reserving frontier for the 20% that need reasoning is exactly how we got our per-task cost under control too.

Mike • Feb 28

Yeah, the surprising part is how many tasks run perfectly fine on the cheapest tier — even Gemini 2.5 Flash handles crash classification, threshold alerts, and structured extraction just fine. Once you audit what actually needs reasoning vs. pattern matching, the frontier calls shrink fast.