DEV Community

I Run a Fleet of AI Agents in Production — Here's the Architecture That Keeps Them Honest

Mike on February 27, 2026

Everyone's building AI agents. Tutorials show you how to make one. "Build an AI agent in 15 minutes!" Great. Now build twelve of them. Give them ...
Collapse
 
theminimalcreator profile image
Guilherme Zaia

The unsexy truth: your supervisor pattern only works because you kept orchestration deterministic. Most teams fail here—they LLM-route tasks, then wonder why prod behavior is stochastic. One gap: you mention multi-LLM councils for high-stakes decisions but skip the latency cost. Council consensus (3+ models voting) adds 2-5s per decision. For crash triage that's fine. For real-time telemetry? You need fallback to single-model with confidence thresholds. Also—your $0.02/task assumes agents don't retry on transient failures. What's your exponential backoff strategy? In .NET distributed systems, we'd use Polly with jittered retry + circuit breakers. Without that, one flaky API turns your cost model into roulette. The 'padded room' cliffhanger better include filesystem sandboxing—agents writing to shared volumes is the #1 way orgs turn 'no credentials' into 'oops, deleted logs'.

Collapse
 
nesquikm profile image
Mike

Fair points. Worth noting this is an in-house agent system, not user-facing — so latency isn't a hard constraint. That said, council voting only triggers for high-stakes decisions; most tasks just use schema validation + confidence thresholds on a single model. Retries and circuit breaking happen at the proxy level, transparent to the agent. Filesystem sandboxing is covered in part 2 — agents get ephemeral scratch space only.

Collapse
 
matthewhou profile image
Matthew Hou

"We're building fleets and forgetting to install brakes" — that stat (88% of orgs had security incidents with AI agents, only 47% monitor them) is damning.

The one-agent-one-job principle is the right call. The Vercel case study backs this up from a different angle: they had one agent with 15 tools at 80% accuracy, cut it to 2 tools and hit 100%. Same model. The failure was in the tool surface, not the reasoning.

Curious about one thing: how do you handle the cases where an agent's one job requires context from another agent's domain? Like if the crash tracker detects a pattern that needs telemetry data to diagnose. Do agents communicate, or does a human bridge the gap? That handoff design is where I've seen most multi-agent systems get messy.

Collapse
 
nesquikm profile image
Mike

@signalstack nailed it: we do the same, orchestrator translates between agents using structured summary packets with a strict schema. The receiving agent never seees raw output from the sender, just typed parameters. Honestly this whole cross-domain handoff topic deserves its own article.

Collapse
 
klement_gunndu profile image
klement Gunndu

The cost engineering breakdown is the most useful part — running 80% of agents on Haiku-tier and reserving frontier for the 20% that need reasoning is exactly how we got our per-task cost under control too.

Collapse
 
nesquikm profile image
Mike

Yeah, the surprising part is how many tasks run perfectly fine on the cheapest tier — even Gemini 2.5 Flash handles crash classification, threshold alerts, and structured extraction just fine. Once you audit what actually needs reasoning vs. pattern matching, the frontier calls shrink fast.

Collapse
 
signalstack profile image
signalstack

The cross-agent context thing is where 'one job' architectures get complicated in practice. Hit this exact problem running a similar setup.

What worked: the orchestrator never passes raw agent output directly to another agent. It sends structured summary packets — a defined schema that strips the crash tracker's output down to just [pattern_type, affected_endpoint, timestamp_range] before injecting it into the telemetry analyzer's context. The receiving agent doesn't know it came from another agent. It just got parameters.

This matters because when you let agents pass full context to each other, the receiving agent latches onto whatever the sending agent was most confident about — including stuff that's totally irrelevant to its job. You end up with reasoning chains: Agent A's conclusion becomes Agent B's premise becomes Agent C's hallucinated 'fact.' The summary packet forces you to be explicit about what actually transfers at each handoff.

Second benefit: it keeps each agent's prompt surface minimal. The telemetry analyzer shouldn't know about crash classification logic. When it does, you get weird bleed.

For the genuinely ambiguous cross-domain cases, humans bridge the gap. But for structured handoffs, the orchestrator-as-translator pattern has been the cleanest approach I've found.

Collapse
 
tylerdurden1983 profile image
TylerDurden1983

This setup is exactly what I've envision for mechanicalsheep.com, fleets of agents rented out to perform jobs for buyers. Just launched and looking for early agent builders! The idea is offset/cover the overhead cost of your agents, optimally make a profit.