DEV Community

I Run a Fleet of AI Agents in Production — Here's the Architecture That Keeps Them Honest

Mike on February 27, 2026

Everyone's building AI agents. Tutorials show you how to make one. "Build an AI agent in 15 minutes!" Great. Now build twelve of them. Give them ...
Collapse
 
vibeyclaw profile image
Vic Chen

The observability layer you describe is exactly what gets skipped in early agent deployments. The pattern of having each agent emit structured logs with a "reasoning trace" field is underrated - it lets you do post-hoc debugging without having to replay prompts. One thing I would add: have you experimented with a "disagreement protocol" where agents can flag low-confidence decisions for human review? In my experience, the failure mode is usually silent overconfidence, not obvious errors.

Collapse
 
nesquikm profile image
Mike

Silent overconfidence is exactly the failure mode -- we hit it with the crash tracker when a model update silently shifted classification boundaries. No errors, just gradually worse accuracy. Caught it only through prompt-response log diffing.

We don't have a formal disagreement protocol yet, but that's a great idea. Right now for high-stakes decisions we run cross-evaluation with multiple LLMs (consensus, structured voting, adversarial debate) and escalate to humans when confidence is low. A more explicit "flag and defer" mechanism for low-confidence outputs would be a natural next step.

I cover the multi-LLM evaluation setup in Part 2.

Collapse
 
vibeyclaw profile image
Vic Chen

The adversarial debate approach is genuinely underrated for catching silent overconfidence -- it forces divergence into the open instead of letting consensus paper over it.

We ran into a very similar structural problem building the data pipeline at 13F Insight. When you're aggregating 13F filings from multiple data vendors, you occasionally get different position numbers for the same fund and quarter. The naive move is to pick the most recent source or take an average. But that's exactly wrong -- it trains your system to silently resolve disagreements instead of surfacing them.

What actually worked: if two sources diverge by more than some threshold, the record gets flagged and deferred for human review rather than auto-resolved. It's the same instinct as your 'escalate to humans when confidence is low' -- the system should be loudly uncertain rather than quietly wrong.

The log diffing approach you mentioned for catching boundary shifts is something I want to steal for the pipeline. The failure mode where accuracy degrades without any error signal is the one that's hardest to defend against operationally.

Collapse
 
vibeyclaw profile image
Vic Chen

The prompt-response log diffing approach for catching silent boundary shifts is smart — that's exactly the kind of thing that's invisible to standard metrics. No errors, no latency spikes, just quietly degrading quality. In a financial context that's especially dangerous because the model might still sound confident while misclassifying edge cases that only matter at the tail.

The structured voting + adversarial debate setup is the right direction for high-stakes calls. One thing I've found useful: tracking inter-model disagreement rate over time as a leading indicator. If your ensemble starts agreeing less frequently on a class of inputs, it often precedes detectable accuracy drops by several days. Cheaper than waiting for users to complain.

Will check out Part 2 — curious how you handle the credential/permission scoping when agents are generating their own fixes. That's a non-trivial trust problem.

Collapse
 
matthewhou profile image
Matthew Hou

"We're building fleets and forgetting to install brakes" — that stat (88% of orgs had security incidents with AI agents, only 47% monitor them) is damning.

The one-agent-one-job principle is the right call. The Vercel case study backs this up from a different angle: they had one agent with 15 tools at 80% accuracy, cut it to 2 tools and hit 100%. Same model. The failure was in the tool surface, not the reasoning.

Curious about one thing: how do you handle the cases where an agent's one job requires context from another agent's domain? Like if the crash tracker detects a pattern that needs telemetry data to diagnose. Do agents communicate, or does a human bridge the gap? That handoff design is where I've seen most multi-agent systems get messy.

Collapse
 
nesquikm profile image
Mike

@signalstack nailed it: we do the same, orchestrator translates between agents using structured summary packets with a strict schema. The receiving agent never seees raw output from the sender, just typed parameters. Honestly this whole cross-domain handoff topic deserves its own article.

Collapse
 
theminimalcreator profile image
Guilherme Zaia

The unsexy truth: your supervisor pattern only works because you kept orchestration deterministic. Most teams fail here—they LLM-route tasks, then wonder why prod behavior is stochastic. One gap: you mention multi-LLM councils for high-stakes decisions but skip the latency cost. Council consensus (3+ models voting) adds 2-5s per decision. For crash triage that's fine. For real-time telemetry? You need fallback to single-model with confidence thresholds. Also—your $0.02/task assumes agents don't retry on transient failures. What's your exponential backoff strategy? In .NET distributed systems, we'd use Polly with jittered retry + circuit breakers. Without that, one flaky API turns your cost model into roulette. The 'padded room' cliffhanger better include filesystem sandboxing—agents writing to shared volumes is the #1 way orgs turn 'no credentials' into 'oops, deleted logs'.

Collapse
 
nesquikm profile image
Mike

Fair points. Worth noting this is an in-house agent system, not user-facing — so latency isn't a hard constraint. That said, council voting only triggers for high-stakes decisions; most tasks just use schema validation + confidence thresholds on a single model. Retries and circuit breaking happen at the proxy level, transparent to the agent. Filesystem sandboxing is covered in part 2 — agents get ephemeral scratch space only.

Collapse
 
signalstack profile image
signalstack

The cross-agent context thing is where 'one job' architectures get complicated in practice. Hit this exact problem running a similar setup.

What worked: the orchestrator never passes raw agent output directly to another agent. It sends structured summary packets — a defined schema that strips the crash tracker's output down to just [pattern_type, affected_endpoint, timestamp_range] before injecting it into the telemetry analyzer's context. The receiving agent doesn't know it came from another agent. It just got parameters.

This matters because when you let agents pass full context to each other, the receiving agent latches onto whatever the sending agent was most confident about — including stuff that's totally irrelevant to its job. You end up with reasoning chains: Agent A's conclusion becomes Agent B's premise becomes Agent C's hallucinated 'fact.' The summary packet forces you to be explicit about what actually transfers at each handoff.

Second benefit: it keeps each agent's prompt surface minimal. The telemetry analyzer shouldn't know about crash classification logic. When it does, you get weird bleed.

For the genuinely ambiguous cross-domain cases, humans bridge the gap. But for structured handoffs, the orchestrator-as-translator pattern has been the cleanest approach I've found.

Collapse
 
klement_gunndu profile image
klement Gunndu

The cost engineering breakdown is the most useful part — running 80% of agents on Haiku-tier and reserving frontier for the 20% that need reasoning is exactly how we got our per-task cost under control too.

Collapse
 
nesquikm profile image
Mike

Yeah, the surprising part is how many tasks run perfectly fine on the cheapest tier — even Gemini 2.5 Flash handles crash classification, threshold alerts, and structured extraction just fine. Once you audit what actually needs reasoning vs. pattern matching, the frontier calls shrink fast.