DEV Community

Mike
Mike

Posted on

I Run a Fleet of AI Agents in Production — Here's the Architecture That Keeps Them Honest

Duck mission control: specialized rubber ducks in padded cubicles

Everyone's building AI agents. Tutorials show you how to make one. "Build an AI agent in 15 minutes!" Great. Now build twelve of them. Give them access to your analytics, your crash reports, your codebase, your telemetry pipeline, and your user acquisition channels. Run them every day. Sleep well at night.

That's a different tutorial. And judging by the numbers, most people are skipping it: according to the State of AI Agent Security 2026 report, 88% of organizations reported confirmed or suspected security incidents involving AI agents in the past year, while only 47% of deployed agents receive any active monitoring. We're building fleets and forgetting to install brakes.

I built a company-wide system of AI agents — not a chatbot, not a copilot, a fleet of about a dozen specialized bots running hundreds of tasks per day across almost every team. Analytics, crash monitoring, code review, telemetry analysis, user channel scanning. Each one has a job. None of them have credentials.

Here's how the architecture works.

One Agent, One Job

The first design decision was the most important: no general-purpose agents.

It's tempting to build one smart agent that can "do everything." Query analytics, check crash reports, review code, scan forums. Give it a massive prompt, a dozen tools, and broad API credentials. It'll figure it out.

It will also figure out how to do things you never intended. The blast radius of a general-purpose agent is your entire infrastructure.

Instead, every agent in the system has exactly one responsibility:

  • Crash tracker — monitors crash reporting services, classifies crash patterns, flags regressions
  • Analytics agent — queries dashboards, spots anomalies, generates reports
  • Telemetry analyzer — processes app telemetry, identifies performance degradation
  • Code reviewer — scans for quality issues, suggests improvements
  • Channel scanner — watches user acquisition streams (forums, social media) for sentiment and opportunities
  • PR creator — takes findings from other agents and autonomously drafts pull requests

The orchestrator dispatches. Specialized agents execute. This is the supervisor-agent pattern. The routing layer — which agent gets which task — is deterministic: config-driven rules, not LLM reasoning. You don't want stochastic decision-making in the control plane. But for high-stakes decisions (is this anomaly real? should we alert?), the orchestrator can invoke multi-LLM evaluation — council discussions, structured voting, adversarial debate — before acting. Deterministic routing, intelligent validation.

It works for the same reason microservices work: small, focused units are easier to test, monitor, debug, and — crucially — contain when they misbehave.

A crash tracker that goes haywire can't accidentally query your revenue data. It doesn't have access. It doesn't even know revenue data exists.

Yes, squint at this and it looks like a job queue with fancy workers. That's intentional. The orchestration layer is deliberately boring: deterministic routing, structured queues, config-driven dispatch. The workers are the non-deterministic part, and the architecture's entire job is containing that non-determinism. Treating agents as regular distributed-systems citizens — with all the operational discipline that implies — is what makes them safe to run unsupervised.

Not Every Agent Needs a Frontier Brain

Here's where cost engineering comes in. People default to running every agent on the most expensive model available. That's like hiring a senior architect to sort your mail.

A crash log classifier? Runs fine on a small model — Haiku-tier or open-weight. It's pattern matching against known categories — fast, cheap, reliable. The telemetry analyzer that just flags threshold breaches? Same tier.

The analysis synthesizer that takes outputs from six agents and produces a coherent executive summary? That one gets the frontier model. The PR creator that needs to understand code context and write meaningful commit messages? Frontier.

When 80% of your fleet runs on models that cost 1/50th of the frontier tier, your average cost per task drops dramatically. The expensive models earn their cost on the 20% of tasks that actually need reasoning. Everything else is glorified JSON transformation, and you should price it accordingly.

For context: the fleet's average cost per task is around $0.02. Frontier model calls average $0.15 each, but they're only 20% of volume. The monthly bill for running the entire fleet — hundreds of tasks per day — stays under $500. Compare that to a single senior engineer's daily rate.

The Padded Room with a Mail Slot

This is the part that makes people uncomfortable, and it's also the part that lets me sleep at night.

Every agent lives in a container with no outbound network access except to its local sidecar proxy. No API keys, no tokens, no direct access to any service. The container can compute and talk to exactly one thing: the proxy on loopback.

If you've worked with service meshes (Envoy, Istio), the pattern is familiar — a sidecar proxy sits next to each agent container and mediates all external communication. The agent calls proxy/analytics/query. The proxy injects authentication, forwards the request to the actual analytics service, gets the response, strips any auth metadata, and returns clean data to the agent.

The agent never sees a credential. It can still trigger actions that use credentials — that's delegated authority, and it's real power. But the agent can't exfiltrate tokens, can't connect to unexpected services, and can't expand its own permissions. The proxy enforces rate limits, request quotas, and maximum response sizes per agent role. If the crash tracker suddenly starts making 10x its normal request volume, the proxy throttles it before it overwhelms downstream systems.

Think of it as a padded room with a mail slot. The agent slides requests through the slot. Answers come back. But the door doesn't open. The agent doesn't know what's on the other side of the wall. It doesn't even know which building it's in.

Here's what a request looks like through the mail slot:

{
  "action": "query",
  "service": "analytics",
  "params": { "metric": "dau", "range": "7d" },
  "workflow_id": "wf-7829",
  "agent": "analytics-agent"
}
Enter fullscreen mode Exit fullscreen mode

The proxy validates this against the agent's role definition, injects auth, forwards it, and returns clean data. An unknown service value? Rejected. An action not in the agent's role? Rejected. Rate limit exceeded? Queued or rejected. The agent doesn't get an error message explaining why — it just gets "not available." This minimizes service-discovery leakage — the agent can't even enumerate what endpoints exist.

Here's the flow visually:

┌─────────────────────┐    loopback only     ┌─────────────────────┐
│   Agent Container   │ ──────────────────→  │    Sidecar Proxy    │
│                     │  proxy/analytics/    │                     │
│ • No network egress │      query           │ • Validates role    │
│ • No credentials    │                      │ • Injects auth      │
│ • No service        │ ←──────────────────  │ • Rate limits       │
│   discovery         │  clean JSON data     │ • Strips metadata   │
│                     │                      │ • Logs everything   │
└─────────────────────┘                      └────────┬────────────┘
                                                      │
                                                      │ authenticated
                                                      │ request
                                                      ▼
                                             ┌─────────────────────┐
                                             │  External Service   │
                                             │  (Analytics, Git,   │
                                             │   Crash Reporting)  │
                                             └─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

This consistent data interface works as a universal abstraction layer. Whether the underlying source is a SQL database, an Elasticsearch cluster, a third-party API, or a codebase repository — the agent queries the same proxy interface. The proxy translates.

Coding Agents as CLI Subprocesses

One pattern I didn't expect to use so heavily: running a coding agent CLI as a subprocess inside agent workflows.

Some agents don't need to be LLM wrappers themselves. The code review agent, for example, identifies areas for improvement using a cheap model, then invokes a coding agent via CLI to actually understand the code context, generate fixes, and create PRs. The agent orchestrates; the coding CLI does the heavy lifting.

This subprocess runs in its own sandbox with hard limits: max runtime, max tokens, max diff size, read-only access to the repo (writes go through a staging area), and a forbidden-paths list that includes auth modules and CI configs. The coding agent can propose new test cases alongside its changes, but it can't modify existing tests or test infrastructure — it can't "fix" a failing test by weakening the assertion.

The PR creator bot works similarly — it collects findings from multiple agents, synthesizes them, then invokes the coding CLI to draft the actual changes with full codebase context. The result: autonomous bots that search for improvements, draft fixes, and open PRs — all without a human writing a single line of code.

Humans still review and merge. Obviously. We haven't lost our minds entirely.

Log Everything, Trust Nothing

If you can't observe it, you can't trust it. And with a fleet of autonomous agents making decisions all day, trust needs to be earned through data, not assumed through vibes.

Append-only logging. Every proxy request, every LLM prompt and response, every decision point — logged to an immutable store. Auth headers and tokens are never logged; prompts and responses go through structured redaction (PII and secret scrubbing) before write. This isn't "standard backend logging." With traditional services you log requests and errors. With AI agents you also need to log reasoning — the full prompt, the full response, the confidence signals (where the model provides them), and which model produced which output. When an agent starts classifying crashes differently than it did last week, you need to diff the prompts and responses, not just the status codes.

Correlation IDs across agent workflows. When the orchestrator dispatches a task to three agents, every log entry carries the same workflow ID. You can reconstruct the entire multi-agent conversation from dispatch to result.

This paid off when the crash tracker started silently misclassifying reports. No errors, no alerts — it was just gradually less accurate. A model update had shifted its classification boundaries. Because we had full prompt-response logging with correlation IDs, we could diff the tracker's outputs across two weeks. The pattern was clear: shorter responses, lower confidence signals, and a category distribution that had drifted from baseline. Without immutable prompt-response logs, this would have been invisible until someone noticed bad data in a report weeks later.

Modular architecture is observability for free. Because each agent is single-purpose and containerized, you get independent monitoring per agent. Dashboard shows the crash tracker is slow? You know exactly where to look. The analytics agent's error rate is climbing? It's not contaminating the telemetry analyzer. Each agent is its own observability boundary.

Unit testing with synthetic data. Every agent includes instructions for generating synthetic data for its domain. A crash tracker gets synthetic crash reports. An analytics agent gets synthetic dashboards. They can be tested in isolation — with mocked LLMs for deterministic CI runs, and with real LLMs for integration tests.

One caveat: if the LLM generates both the test data and the responses, you're testing the model against itself — a hallucination echo chamber. The synthetic data templates are human-authored, seeded from real production incidents and known edge cases. The LLM gets to respond to the synthetic inputs, but it doesn't get to define what "hard" looks like. That's your job.

Sandboxed environments for prototyping. New agents start in a sandbox — same container isolation, same proxy interface, but pointed at synthetic data. You can prototype a new "security scanner" agent without it ever touching production services. When it's ready, you point the proxy at the real endpoints. The agent doesn't know the difference. It was always just sliding paper through a mail slot.

How It All Fits Together

Here's a single workflow traced end to end — a telemetry spike turning into a pull request:

  1. Orchestrator receives a scheduled telemetry analysis task. It creates a workflow (wf-8341), selects the telemetry analyzer agent, and dispatches.

  2. Telemetry analyzer (running on a cheap model) queries proxy/telemetry/metrics for the last 24 hours. The proxy validates the request against the agent's role, injects authentication, forwards it, and returns clean data. The agent flags a 3x latency regression on the payments endpoint.

  3. Orchestrator receives the flag. Because it's a potential regression (high stakes), it triggers cross-evaluation: the same data goes to two additional models. All three agree — this is real, not noise.

  4. Orchestrator dispatches the finding to the PR creator agent with a new JIT token scoped to read:source-code, create-pr, create-branch.

  5. PR creator invokes a coding agent CLI as a subprocess. The CLI runs in a sandbox with read-only repo access, a forbidden-paths list, and hard limits on runtime and diff size. It identifies the likely cause (a missing database index on a recently added column), drafts a migration, and adds a new benchmark test.

  6. PR creator opens a pull request with the fix, the telemetry evidence, and a link to the workflow trace (wf-8341).

  7. Everything is logged: every proxy call, every LLM prompt and response, every decision point — all carrying wf-8341 as the correlation ID. The token expires. The containers reset.

  8. A human reviews the PR, checks the telemetry evidence, and merges. Or doesn't.

Total time: about 4 minutes. Total cost: under $0.30 (one frontier model call for the PR, cheap models for everything else). Human time: the 2 minutes it takes to review the PR.

What's Next

This architecture keeps agents productive and observable. But observability without security is just surveillance theater — congratulations, you can now watch your agents leak data in high definition.

In Part 2, I'll cover the security model that makes all of this safe: zero-trust with JIT tokens via JWT, RBAC for agents, a container proxy that means no credential ever touches an agent, and the meta-workflow — a special agent that analyzes logs from all other agents, identifies problems, and stages PRs to fix them. The system facilitates its own improvement, with human review at every step.

Because the boring parts — tokens, proxies, role definitions, logging — are what make the ambitious parts possible.


This is Part 1 of a two-part series on multi-agent AI architecture in production. Part 2 covers security, JIT tokens, and self-healing workflows.

Top comments (13)

Collapse
 
vibeyclaw profile image
Vic Chen

The observability layer you describe is exactly what gets skipped in early agent deployments. The pattern of having each agent emit structured logs with a "reasoning trace" field is underrated - it lets you do post-hoc debugging without having to replay prompts. One thing I would add: have you experimented with a "disagreement protocol" where agents can flag low-confidence decisions for human review? In my experience, the failure mode is usually silent overconfidence, not obvious errors.

Collapse
 
nesquikm profile image
Mike

Silent overconfidence is exactly the failure mode -- we hit it with the crash tracker when a model update silently shifted classification boundaries. No errors, just gradually worse accuracy. Caught it only through prompt-response log diffing.

We don't have a formal disagreement protocol yet, but that's a great idea. Right now for high-stakes decisions we run cross-evaluation with multiple LLMs (consensus, structured voting, adversarial debate) and escalate to humans when confidence is low. A more explicit "flag and defer" mechanism for low-confidence outputs would be a natural next step.

I cover the multi-LLM evaluation setup in Part 2.

Collapse
 
vibeyclaw profile image
Vic Chen

The adversarial debate approach is genuinely underrated for catching silent overconfidence -- it forces divergence into the open instead of letting consensus paper over it.

We ran into a very similar structural problem building the data pipeline at 13F Insight. When you're aggregating 13F filings from multiple data vendors, you occasionally get different position numbers for the same fund and quarter. The naive move is to pick the most recent source or take an average. But that's exactly wrong -- it trains your system to silently resolve disagreements instead of surfacing them.

What actually worked: if two sources diverge by more than some threshold, the record gets flagged and deferred for human review rather than auto-resolved. It's the same instinct as your 'escalate to humans when confidence is low' -- the system should be loudly uncertain rather than quietly wrong.

The log diffing approach you mentioned for catching boundary shifts is something I want to steal for the pipeline. The failure mode where accuracy degrades without any error signal is the one that's hardest to defend against operationally.

Collapse
 
vibeyclaw profile image
Vic Chen

The prompt-response log diffing approach for catching silent boundary shifts is smart — that's exactly the kind of thing that's invisible to standard metrics. No errors, no latency spikes, just quietly degrading quality. In a financial context that's especially dangerous because the model might still sound confident while misclassifying edge cases that only matter at the tail.

The structured voting + adversarial debate setup is the right direction for high-stakes calls. One thing I've found useful: tracking inter-model disagreement rate over time as a leading indicator. If your ensemble starts agreeing less frequently on a class of inputs, it often precedes detectable accuracy drops by several days. Cheaper than waiting for users to complain.

Will check out Part 2 — curious how you handle the credential/permission scoping when agents are generating their own fixes. That's a non-trivial trust problem.

Collapse
 
matthewhou profile image
Matthew Hou

"We're building fleets and forgetting to install brakes" — that stat (88% of orgs had security incidents with AI agents, only 47% monitor them) is damning.

The one-agent-one-job principle is the right call. The Vercel case study backs this up from a different angle: they had one agent with 15 tools at 80% accuracy, cut it to 2 tools and hit 100%. Same model. The failure was in the tool surface, not the reasoning.

Curious about one thing: how do you handle the cases where an agent's one job requires context from another agent's domain? Like if the crash tracker detects a pattern that needs telemetry data to diagnose. Do agents communicate, or does a human bridge the gap? That handoff design is where I've seen most multi-agent systems get messy.

Collapse
 
nesquikm profile image
Mike

@signalstack nailed it: we do the same, orchestrator translates between agents using structured summary packets with a strict schema. The receiving agent never seees raw output from the sender, just typed parameters. Honestly this whole cross-domain handoff topic deserves its own article.

Collapse
 
theminimalcreator profile image
Guilherme Zaia

The unsexy truth: your supervisor pattern only works because you kept orchestration deterministic. Most teams fail here—they LLM-route tasks, then wonder why prod behavior is stochastic. One gap: you mention multi-LLM councils for high-stakes decisions but skip the latency cost. Council consensus (3+ models voting) adds 2-5s per decision. For crash triage that's fine. For real-time telemetry? You need fallback to single-model with confidence thresholds. Also—your $0.02/task assumes agents don't retry on transient failures. What's your exponential backoff strategy? In .NET distributed systems, we'd use Polly with jittered retry + circuit breakers. Without that, one flaky API turns your cost model into roulette. The 'padded room' cliffhanger better include filesystem sandboxing—agents writing to shared volumes is the #1 way orgs turn 'no credentials' into 'oops, deleted logs'.

Collapse
 
nesquikm profile image
Mike

Fair points. Worth noting this is an in-house agent system, not user-facing — so latency isn't a hard constraint. That said, council voting only triggers for high-stakes decisions; most tasks just use schema validation + confidence thresholds on a single model. Retries and circuit breaking happen at the proxy level, transparent to the agent. Filesystem sandboxing is covered in part 2 — agents get ephemeral scratch space only.

Collapse
 
signalstack profile image
signalstack

The cross-agent context thing is where 'one job' architectures get complicated in practice. Hit this exact problem running a similar setup.

What worked: the orchestrator never passes raw agent output directly to another agent. It sends structured summary packets — a defined schema that strips the crash tracker's output down to just [pattern_type, affected_endpoint, timestamp_range] before injecting it into the telemetry analyzer's context. The receiving agent doesn't know it came from another agent. It just got parameters.

This matters because when you let agents pass full context to each other, the receiving agent latches onto whatever the sending agent was most confident about — including stuff that's totally irrelevant to its job. You end up with reasoning chains: Agent A's conclusion becomes Agent B's premise becomes Agent C's hallucinated 'fact.' The summary packet forces you to be explicit about what actually transfers at each handoff.

Second benefit: it keeps each agent's prompt surface minimal. The telemetry analyzer shouldn't know about crash classification logic. When it does, you get weird bleed.

For the genuinely ambiguous cross-domain cases, humans bridge the gap. But for structured handoffs, the orchestrator-as-translator pattern has been the cleanest approach I've found.

Collapse
 
klement_gunndu profile image
klement Gunndu

The cost engineering breakdown is the most useful part — running 80% of agents on Haiku-tier and reserving frontier for the 20% that need reasoning is exactly how we got our per-task cost under control too.

Collapse
 
nesquikm profile image
Mike

Yeah, the surprising part is how many tasks run perfectly fine on the cheapest tier — even Gemini 2.5 Flash handles crash classification, threshold alerts, and structured extraction just fine. Once you audit what actually needs reasoning vs. pattern matching, the frontier calls shrink fast.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.