DEV Community

Mike
Mike

Posted on

My AI Agents Create Their Own Bug Fixes — But None of Them Have Credentials

Zero-trust duck room: a bouncer-proxy checks JWT tokens while a detective duck watches the fleet

In Part 1, I described the architecture of a fleet of single-purpose AI agents: one job per agent, containerized isolation, cheap LLMs for simple tasks, frontier models for reasoning, append-only logging, and a consistent proxy interface.

That's the skeleton. But architecture without security is just organized chaos with good diagrams.

Here's a stat that should keep you up at night: according to the State of AI Agent Security 2026 report, 45.6% of teams still use shared API keys for agent-to-agent authentication, and only 14.4% have full security approval for their entire AI agent fleet. We're building autonomous systems and authenticating them like it's 2019.

Here's the part that actually matters: how these agents do powerful things — querying sensitive data, creating pull requests, analyzing telemetry — without ever holding dangerous permissions. And how the system improves itself over time without anyone trusting a bot with a merge button.

To be precise about "no credentials": no stored API keys. No standing tokens. No secrets in environment variables, config files, or prompts. Credentials are minted per workflow run, injected into the sidecar proxy — never into the container — and expire within minutes. The agents cannot leak what they never hold.

The Intern With the Admin Password

Let's talk about how most people give AI agents access to things.

Step 1: Create API credentials. Step 2: Paste them into the agent's environment variables. Step 3: Hope the agent only uses them for what you intended. Step 4: Forget about the credentials. Step 5: Read about it on the front page of the internet.

This is the "just in case" model. The agent has standing credentials — always valid, broadly scoped, sitting in a config file or environment variable like a house key under the doormat. Maybe you rotate them quarterly. Maybe.

With traditional software, this is already risky. With AI agents, it's genuinely terrifying. These are systems that take orders from text. Their behavior is shaped by prompts, which can be manipulated. A prompt injection attack on an agent with standing database credentials isn't a theoretical risk — it's business as usual.

You wouldn't give an intern the admin password on their first day. Don't give it to a bot that will confidently get things wrong on a regular basis.

From "Just in Case" to "Just in Time"

The core principle: agents have zero standing permissions. No stored credentials. No API keys. No database passwords. Not in environment variables, not in config files, not in prompts, not anywhere inside the container. Zero.

When a workflow needs an agent to do something, the orchestrator creates a short-lived, narrowly-scoped JWT for exactly the services that agent needs to query — and only for the duration of that workflow run.

Here's the lifecycle:

Orchestrator receives task
  → Creates JIT JWT: {agent: "telemetry", scope: "read:telemetry", workflow: "wf-7829", exp: 5min}
  → Configures container proxy with this token
  → Agent runs, makes requests through proxy
  → Proxy injects JWT into outbound requests
  → Workflow completes
  → Token expires within minutes
  → Nothing persists
Enter fullscreen mode Exit fullscreen mode

The agent never sees the token. The token lives in the proxy configuration, injected by the orchestrator. The agent calls proxy/telemetry/query, the proxy adds Authorization: Bearer <jwt>, forwards the request, gets the response, strips auth headers, and returns clean data.

No credentials in data. Not in prompts, not in agent context, not in logs. The agent literally cannot leak what it doesn't have. You can't social-engineer a password out of someone who was never told it. A prompt injection attack on a read-only agent gets you... the ability to ask the proxy for data the agent was already authorized to request. Congratulations, you've hacked your way into doing exactly what the agent was supposed to do anyway. On a write-capable agent (like the PR creator), the risk is more real — but it's still confined to the agent's specific role, its rate limits, and the mandatory human-in-the-loop review before anything merges.

To be clear: secretless doesn't mean harmless. The agent can still trigger actions through the proxy — that's delegated authority, and it's real power. But the blast radius is capped by the token's scope, the proxy's rate limits, and the role's action allowlist. A compromised agent can waste your compute budget for 5 minutes. It can't steal long-lived credentials or open arbitrary outbound connections, and any data access is limited to the narrow scope of its short-lived token.

No ever-living tokens. Every token is created per workflow run and expires when the workflow completes. There's nothing to rotate because there's nothing that persists. Your credential rotation policy is "tokens die automatically, every time." The security team's favorite rotation schedule is "never needs one."

If you want the academic framing: the recent Agentic JWT paper formalizes this as "intent tokens" — JWTs that bind each agent action to a specific user intent, workflow step, and agent identity checksum. It's the same principle: scope tokens to intent, not to identity. We arrived at the same pattern independently; it's nice to see it getting formal treatment.

RBAC: Roles Are for Bots Too

Role-Based Access Control isn't just for humans. Every agent type has a role definition. Here's a subset:

roles:
  crash-tracker:
    services: [crash-reporting]
    actions: [read]
    data: [crash-reports, stack-traces]
    limits: { max_requests_per_min: 30, max_response_size: 50kb }

  analytics-agent:
    services: [analytics-dashboard]
    actions: [read, query]
    data: [user-metrics, funnel-data]
    limits: { max_requests_per_min: 20, max_response_size: 200kb }

  code-reviewer:
    services: [code-repository]
    actions: [read, create-pr]
    data: [source-code, pull-requests]
    forbidden_paths: [auth/*, .ci/*, security/*]
    limits: { max_diff_lines: 500, max_runtime: 300s }

  pr-creator:
    services: [code-repository]
    actions: [read, create-pr, create-branch]
    data: [source-code]
    forbidden_paths: [auth/*, .ci/*, security/*]
    test_policy: can_add_new, cannot_modify_existing
    limits: { max_diff_lines: 500, max_files_changed: 10 }

  # telemetry-analyzer, channel-scanner, etc. follow the same pattern
Enter fullscreen mode Exit fullscreen mode

The crash tracker can read crash reports. That's it. Not "crash reports and also maybe telemetry if it asks nicely." The proxy enforces these roles structurally — the agent can't request outside its role because the proxy doesn't have endpoints for services the role doesn't include.

This is the key distinction: roles are defined in config, not in prompts. The security model is structural, not behavioral. You're not saying "please only query analytics" in the system prompt and hoping the LLM listens. You're saying "the only endpoint that exists is analytics" at the infrastructure level. Prompt injection can't circumvent a wall that has no door.

Validation: Trust, But Verify. Actually, Don't Trust.

Every agent output goes through validation before anything happens. This is not optional. It's not a "nice to have." It's a stage in the workflow pipeline that cannot be skipped.

For routine outputs — crash classifications, metric summaries — schema validation is enough. The output either matches the expected structure or it doesn't. Zod schemas, strict mode, no exceptions.

For consequential decisions — "should we alert the team about this anomaly?", "is this PR worth creating?" — I use cross-evaluation with multiple LLMs. The same question goes to 2–3 models, and the system measures consensus: council discussions, structured voting with confidence scores, adversarial debate, and model-as-judge evaluation.

A caveat: multi-LLM consensus isn't magic. Models share training data and can converge on the same mistake — correlated failures are real. Cross-evaluation works best when paired with deterministic checks: schema validation, static analysis, and regression tests that don't care what any model thinks. The LLMs catch the subtle stuff; the deterministic checks catch the obvious stuff. Together they cover more than either alone.

Integrated test suites with synthetic data. Each agent can be instructed on how to generate synthetic test data for its domain. This means:

  • CI runs with mocked LLMs (deterministic, fast, for regression testing)
  • Integration tests with real LLMs (for evaluation and quality assessment)
  • New agents can be added without regression risk — they're tested in isolation first

Evaluation isn't a phase. It runs on every output, every time.

The Meta-Workflow: The System That Fixes Itself

This is my favorite part. And the part people don't believe until they see it.

There's a special workflow — the meta-workflow — that doesn't serve users or teams directly. Its job is to analyze the logs from all other agents. It runs under its own role: read-only access to the log store, write access to the code repository (for staging PRs), and nothing else.

Remember the append-only logging from Part 1? Every prompt, every response, every decision, every proxy call? The meta-workflow reads all of it.

Here's what it does:

Separates happy paths from failure paths. Most agent runs succeed quietly. The meta-workflow builds a baseline of "normal" — typical response times, common classifications, expected output shapes. Then it flags the runs that deviate. Not based on error codes alone — based on behavioral patterns. "The crash tracker classified 47 reports today, but 12 of them took 3x longer than average and returned unusually short classifications." That's not an error. It's a degradation trend that a simple health check would miss.

Detects security anomalies. Did an agent make an unusual sequence of proxy requests? Did the telemetry agent suddenly start querying twice as often with different parameters? The meta-workflow flags access-pattern drift, unusual request sequences, and anything that looks like exploration rather than execution.

Stages PRs with proposed fixes. When the meta-workflow identifies a concrete problem — a prompt that's producing lower-quality outputs, a workflow configuration that's making redundant proxy calls — it uses a coding agent CLI to draft a pull request with the proposed fix, along with the log evidence that triggered it.

A prompt that's causing classification drift? The meta-workflow drafts an updated prompt, tests it against synthetic data, and opens a PR with the diff, a test report, and the specific log entries that showed the degradation. The reviewer doesn't just see "AI thinks this is better" — they see the receipts.

The quality gates are strict and enforced via branch protection rules and a dedicated bot account with limited repository permissions: PRs from the meta-workflow can't modify test files, can't touch auth code, can't change CI configuration, and must pass the full test suite before they're even visible for review. The agent can propose changes to prompts, configs, and workflow logic. It can't propose changes to its own guardrails. That's a hard boundary.

Is it noisy? Sometimes. Log analysis produces false positives, and not every staged PR is worth merging. But the signal-to-noise ratio improves over time — because the meta-workflow's own analysis prompts (not its guardrails or security config) are subject to the same improvement cycle. The distinction matters: it can get better at spotting problems, but it can't loosen its own constraints.

The system facilitates its own improvement over time. With one non-negotiable constraint:

Human-in-the-loop everywhere. An agent can create a PR but not merge it. An agent can flag an anomaly but not take corrective action. An agent can draft an alert but a human decides whether to send it. No irreversible actions happen without human approval.

The continuous improvement loop looks like this:

Agents run workflows
  → Append-only logs accumulate
  → Meta-workflow analyzes logs (daily / on-demand)
  → Identifies: degradation trends, security anomalies, quality drifts
  → Stages PRs with fixes + test evidence
  → Human reviews and merges (or doesn't)
  → Next cycle, the improved system produces better logs
  → Meta-workflow sees improvement (or spots new issues)
  → Repeat
Enter fullscreen mode Exit fullscreen mode

This isn't self-modifying AI in the scary sci-fi sense. It's a CI/CD pipeline where one of the contributors happens to be an AI that reads its own logs. The human is always in the loop. The merge button is always manual.

Threat Model: What Can Still Go Wrong

No architecture is bulletproof. Here's what I worry about and how the system mitigates each:

Threat Mitigation Residual risk
Prompt injection via agent input Agents can't expand permissions; proxy enforces role boundaries regardless of what agent requests Agent could waste compute within its allowed scope
Data exfiltration through crafted outputs Agents have no network egress; output goes through proxy which strips sensitive metadata Agent could encode data in its "normal" outputs — monitoring catches anomalous output patterns
Confused deputy (proxy misusing its authority) Proxy is deterministic, minimal-state, rule-based, and audited. No LLM in the proxy path Misconfigured role definitions could over-scope access
Log poisoning (agent manipulating its own audit trail) Append-only log store; agents write through a separate logging channel they can't read or modify A compromised logging pipeline upstream of the store
Self-reinforcing bugs (meta-workflow making things worse) PRs can't modify tests, auth, or CI; full test suite must pass; human reviews every merge Subtle quality regressions that pass tests
Correlated LLM failures in cross-evaluation Deterministic checks (schema validation, static analysis) run alongside LLM evaluation Novel failure modes that neither LLMs nor deterministic checks catch

The honest answer: this system reduces the blast radius and raises the cost of attacks. It doesn't eliminate risk. Nothing does. But "the agent can waste 5 minutes of compute within its allowed scope" is a very different threat profile from "the agent has the database password."

The Boring Parts Are the Point

If you've read this far, you might have noticed a pattern: most of this article is about tokens, proxies, role configs, and logging. Not about the AI. Not about the prompts. Not about which model is smartest.

That's intentional.

The interesting parts of a multi-agent system — self-healing workflows, autonomous PR creation, cross-model evaluation — are only possible because the boring parts are solid. JIT tokens mean you don't wake up to a credential leak. Container proxies mean prompt injection is a nuisance, not a catastrophe. RBAC means a misbehaving agent can't cascade. Append-only logs mean the meta-workflow has something to analyze.

The boring infrastructure is the product. The AI agents are just the tenants.

If you're building multi-agent systems, don't start with the prompts. Start with the proxy. Start with the token lifecycle. Start with the logging pipeline. Get the padded room right, then worry about what the agent inside it is saying.

The Cloud Security Alliance's Agentic Trust Framework puts it well: "No AI agent should be trusted by default, regardless of purpose or claimed capability." The framework maps five core elements — identity, behavior, data governance, segmentation, incident response — that align with everything described in this series. It's worth reading if you're designing agent infrastructure.

Once the foundation is solid, the ambitious parts take care of themselves.


This is Part 2 of a two-part series on multi-agent AI architecture in production. Part 1 covers agent architecture, container isolation, tiered LLMs, and observability.

The multi-LLM evaluation patterns mentioned in this article (council, voting, debate, judge) are open-source in mcp-rubber-duck.

Top comments (0)