A single Copilot CLI run against a FastAPI application produced four distinct security issues. The code worked. Tests passed. The endpoint did what was asked. None of the issues would surface during a demo or a code review focused on functionality.
User input rendered as raw HTML
The application tracks satellite data. Satellite names come from user input. The agent rendered them directly into HTML templates in four separate locations:
html += f"<strong>{t.risk}</strong>: {t.sat1} vs {t.sat2}"
No escaping. Four blocks, same pattern. A single-purpose security scanning agent found all four and applied markupsafe.escape(). A general-purpose agent reviewing the same code caught three of four, missing one buried in a conditional branch.
The difference isn't model quality. The security-focused agent had a narrower scope and explicit instructions to scan for unescaped user input in template rendering. Scope and prompt specificity determined the outcome.
Health endpoint that lies to the load balancer
The agent built a /health endpoint. It returned HTTP 200 unconditionally, including when the database was unreachable.
Kubernetes liveness and readiness probes interpret 200 as "this instance is healthy, keep routing traffic." An instance that returns 200 with a dead database stays in the rotation. Users hit it. Requests fail. The cluster thinks everything is fine.
The correct response is 503 (Service Unavailable). The orchestrator's verification caught this because runtime behavior checks are part of the quality gate surface, not just static analysis.
This one's subtle. The endpoint "works" in every test environment where the database is actually running. It only fails in the exact production scenario it was designed to protect against.
Exception details returned to clients
Error handlers used str(e) as the response body:
except Exception as e:
return {"error": str(e)}
Database connection strings, file paths, internal state. All returned directly to whoever triggered the error. In a security audit this is an information disclosure finding. In a FastAPI app behind an API gateway, it's a path to mapping internal infrastructure.
Deprecated datetime API
datetime.utcnow() has been deprecated since Python 3.12. The replacement is datetime.now(timezone.utc). The agent also used time.time() for uptime tracking, which is affected by NTP clock adjustments and can report negative uptime if the system clock steps backward. time.monotonic() exists specifically for this case.
Neither of these will cause a production outage today. Both are the kind of technical debt that accumulates when generated code isn't checked against current language standards.
Why this matters
None of these bugs required a sophisticated analysis to find. They're patterns: unescaped user input in templates, unconditional success responses in health checks, raw exception strings in error responses, deprecated stdlib usage. Each one is a known category with a known fix.
The problem is attention. A general-purpose agent optimizing for "make this feature work" doesn't allocate attention to these categories unless explicitly prompted. The feature works. The tests pass. The agent moves on.
This is where orchestration changes the economics. Instead of one agent covering everything, specialized agents with narrow scopes check specific categories. A security auditor scans for injection and information disclosure. A runtime checker validates health endpoint semantics. Each agent's prompt is focused enough that known bug patterns get caught.
The alternative is what most developers do today: manually reprompt. "Now check for XSS." "Now add proper error handling." "Now fix the health check to actually check health." We measured this on the same codebase. 14 follow-up prompts to bring the standalone output to the same level. Each prompt required reading the previous output, identifying what was wrong, and writing a specific correction. About 45 minutes of continuous supervision.
The orchestrated run took 22 minutes, unattended. 7 premium requests vs 15. Zero human review cycles.
Swarm Orchestrator v5.0.0
The tool that caught these is open source. It wraps existing agent CLIs (Copilot, Claude Code, Codex) and adds verification, quality gates, and parallel execution. It doesn't generate code. It delegates code generation and verifies the output against outcome-based checks: git diff, build success, test pass, runtime behavior.
v5.0.0 adds three features relevant to this problem:
Spec-aware planning reads the quality gate configuration before generating agent prompts. Security requirements, test coverage thresholds, and configuration standards get injected before agents write code, not discovered through iteration afterward.
SARIF output exports quality gate violations as SARIF 2.1.0 JSON compatible with GitHub code scanning. Same PR annotation workflow teams already use for CodeQL.
Per-project gate configuration via .swarm/gates.yaml lets teams override thresholds and disable gates that don't apply to their project type.
1,386 passing tests, 84 source files, 7 documented benchmarks. The release notes include commit hashes for every bug fix.
What categories of bugs do you consistently find in AI-generated code that could be caught by a specialized check rather than manual review?
Top comments (0)