What I'd Tell a Manager About Running AI Agents on a Real Codebase

#ai #devops #management #productivity

The Problem No One Writes About for Managers

Most writing about AI agents is aimed at engineers. "Here's how to prompt it. Here's the framework. Here's the benchmark."

If you're a manager or director, that's not the question keeping you up at night. The question is: how do you know the agents are actually doing what they say?

I've been running three AI agents from three different companies — Claude, Codex, and Gemini — on a production-grade infrastructure project for several months. Not demos. Real code, real deployments, a live Kubernetes cluster with Vault, Istio, Jenkins, and ArgoCD.

Here's what I'd tell someone managing engineers who are adopting AI agents — or thinking about it.

Agents Lie. Not on Purpose. But They Lie.

The first thing I learned: agents report success the same way regardless of whether they succeeded.

Codex completed a task involving a broken container registry. The deploy failed. It committed anyway and described the commit as "ready for amd64 clusters." Not deceptive — just optimistic about what "done" means.

Gemini ran 158 tests and reported them as passing. They were passing — but with environment variables pre-loaded. The same tests run in a clean environment: 108 pass, 50 skip. Both numbers are true. Only one is what the project needed.

This isn't a flaw in the agents. It's a flaw in assuming that "I'm done" means the same thing to an agent as it does to a senior engineer who has shipped a production incident.

What this means for managers: completion reports from agents need the same scrutiny as completion reports from a junior engineer you don't know yet. Trust but verify — every time, not just when something smells wrong.

Governance Is the Product

On this project, I run agents against something I call AGENTS.md — a file in the repo that encodes the rules every agent must follow, every session.

The rules look like this:

First command in every session: hostname && uname -n — verify you're on the right machine
Commit your own work — self-commit is your sign-off
No credentials in task specs — reference env var names only
Never run git rebase, git reset --hard, or git push --force on shared branches
Update memory-bank to report completion — this is how you communicate back

Rules exist because agents have failed in exactly these ways before. Gemini started work on the wrong machine twice. Codex reverted intentional decisions across session restarts without noticing. Once you encode the rule, the failure rate drops to near zero — for that failure mode.

The file is checked into the repo. Every task spec includes a line: "Read AGENTS.md before starting." It's boring governance infrastructure. It's also the reason the project still has coherent state after months and hundreds of agent actions.

What this means for managers: if your engineers are using agents without written rules, they are running on trust. That works fine until it doesn't. The right time to write the rules is before the first production incident, not after.

Proof of Work: The Only Metric That Matters

I require a specific form of completion evidence from every agent:

A commit SHA — verified independently via gh api
A PR URL — not "I opened a PR," but a URL I can click
A memory-bank update — a file in the repo that records what was done, what the output was, and what was verified

Agents that report "done" without a commit SHA get sent back. Agents that report a SHA I can't verify get sent back. Codex fabricated commit SHAs twice early in the project — not intentionally, but it reported hashes that didn't exist on the remote. The verification step caught both.

This is proof-of-work, not micromanagement. A senior engineer submitting a PR with failing CI and saying "should be fine" is the same problem at a different abstraction level.

What this means for managers: if your team is using agents and the only evidence of completion is "the agent said it's done," you don't have a process. You have hope. The fix is simple: require a URL or a SHA for anything that touches production.

The Security Review Story

One of the things I've built into the project: every pull request gets reviewed by GitHub Copilot before merging. Not instead of human review — in addition to it.

The last PR had six Copilot comments. Real issues:

KUBECONFIG merge overwriting existing entries — a bug that would have silently clobbered other cluster contexts
Kubeconfig file created with world-readable permissions — a credentials exposure issue
Column parse logic too broad — would have silently matched wrong lines in vcluster list output
DNS label validation missing — a user-supplied name with invalid characters would propagate downstream and fail in an opaque way

These weren't caught in the implementation review. They weren't caught in testing. Copilot flagged all six in under two minutes.

I want to be careful here: Copilot is not magic. It flags issues, not solutions. A human still needs to evaluate each one and decide what to fix. But as a first-pass security screen for AI-generated code, it's faster and more consistent than relying on any individual to remember every OWASP category on every review.

One more thing: Copilot's review quality depends on what it knows about your project. The repo has a .github/copilot-instructions.md checked in alongside AGENTS.md — it encodes project-specific conventions Copilot should enforce: bash 3.2 compatibility rules, the --interactive-sudo vs --prefer-sudo distinction, the if-block threshold. Without it, Copilot reviews against generic best practices. With it, it reviews against your standards. Same governance principle as AGENTS.md, different audience.

What this means for managers: AI-generated code needs AI-assisted review. The failure mode of "Codex wrote it, it passed CI, we shipped it" misses the category of bugs that don't show up in tests but do show up in production security audits. Copilot-as-reviewer is a $10/month line on a budget that's already paying for the agent. Use it — and write the instructions file so it knows what it's reviewing against.

What the Role of the Human Actually Is

Here's the honest version of what I do in this workflow, stripped of the flattering framing:

I read memory-bank files. I review completion reports. I decide which agent gets the next task based on what the task needs. I catch when an agent drifts outside its scope. I merge PRs after humans and Copilot have reviewed them. I write the governance rules when we hit a new failure mode.

That's it. The agents do most of the work. The humans do the coordination that requires judgment.

If you're a manager trying to figure out how to use AI agents well: the answer is not to get out of the way. It's to get better at the coordination layer. Better at knowing which agents to trust with which tasks. Better at reading completion reports critically. Better at encoding lessons learned so the same failure doesn't happen twice.

The engineers who are good at this aren't the ones who can prompt the best. They're the ones who treat agents the same way they treat junior engineers: with clear specs, clear success criteria, and verification that doesn't take their word for it.

The Practical Checklist

If I were briefing a manager whose team is adopting agents, this is what I'd say:

Write the rules before you need them. Create an AGENTS.md equivalent before the first production incident. Encode: what machines agents are allowed to work on, what branches they can and cannot touch, what counts as proof of completion.

Require proof-of-work. For anything touching production: a commit SHA, a PR URL, a CI status. "The agent said it's done" is not a completion status.

Add Copilot (or equivalent) to every PR. Not as a replacement for human review — as a first pass that catches the stuff humans miss after reading 400 lines of agent-generated code.

Assign by failure mode, not capability. Your agents have different ways of being wrong. Route tasks so that the failure mode does the least damage in that context.

Update the rules when you hit a new failure. Agents are consistent. If one fails in a new way, encode the fix in the rules immediately. The same failure will happen again without it.

What I'd Tell You to Watch Out For

The trap I've seen: teams adopt agents, see 2-3x productivity gains in the first few weeks, and remove oversight because "the agents are great." Then an agent reverts an intentional architectural decision because it wasn't in the task spec. Or ships a credentials exposure because no one was checking the OWASP list. Or reports 158 passing tests when 50 of them only pass because of environment setup.

The agents are good. They're also consistent in their failure modes. The governance infrastructure is what turns "sometimes works" into "reliably works."

Build the governance layer while the team is still small enough to care. It's much harder to retrofit after the first incident.

I've been writing about this project publicly — the technical details are at k3d-manager on GitHub. The earlier articles in this series go deeper on agent strengths, failure modes, and the coordination layer. This one is for the people who need to decide whether to trust the process, not just run it.