Lots Of People Are Demoing AI Agents. Almost Nobody's Shipping Them The Right Way.

#ai #platformengineering #devops #infrastructure

Lots of people are demoing AI agents. Almost nobody's shipping them the right way.

Conference stages are packed with live demos of agents writing Terraform, spinning up Kubernetes clusters, and generating Helm charts on command. The audience claps. The tweet goes viral. And then... nothing ships.

Here's the uncomfortable truth: the gap between "look what my agent can do" and "this runs in production every day" is enormous. I've been on both sides. I spent years as an Enterprise Architect watching organizations spin up AI pilots that never graduated. Now I run my own infrastructure with Claude as the core agent — not as a demo, not as a proof of concept, but as the actual engine that keeps things moving.

This is how I did it, and what most people are getting wrong.

The problem: agents without harnesses

Here's a stat that should bother you: 89% of teams running AI agents have observability. Only 52% have evals.

Read that again. The industry built agents that can generate Terraform, write Helm charts, and scaffold entire CI pipelines. Impressive stuff. But almost nobody built the harness to know whether the output is actually safe to ship.

We gave AI agents the power to write infrastructure code, then forgot to build the quality gates that tell us if that code will blow up in production. That's not an AI problem — that's an engineering problem. And it's the reason most agent deployments are still stuck in pilot purgatory.

Observability tells you what happened. Evals tell you whether what happened was correct. Without evals, you're flying blind with an autopilot you can't verify.

My approach: build the pipeline around AI

Most teams bolt AI onto their existing pipeline. They add a Copilot here, an agent there, maybe a ChatGPT wrapper that generates boilerplate. The pipeline stays the same — AI is just a faster typist.

I did the opposite. I built the pipeline around AI from the ground up.

Claude is the core agent. Not an assistant, not a sidebar. The primary operator.
The developer experience is optimized for Claude. Every prompt, every constraint, every guardrail is designed for how the agent works — not retrofitted onto a human workflow.
Every dev area has a custom CLI. Task management, deployments, monitoring — each domain has its own purpose-built command-line interface.
The infra is fully declarative. Define the desired state, and it deploys. No manual steps, no click-ops.

The agent isn't an add-on. It's the engine. That distinction matters because it changes how you think about every other part of the system.

The harness: trust, but verify — every single time

If you take one thing from this article, let it be this: the harness is more important than the agent.

An AI agent without quality gates is a liability. A mediocre agent with a tight harness will outperform a brilliant agent running unchecked. Here's what my harness looks like — all local, all pre-commit:

Linters run automatically via Husky pre-commit hooks. No exceptions, no skipping.
Unit tests execute before code leaves the machine. If tests fail, the commit is rejected. Period.
Explicit verification prompts. "Did you test this?" and "Did you verify the data flow end-to-end?" are baked into the workflow, not left to human memory.

Notice what's missing? A remote CI server.

The gates are on the dev machine. By the time code gets pushed, it's already passed linting, testing, and verification. This isn't about skipping CI — it's about shifting quality left to the point where the agent operates. If the agent generates bad code, it gets caught before it ever leaves my laptop.

Token economics: stop wasting 80% of your budget

Most people running AI agents are hemorrhaging tokens. They pipe everything through MCP (Model Context Protocol), load massive contexts, and wonder why their agent bills look like cloud computing invoices from 2015.

My rules for token efficiency:

CLI over MCP — every time. A well-designed CLI returns exactly what the agent needs in a fraction of the tokens. MCP is flexible but verbose.
Port MCPs to lightweight CLIs. If you're using an MCP server for a specific domain, ask yourself: could this be a 50-line CLI script that returns structured output? Usually, yes.
Purpose-built CLIs per dev area. One CLI for task management, one for deployments, one for monitoring. Each returns minimal, structured data.
Fewer tokens = faster = cheaper. This isn't just about cost. Smaller contexts mean faster responses and fewer hallucinations.

Here's the bottom line: if your agent workflow costs more than the engineer it replaces, you haven't optimized the process. You've just automated the waste.

Multi-model by design: one model is a single point of failure

This is the part most people don't want to hear: running everything through a single model is a risk.

One model means one set of biases, one set of blind spots, one failure mode. That's fine for a chatbot. It's not fine for infrastructure.

My setup uses Claude as the primary agent, but delegates coding tasks to Codex. A cheap Plus plan covers most of the review work. But here's where it gets interesting:

Claude drafts the plan. Architecture decisions, task breakdown, implementation strategy.
Codex reviews before execution. A different model with different training data and different assumptions.
Different model = different blind spots. What Claude accepts without question, Codex might flag. And vice versa.
The review catches assumptions the primary agent won't question. This is the AI equivalent of a code review — you want a different perspective, not an echo chamber.

Single-model pipelines are the new single point of failure. If your entire infrastructure workflow depends on one model's judgment, you've built a fragile system with extra steps.

The vertical: insourcing CI

Once the harness is tight and the token economics make sense, something interesting becomes possible: you can insource the entire CI pipeline.

Build happens locally. No waiting for remote runners to spin up.
Docker images push straight to prod. The build artifact goes directly where it needs to go.
Infra is declarative — define = deploy. Write the desired state, and the system converges to it.
No waiting for CI server queues. When you're a team of one (plus agents), a CI queue is just latency with no upside.

This only works because the harness is tight. Pre-commit hooks, linters, and tests give you the confidence to ship from your laptop. Without those gates, shipping from local would be reckless. With them, it's the fastest path to production.

What most people get wrong

They treat the agent like a chatbot. That's the root cause of most failed agent deployments.

Here's the pattern I see over and over:

What people do	What actually works
Ask Claude to "write me some Terraform"	Build a DX where Claude operates inside constraints
Trust the output because "AI is smart"	Verify with linters, tests, and explicit prompts
Run everything through MCP	Optimize for token cost and speed with CLIs
Use one model for everything	Let Claude delegate to Codex — different bias, different blind spots

The difference isn't the agent. The difference is the system around it.

AI agents in infrastructure work

But only if you build the harness first.

The agent is the easy part. Any competent engineer can get Claude or GPT to generate Terraform. The hard part — the part that separates demos from production — is everything around it: the quality gates, the token optimization, the multi-model verification, the declarative infrastructure that makes it all reproducible.

If you're stuck in pilot purgatory, the fix isn't a better agent. It's a better harness.

I'm writing more about the practitioner side of platform engineering + AI. If you're building with agents in production — not just demoing them — I'd like to hear from you.

What's your biggest blocker getting agents into production? Drop it in the comments.

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.