Matthew Hou

Posted on Feb 23 • Edited on Mar 10

Your AI Agent Doesn't Need More Intelligence — It Needs Better Plumbing

#ai #llm #discuss #programming

If you're building with AI right now, you've probably had this moment: the demo works perfectly, you ship it, and then production surfaces every edge case the model never considered. Hallucinated IDs. Ignored constraints. Fluent, confident, wrong output.

You're not alone. I've been there — and based on the conversations happening in my comment sections, a lot of you are hitting the exact same wall.

Last week I watched a demo where an AI agent processed a customer refund using a hallucinated customer ID. The LLM was confident. The code was clean. The refund went through. Nobody caught it for three minutes.

That three-minute gap is the entire story of AI in production right now.

What your comments taught me

After my METR study post, @leob left a comment that reframed how I think about this:

"Maybe we should move away from the idea of using AI tools for 'coding' only, and use it more in an 'advisory' role instead — as virtual brainstorming buddies."

That stuck with me. Because the reliability problem isn't about the AI's reasoning — it's about us treating generation as the finished product instead of the starting point. @signalstack put it even more sharply: "Generation got cheap. Verification didn't."

Those two comments are basically the thesis of this article.

The demo-to-production gap is a plumbing problem

Most AI demos are one prompt, one model call, one result. It looks like magic. Then you ship it and discover the model hallucinates, ignores constraints, and produces outputs that are fluent but subtly wrong.

The fix isn't a better model. It's better plumbing.

When I started running AI workflows daily, I assumed the bottleneck would be model quality. It wasn't. The bottleneck was everything around the model: input validation, output verification, retry logic, state management, error handling.

The boring stuff. The plumbing.

What "plumbing" actually looks like

Here's the architecture shift that made my AI workflows reliable:

Before: User request → LLM call → output to user

After: User request → input cleaning → LLM call → output validation → decision gate (pass/retry/escalate) → formatting → output to user

That "decision gate" is the key piece most people skip. It's where you check: did the model actually follow the constraints? Is this output structurally valid? Does this make sense given what we know?

Sometimes the gate triggers a retry with a modified prompt. Sometimes it routes to a different model. Sometimes it just says "I can't confidently answer this" — which is infinitely better than confidently being wrong.

The cost reality nobody talks about

Token prices are dropping. People see this and think "AI is getting cheaper."

Not exactly.

A single model call is cheap. A reliable system rarely uses a single call. One user request might trigger: generation, evaluation, regeneration, formatting, tool calls. The user sees one answer. The backend ran a small workflow.

I've seen my per-request cost go up 3-5x after adding proper validation layers. But my error rate dropped by an order of magnitude. That trade-off is worth it every time.

The analogy I keep coming back to: saying "tokens are cheap, therefore AI is cheap" is like saying screws are cheap, therefore airplanes are cheap.

Three patterns that actually work

1. Validate outputs against a schema, not vibes

Don't just check if the output "looks right." Define a concrete schema for what you expect. If your agent is supposed to return a JSON with specific fields, validate every field. If it's generating code, run it against your test suite before accepting it.

2. Build retry loops with variation

When validation fails, don't just retry with the same prompt. Modify something: add the error message as context, simplify the request, try a different model. I typically cap at 3 retries before escalating to a human or returning an explicit failure.

3. Separate the "thinking" from the "doing"

Let the LLM reason about what to do. Then have a separate, deterministic system actually execute it. The LLM decides "refund customer X $50." A validation layer checks: does customer X exist? Is $50 within the refund policy? Only then does the actual API call happen.

The uncomfortable truth

Nobody knows what software engineering looks like in 2 years. That's terrifying. The tools change faster than anyone can keep up.

But also — making AI reliable is just engineering. Every powerful but unreliable technology goes through this phase. Databases needed ACID. Networks needed TCP. AI needs its own reliability layer.

The engineers who figure out this plumbing will be the ones building things that actually work. The ones chasing the next model release will keep rebuilding their demos.

The only way I've found to stay sane through this is to build in public and learn from people who push back on my assumptions. @mahima_heydev pointed out in my last post that the real hidden cost isn't time — it's confidence. People ship changes they don't fully understand. That observation changed how I think about validation layers: they're not just catching bugs, they're preserving your ability to trust your own system.

What I want to hear from you

If you're running AI in production — what's your plumbing look like? Are you hand-rolling validation, using a framework, or still flying without a net?

I genuinely appreciate every one of you who takes the time to share what you're seeing. Some of the best architectural decisions in my projects started as a sentence someone left in a comment. This isn't a platitude — it's literally how my last three posts evolved.

If something here doesn't match your experience, I want to know. That's how this gets better.

P.S. I package what I learn into tools. If you want executable workflow files that add validation gates and retry logic to your AI workflows automatically: 3 Skill Files.

Top comments (4)

Cynthia Shen • Feb 25

This one super helpful！

Matthew Hou • Mar 2

Really appreciate that! The plumbing side is so unglamorous compared to model upgrades, but it's where most of the actual reliability comes from.

leob • Feb 28

This:

Separate the "thinking" from the "doing"

Matthew Hou • Mar 2

Yeah that's probably the single biggest unlock. Once you stop letting the agent think and act in the same step, half the weird failure modes just disappear. It's counterintuitive because it feels slower, but the reliability gain more than makes up for it.