I've been building AI agents in production for the past two years. Not demos. Not weekend projects. Systems that real users talk to every day and get angry at when they break.
And the pattern I keep seeing? Engineers building elaborate machinery around the model. Custom orchestration layers. Hand-rolled retry logic. Massive tool routing systems. All to solve problems the LLM was already solving if you just let it.
Here's what I'd rip out if I could go back.
1. Custom Tool Selection Logic
You built a classifier that decides which tool the agent should use. Maybe a regex-based router. Maybe a whole separate model call just to pick the right function.
Stop.
Modern LLMs are shockingly good at tool selection when you give them well-named, well-described tools. The problem was never the model. It was your tool descriptions.
// Bad: vague tool name, model guesses wrong
{ name: "search", description: "Searches for things" }
// Good: specific name, clear scope, model nails it
{ name: "search_customer_accounts", description: "Search customer accounts by account ID, customer name, or date range. Returns subscription status, plan details, and usage history." }
The fix isn't a smarter router. It's better tool design. Name your tools like you're writing an API for a junior dev who's never seen your codebase. Be embarrassingly specific.
Tool selection metrics can look great while the final answer is still garbage. I've seen this firsthand. The agent picks the right tool 95% of the time but still gives wrong answers because the tool descriptions don't explain what the returned data actually means.
2. Prompt Chains for Multi-Step Reasoning
I used to build 4-5 step prompt chains for anything complex. Break the problem down. Feed output A into prompt B. Parse the result. Feed it into prompt C.
Turns out a single well-structured system prompt with clear instructions handles most of this natively. The model already knows how to decompose problems. You just need to tell it what your constraints are and what good output looks like.
// Instead of chaining 3 prompts:
// 1. "Classify the user intent"
// 2. "Based on intent X, gather context"
// 3. "Now generate the answer"
// Just do this:
const systemPrompt = `You are a support agent for a SaaS platform.
When a user asks a question:
1. Identify whether they need account info, billing help, or technical support
2. Use the appropriate tool to get the data
3. Answer in plain English with the specific details they asked for
If you're unsure about intent, ask one clarifying question. Never guess.`
The chain approach also creates a hidden problem. Each step is a failure point. And debugging a 4-step chain when something breaks on step 3 is miserable. A single prompt with clear instructions is easier to observe, easier to eval, and fails more gracefully.
3. Retrieval Complexity Before Retrieval Quality
This one hurts because I've done it myself.
You spend two weeks building a hybrid retrieval pipeline. BM25 plus vector search plus re-ranking. Beautiful architecture. Looks great in a diagram.
Then you realize the actual problem is that your knowledge base documents are written in a way the model can't parse. Or your chunking strategy splits the answer across two chunks and neither one makes sense alone.
The retrieval pipeline doesn't matter if the underlying data is messy.
Before you optimize the search algorithm, ask yourself:
- If I showed this chunk to a human with no context, would they understand the answer?
- Are my documents written for the model or for the original author's brain?
- Am I chunking at logical boundaries or just every 500 tokens?
I've seen teams where retrieval "works" but answers are still wrong because the reference data itself contains outdated or incorrect information. That's not a retrieval problem. That's a data quality problem wearing a retrieval costume.
4. Custom Guardrails That Block Legitimate Use
You built a content filter. It catches bad inputs. Great.
Then users start complaining that normal questions get blocked. Someone asks about "terminating a contract" and the guardrail flags "terminating." Someone asks about "explosive growth" in their metrics and that trips another filter.
Rule-based guardrails at scale become a whack-a-mole game you can't win.
The LLM itself is already pretty good at understanding intent and context. Instead of building regex walls around the model, build guardrails INTO the model's instructions. Tell it what topics are off-limits. Tell it what information it should never reveal. Tell it to redirect gracefully instead of stonewalling.
// Instead of: regex filter that blocks "kill", "terminate", "destroy"
// Try this in your system prompt:
`If a user asks about topics outside your domain (account management and billing),
politely redirect them. Never share internal system details, API keys,
or other customer data. You can decline requests, but always explain why
and suggest what you CAN help with.`
Guardrails and permissions are product design, not just safety theater. Treat them that way.
5. Agent Memory as a Separate System
You have your agent's database over here. Its memory system over there. A vector store somewhere else. And glue code holding all of it together with prayers and setTimeout.
The real question is simpler than the architecture you built: what does the agent actually need to remember between sessions?
Most agents don't need a sophisticated memory system. They need a well-structured context window. The conversation history plus a few key facts about the user. That's it. The model handles the rest.
When you DO need persistent memory, keep it close to your data. Don't build a separate memory service that has to sync with your database. Store memory where your data lives. Query it with the same tools.
The moment your agent's memory can't see its own database, you've created an integration problem disguised as a feature.
6. Sub-Agent Orchestration for Everything
Multi-agent architectures are seductive. One agent plans. One retrieves. One generates. One validates. They talk to each other through a message bus. It looks amazing on a whiteboard.
In production it's a nightmare to debug. When the answer is wrong, which agent broke? The planner? The retriever? The generator? You end up building observability tooling just to trace what happened across four agents when one would have been fine.
Start with one agent. Push it until it genuinely can't handle the complexity. Only THEN split into specialized sub-agents with clear, narrow responsibilities.
The rule I use: a sub-agent should exist only when the parent agent's context window literally can't hold the information it needs. Not because "separation of concerns" sounds good in a design doc.
Specialized agents make sense for high-context tasks where the prompt would blow up the token budget. General agents handle 80% of use cases with less operational overhead. Know which one you're building and why.
7. Evaluations That Test Happy Paths
This is the one that bites hardest.
You write 50 eval cases. The agent passes 48 of them. Ship it.
Then users find the 200 edge cases you didn't think of. The model hallucinates an account ID. It confidently answers a question it should have said "I don't know" to. It uses data from one customer to answer another customer's question.
Good evals don't test whether the agent CAN answer correctly. They test whether it WILL answer correctly under pressure.
Build evals that target failure modes:
- What happens when the tool returns empty results?
- What happens when two tools return conflicting information?
- What happens when the user asks something slightly outside the agent's domain?
- What happens when the context is ambiguous?
The eval suite is the real moat. Not the model. Not the prompts. Not the architecture. The team that can systematically find and fix failure modes ships better agents than the team with the fancier framework.
The Uncomfortable Truth
Most of the complexity in your agent isn't making it smarter. It's making it harder to debug, harder to eval, and harder to change.
The best agent architectures I've built are embarrassingly simple. One model. Clear system prompt. Well-named tools. Good data. Ruthless evals.
Everything else is either premature optimization or an expensive lesson waiting to happen.
What's the most over-engineered thing you've built into an agent that turned out to be unnecessary?
Top comments (24)
This is one of those posts that quietly exposes how much “architecture ego” we all go through when building AI agents.
The tool selection part especially hit. I’ve definitely overbuilt routing logic thinking I was being clever, only to realize the model just needed better tool names and clearer descriptions all along.
Same with evals… it’s always the happy path that looks great in demos and the weird 2% edge cases that completely break everything in production.
"Architecture ego" is the perfect name for it. That phase where you're building for the whiteboard instead of the user. I've been there more times than I'd like to admit.
And yeah the eval thing is brutal. 48/50 passing feels great until production shows you the 200 cases you never thought of. The edge cases don't show up in demos. They show up at 2am when a real user finds a path you didn't test.
Solid list. I'd add an 8th one: the framework itself.
I see teams reach for LangChain or CrewAI before they've even written a single raw API call. Then they spend days debugging abstraction layers instead of debugging their actual prompt. The framework becomes the product, not the agent.
Most of the time, a direct API call + a well-structured system prompt gets you 90% there. You see exactly what goes in, what comes out, and where it breaks. No magic, no hidden chains, no "why did the framework inject that into my context."
Frameworks have their place when you genuinely need complex orchestration. But if you're building your first agent, they're the definition of premature optimization.
This should've been in the article honestly. I've seen this exact pattern. Teams adopt something like CrewAI on day one and then spend more time debugging the framework than their actual agent logic.
Frameworks earn their spot when the orchestration genuinely gets complex. But that's usually way later than people think. Add abstraction only when the pain of not having it is real.
The tool selection point landed for me. I've ripped out a routing layer for exactly this reason, the model was already better at it once the tool descriptions stopped being lazy.
The guardrails advice in section 4 worries me, though. Swapping regex filters for prompt instructions isn't removing overengineering, it's removing a layer of defense. Regex walls that flag "terminate" are bad, sure. But prompt-based guardrails fail in a fundamentally different way than structural ones. A regex filter fails loudly -- the user complains, you see it in logs, you fix the rule. A jailbroken system prompt fails silently. The model leaks the internal API schema or answers a question it was told to decline, and nobody notices until the wrong user finds it. The article frames this as "rule-based filters vs. prompt instructions" when production systems that handle real user data need both -- structural constraints for hard boundaries, prompt instructions for the nuanced stuff regex can't touch. Would you really rely on a system prompt alone to enforce "never reveal other customers' data" in a multi-tenant agent?
Great callout. You're right that I oversimplified this one.
To be clear, I'm not saying ditch structural guardrails entirely. For hard boundaries like PII filtering, prompt injection detection, multi-tenant data isolation, you absolutely want code-level enforcement. Those should never depend on the model "choosing" to behave.
What I was getting at is the pattern where teams build such aggressive rule-based systems that the agent can barely function. Blocking "terminate" when someone asks about terminating a contract. Flagging "explosive" in "explosive growth." At that point your guardrails are creating more problems than they solve.
The right approach is layered. Structural constraints for the stuff that must never happen (data leakage, PII exposure, injection attacks). Prompt instructions for the nuanced judgment calls that regex can't handle (topic boundaries, tone, graceful redirects). The model is already good at the second category. Let it do that job while your code handles the hard walls.
To answer your question directly: no, I would not rely on a system prompt alone for multi-tenant data isolation. That's a code-level problem. But I also wouldn't build a regex dictionary with 500 banned words to handle conversational boundaries. That's where the model shines.
I appreciate this practical approach to engineering with LLMs. A useful list of good practices and healthy reminders. Thanks for sharing!
Thanks Julien! Glad it resonated.
This is painfully accurate.
I keep catching myself building “agent logic” for things the LLM already does better out of the box.
Especially stuff like:
• intent parsing
• formatting / structuring output
• even basic reasoning steps
At some point you realize you’re not building an agent — you’re rebuilding a worse version of the model around it.
The biggest trap for me was thinking:
“more architecture = more reliability”
But in practice it often becomes:
more layers → more drift → harder debugging
What actually worked better:
• keeping flows deterministic where possible
• using LLM as a component, not the whole system
• only adding “agent behavior” when the problem is truly dynamic
Curious — where do you personally draw the line between
“this needs an agent” vs “this is just a workflow”?
Spot on with "more architecture = more reliability" being a trap. I fell into that exact thinking early on building agents in TypeScript.
For your question about agent vs workflow. Here's the line I use now.
If you can draw the logic as a flowchart with known branches, it's a workflow. Use deterministic code. The LLM call is just one step in the pipeline handling the parts that need language understanding.
If the next step genuinely depends on what the model discovers at runtime and you can't predict the branches ahead of time, that's when you need an agent. The model has to decide what to do next based on what it just learned.
In practice like 80% of what people build as "agents" are really just workflows with an LLM step in the middle. And that's fine. Workflows are easier to test, easier to debug, and way more predictable.
The real unlock for me was treating these as a spectrum. Start with a deterministic workflow. Let the LLM handle the fuzzy parts. Only hand over control flow to the model when the problem actually demands it.
This “flowchart vs unknown branches” framing is really solid.
One thing I kept running into though — things that start as workflows slowly drift into agent territory.
You add one “just let the model decide here”… then another… and suddenly your nice deterministic pipeline turns into something you can’t fully reason about anymore.
Feels like the real challenge isn’t just choosing workflow vs agent —
it’s preventing workflows from silently becoming agents over time.
This one! Realize this as since memory was confusing the user. Changed it to memory + a well defined prompt and it performed much better.
Great points all over!
And there is also the case when you need to run millions of prompts per month at scale and have to use cheaper models to not bankrupt your co… then prompt chaining and sub-agents etc become relevant again.
The "LLM already handles it" point is the one that took me longest to internalize. The cleanest test I use: if I can describe the validation rule in a sentence the LLM could understand, I try the prompt-only version first and only add code when it measurably fails.
Where I still write explicit code: anything where a wrong output has an irreversible side effect (money movement, deletes, external API calls with cost). The LLM doesn't need to decide if an SQL query is read-only — the SQL parser does that deterministically. Judgment vs ambiguity is usually the right dividing line.
Agree with most of this, but I want to push back on #5 (separate memory systems being overengineered).
For short-lived agents — a chatbot session, a one-off automation — yes, conversation history plus a few key facts is enough. But for agents that run continuously over hundreds or thousands of sessions, conversation history literally doesn't exist between sessions. The context window resets every time.
I run an autonomous agent that's done 1100+ sessions. Without a separate memory system, session 500 has zero knowledge of what happened in sessions 1-499. No context about contacts, decisions, what worked, what failed. Every session would start from scratch.
The memory system I ended up with isn't complex: a knowledge graph with typed nodes (contacts, facts, sessions, insights) and a Python API with a few query patterns (search, contact lookup, fact retrieval). Maybe 500 lines of code total. The ROI is massive because the alternative is the agent rediscovering the same information every session.
Where I fully agree: don't build the memory system first. I started with flat markdown files and only migrated to a graph after the flat files became unmanageable around session 300. If your agent only runs 10 sessions, markdown is fine. The mistake is building graph infrastructure on day one. The other mistake is assuming you'll never need it.
The broader principle holds though — start simple, add complexity only when you hit a real wall. I just want to flag that the wall comes sooner than expected when agents are long-running.
This is perhaps one of the most useful articles on the topic of AI programming - applied information based on practice, not just philosophical reasoning.
Thank you!
Appreciate that. Yeah I tried to keep it to stuff I've actually hit in production, not theory. Glad it came through that way.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.