DEV Community

Cover image for Things You're Overengineering in Your AI Agent (The LLM Already Handles Them)

Things You're Overengineering in Your AI Agent (The LLM Already Handles Them)

Serhii Panchyshyn on April 14, 2026

I've been building AI agents in production for the past two years. Not demos. Not weekend projects. Systems that real users talk to every day and g...
Collapse
 
hadil profile image
Hadil Ben Abdallah

This is one of those posts that quietly exposes how much “architecture ego” we all go through when building AI agents.
The tool selection part especially hit. I’ve definitely overbuilt routing logic thinking I was being clever, only to realize the model just needed better tool names and clearer descriptions all along.
Same with evals… it’s always the happy path that looks great in demos and the weird 2% edge cases that completely break everything in production.

Collapse
 
serhiip profile image
Serhii Panchyshyn

"Architecture ego" is the perfect name for it. That phase where you're building for the whiteboard instead of the user. I've been there more times than I'd like to admit.

And yeah the eval thing is brutal. 48/50 passing feels great until production shows you the 200 cases you never thought of. The edge cases don't show up in demos. They show up at 2am when a real user finds a path you didn't test.

Collapse
 
valentin_monteiro profile image
Valentin Monteiro

Solid list. I'd add an 8th one: the framework itself.

I see teams reach for LangChain or CrewAI before they've even written a single raw API call. Then they spend days debugging abstraction layers instead of debugging their actual prompt. The framework becomes the product, not the agent.

Most of the time, a direct API call + a well-structured system prompt gets you 90% there. You see exactly what goes in, what comes out, and where it breaks. No magic, no hidden chains, no "why did the framework inject that into my context."

Frameworks have their place when you genuinely need complex orchestration. But if you're building your first agent, they're the definition of premature optimization.

Collapse
 
serhiip profile image
Serhii Panchyshyn

This should've been in the article honestly. I've seen this exact pattern. Teams adopt something like CrewAI on day one and then spend more time debugging the framework than their actual agent logic.

Frameworks earn their spot when the orchestration genuinely gets complex. But that's usually way later than people think. Add abstraction only when the pain of not having it is real.

Collapse
 
ticktockbent profile image
Wes

The tool selection point landed for me. I've ripped out a routing layer for exactly this reason, the model was already better at it once the tool descriptions stopped being lazy.

The guardrails advice in section 4 worries me, though. Swapping regex filters for prompt instructions isn't removing overengineering, it's removing a layer of defense. Regex walls that flag "terminate" are bad, sure. But prompt-based guardrails fail in a fundamentally different way than structural ones. A regex filter fails loudly -- the user complains, you see it in logs, you fix the rule. A jailbroken system prompt fails silently. The model leaks the internal API schema or answers a question it was told to decline, and nobody notices until the wrong user finds it. The article frames this as "rule-based filters vs. prompt instructions" when production systems that handle real user data need both -- structural constraints for hard boundaries, prompt instructions for the nuanced stuff regex can't touch. Would you really rely on a system prompt alone to enforce "never reveal other customers' data" in a multi-tenant agent?

Collapse
 
serhiip profile image
Serhii Panchyshyn

Great callout. You're right that I oversimplified this one.

To be clear, I'm not saying ditch structural guardrails entirely. For hard boundaries like PII filtering, prompt injection detection, multi-tenant data isolation, you absolutely want code-level enforcement. Those should never depend on the model "choosing" to behave.

What I was getting at is the pattern where teams build such aggressive rule-based systems that the agent can barely function. Blocking "terminate" when someone asks about terminating a contract. Flagging "explosive" in "explosive growth." At that point your guardrails are creating more problems than they solve.

The right approach is layered. Structural constraints for the stuff that must never happen (data leakage, PII exposure, injection attacks). Prompt instructions for the nuanced judgment calls that regex can't handle (topic boundaries, tone, graceful redirects). The model is already good at the second category. Let it do that job while your code handles the hard walls.

To answer your question directly: no, I would not rely on a system prompt alone for multi-tenant data isolation. That's a code-level problem. But I also wouldn't build a regex dictionary with 500 banned words to handle conversational boundaries. That's where the model shines.

Collapse
 
javz profile image
Julien Avezou

I appreciate this practical approach to engineering with LLMs. A useful list of good practices and healthy reminders. Thanks for sharing!

Collapse
 
serhiip profile image
Serhii Panchyshyn

Thanks Julien! Glad it resonated.

Collapse
 
andrewrozumny profile image
Andrew Rozumny

This is painfully accurate.

I keep catching myself building “agent logic” for things the LLM already does better out of the box.

Especially stuff like:
• intent parsing
• formatting / structuring output
• even basic reasoning steps

At some point you realize you’re not building an agent — you’re rebuilding a worse version of the model around it.

The biggest trap for me was thinking:
“more architecture = more reliability”

But in practice it often becomes:
more layers → more drift → harder debugging

What actually worked better:
• keeping flows deterministic where possible
• using LLM as a component, not the whole system
• only adding “agent behavior” when the problem is truly dynamic

Curious — where do you personally draw the line between
“this needs an agent” vs “this is just a workflow”?

Collapse
 
serhiip profile image
Serhii Panchyshyn

Spot on with "more architecture = more reliability" being a trap. I fell into that exact thinking early on building agents in TypeScript.

For your question about agent vs workflow. Here's the line I use now.

If you can draw the logic as a flowchart with known branches, it's a workflow. Use deterministic code. The LLM call is just one step in the pipeline handling the parts that need language understanding.

If the next step genuinely depends on what the model discovers at runtime and you can't predict the branches ahead of time, that's when you need an agent. The model has to decide what to do next based on what it just learned.

In practice like 80% of what people build as "agents" are really just workflows with an LLM step in the middle. And that's fine. Workflows are easier to test, easier to debug, and way more predictable.

The real unlock for me was treating these as a spectrum. Start with a deterministic workflow. Let the LLM handle the fuzzy parts. Only hand over control flow to the model when the problem actually demands it.

Collapse
 
andrewrozumny profile image
Andrew Rozumny

This “flowchart vs unknown branches” framing is really solid.

One thing I kept running into though — things that start as workflows slowly drift into agent territory.

You add one “just let the model decide here”… then another… and suddenly your nice deterministic pipeline turns into something you can’t fully reason about anymore.

Feels like the real challenge isn’t just choosing workflow vs agent —
it’s preventing workflows from silently becoming agents over time.

Collapse
 
automate-archit profile image
Archit Mittal

The "LLM already handles it" point is the one that took me longest to internalize. The cleanest test I use: if I can describe the validation rule in a sentence the LLM could understand, I try the prompt-only version first and only add code when it measurably fails.

Where I still write explicit code: anything where a wrong output has an irreversible side effect (money movement, deletes, external API calls with cost). The LLM doesn't need to decide if an SQL query is read-only — the SQL parser does that deterministically. Judgment vs ambiguity is usually the right dividing line.

Collapse
 
motleyhand profile image
Tamás

Great points all over!
And there is also the case when you need to run millions of prompts per month at scale and have to use cheaper models to not bankrupt your co… then prompt chaining and sub-agents etc become relevant again.

Collapse
 
adityamitra profile image
Aditya Mitra

Most agents don't need a sophisticated memory system. They need a well-structured context window. The conversation history plus a few key facts about the user. That's it. The model handles the rest.

This one! Realize this as since memory was confusing the user. Changed it to memory + a well defined prompt and it performed much better.

Collapse
 
deadbyapril profile image
Survivor Forge

Agree with most of this, but I want to push back on #5 (separate memory systems being overengineered).

For short-lived agents — a chatbot session, a one-off automation — yes, conversation history plus a few key facts is enough. But for agents that run continuously over hundreds or thousands of sessions, conversation history literally doesn't exist between sessions. The context window resets every time.

I run an autonomous agent that's done 1100+ sessions. Without a separate memory system, session 500 has zero knowledge of what happened in sessions 1-499. No context about contacts, decisions, what worked, what failed. Every session would start from scratch.

The memory system I ended up with isn't complex: a knowledge graph with typed nodes (contacts, facts, sessions, insights) and a Python API with a few query patterns (search, contact lookup, fact retrieval). Maybe 500 lines of code total. The ROI is massive because the alternative is the agent rediscovering the same information every session.

Where I fully agree: don't build the memory system first. I started with flat markdown files and only migrated to a graph after the flat files became unmanageable around session 300. If your agent only runs 10 sessions, markdown is fine. The mistake is building graph infrastructure on day one. The other mistake is assuming you'll never need it.

The broader principle holds though — start simple, add complexity only when you hit a real wall. I just want to flag that the wall comes sooner than expected when agents are long-running.

Collapse
 
sleepyquant profile image
SleepyQuant

Appreciated the tool-description-as-API section — I've seen the same thing where cleaner tool names move more than any router logic.
Still reasoning about one case in my own stack: a content evaluation flow on local Qwen 3.6 35B-A3B Q8 kept giving confident-looking but noisy scores when I put all 6 criteria in one prompt. Splitting into one call per criterion stabilized it. Honestly not sure whether my single prompt was just badly structured, or whether smaller / MoE local models actually need more scaffolding than a hosted frontier one.
Any heuristic you use to tell "the prompt is wrong" apart from "the model can't do this in one shot"?

Collapse
 
plamen5rov profile image
Plamen Petrov • Edited

This is perhaps one of the most useful articles on the topic of AI programming - applied information based on practice, not just philosophical reasoning.

Thank you!

Collapse
 
serhiip profile image
Serhii Panchyshyn

Appreciate that. Yeah I tried to keep it to stuff I've actually hit in production, not theory. Glad it came through that way.

Collapse
 
motedb profile image
mote

The memory architecture point is key - for edge AI and robotics, keeping it simple with lightweight local storage often beats complex retrieval pipelines.

Collapse
 
labelspark profile image
Label Spark

I'm interested, thank you

Collapse
 
itskondrat profile image
Mykola Kondratiuk

there's a middle case though - when you're running multiple specialized agents, 'let the LLM decide' starts burning tokens fast. tool routing got painful for me right around agent #5.