DEV Community

Things You're Overengineering in Your AI Agent (The LLM Already Handles Them)

Serhii Panchyshyn on April 14, 2026

I've been building AI agents in production for the past two years. Not demos. Not weekend projects. Systems that real users talk to every day and g...

Read full post

Hadil Ben Abdallah • Apr 15

This is one of those posts that quietly exposes how much “architecture ego” we all go through when building AI agents.
The tool selection part especially hit. I’ve definitely overbuilt routing logic thinking I was being clever, only to realize the model just needed better tool names and clearer descriptions all along.
Same with evals… it’s always the happy path that looks great in demos and the weird 2% edge cases that completely break everything in production.

Serhii Panchyshyn • Apr 15

"Architecture ego" is the perfect name for it. That phase where you're building for the whiteboard instead of the user. I've been there more times than I'd like to admit.

And yeah the eval thing is brutal. 48/50 passing feels great until production shows you the 200 cases you never thought of. The edge cases don't show up in demos. They show up at 2am when a real user finds a path you didn't test.

Valentin Monteiro • Apr 15

Solid list. I'd add an 8th one: the framework itself.

I see teams reach for LangChain or CrewAI before they've even written a single raw API call. Then they spend days debugging abstraction layers instead of debugging their actual prompt. The framework becomes the product, not the agent.

Most of the time, a direct API call + a well-structured system prompt gets you 90% there. You see exactly what goes in, what comes out, and where it breaks. No magic, no hidden chains, no "why did the framework inject that into my context."

Frameworks have their place when you genuinely need complex orchestration. But if you're building your first agent, they're the definition of premature optimization.

Serhii Panchyshyn • Apr 15

This should've been in the article honestly. I've seen this exact pattern. Teams adopt something like CrewAI on day one and then spend more time debugging the framework than their actual agent logic.

Frameworks earn their spot when the orchestration genuinely gets complex. But that's usually way later than people think. Add abstraction only when the pain of not having it is real.

Wes • Apr 15

The tool selection point landed for me. I've ripped out a routing layer for exactly this reason, the model was already better at it once the tool descriptions stopped being lazy.

The guardrails advice in section 4 worries me, though. Swapping regex filters for prompt instructions isn't removing overengineering, it's removing a layer of defense. Regex walls that flag "terminate" are bad, sure. But prompt-based guardrails fail in a fundamentally different way than structural ones. A regex filter fails loudly -- the user complains, you see it in logs, you fix the rule. A jailbroken system prompt fails silently. The model leaks the internal API schema or answers a question it was told to decline, and nobody notices until the wrong user finds it. The article frames this as "rule-based filters vs. prompt instructions" when production systems that handle real user data need both -- structural constraints for hard boundaries, prompt instructions for the nuanced stuff regex can't touch. Would you really rely on a system prompt alone to enforce "never reveal other customers' data" in a multi-tenant agent?

Serhii Panchyshyn • Apr 15

Great callout. You're right that I oversimplified this one.

To be clear, I'm not saying ditch structural guardrails entirely. For hard boundaries like PII filtering, prompt injection detection, multi-tenant data isolation, you absolutely want code-level enforcement. Those should never depend on the model "choosing" to behave.

What I was getting at is the pattern where teams build such aggressive rule-based systems that the agent can barely function. Blocking "terminate" when someone asks about terminating a contract. Flagging "explosive" in "explosive growth." At that point your guardrails are creating more problems than they solve.

The right approach is layered. Structural constraints for the stuff that must never happen (data leakage, PII exposure, injection attacks). Prompt instructions for the nuanced judgment calls that regex can't handle (topic boundaries, tone, graceful redirects). The model is already good at the second category. Let it do that job while your code handles the hard walls.

To answer your question directly: no, I would not rely on a system prompt alone for multi-tenant data isolation. That's a code-level problem. But I also wouldn't build a regex dictionary with 500 banned words to handle conversational boundaries. That's where the model shines.

Julien Avezou • Apr 15

I appreciate this practical approach to engineering with LLMs. A useful list of good practices and healthy reminders. Thanks for sharing!

Serhii Panchyshyn • Apr 15

Thanks Julien! Glad it resonated.

Andrew Rozumny • Apr 15

This is painfully accurate.

I keep catching myself building “agent logic” for things the LLM already does better out of the box.

Especially stuff like:
• intent parsing
• formatting / structuring output
• even basic reasoning steps

At some point you realize you’re not building an agent — you’re rebuilding a worse version of the model around it.

The biggest trap for me was thinking:
“more architecture = more reliability”

But in practice it often becomes:
more layers → more drift → harder debugging

What actually worked better:
• keeping flows deterministic where possible
• using LLM as a component, not the whole system
• only adding “agent behavior” when the problem is truly dynamic

Curious — where do you personally draw the line between
“this needs an agent” vs “this is just a workflow”?

Serhii Panchyshyn • Apr 15

Spot on with "more architecture = more reliability" being a trap. I fell into that exact thinking early on building agents in TypeScript.

For your question about agent vs workflow. Here's the line I use now.

If you can draw the logic as a flowchart with known branches, it's a workflow. Use deterministic code. The LLM call is just one step in the pipeline handling the parts that need language understanding.

If the next step genuinely depends on what the model discovers at runtime and you can't predict the branches ahead of time, that's when you need an agent. The model has to decide what to do next based on what it just learned.

In practice like 80% of what people build as "agents" are really just workflows with an LLM step in the middle. And that's fine. Workflows are easier to test, easier to debug, and way more predictable.

The real unlock for me was treating these as a spectrum. Start with a deterministic workflow. Let the LLM handle the fuzzy parts. Only hand over control flow to the model when the problem actually demands it.

Andrew Rozumny • Apr 15

This “flowchart vs unknown branches” framing is really solid.

One thing I kept running into though — things that start as workflows slowly drift into agent territory.

You add one “just let the model decide here”… then another… and suddenly your nice deterministic pipeline turns into something you can’t fully reason about anymore.

Feels like the real challenge isn’t just choosing workflow vs agent —
it’s preventing workflows from silently becoming agents over time.

Mininglamp • May 11

Agreed on most points, but there's a nuance with the "skip explicit routing" advice. It holds when the model has strong instruction following (GPT-4, Claude). But in local/smaller model deployments (7B-14B range), implicit routing becomes unreliable — the model starts hallucinating tool calls or picks the wrong tool 15-20% of the time. So the engineering complexity isn't wasted, it's just environment-dependent. Powerful cloud model = lean on emergent capabilities. Smaller local model = be more explicit in orchestration. The over-engineering threshold shifts based on what's actually running inference.

Aditya Mitra • Apr 17

Most agents don't need a sophisticated memory system. They need a well-structured context window. The conversation history plus a few key facts about the user. That's it. The model handles the rest.

This one! Realize this as since memory was confusing the user. Changed it to memory + a well defined prompt and it performed much better.

Survivor Forge • Apr 17

Agree with most of this, but I want to push back on #5 (separate memory systems being overengineered).

For short-lived agents — a chatbot session, a one-off automation — yes, conversation history plus a few key facts is enough. But for agents that run continuously over hundreds or thousands of sessions, conversation history literally doesn't exist between sessions. The context window resets every time.

I run an autonomous agent that's done 1100+ sessions. Without a separate memory system, session 500 has zero knowledge of what happened in sessions 1-499. No context about contacts, decisions, what worked, what failed. Every session would start from scratch.

The memory system I ended up with isn't complex: a knowledge graph with typed nodes (contacts, facts, sessions, insights) and a Python API with a few query patterns (search, contact lookup, fact retrieval). Maybe 500 lines of code total. The ROI is massive because the alternative is the agent rediscovering the same information every session.

Where I fully agree: don't build the memory system first. I started with flat markdown files and only migrated to a graph after the flat files became unmanageable around session 300. If your agent only runs 10 sessions, markdown is fine. The mistake is building graph infrastructure on day one. The other mistake is assuming you'll never need it.

The broader principle holds though — start simple, add complexity only when you hit a real wall. I just want to flag that the wall comes sooner than expected when agents are long-running.

Archit Mittal • Apr 18

The "LLM already handles it" point is the one that took me longest to internalize. The cleanest test I use: if I can describe the validation rule in a sentence the LLM could understand, I try the prompt-only version first and only add code when it measurably fails.

Where I still write explicit code: anything where a wrong output has an irreversible side effect (money movement, deletes, external API calls with cost). The LLM doesn't need to decide if an SQL query is read-only — the SQL parser does that deterministically. Judgment vs ambiguity is usually the right dividing line.

Tamás • Apr 18

Great points all over!
And there is also the case when you need to run millions of prompts per month at scale and have to use cheaper models to not bankrupt your co… then prompt chaining and sub-agents etc become relevant again.

Plamen Petrov • Apr 15 • Edited

This is perhaps one of the most useful articles on the topic of AI programming - applied information based on practice, not just philosophical reasoning.

Thank you!

Serhii Panchyshyn • Apr 15

Appreciate that. Yeah I tried to keep it to stuff I've actually hit in production, not theory. Glad it came through that way.

SleepyQuant • Apr 21

Appreciated the tool-description-as-API section — I've seen the same thing where cleaner tool names move more than any router logic.
Still reasoning about one case in my own stack: a content evaluation flow on local Qwen 3.6 35B-A3B Q8 kept giving confident-looking but noisy scores when I put all 6 criteria in one prompt. Splitting into one call per criterion stabilized it. Honestly not sure whether my single prompt was just badly structured, or whether smaller / MoE local models actually need more scaffolding than a hosted frontier one.
Any heuristic you use to tell "the prompt is wrong" apart from "the model can't do this in one shot"?

mote • Apr 15

The memory architecture point is key - for edge AI and robotics, keeping it simple with lightweight local storage often beats complex retrieval pipelines.

Mykola Kondratiuk • Apr 21

there's a middle case though - when you're running multiple specialized agents, 'let the LLM decide' starts burning tokens fast. tool routing got painful for me right around agent #5.

Label Spark • Apr 19

I'm interested, thank you