Things You're Over-Engineering in Your AI Agent (That the LLM Already Handles)

#english #opinion #typescript #llm

There's a belief deeply embedded in the dev community that LLMs are black boxes we need to "tame" with infrastructure. That if you don't wrap the model in five layers of your own logic, you'll lose control. That manual retry, hand-rolled context management, artisanal response parsers — all of it is necessary because "you can't trust the model".

With all due respect: that's mostly wrong.

I'm not saying LLMs are perfect. I'm saying something more specific and more uncomfortable: we're re-implementing, in fragile and hard-to-maintain code, functionality the model already has built in. And we're doing it because it gives us a sense of control. That feeling is a comfortable lie.

I know exactly what I'm talking about because I did it. I opened my repo this week and found the corpse.

Over-Engineering AI Agents: The Damage Inventory

Context: I have an agent in production that processes queries, maintains multi-turn conversation, and calls external tools. A real system, with real users, processing real load. I built it eight months ago when I was just starting to understand how agents actually work.

This week I read a Dev.to post about things you over-engineer in your agents. I went straight to the repo. What I found was basically a collection of my own insecurities turned into code.

Sin #1: The Hand-Rolled Retry System

This one hurts the most because it has tests. Tests I was proud of. Look at this:

// What I wrote 8 months ago — 87 lines to do this
class LLMRetryManager {
  private maxAttempts: number;
  private backoffMs: number;
  private contextWindow: ConversationContext[];

  constructor(config: RetryConfig) {
    this.maxAttempts = config.maxAttempts ?? 3;
    this.backoffMs = config.backoffMs ?? 1000;
    this.contextWindow = [];
  }

  // Handled context trimming by hand
  private trimContext(messages: Message[]): Message[] {
    const MAX_TOKENS = 4000; // hardcoded, obviously
    let totalTokens = 0;
    const trimmed: Message[] = [];

    // Counted tokens in a completely wrong way
    for (const msg of messages.reverse()) {
      const estimatedTokens = msg.content.length / 4; // 💀
      if (totalTokens + estimatedTokens < MAX_TOKENS) {
        trimmed.unshift(msg);
        totalTokens += estimatedTokens;
      } else {
        break; // just cut it off, no system prompt preservation
      }
    }
    return trimmed;
  }

  async execute(prompt: string, attempt = 0): Promise<string> {
    try {
      const context = this.trimContext(this.contextWindow);
      const response = await callLLM(context, prompt);
      // stored response in local context
      this.contextWindow.push({ role: 'assistant', content: response });
      return response;
    } catch (error) {
      if (attempt >= this.maxAttempts) throw error;
      // exponential backoff that wasn't actually exponential
      await sleep(this.backoffMs * attempt);
      return this.execute(prompt, attempt + 1);
    }
  }
}

Eighty-seven lines. With tests. To re-implement, badly, what the OpenAI SDK already does. To re-implement, worse, the context management the model handles when you pass the message array correctly.

What it should be:

// What replaced those 87 lines — with the modern SDK
import OpenAI from 'openai';

const client = new OpenAI();

// The SDK handles retry with real exponential backoff by default
// maxRetries is configurable — you don't need to re-implement it
async function callAgent(messages: OpenAI.ChatCompletionMessageParam[]) {
  // The model handles context — you just maintain the messages array
  // No need to count tokens by hand for the basic flow
  const response = await client.chat.completions.create({
    model: 'gpt-4o',
    messages, // full history, the model knows what to do with it
    // If you need token control, use max_tokens on the output
    // Not hand-trimming the input array
  });

  return response.choices[0].message.content;
}

// For retry specific to your business logic, sure — that makes sense
// But for network errors and rate limiting: the SDK already handles it

The difference isn't just line count. It's that my hand-rolled version had a bug in the trimming that cut the system prompt in long conversations. It took me three weeks to find that bug. The SDK doesn't have that bug because the people who wrote it understand the API better than I do.

Sin #2: The Structured Response Parser

I was asking the LLM for JSON. The LLM would sometimes send JSON wrapped in markdown. Reasonable solution: parse it. My actual solution: 140 lines of regex and fallbacks.

// The monster I built
function parseStructuredResponse(raw: string): AgentAction {
  // tried to strip markdown
  let cleaned = raw.replace(/```
{% endraw %}
json\n?/g, '').replace(/
{% raw %}
```\n?/g, '');

  // tried to find JSON inside the text
  const jsonMatch = cleaned.match(/\{[\s\S]*\}/);
  if (jsonMatch) {
    cleaned = jsonMatch[0];
  }

  try {
    return JSON.parse(cleaned);
  } catch {
    // fallback to field-specific regex — I'm serious
    const action = cleaned.match(/"action":\s*"([^"]+)"/);
    const params = cleaned.match(/"params":\s*(\{[^}]+\})/);
    // ... 80 more lines of this
  }
}

The solution the ecosystem already had and I ignored: structured outputs.

import { zodResponseFormat } from 'openai/helpers/zod';
import { z } from 'zod';

// Define the schema once
const AgentActionSchema = z.object({
  action: z.enum(['search', 'calculate', 'respond', 'ask_clarification']),
  params: z.record(z.string()),
  reasoning: z.string().optional(),
});

// The model guarantees the structure — no parsing needed
const response = await client.beta.chat.completions.parse({
  model: 'gpt-4o-2024-08-06', // structured outputs require this model or newer
  messages,
  response_format: zodResponseFormat(AgentActionSchema, 'agent_action'),
});

// Already typed, already validated, already your object
const action = response.choices[0].message.parsed;
// action.action is 'search' | 'calculate' | 'respond' | 'ask_clarification'
// TypeScript knows it. No parsing. No regex.

One hundred and forty lines of brittle regex versus ten lines of schema. And the schema also documents the API contract.

Sin #3: Manual Tool Orchestration

This one is more subtle. When I implemented the tool calling system, I built an orchestration loop that decided when to call tools, how to interpret results, when to hand back to the model. Real business logic tangled up with plumbing the SDK already handles.

I touched on this sideways in the post about multi-agent systems as distributed systems problems — coordination complexity tends to accumulate in layers that didn't need to exist.

The modern SDK has client.beta.chat.completions.runTools() which handles the full loop. You register the tools, the model decides when to use them, the SDK runs the loop, you get the final response. You don't re-implement the protocol.

The Common Mistakes That Put You on This Path

Mistake 1: Legitimate distrust overgeneralized. There are things you can't trust the model on — complex mathematical reasoning, dates and times, post-cutoff information. But that legitimate distrust gets generalized to everything: "I can't trust the model to manage context", "I can't trust the model to structure output". That's where the over-engineering starts.

Mistake 2: Building for the model from two years ago. GPT-3.5 in 2022 needed a lot more scaffolding. Current models are fundamentally better at following instructions, maintaining structure, and handling context. The code you wrote to tame GPT-3.5 might be actively making your GPT-4o experience worse.

Mistake 3: Not reading the SDK changelog. The OpenAI, Anthropic, and Google SDKs have all updated massively in the last year. Functionality you had to implement by hand in 2023 exists as a method in the SDK in 2025. I didn't read it. I paid the price in lines of code.

Mistake 4: Premature orchestration. Similar to what I saw building the Buenos Aires bus sonification experiment — the temptation to build the coordination system before you have clear use cases. With agents: you build the retry framework, the state management, the orchestration — before you know what specific problem you're solving.

Mistake 5: Tests that validate the wrong complexity. My LLMRetryManager tests were good tests of bad code. They validated that my retry system worked as I designed it — not that the agent's behavior was correct. When I deleted the retry system and used the SDK's, the tests became obsolete. That should have told me something earlier.

This over-engineering pattern isn't exclusive to AI agents. I saw it in the Rust runtimes for TypeScript ecosystem too — sometimes the extra control layer introduces more problems than it solves.

FAQ: Over-Engineering in AI Agents

When DOES it make sense to have your own retry system?
When your retry logic is specific to business domain, not to the network. The SDK handles rate limits and transient network errors. You handle: "if the model says it doesn't have enough information, I query the database and retry". That logic is yours. The other kind belongs to the SDK.

Does structured outputs work with all models?
No. It requires gpt-4o-2024-08-06 or later, and gpt-4o-mini-2024-07-18 or later from OpenAI. For Anthropic, the approach is different — tool use with schema. For local models with Ollama, it depends on the model and version. Check compatibility before adopting.

Isn't it better to have your own context control for cost optimization?
Yes, but there's a difference between intelligent context optimization and badly-done manual trimming. For real production cost optimization: you use embeddings for selective context retrieval (RAG), you don't slice the array by hand. The manual trimming I was doing didn't optimize costs — it just broke long conversations.

What about security? Don't I need to validate model responses before executing actions?
Absolutely. This is the layer where you DO want your own code. Validating that the action is in the allowed set, that parameters satisfy business invariants, that the user has permissions for the requested action. That's yours. What isn't yours: parsing the JSON the model generates when you could use structured outputs.

Is it worth refactoring code that works?
Depends on "works". If it works and it's not going to change: maybe not. But my code "worked" with a silent bug in long conversations. The technical debt of re-implementing what the SDK does is that when the SDK improves — and it has improved a lot — you don't get it automatically. You're stuck with your two-year-old implementation.

Are there cases where over-engineering agents is the right call?
Yes: when you have very specific constraints (can't use the official SDK, have compliance requirements, need support for highly custom models). Or when the abstraction layer you're given isn't enough for your use case — there are security scenarios where you need fine-grained control over the protocol. But those are the exception. Most projects aren't in that situation.

The Real Cost of Illusory Control

I deleted 340 lines this week. Eighty-seven from retry, one hundred and forty from parsing, the rest from redundant orchestration. The system does exactly the same thing. The tests that matter still pass. The long-conversation bug — which I discovered while reviewing for this post — is gone.

The cost wasn't just the time writing those lines. It was time debugging bugs the SDK doesn't have. It was cognitive overhead every time someone new touches the code. It was the false sense that I understood what was happening because I had written it.

There's a version of this I've seen in other domains — the temptation to build from scratch because trusting something external feels vertiginous. I thought about it when I was looking at the pneumatic display with compressed air: sometimes building the most primitive layer makes artistic or technical sense. In production with a deadline: almost never.

The question I ask myself now before writing any infrastructure layer around a model: does this already exist in the SDK? Does the model already handle it? If the answer is yes and my implementation doesn't add something specific to my domain, it's over-engineering.

It's not a lack of control. It's choosing where you spend the control you actually have.