DEV Community

Cover image for I Over-Engineered My Agent (Here's What I Learned)
Gantz AI for Gantz

Posted on

I Over-Engineered My Agent (Here's What I Learned)

I spent 3 months building the perfect agent architecture.

It had everything:

  • Multi-tier memory system
  • Hierarchical planning
  • Self-reflection loops
  • Multi-agent coordination
  • Custom embedding pipeline
  • Sophisticated error recovery

It was beautiful.

Users hated it.

Here's what went wrong.

The vision

I was building a coding assistant. My architecture looked like this:

┌─────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   PLANNER   │  │  EXECUTOR   │  │  REVIEWER   │          │
│  │    AGENT    │──│    AGENT    │──│    AGENT    │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│         │                │                │                  │
│         ▼                ▼                ▼                  │
│  ┌─────────────────────────────────────────────────┐        │
│  │              SHARED MEMORY SYSTEM                │        │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────┐  │        │
│  │  │ Working │ │ Episodic│ │Semantic │ │ Long  │  │        │
│  │  │ Memory  │ │ Memory  │ │ Memory  │ │ Term  │  │        │
│  │  └─────────┘ └─────────┘ └─────────┘ └───────┘  │        │
│  └─────────────────────────────────────────────────┘        │
│                           │                                  │
│         ┌─────────────────┴─────────────────┐               │
│         ▼                                   ▼               │
│  ┌─────────────┐                    ┌─────────────┐         │
│  │    TOOLS    │                    │  REFLECTION │         │
│  │   (15 MCP)  │                    │    MODULE   │         │
│  └─────────────┘                    └─────────────┘         │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Impressive, right?

What I built

Multi-agent system

Three specialized agents:

class PlannerAgent:
    """Creates detailed execution plans"""
    def plan(self, task):
        return self.llm.create(
            system="You are a planning specialist...",
            messages=[{"role": "user", "content": f"Plan: {task}"}]
        )

class ExecutorAgent:
    """Executes individual steps"""
    def execute(self, step, tools):
        return self.llm.create(
            system="You are an execution specialist...",
            tools=tools,
            messages=[{"role": "user", "content": f"Execute: {step}"}]
        )

class ReviewerAgent:
    """Reviews and critiques work"""
    def review(self, work):
        return self.llm.create(
            system="You are a code review specialist...",
            messages=[{"role": "user", "content": f"Review: {work}"}]
        )
Enter fullscreen mode Exit fullscreen mode

Four-tier memory

class MemorySystem:
    def __init__(self):
        self.working = WorkingMemory()      # Current task
        self.episodic = EpisodicMemory()    # Past conversations
        self.semantic = SemanticMemory()    # Learned facts (vector DB)
        self.long_term = LongTermMemory()   # Persistent storage
Enter fullscreen mode Exit fullscreen mode

Reflection loop

def execute_with_reflection(self, task):
    for attempt in range(3):
        result = self.executor.execute(task)

        reflection = self.reflector.reflect(result)

        if reflection.is_good:
            return result

        task = self.improve_task(task, reflection.feedback)

    return result
Enter fullscreen mode Exit fullscreen mode

15 specialized tools

tools:
  - read_file
  - read_file_range
  - write_file
  - patch_file
  - search_content
  - search_files
  - search_symbols
  - run_command
  - run_tests
  - lint_code
  - format_code
  - git_status
  - git_diff
  - git_commit
  - analyze_dependencies
Enter fullscreen mode Exit fullscreen mode

The problems

Problem 1: It was slow

User: "Add a print statement to main.py"

My agent:
1. Planner analyzes task (2s)
2. Creates 5-step plan (3s)
3. Executor reads file (1s)
4. Executor makes change (2s)
5. Reviewer checks work (3s)
6. Reflection evaluates (2s)
7. Memory updates (1s)

Total: 14 seconds

Simple agent:
1. Read file (0.5s)
2. Make change (1s)

Total: 1.5 seconds
Enter fullscreen mode Exit fullscreen mode

14 seconds for a one-line change. Users didn't wait.

Problem 2: It was expensive

Simple request token usage:

Planner: 2,000 tokens
Executor: 1,500 tokens
Reviewer: 1,800 tokens
Reflection: 1,200 tokens
Memory queries: 800 tokens
Orchestration: 500 tokens

Total: 7,800 tokens

vs.

Simple agent: 1,200 tokens
Enter fullscreen mode Exit fullscreen mode

6x more expensive. For the same result.

Problem 3: It was fragile

# Things that broke regularly:

# Planner and Executor disagreed on format
PlannerOutput: {"steps": ["read", "modify", "save"]}
ExecutorExpected: {"step": "read", "params": {...}}

# Memory retrieval returned irrelevant context
Query: "How to add logging?"
Retrieved: "User prefers dark mode" (from 3 weeks ago)

# Reflection loop got stuck
Reflection: "Code could be better"
Attempt 2: Same code
Reflection: "Code could be better"
Attempt 3: Same code
Enter fullscreen mode Exit fullscreen mode

Problem 4: It was unpredictable

Same request, different runs:

Run 1: Planner creates 3 steps
Run 2: Planner creates 7 steps
Run 3: Planner creates 1 step (skips planning)

Run 1: Reviewer approves
Run 2: Reviewer requests changes
Run 3: Reviewer gets confused about what to review
Enter fullscreen mode Exit fullscreen mode

Users couldn't trust it.

Problem 5: It was impossible to debug

User: "Why did it delete my file?"

Me: "Let me check..."
- Orchestrator logs: "Delegated to executor"
- Planner logs: "Step 3: clean up"
- Executor logs: "Executed: delete temp files"
- Memory logs: "Retrieved: user likes clean workspaces"
- Reflection logs: "Approved cleanup"

Me: "Uh... multiple agents agreed it was a good idea?"
Enter fullscreen mode Exit fullscreen mode

No single point of responsibility.

The rewrite

I threw it all away. Started over with this:

class SimpleAgent:
    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = tools
        self.messages = []

    def run(self, user_input):
        self.messages.append({"role": "user", "content": user_input})

        while True:
            response = self.llm.create(
                messages=self.messages,
                tools=self.tools
            )

            if response.tool_call:
                result = self.execute_tool(response.tool_call)
                self.messages.append({"role": "tool", "content": result})
            else:
                self.messages.append({"role": "assistant", "content": response.content})
                return response.content

    def execute_tool(self, tool_call):
        return self.tools.call(tool_call.name, tool_call.params)
Enter fullscreen mode Exit fullscreen mode

50 lines. No planner. No reviewer. No reflection. No multi-tier memory.

The tools

# From 15 to 4
tools:
  - name: read
    description: Read a file
  - name: write
    description: Write to a file
  - name: search
    description: Search in files
  - name: run
    description: Run a command
Enter fullscreen mode Exit fullscreen mode

The result

Same request: "Add a print statement to main.py"

Simple agent:
1. Read file (0.5s)
2. Write file (1s)

Total: 1.5 seconds
Tokens: 1,200
Success rate: 95%
Enter fullscreen mode Exit fullscreen mode

What I learned

Lesson 1: Start with the simplest thing

# Week 1: This
response = llm.create(messages + [user_input])

# Week 2: Add tools if needed
response = llm.create(messages + [user_input], tools=basic_tools)

# Week 3+: Add complexity only when you hit real limits
Enter fullscreen mode Exit fullscreen mode

Not:

# Week 1: Build the perfect architecture
orchestrator = Orchestrator(
    planner=PlannerAgent(),
    executor=ExecutorAgent(),
    reviewer=ReviewerAgent(),
    memory=FourTierMemory(),
    reflection=ReflectionModule()
)
Enter fullscreen mode Exit fullscreen mode

Lesson 2: Multi-agent is usually wrong

Multi-agent sounds cool. In practice:

Single agent:
- One context
- One decision maker
- Clear responsibility
- Easy to debug

Multi-agent:
- Context passing overhead
- Coordination bugs
- Blame diffusion
- Debugging nightmare
Enter fullscreen mode Exit fullscreen mode

Use multi-agent when you have genuinely different capabilities that can't share context. That's rare.

Lesson 3: Memory is a premature optimization

# What I built
memory = VectorDB() + EpisodicStore() + FactDatabase() + WorkingMemory()

# What I needed
messages = []  # That's it. Conversation history.
Enter fullscreen mode Exit fullscreen mode

Add memory when users complain about forgetting. Not before.

Lesson 4: Reflection loops are token burners

# What I thought
"Self-reflection will catch errors and improve quality!"

# What happened
Agent: *does thing*
Reflector: "Could be better"
Agent: *does same thing slightly differently*
Reflector: "Could be better"
Agent: *does same thing again*
# 3x the tokens, same result
Enter fullscreen mode Exit fullscreen mode

Reflection works in research papers. In production, it mostly burns money.

Lesson 5: More tools = more confusion

# With 15 tools
AI: "Should I use search_content or search_files or search_symbols?"
AI: "Should I use write_file or patch_file?"
AI: *picks wrong one*

# With 4 tools
AI: "I need to search" → uses search
AI: "I need to write" → uses write
Enter fullscreen mode Exit fullscreen mode

Fewer tools = clearer decisions.

Lesson 6: Speed matters more than intelligence

User preference:

Fast + slightly wrong → "I can fix that, thanks!"
Slow + perfect → "Is it frozen? *closes tab*"
Enter fullscreen mode Exit fullscreen mode

Users will tolerate imperfection. They won't tolerate waiting.

The new philosophy

Before:
"How do I make my agent smarter?"

After:
"How do I make my agent simpler?"
Enter fullscreen mode Exit fullscreen mode

Simple tools with Gantz

Now I use Gantz Run with minimal tools:

# gantz.yaml
tools:
  - name: read
    script:
      shell: cat "{{path}}"

  - name: write
    script:
      shell: echo "{{content}}" > "{{path}}"

  - name: search
    script:
      shell: rg "{{query}}" . | head -30

  - name: run
    script:
      shell: "{{command}}"
Enter fullscreen mode Exit fullscreen mode

Four tools. Covers 95% of coding tasks.

When I add complexity now

Only when I have evidence:

"Users are asking the same questions repeatedly"
→ Maybe add memory

"Simple search isn't finding relevant content"
→ Maybe add better retrieval

"Tasks require genuine coordination"
→ Maybe add another agent
Enter fullscreen mode Exit fullscreen mode

Not because it's cool. Because users need it.

Summary

What I over-engineered:

  • Multi-agent coordination (didn't need it)
  • Four-tier memory (conversation history was enough)
  • Reflection loops (burned tokens)
  • 15 specialized tools (4 was enough)
  • Hierarchical planning (overkill for most tasks)

What I learned:

  • Start with the simplest thing that works
  • Add complexity only when you hit real limits
  • Speed beats intelligence
  • One agent beats three
  • Users don't care about architecture

The best agent is the one that gets out of the way.


Have you over-engineered an agent? What did you cut?

Top comments (0)