Gantz AI for Gantz

Posted on Jan 5

I Over-Engineered My Agent (Here's What I Learned)

#agents #ai #architecture #learning

I spent 3 months building the perfect agent architecture.

It had everything:

Multi-tier memory system
Hierarchical planning
Self-reflection loops
Multi-agent coordination
Custom embedding pipeline
Sophisticated error recovery

It was beautiful.

Users hated it.

Here's what went wrong.

The vision

I was building a coding assistant. My architecture looked like this:

┌─────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   PLANNER   │  │  EXECUTOR   │  │  REVIEWER   │          │
│  │    AGENT    │──│    AGENT    │──│    AGENT    │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│         │                │                │                  │
│         ▼                ▼                ▼                  │
│  ┌─────────────────────────────────────────────────┐        │
│  │              SHARED MEMORY SYSTEM                │        │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────┐  │        │
│  │  │ Working │ │ Episodic│ │Semantic │ │ Long  │  │        │
│  │  │ Memory  │ │ Memory  │ │ Memory  │ │ Term  │  │        │
│  │  └─────────┘ └─────────┘ └─────────┘ └───────┘  │        │
│  └─────────────────────────────────────────────────┘        │
│                           │                                  │
│         ┌─────────────────┴─────────────────┐               │
│         ▼                                   ▼               │
│  ┌─────────────┐                    ┌─────────────┐         │
│  │    TOOLS    │                    │  REFLECTION │         │
│  │   (15 MCP)  │                    │    MODULE   │         │
│  └─────────────┘                    └─────────────┘         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Impressive, right?

What I built

Multi-agent system

Three specialized agents:

class PlannerAgent:
    """Creates detailed execution plans"""
    def plan(self, task):
        return self.llm.create(
            system="You are a planning specialist...",
            messages=[{"role": "user", "content": f"Plan: {task}"}]
        )

class ExecutorAgent:
    """Executes individual steps"""
    def execute(self, step, tools):
        return self.llm.create(
            system="You are an execution specialist...",
            tools=tools,
            messages=[{"role": "user", "content": f"Execute: {step}"}]
        )

class ReviewerAgent:
    """Reviews and critiques work"""
    def review(self, work):
        return self.llm.create(
            system="You are a code review specialist...",
            messages=[{"role": "user", "content": f"Review: {work}"}]
        )

Four-tier memory

class MemorySystem:
    def __init__(self):
        self.working = WorkingMemory()      # Current task
        self.episodic = EpisodicMemory()    # Past conversations
        self.semantic = SemanticMemory()    # Learned facts (vector DB)
        self.long_term = LongTermMemory()   # Persistent storage

Reflection loop

def execute_with_reflection(self, task):
    for attempt in range(3):
        result = self.executor.execute(task)

        reflection = self.reflector.reflect(result)

        if reflection.is_good:
            return result

        task = self.improve_task(task, reflection.feedback)

    return result

15 specialized tools

tools:
  - read_file
  - read_file_range
  - write_file
  - patch_file
  - search_content
  - search_files
  - search_symbols
  - run_command
  - run_tests
  - lint_code
  - format_code
  - git_status
  - git_diff
  - git_commit
  - analyze_dependencies

The problems

Problem 1: It was slow

User: "Add a print statement to main.py"

My agent:
1. Planner analyzes task (2s)
2. Creates 5-step plan (3s)
3. Executor reads file (1s)
4. Executor makes change (2s)
5. Reviewer checks work (3s)
6. Reflection evaluates (2s)
7. Memory updates (1s)

Total: 14 seconds

Simple agent:
1. Read file (0.5s)
2. Make change (1s)

Total: 1.5 seconds

14 seconds for a one-line change. Users didn't wait.

Problem 2: It was expensive

Simple request token usage:

Planner: 2,000 tokens
Executor: 1,500 tokens
Reviewer: 1,800 tokens
Reflection: 1,200 tokens
Memory queries: 800 tokens
Orchestration: 500 tokens

Total: 7,800 tokens

vs.

Simple agent: 1,200 tokens

6x more expensive. For the same result.

Problem 3: It was fragile

# Things that broke regularly:

# Planner and Executor disagreed on format
PlannerOutput: {"steps": ["read", "modify", "save"]}
ExecutorExpected: {"step": "read", "params": {...}}

# Memory retrieval returned irrelevant context
Query: "How to add logging?"
Retrieved: "User prefers dark mode" (from 3 weeks ago)

# Reflection loop got stuck
Reflection: "Code could be better"
Attempt 2: Same code
Reflection: "Code could be better"
Attempt 3: Same code

Problem 4: It was unpredictable

Same request, different runs:

Run 1: Planner creates 3 steps
Run 2: Planner creates 7 steps
Run 3: Planner creates 1 step (skips planning)

Run 1: Reviewer approves
Run 2: Reviewer requests changes
Run 3: Reviewer gets confused about what to review

Users couldn't trust it.

Problem 5: It was impossible to debug

User: "Why did it delete my file?"

Me: "Let me check..."
- Orchestrator logs: "Delegated to executor"
- Planner logs: "Step 3: clean up"
- Executor logs: "Executed: delete temp files"
- Memory logs: "Retrieved: user likes clean workspaces"
- Reflection logs: "Approved cleanup"

Me: "Uh... multiple agents agreed it was a good idea?"

No single point of responsibility.

The rewrite

I threw it all away. Started over with this:

class SimpleAgent:
    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = tools
        self.messages = []

    def run(self, user_input):
        self.messages.append({"role": "user", "content": user_input})

        while True:
            response = self.llm.create(
                messages=self.messages,
                tools=self.tools
            )

            if response.tool_call:
                result = self.execute_tool(response.tool_call)
                self.messages.append({"role": "tool", "content": result})
            else:
                self.messages.append({"role": "assistant", "content": response.content})
                return response.content

    def execute_tool(self, tool_call):
        return self.tools.call(tool_call.name, tool_call.params)

50 lines. No planner. No reviewer. No reflection. No multi-tier memory.

The tools

# From 15 to 4
tools:
  - name: read
    description: Read a file
  - name: write
    description: Write to a file
  - name: search
    description: Search in files
  - name: run
    description: Run a command

The result

Same request: "Add a print statement to main.py"

Simple agent:
1. Read file (0.5s)
2. Write file (1s)

Total: 1.5 seconds
Tokens: 1,200
Success rate: 95%

What I learned

Lesson 1: Start with the simplest thing

# Week 1: This
response = llm.create(messages + [user_input])

# Week 2: Add tools if needed
response = llm.create(messages + [user_input], tools=basic_tools)

# Week 3+: Add complexity only when you hit real limits

Not:

# Week 1: Build the perfect architecture
orchestrator = Orchestrator(
    planner=PlannerAgent(),
    executor=ExecutorAgent(),
    reviewer=ReviewerAgent(),
    memory=FourTierMemory(),
    reflection=ReflectionModule()
)

Lesson 2: Multi-agent is usually wrong

Multi-agent sounds cool. In practice:

Single agent:
- One context
- One decision maker
- Clear responsibility
- Easy to debug

Multi-agent:
- Context passing overhead
- Coordination bugs
- Blame diffusion
- Debugging nightmare

Use multi-agent when you have genuinely different capabilities that can't share context. That's rare.

Lesson 3: Memory is a premature optimization

# What I built
memory = VectorDB() + EpisodicStore() + FactDatabase() + WorkingMemory()

# What I needed
messages = []  # That's it. Conversation history.

Add memory when users complain about forgetting. Not before.

Lesson 4: Reflection loops are token burners

# What I thought
"Self-reflection will catch errors and improve quality!"

# What happened
Agent: *does thing*
Reflector: "Could be better"
Agent: *does same thing slightly differently*
Reflector: "Could be better"
Agent: *does same thing again*
# 3x the tokens, same result

Reflection works in research papers. In production, it mostly burns money.

Lesson 5: More tools = more confusion

# With 15 tools
AI: "Should I use search_content or search_files or search_symbols?"
AI: "Should I use write_file or patch_file?"
AI: *picks wrong one*

# With 4 tools
AI: "I need to search" → uses search
AI: "I need to write" → uses write

Fewer tools = clearer decisions.

Lesson 6: Speed matters more than intelligence

User preference:

Fast + slightly wrong → "I can fix that, thanks!"
Slow + perfect → "Is it frozen? *closes tab*"

Users will tolerate imperfection. They won't tolerate waiting.

The new philosophy

Before:
"How do I make my agent smarter?"

After:
"How do I make my agent simpler?"

Simple tools with Gantz

Now I use Gantz Run with minimal tools:

# gantz.yaml
tools:
  - name: read
    script:
      shell: cat "{{path}}"

  - name: write
    script:
      shell: echo "{{content}}" > "{{path}}"

  - name: search
    script:
      shell: rg "{{query}}" . | head -30

  - name: run
    script:
      shell: "{{command}}"

Four tools. Covers 95% of coding tasks.

When I add complexity now

Only when I have evidence:

"Users are asking the same questions repeatedly"
→ Maybe add memory

"Simple search isn't finding relevant content"
→ Maybe add better retrieval

"Tasks require genuine coordination"
→ Maybe add another agent

Not because it's cool. Because users need it.

Summary

What I over-engineered:

Multi-agent coordination (didn't need it)
Four-tier memory (conversation history was enough)
Reflection loops (burned tokens)
15 specialized tools (4 was enough)
Hierarchical planning (overkill for most tasks)

What I learned:

Start with the simplest thing that works
Add complexity only when you hit real limits
Speed beats intelligence
One agent beats three
Users don't care about architecture

The best agent is the one that gets out of the way.

Have you over-engineered an agent? What did you cut?

DEV Community