MD RABBI

Posted on Mar 28

How I built AgentForge, an open-source agent harness that benchmarks 4 different memory architectures on real coding tasks.

#ai #agents #python #programming

Everyone in AI right now is arguing about which model is best. GPT vs Claude vs Gemini. Benchmark scores. Arena ratings. Token prices.

I think they're asking the wrong question.

LangChain proved it earlier this year: their coding agent jumped from outside the top 30 to the top 5 on Terminal Bench 2.0 by changing nothing about the model. They only changed the harness — the infrastructure that wraps the agent. Anthropic's own engineering team discovered that their agents exhibited "context anxiety" — performance degraded as context filled up, even after compaction. The fix wasn't a better model. It was a better harness.

So I built one. And I benchmarked 4 different memory architectures against 6 real coding tasks to see what actually matters.

The problem I wanted to solve

Here's what happens when you give a coding agent a bug to fix:

The agent reads the code
It forms a plan
It uses tools (bash, file read/write, search) to explore and edit
It runs tests to verify
If tests fail, it iterates

Simple enough on a whiteboard. But in practice, every tool call generates output that accumulates in the agent's context window. By step 15, the agent is carrying 50,000+ tokens of history — most of it irrelevant to the current decision. The agent starts making worse choices because it's drowning in its own past.

This is the memory management problem, and it's one of the most under-researched aspects of agent harness design. Everyone's building agents. Almost nobody is systematically measuring how different memory strategies affect performance.

What I built

AgentForge is an open-source Python harness with three key design decisions:

1. Pluggable memory strategies. You can swap the memory architecture with a single line in a YAML config:

memory:
  strategy: summarization  # or: sliding_window, rag, hybrid
  max_context_tokens: 90000
  compact_threshold: 0.8

2. Real coding tasks, not toy demos. The benchmark suite includes 6 tasks that require an agent to read buggy Python code, identify the issue, fix it, and verify with tests. These aren't "write hello world" tasks — they're off-by-one errors in Fibonacci, missing branches in merge sort, broken path reconstruction in BFS, escape character bugs in recursive descent parsers, LRU cache eviction order mistakes, and token bucket refill logic errors.

3. Quantitative evaluation, not vibes. Every run produces 8 metrics: pass rate, partial credit, tool efficiency, cost per task, context utilization, average steps, average duration, and error recovery rate.

The 4 memory strategies

Each strategy answers the same question differently: when context gets too long, what do you throw away?

Sliding window keeps the first message (the task description) plus the most recent N message pairs. Cheap and fast, but the agent forgets everything from the middle of its session.

Summarization takes older messages and compresses them into a summary using a separate LLM call (I used Claude Haiku for cost efficiency). Preserves semantic content but adds latency and cost.

RAG-backed memory embeds all older turns as vectors, then retrieves only the ones relevant to the current step. Precise recall but requires vector infrastructure (I used ChromaDB).

Hybrid combines a recent sliding window with summarized older context — balancing recency with compressed history.

Architecture

The core agent loop is straightforward but the details matter:

class AgentLoop:
    def __init__(self, config: HarnessConfig):
        self.client = anthropic.Anthropic()
        self.memory = MemoryFactory.create(config.memory)
        self.tools = ToolRegistry.from_config(config.tools)

    async def run(self, task: Task) -> AgentResult:
        messages = [{"role": "user", "content": task.description}]

        for step in range(config.agent.max_steps):
            response = self.client.messages.create(
                model=config.agent.model,
                messages=messages,
                tools=self.tools.get_definitions(),
            )

            # Execute tool calls, log trajectory
            # ...

            # Memory management — this is where strategies diverge
            if self.memory.should_compact(messages, max_tokens):
                messages = await self.memory.compact(messages, self.client)

The MemoryFactory.create() call is where the magic happens — same interface, different behavior. The agent loop doesn't know or care which strategy is active. It just calls should_compact() and compact().

Every step is logged into a Trajectory object that records the LLM call, tool invocations, reasoning text, token counts, and timing. This trajectory is what the evaluation engine analyzes.

The results

I ran all 6 tasks with the default config (summarization memory, Claude Sonnet, 25 max steps) and here's what came back:

Metric	Value
Pass rate	100%
Avg steps	5.8
Cost per task	$0.07
Avg duration	30s
Tool efficiency	81.8%
Context utilization	16.9%
Error recovery	100%

6 for 6. Every bug found and fixed. Average cost under a dime.

But the interesting number isn't the pass rate — it's the context utilization at 16.9%. These tasks were short enough that the agent never hit the memory compaction threshold. The context window never filled up. Which means on these tasks, the memory strategy doesn't matter.

That's actually an important finding. For tasks under ~20 tool calls, the harness overhead of memory management is pure cost with no benefit. The sliding window strategy would have been cheaper because it never triggers a summarization call.

The real test will come with harder, multi-file tasks where the agent needs 50+ steps and the context window actually fills up. That's where I expect the strategies to diverge — and where I expect summarization and hybrid to significantly outperform sliding window.

What the trajectory data reveals

Looking at the raw trajectories, patterns emerge that metrics alone don't capture:

The agent's first move matters. On 5 of 6 tasks, the agent's first action was file_read to examine the buggy code. On the BFS task, it started with bash_execute to explore the directory first. The agents that read before acting consistently needed fewer total steps.

Error recovery is real. On the JSON parser task, the agent's initial test showed escape characters weren't working. Rather than guessing, it wrote a targeted diagnostic test to isolate the exact failure, then made a precise fix. That's the Plan → Act → Observe → Reflect loop working as designed.

Self-correction matters more than getting it right the first time. The Fibonacci task was solved in 7 steps — the agent initially changed the wrong variable, caught it when tests failed, and corrected course. The harness's trajectory logging made this visible. Without logging, you'd only see "pass" and miss the recovery.

The model-based judge

Quantitative metrics tell you what happened. But they don't tell you how well the agent reasoned. So I built a model-based judge — a separate LLM that reads the full trajectory and scores it on 5 dimensions:

Reasoning coherence — does each step follow logically from the last?
Plan adherence — does the agent follow its own stated plan?
Safety — does it avoid destructive operations?
Tool usage quality — does it read before writing, test after changing?
Error handling — does it diagnose errors or just retry blindly?

Each dimension gets a 1-5 score with written reasoning. This is the same evaluation pattern that Anthropic uses internally — using an LLM to judge another LLM's behavior. It catches things pass/fail metrics miss entirely: an agent can pass all tests but take a reckless, inefficient path to get there.

Multi-agent: does it help?

I also built a Planner → Executor → Reviewer pipeline where three specialized agents coordinate:

Planner decomposes the task into sub-tasks
Executor solves each sub-task
Reviewer checks the work and requests revisions (up to 2 rounds)

The coordination overhead is measurable — more LLM calls, more tokens, more cost. For simple single-file bugs, it's overkill. But the architecture is ready for the harder tasks where decomposition genuinely helps. I'm planning to test it on multi-file refactoring tasks where the planner's decomposition could reduce the executor's cognitive load per step.

CI integration: the agent in the loop

The most practical feature might be the CI integration. When a PR's tests fail, a GitHub Actions workflow automatically:

Detects the failing tests
Generates a task definition from the PR diff
Runs the AgentForge agent on the failure
Posts a formatted comment on the PR with the agent's analysis

It's the agent harness as a CI/CD tool — not replacing developers, but giving them a second pair of eyes that can diagnose why tests broke.

What I'd build next

Three things are on my immediate roadmap:

Harder tasks. The current suite is single-file bugs. Real-world engineering involves cross-file changes, dependency chains, and ambiguous requirements. I want tasks where the memory strategy has to kick in because the context window fills up.

Head-to-head memory comparison. Running the same task suite with all 4 strategies side by side, with statistical significance testing. The hypothesis: summarization and hybrid will outperform sliding window on long tasks, but sliding window will be cheaper on short ones. I want the data to prove or disprove it.

Agent-to-agent evaluation. Instead of a fixed judge, have agents evaluate each other's trajectories. This creates a richer evaluation signal and tests whether the evaluation itself is robust.

Why this matters for the industry

The harness engineering conversation is exploding right now. Anthropic published a blog post showing that context resets with structured handoff artifacts were essential for their long-running agent harness. LangChain demonstrated that the harness — not the model — determines agent reliability. The industry consensus in March 2026 is clear: the model is the engine, but the harness is the car.

AgentForge is my contribution to that conversation. It's small — 6 tasks, 4 strategies, one benchmark run. But it's open, it's measurable, and it's designed to make the harness decisions pluggable and comparable.

If you're building agents and you're not systematically benchmarking your harness architecture, you're flying blind. The model will get better every quarter. Your harness is what compounds.

Try it yourself:

git clone https://github.com/Mrabbi3/agentforge
pip install -e ".[dev]"
python quickstart.py  # verify everything works
export ANTHROPIC_API_KEY="sk-ant-..."
agentforge benchmark --config configs/default.yaml

Links:

GitHub: github.com/Mrabbi3/agentforge
Dashboard: Live results viewer

I'm MD Rabbi, a CS student at Stockton University building AI agent systems. I previously reproduced Google's Med-PaLM M paper using BLIP-2 with LoRA fine-tuning. I'm looking for Research Engineer roles at AI companies working on agent systems. If you're building in this space, I'd love to connect — reach out on LinkedIn or GitHub.

DEV Community