DEV Community: Nikhil tiwari

Building a Website with Anthropic's Generator-Evaluator Loop (Harness Engineering)

Nikhil tiwari — Sun, 17 May 2026 08:18:22 +0000

Building a Website with Anthropic's Generator-Evaluator Loop

Anthropic recently published their harness design for long-running apps — a GAN-inspired architecture where a Generator builds and an Evaluator critiques in a loop. I replicated it with Kiro CLI to build a marketing website autonomously.

Result: 12 iterations, 3.5 hours, zero manual coding.

The Architecture

Planner (1x) → Generator ↔ Evaluator (12x)

Each agent is a separate CLI process — clean slate, no shared context:

#!/bin/bash
# orchestrator.sh — spawns fresh sessions per agent

KIRO="/Users/me/.local/bin/kiro-cli"

invoke_kiro() {
  "$KIRO" chat --no-interactive --trust-all-tools \
    "Read $1 and execute all instructions."
}

# Planner (once)
invoke_kiro "prompts/planner.md"

# Generator ↔ Evaluator loop
for ((i=1; i<=12; i++)); do
  pkill -f "next dev" 2>/dev/null; sleep 2
  invoke_kiro "prompts/generator.md"
  invoke_kiro "prompts/evaluator.md"
done

Communication is only through files:

Planner writes .harness/spec.md
Evaluator writes .harness/eval-report.md
Generator reads both

The Evaluator Uses Playwright

This is the key differentiator. The evaluator doesn't just read code — it browses the live site:

# evaluator.md (excerpt)

### Start Server
nohup npx next dev --port 3000 > /dev/null 2>&1 & disown; sleep 8

### Test with Playwright MCP
1. browser_navigate to http://localhost:3000
2. browser_snapshot at 1440x900, 768x1024, 375x812
3. browser_click all interactive elements
4. Score: Design Quality, Originality, Craft, Functionality

### Always provide feedback — never just "PASS"

The Design Skill (Anti-AI-Slop)

Both Generator and Evaluator reference Anthropic's frontend design skill:

## NEVER
- Inter, Roboto, Arial, Space Grotesk
- Purple gradients on white backgrounds
- Predictable card layouts

## ALWAYS
- Bold, distinctive font choices
- Unexpected layouts, asymmetry
- Commit fully to a direction

Generic AI patterns score max 5/10. This forces creative risk-taking.

Iteration Progression

Iteration	What Happened	Design Score
1	Generic dark site, Inter font	5/10
2	Amber palette pivot, custom SVGs	6/10
3	Fixed a11y, responsive, fonts	6/10
4	Terminal Noir pivot — IBM Plex Mono, grain, scanlines	7/10
5-12	Refinement — animations, reduced-motion, polish	7-8/10

The biggest jumps came from pivots, not refinements.

Key Lessons

Separate generator from evaluator — AI can't judge its own work
Clean slate per invocation — prevents context anxiety
Playwright > code review — catches visual bugs code review misses
Penalize generic patterns — or the model converges on AI slop
Pivots > polish — tell the generator to scrap and restart if scores stagnate

Two Architectures

I built templates for both use cases:

Frontend (design-focused):

No sprints, single build
5-15 iterations of generate → evaluate → improve
Scoring: Design Quality, Originality, Craft, Functionality

Full-stack (correctness-focused):

Sprint-based with contracts
Generator + Evaluator negotiate "done" before coding
Pass/fail per sprint, max 3 retries

Try It

git clone https://github.com/Mnemo-mcp/Harness
cd Harness
./orchestrator.sh

Requires: Kiro CLI + Playwright MCP configured in .kiro/settings/mcp.json.

The model is the engine. The harness is what makes it drive straight.

Live site | Harness repo

I Built Persistent Memory for AI Coding Assistants — Here's How It Works

Nikhil tiwari — Sun, 10 May 2026 21:42:13 +0000

Every time you open a new AI chat, your assistant forgets everything. I fixed that.

The Problem

If you use Cursor, Claude Code, or Amazon Q regularly, you've probably hit this wall:

You explain your project architecture in Monday's chat. On Tuesday, you open a new session and start from scratch. You paste the same context, re-explain the same patterns, and re-describe the same service boundaries — every single time.

This isn't a minor inconvenience.

For large codebases, every AI interaction starts with 10 minutes of context-loading before you can ask anything useful. For teams, every developer builds their own mental model of the codebase in isolation, while the AI assistant knows none of it.

I've been building production systems on Azure for several years — .NET microservices, KEDA autoscaling, Azure Service Bus pipelines. Our codebase has 40+ services, clean architecture patterns, vendor integration handlers, and years of architectural decisions that live entirely in people's heads.

Every time I opened a new AI chat, I was manually transferring that knowledge into a chat window.

So I built Mnemo.

What Mnemo Does

Mnemo is a local MCP (Model Context Protocol) server that gives AI coding assistants persistent, structured knowledge about your codebase.

One command initializes it.

After that, every AI chat automatically knows:

Your project's architecture and patterns
Your API endpoints
Your engineering decisions
Who owns which part of the codebase
Errors you've already debugged and how you fixed them
Incidents, code reviews, and team knowledge

It works with Cursor, Claude Code, Amazon Q, and any MCP-compatible AI client.

The moment a new AI chat session starts, Mnemo automatically loads your project context.

You never paste architecture descriptions again.

How It Works Technically

When you run:

mnemo init

inside your project, several things happen.

1. AST-Based Codebase Parsing

Mnemo parses your codebase using real language parsers — not regex or grep.

C# → Roslyn
Python → ast
TypeScript → TypeScript Compiler API

This allows Mnemo to understand:

Method signatures
Class hierarchies
Interface implementations
Dependency relationships

From this, it builds a compact repo map — a structured representation of your codebase shape.

Not the full source code.

Just the architecture-level understanding required for AI context.

2. Architecture Detection

Mnemo scans the codebase for structural signals:

Common handler inheritance
IRepository<T> patterns
Command/query separation
Event-driven conventions
DI registration styles

Using these signals, it classifies your architecture automatically:

Clean Architecture
CQRS
Event-Driven
Hexagonal
Repository Pattern
Handler Pattern

This becomes part of the persistent project memory.

3. MCP Server Initialization

Mnemo launches a local MCP server process that exposes tools AI assistants can call.

At the start of each AI chat session, the assistant calls:

mnemo_recall

Mnemo then returns a structured context payload containing:

Repo map
Architecture profile
Recent engineering decisions
Error/debug history
Current task context

Everything is stored locally inside:

.mnemo/

Currently:

JSON files store structured memory
A vector store powers semantic search

No source code leaves your machine.

Why MCP Matters

MCP (Model Context Protocol) is an open protocol from Anthropic that standardizes how AI assistants connect to tools and external context.

Think of it like USB for AI tooling.

Instead of every AI platform building proprietary integrations:

Any MCP server can provide tools
Any MCP-compatible AI client can consume them

Mnemo implements MCP, meaning it works across:

Cursor
Claude Code
Amazon Q
Kiro
Other MCP-compatible tools

Mnemo isn't tied to a single AI assistant.

It's an intelligence layer that upgrades all of them.

What the AI Actually Sees

When the assistant calls mnemo_recall, it receives structured project context like this:

## Project Context
Architecture: Clean Architecture + CQRS
Patterns: Repository (9 interfaces), Handler pattern (12 handlers), DI container

## Decisions
- Use handler pattern for vendor-specific logic
- Auth service uses cache-aside with 5min TTL

## Repo Map
PaymentService/Handlers/
  - StripeHandler
  - PayPalHandler
  - SquareHandler

AuthService/Services/
  - TokenService : ITokenService

With this context loaded, the AI understands:

How the codebase is structured
Which patterns are expected
Existing architectural conventions
Historical engineering decisions

So when you ask:

"Add a new payment handler"

The generated implementation:

Inherits from BasePaymentHandler
Follows existing conventions
Registers correctly in DI
Matches existing architecture

Without Mnemo, most assistants generate generic code that doesn't fit the system design at all.

Installation

Option A: VS Code Extension (Easiest)

Install the Mnemo extension from the VS Code Marketplace
Open a project
Click "Initialize Mnemo?"

Done.

The extension automatically:

Downloads the Mnemo binary
Initializes the repository
Configures MCP

No Python required.

Option B: Homebrew (macOS/Linux)

brew tap Mnemo-mcp/tap
brew install mnemo

Then:

cd your-project
mnemo init

Option C: pip (All Platforms)

pip install mnemo

Or from source:

git clone https://github.com/Mnemo-mcp/Mnemo.git
cd Mnemo
pip install -e .

Then:

cd your-project
mnemo init

Final Thoughts

Mnemo started as a solution to a frustrating problem:

AI assistants forget everything between sessions.

For small projects, that's annoying.

For large production systems, it's a major productivity bottleneck.

Mnemo gives AI coding assistants persistent architectural memory, allowing them to operate with real understanding of your codebase instead of stateless guesses.

I'd love feedback — especially from teams managing large, distributed systems.

What project context do you find yourself re-explaining most often?