Omri Arnon

Posted on Mar 19 • Edited on Mar 21

How I Built a Production-Grade Tool with Claude Code - And Kept the Architecture Honest Across Weeks of Sessions

#ai #productivity #programming #devtool

A methodology for keeping AI-assisted projects honest across weeks of sessions

Most developers who use Claude Code for anything beyond a single file have hit the same wall.

You spend a week building something real. The architecture is in your head. The decisions are in your head. The tradeoffs are in your head. Then you start a new session and Claude Code starts cold, with no memory of any of it. You paste some context, it gets going, and three hours later you realize it quietly drifted from the design you spent days thinking through. Nothing broke. The tests pass. But the code no longer matches the architecture you intended.

This is not a Claude Code problem. It is a context problem. And it gets worse the longer the project runs.

I ran into this building AI Ranger, a passive network observability tool that detects which AI providers are being called across an organization's machines. The project spans a Rust agent, a Python gateway, Go workers, ClickHouse, Postgres, RabbitMQ, and a React dashboard. Weeks of work, hundreds of decisions, multiple phases. The kind of project where losing track of why something was built a certain way costs real time.

Here is the methodology I developed to keep it under control.

The Core Idea: Three Roles, Not One

Most people use Claude Code as a single assistant that does everything. You describe a task, it implements it, you review the diff. This works fine for isolated tasks. It breaks down for complex projects.

The methodology splits the work across three roles:

Role	Who	What they do
Project Owner	You	Vision, domain expertise, final decisions
Supervisor	Claude (long-lived chat)	Architecture, planning, prompts, reviews
Executor	Claude Code	Plans, implements, updates docs, reports back

The key insight is that the Supervisor and the Executor are two different AI instances with two different jobs.

The Supervisor is a long-lived Claude chat session that holds the full project history. It never writes code. Its job is to think through problems, make architectural decisions, and translate those decisions into precise prompts for the Executor. It also reviews everything the Executor produces and catches drift before it compounds.

The Executor is Claude Code. It operates in short-lived sessions with no memory of previous work. It is excellent at execution and poor at judgment about the overall system. So you do not ask it for judgment. You give it precise instructions and review its output carefully.

The Project Owner - you - never writes prompts for Claude Code directly. You have a conversation with the Supervisor, who translates your vision into something the Executor can act on without going off the rails.

The External Memory

The Supervisor compensates for the Executor's cold starts through three documents that act as persistent memory across every session:

ARCHITECTURE.md is the source of truth for what the system is. Every component, interface, data structure, and phase boundary. The Executor reads this at the start of every session. It cannot hold the whole project in mind, but it can read the document that does.

DECISIONS.md is the record of why the system is the way it is. Every significant decision, including the ones that were reversed, with full reasoning. Why this database and not that one. Why this library was rejected. Why a feature was deferred. It answers the question every future contributor will ask: "why is it done this way?"

CLAUDE.md is the standing instructions file. It accumulates rules as the project matures. No magic numbers. No business logic in main.rs. Database access uses ORMs. Health endpoints on every HTTP service. Every rule was added because something went wrong or almost went wrong without it.

These documents persist across sessions. The AI's memory is ephemeral. The documents are not.

Two Loops

Once at Project Start

Before any code is written, the Supervisor and Project Owner establish the foundation:

Project Owner shares the vision with the Supervisor
Supervisor produces ARCHITECTURE.md with the full system design, divided into phases
Supervisor writes DECISIONS.md and CLAUDE.md
Executor reads the docs, asks clarifying questions, and signals readiness

The plan is not fixed. Phases can change, priorities can shift, decisions get revised. All of this happens in conversation with the Supervisor, who phrases the next prompt to reflect reality. The Executor updates the docs as part of every task. DECISIONS.md records every pivot and why it happened.

For Every Task

Every task follows the same four-step cycle:

Step 1: Supervisor phrases the planning prompt. Scoped, constrained, with explicit acceptance criteria. Always ends with: "Show me the plan. Wait for my approval before writing anything."

Step 2: Executor proposes a full plan. Lists every file it will create or modify and why. No code written yet. Supervisor reviews, may refine, then approves.

Step 3: Executor implements and reports. Implements on approval. Updates all relevant docs. Reports back with a full summary of every file touched, every decision made, and any blockers encountered.

Step 4: Supervisor reviews. Reads the summary. Verifies it matches the intent. Catches drift. Either closes the task or sends a correction prompt.

For small isolated fixes, you can skip straight to the Executor. Use judgment. If it touches architecture, use the full loop.

What the Prompts Actually Look Like

This is where most writeups stay abstract. Here are the real prompts.

Planning prompt (before any code)

Read ARCHITECTURE.md before doing anything.

We are adding IP range matching as a third detection method.
Before writing any code, produce a plan covering:
- Which files you will create or modify
- The exact function signatures you will add
- How this fits the existing detection order: SNI > DNS > IpRange
- Which tests you will add

Constraints: only Anthropic has dedicated IP ranges. Do not add
ip_ranges to CDN-backed providers. Use the ipnet crate. No magic strings.

Show me the plan. Wait for my approval before writing anything.

Executor's response (the plan)

Files I will create or modify:

1. agent/src/classifier/ip_ranges.rs (new)
   - IpRangeMatcher struct with a list of CIDR ranges per provider
   - fn match_ip(ip: IpAddr) -> Option<&'static str>

2. providers/providers.toml (modified)
   - Add ip_ranges field to the Anthropic entry only
   - Value: ["160.79.104.0/23"]

3. agent/src/pipeline.rs (modified)
   - Add IpRange as third fallback after SNI and DNS
   - Detection order: SNI -> DNS -> IpRange

4. agent/src/classifier/mod.rs (modified)
   - Export IpRangeMatcher

5. Cargo.toml (modified)
   - Add ipnet crate dependency

Tests I will add (3 new):
   - test_ip_range_matches_anthropic_cidr
   - test_ip_range_no_match_outside_cidr
   - test_ip_range_not_checked_when_sni_matches

Constraints respected:
   - ip_ranges added to Anthropic only
   - No CDN-backed providers touched
   - No magic strings - CIDR ranges defined as named constants

Ready to proceed on your approval.

Execution prompt (after plan approval)

Good plan. One clarification: the IP range fallback must only fire if
both SNI and DNS produced no match, not just SNI.
Priority order: SNI > DNS > IpRange.

Proceed with all changes as outlined. After implementation confirm:
- Test count has increased (at least 3 new tests)
- cargo clippy passes clean
- Show me a summary of every file touched

Review after action

The Executor reports back. You do not just move on. You verify:

Executor: 3 new tests passing. Files touched: classifier/providers.rs, providers/providers.toml, event.rs, main.rs. ARCHITECTURE.md updated.

Supervisor: Clean. Before moving on, confirm the fallback order in main.rs is SNI > DNS > IpRange and not SNI > IpRange > DNS. Show me that code block.

This step is where drift gets caught. Things that look correct in the summary are sometimes subtly wrong in the implementation.

The Phase Audit

At the end of every phase, before merging or moving on, the Supervisor runs a structured audit:

Read ARCHITECTURE.md, README.md, and DECISIONS.md in full.

We are doing a final Phase 1 audit. Do not fix anything. Audit only.

Check every Phase 1 deliverable. Compare every data structure in the docs
against the actual code. Check every README claim against what actually
exists. Scan the codebase for TODO, FIXME, and HACK comments.

Produce a report: Aligned / Misaligned / Missing from docs / Missing from code.

Wait for my review before making any changes.

The Phase 1 audit on AI Ranger caught 37 discrepancies. The README advertised a traffic measurement feature that had been deliberately removed weeks earlier. A field annotated as Phase 1 in the architecture was correctly marked Phase 5 in the code - a discrepancy that would have sent the first contributor down the wrong path. None of these showed up in tests. All of them would have cost a new contributor real time.

Pivots Are Documented, Not Hidden

Every project has pivots. In most AI-assisted projects they just happen silently. The code changes, the original plan is forgotten, and nobody knows why the current approach differs from what was designed.

Here, every pivot goes into DECISIONS.md immediately. What was planned, what changed, why.

The Windows capture layer in AI Ranger went through three designs. SIO_RCVALL could not capture IPv6. Then ETW NDIS-PacketCapture required undocumented IOCTLs activated via netsh, making it fragile. The final solution was ETW DNS-Client, which gives hostname and PID directly from the OS DNS resolver - which turned out to be better than raw packets for this use case anyway. Three designs, two pivots, all documented. A new contributor reading DECISIONS.md knows not just what the current approach is but why two apparently reasonable alternatives were rejected.

What About Existing Codebases?

The methodology was developed on a greenfield project but the principles apply to existing codebases with one key adaptation. You cannot write ARCHITECTURE.md upfront for a system that already exists.

Instead, scope the external memory per module. Before touching a part of the codebase, ask the Executor to read that module and produce a focused document covering the relevant files, dependencies, and interfaces. Upload it to the Supervisor and start the planning conversation from there. You can build a full feature across several modules this way, producing a small architecture document for each area before you touch it. Keep them or delete them when done. Either way they give you the foundation to run the full process on any existing codebase.

What This Produces

By the end of Phase 1, AI Ranger had zero TODO comments in the codebase, 49 passing tests, and documentation that accurately described what the code actually did.

By the end of Phase 2, it had a full backend with FastAPI, Go workers, RabbitMQ, Postgres, and ClickHouse - enterprise-grade configuration management, health endpoints on every service, k8s-compatible architecture, per-service Dockerfiles, and an integration test suite covering the full pipeline, all verified by CI on every push.

None of this happened by accident. It happened because the Supervisor maintained the architectural vision across dozens of Claude Code sessions, caught drift early through regular audits, and documented every significant decision as it was made.

The AI does not get better at your project over time.

But your external memory does. And that is what makes the difference.

This methodology built AI Ranger - an open source passive agent that tells you which AI tools are running across your organization's machines. No proxies, no certificate installation, no content inspection. The repo: github.com/ai-ranger-io/ai-ranger

DEV Community