DEV Community: Esan Mohammad

I rebuilt my open-source AI coding agent that routes each pipeline stage to a different LLM

Esan Mohammad — Tue, 05 May 2026 15:55:01 +0000

The pattern in AI coding tools has been bugging me for a while.

You sign up for one of them. You agree to a per-seat subscription. You get exactly one model: the one the vendor picked for you.

Underneath, the whole thing is glued to that vendor’s SDK, so even if you wanted to swap models, you couldn’t without forking. Then the next month, a better model ships from a different vendor, and you’re stuck.

That way of building locks users out of one of the most valuable properties of LLMs:

They are swappable, comparable, and increasingly cheap.

So I built Anvil — an open-source coding agent that takes a one-line feature request and ships a PR end-to-end.

The thing that makes it different is per-stage model routing.

A single pipeline run can cycle through three or four different LLMs, each picked for what it is actually good at.

The pitch in one example

Here is an actual run from yesterday:

clarify     →  Ollama qwen3:14b      (local)          ~ $0.00
plan        →  Claude Sonnet 4.6     (deep analysis)  ~ $0.05
build       →  Ollama qwen3:14b      (local)          ~ $0.00
test        →  Ollama qwen3:14b      (local)          ~ $0.00
validate    →  Claude Haiku 4.5      (cheap, fast)    ~ $0.01
review      →  Claude Sonnet 4.6     (judgment)       ~ $0.08
ship        →  Ollama qwen3:14b      (local git ops)  ~ $0.00
──────────
~ $0.14

Read that top to bottom.

That is a fully reviewed, tested, PR’d feature for fourteen cents in cloud spend.

Most stages ran free on a local model. Premium models only showed up where premium models actually move the quality needle: planning the work and reviewing the result.

The routing is just config

The routing is not hardcoded. It is a YAML file:

# ~/.anvil/stage-policy.yaml
stages:
  clarify:
    capability: reasoning
    complexity: S
    prefer: [local, cheap, premium]

  plan:
    capability: reasoning
    complexity: L
    prefer: [premium]

  build:
    capability: code
    complexity: M
    prefer: [local, cheap, premium]

  review:
    capability: reasoning
    complexity: L
    prefer: [premium]

You declare:

the capability each stage needs
the complexity of the task
the tier preference order

The resolver walks ~/.anvil/models.yaml and picks the cheapest model that matches.

Eight providers, one pipeline

Anvil ships with eight LLM provider adapters:

Claude
OpenAI
Gemini
OpenRouter
OpenCode
Ollama
Gemini CLI
Google ADK

Every adapter speaks the same streaming format, throws the same UpstreamError on retryable failures, and reports cost the same way.

What is deliberately not in there: vendor SDKs.

Every HTTP adapter is hand-rolled with fetch().

No @anthropic-ai/sdk.

No openai package.

No LangChain.

No Vercel AI SDK.

If a model is dropped tomorrow, your code keeps compiling.

That is the whole point of being provider-agnostic: you cannot be agnostic if you are importing the vendor’s TypeScript types.

If a model 429s mid-run, the chain-walker burns it for the rest of that run and falls through to the next entry in the same tier. Same provider or different provider — your call.

What ships in v0.1.0

I just cut MVP 2: v0.1.0.

Here is what is in the box, framed by why each piece keeps runs cheap and quality high.

9-stage pipeline runner

Small, focused stages mean each agent call is short and can run on the cheapest model that is good enough.

Chain fallback also means one rate-limited provider does not kill the run.

Hybrid retrieval

Anvil combines:

vector search
BM25
project graph retrieval
cross-encoder reranking
AST chunking via tree-sitter

Sharp context lets cheaper models do work that would otherwise need a premium model.

The build stage rarely needs a frontier model because it is already looking at the right code.

Long-term memory

Anvil has long-term memory with bi-temporal validity and code-fact drift detection.

Agents do not re-derive what they have already learned, and stale memories get pruned automatically.

No tokens burned rediscovering a pattern you fixed last week.

Convention engine

Recurring review complaints get promoted to deterministic rules.

Once a mistake has been called out twice, the rule engine catches it at lint time instead of review time.

That is zero LLM tokens.

Plan validator

The plan validator catches issues before any code is written, including:

missing tests
wrong stage routing
undocumented rollback strategies

The cheapest place to fix anything is the planning stage.

Multi-pass PR review

The PR review system includes:

evidence gating
scope matching
knowledge-base context
dismissal filtering

Premium spend lands where it actually moves the quality needle.

OpenTelemetry and cost ledger

Every adapter call attaches a real gen_ai.usage.cost from a vendored LiteLLM pricing table.

No estimates. No surprises.

Anvil is MIT-licensed, runs locally, has no hosted plan, and sends no telemetry home.

How to try it

Install the CLI:

npm install -g @esankhan3/anvil-cli

Set up a project:

anvil init

Open the dashboard and ship:

anvil dashboard

The dashboard hosts both the React UI and the WebSocket backend in a single Node process.

Open the browser tab and you get:

pipeline view
run history
knowledge graph
memory inspector
settings UI for provider keys

If you have Ollama installed, you can run the cheap-tier stages fully offline:

brew install ollama
ollama pull qwen3:14b

If you have an OpenCode Zen subscription, you do not even need a GPU. It can replace the entire local tier with hosted open-coding models.

For the full walkthrough — prerequisites, provider keys, and troubleshooting — check the getting-started doc in the repo.

What is intentionally not in v0.1.0

A few things are intentionally not in this release.

No hosted plan

No hosted plan. No SaaS.

This is by design. Hosting is a different business, and I want to keep the project unencumbered.

No vendor SDKs

Same reason as above.

The goal is provider-agnostic infrastructure, not a wrapper around one vendor’s client library.

No durable execution yet

Today the pipeline is “Pattern 1”: audit log plus state-file granularity, not cross-process step replay.

That is the next big thing on the roadmap.

Memory-layer vector retrieval is still in progress

Vector retrieval is stubbed today in the memory layer. Knowledge-core retrieval is fully featured.

Sleeptime population of memory embeddings is in flight.

Who this is for

If you have ever felt the pull of:

I should be able to swap this model.

only to realize SDK lock-in makes it a project-level rewrite, this might be for you.

If you have watched cloud LLM costs creep into a budget that should have been local-model-cheap, this might be for you.

If you maintain a multi-repo project and existing tools force you to think one repo at a time, this is definitely for you.

Repo: https://github.com/esanmohammad/Anvil

I would love feedback, especially on the per-stage routing model.

Does it match how you would want to spend tokens? What stages would you route differently?

Drop a comment.

I was tired of re-explaining my project to Claude every session

Esan Mohammad — Wed, 22 Apr 2026 13:02:00 +0000

I'd start a new Claude Code session and spend the first ten minutes pasting files.

"Here's the API gateway. Here's the user service. The gateway talks to users over HTTP. Users publish to this Kafka topic. The payments service consumes it. The shared types live in this package. Here's the schema."

Next day. New session. Same ten minutes.

Some days I'd realize halfway through that I'd already burned half my context window on orientation and hadn't gotten to the actual problem yet. That was the moment I knew this was broken.

Our work project spans five repos. TypeScript, Go, Python, Kafka, shared Postgres. Real production stuff. And every AI tool I tried treated my project like a blank slate every time I opened it. The better the model, the more it noticed the gaps, the more it guessed. And when it guessed wrong, it guessed wrong confidently.

I got tired of it. I spent three months of weekends building the thing I wished existed. It ended up being two things.

The first thing: a pipeline with a built-in knowledge graph

The core idea is stupid simple. Parse the project once. Build a compact architectural summary. Inject that summary into every agent call.

I call it Anvil Pipeline. Here's what it actually does:

It walks every repo in your project. Uses tree-sitter to extract functions, classes, interfaces, types, imports. Builds a graph where nodes are symbols and edges are the relationships between them. Then it looks across repos and auto-detects the connections between them — Kafka topic producers and consumers, HTTP routes and their callers, shared TypeScript interfaces, protobuf definitions, Docker Compose service links, environment variables that reference other services. Fourteen detection strategies in total.

The output is a GRAPH_REPORT.md file per repo. It's designed to be low-token — a compact architectural overview, not a dump of code. That file gets injected into every agent prompt.

The first time I ran it and started a Claude session, the agent just... knew. Knew which services I had. Knew Kafka was between them. Knew which types were shared. I didn't paste anything. I saved a conservative 20,000 tokens in the first session alone, probably more.

That alone would have been enough. But while I was there I built the pipeline part.

Anvil Pipeline takes a feature description and runs it through eight stages: clarify the intent, produce a high-level plan, break it into per-repo requirements, write technical specs, generate task lists, build the code, validate with build and test commands, ship as pull requests. Each stage writes artifacts to disk. Each stage is resumable.

The resumability part matters. If you've run long agent sessions you know: Claude's auth expires. Your budget hits its limit. Your laptop goes to sleep. The dashboard crashes. Any of these kills a naive agent loop and you lose everything.

Anvil checkpoints at every stage. When auth expires, the pipeline pauses, sends a browser notification, auto-opens the re-login page, and resumes from the same spot once you're authenticated. I built this part after losing a 40-minute run to a five-second auth check. The kind of thing where you walk away to grab water and come back to nothing.

The second thing: a plug-and-play MCP server

The other half of my frustration was smaller but more constant. AI tools making up function names. Imports that don't exist. Helpers I never wrote.

The fix here is also stupid simple: give the model actual tools to look up the code, instead of asking it to recall from training.

Code Search MCP is a standalone MCP server. Any MCP client picks it up — Claude Code, Claude Desktop, Cursor, whatever you're using next month. One line to install in Claude Code:

That's it. Claude now has eleven new tools, including the ones I actually use:

search_code for hybrid search — vector plus BM25 plus graph expansion plus cross-encoder reranking
find_callers — everywhere a function is called across all your repos
find_dependencies — what a function depends on
impact_analysis — what breaks if you change this file
impact_analysis has turned out to be the one I use most. I did not expect that. Before, I'd ask Claude "what breaks if I remove this" and get a plausible-sounding guess. Now I get a real answer, because the tool walks the graph.

The part I spent the most time on is incremental indexing. Codebases change constantly. Re-embedding the whole thing on every commit is expensive and slow. So I built four layers of skip logic:

Git SHA at the repo level. If the repo's HEAD hasn't moved, skip entirely.
Git diff at the file level. Only files that changed.
SHA-256 at the content level. Files that changed but ended up with the same content get skipped too.
Embedding diff at the chunk level. Only new chunks are embedded. Existing embeddings in LanceDB are preserved.
A typical "I changed 2 files" reindex embeds about 5 new chunks instead of redoing the whole repo.

Embeddings are provider-agnostic. Ollama is the default, which means it runs free and local out of the box. If you want better quality you can plug in Voyage, Mistral/Codestral, OpenAI, Gemini, or any OpenAI-compatible endpoint. I did not want to force anyone into a specific cloud.

Why both live in the same repo

I almost split them. They target different users. Pipeline is for teams doing feature work across repos. Code Search MCP is for any developer using any AI assistant. Different install stories, different mental models.

But they share the hard parts. The tree-sitter parsing, the cross-repo edge detection, the embedding pipeline. Splitting them meant maintaining two copies of all that.

So they ship together, under the Anvil umbrella. Use either. Use both. Use neither — the code is MIT, so rip out the parts you want.

What I care about, technically

Everything runs on your machine. Dashboard, pipeline, knowledge graph, indexing — all local.

No telemetry. No analytics. No crash reporters. No phone-home. I checked twice.

No account system. Nothing to sign up for.

Your code only goes to the LLM provider you explicitly select. Anvil never proxies or stores it.

MIT licensed. Every line auditable.

I wanted AI tooling that didn't compromise on any of this. There is a lot of AI dev tooling out there now, but most of it sends your code through someone's SaaS. I didn't want that for my own work, and I figured other people might not want it either.

What's still weak

The 8-stage pipeline is opinionated. It works for how I work. If your workflow doesn't fit "describe, plan, code, ship" it'll feel stiff.

The cross-repo detection strategies cover my stack. GraphQL federation, event sourcing, and some message queue patterns aren't handled yet. I'm collecting edge cases.

The dashboard isn't pretty. I spent the time on correctness.

If you try it

The repo is at https://github.com/esanmohammad/Anvil

Code Search MCP is one line to install. Pipeline takes a config file — anvil init walks you through it.

I'd love to hear what breaks. Especially the cross-repo detection, and which of the 11 MCP tools you actually use in practice. I suspect two or three should be cut and I don't know which ones yet.

This is my first time shipping a side project in public. Feedback welcome, roasts too.

If you found any of this useful, the repo is on GitHub and I'd love a star — it helps other people find it.