Esan Mohammad

Posted on May 5

I rebuilt my open-source AI coding agent that routes each pipeline stage to a different LLM

#ai #llm #productivity #opensource

The pattern in AI coding tools has been bugging me for a while.

You sign up for one of them. You agree to a per-seat subscription. You get exactly one model: the one the vendor picked for you.

Underneath, the whole thing is glued to that vendor’s SDK, so even if you wanted to swap models, you couldn’t without forking. Then the next month, a better model ships from a different vendor, and you’re stuck.

That way of building locks users out of one of the most valuable properties of LLMs:

They are swappable, comparable, and increasingly cheap.

So I built Anvil — an open-source coding agent that takes a one-line feature request and ships a PR end-to-end.

The thing that makes it different is per-stage model routing.

A single pipeline run can cycle through three or four different LLMs, each picked for what it is actually good at.

The pitch in one example

Here is an actual run from yesterday:

clarify     →  Ollama qwen3:14b      (local)          ~ $0.00
plan        →  Claude Sonnet 4.6     (deep analysis)  ~ $0.05
build       →  Ollama qwen3:14b      (local)          ~ $0.00
test        →  Ollama qwen3:14b      (local)          ~ $0.00
validate    →  Claude Haiku 4.5      (cheap, fast)    ~ $0.01
review      →  Claude Sonnet 4.6     (judgment)       ~ $0.08
ship        →  Ollama qwen3:14b      (local git ops)  ~ $0.00
──────────
~ $0.14

Read that top to bottom.

That is a fully reviewed, tested, PR’d feature for fourteen cents in cloud spend.

Most stages ran free on a local model. Premium models only showed up where premium models actually move the quality needle: planning the work and reviewing the result.

The routing is just config

The routing is not hardcoded. It is a YAML file:

# ~/.anvil/stage-policy.yaml
stages:
  clarify:
    capability: reasoning
    complexity: S
    prefer: [local, cheap, premium]

  plan:
    capability: reasoning
    complexity: L
    prefer: [premium]

  build:
    capability: code
    complexity: M
    prefer: [local, cheap, premium]

  review:
    capability: reasoning
    complexity: L
    prefer: [premium]

You declare:

the capability each stage needs
the complexity of the task
the tier preference order

The resolver walks ~/.anvil/models.yaml and picks the cheapest model that matches.

Eight providers, one pipeline

Anvil ships with eight LLM provider adapters:

Claude
OpenAI
Gemini
OpenRouter
OpenCode
Ollama
Gemini CLI
Google ADK

Every adapter speaks the same streaming format, throws the same UpstreamError on retryable failures, and reports cost the same way.

What is deliberately not in there: vendor SDKs.

Every HTTP adapter is hand-rolled with fetch().

No @anthropic-ai/sdk.

No openai package.

No LangChain.

No Vercel AI SDK.

If a model is dropped tomorrow, your code keeps compiling.

That is the whole point of being provider-agnostic: you cannot be agnostic if you are importing the vendor’s TypeScript types.

If a model 429s mid-run, the chain-walker burns it for the rest of that run and falls through to the next entry in the same tier. Same provider or different provider — your call.

What ships in v0.1.0

I just cut MVP 2: v0.1.0.

Here is what is in the box, framed by why each piece keeps runs cheap and quality high.

9-stage pipeline runner

Small, focused stages mean each agent call is short and can run on the cheapest model that is good enough.

Chain fallback also means one rate-limited provider does not kill the run.

Hybrid retrieval

Anvil combines:

vector search
BM25
project graph retrieval
cross-encoder reranking
AST chunking via tree-sitter

Sharp context lets cheaper models do work that would otherwise need a premium model.

The build stage rarely needs a frontier model because it is already looking at the right code.

Long-term memory

Anvil has long-term memory with bi-temporal validity and code-fact drift detection.

Agents do not re-derive what they have already learned, and stale memories get pruned automatically.

No tokens burned rediscovering a pattern you fixed last week.

Convention engine

Recurring review complaints get promoted to deterministic rules.

Once a mistake has been called out twice, the rule engine catches it at lint time instead of review time.

That is zero LLM tokens.

Plan validator

The plan validator catches issues before any code is written, including:

missing tests
wrong stage routing
undocumented rollback strategies

The cheapest place to fix anything is the planning stage.

Multi-pass PR review

The PR review system includes:

evidence gating
scope matching
knowledge-base context
dismissal filtering

Premium spend lands where it actually moves the quality needle.

OpenTelemetry and cost ledger

Every adapter call attaches a real gen_ai.usage.cost from a vendored LiteLLM pricing table.

No estimates. No surprises.

Anvil is MIT-licensed, runs locally, has no hosted plan, and sends no telemetry home.

How to try it

Install the CLI:

npm install -g @esankhan3/anvil-cli

Set up a project:

anvil init

Open the dashboard and ship:

anvil dashboard

The dashboard hosts both the React UI and the WebSocket backend in a single Node process.

Open the browser tab and you get:

pipeline view
run history
knowledge graph
memory inspector
settings UI for provider keys

If you have Ollama installed, you can run the cheap-tier stages fully offline:

brew install ollama
ollama pull qwen3:14b

If you have an OpenCode Zen subscription, you do not even need a GPU. It can replace the entire local tier with hosted open-coding models.

For the full walkthrough — prerequisites, provider keys, and troubleshooting — check the getting-started doc in the repo.

What is intentionally not in v0.1.0

A few things are intentionally not in this release.

No hosted plan

No hosted plan. No SaaS.

This is by design. Hosting is a different business, and I want to keep the project unencumbered.

No vendor SDKs

Same reason as above.

The goal is provider-agnostic infrastructure, not a wrapper around one vendor’s client library.

No durable execution yet

Today the pipeline is “Pattern 1”: audit log plus state-file granularity, not cross-process step replay.

That is the next big thing on the roadmap.

Memory-layer vector retrieval is still in progress

Vector retrieval is stubbed today in the memory layer. Knowledge-core retrieval is fully featured.

Sleeptime population of memory embeddings is in flight.

Who this is for

If you have ever felt the pull of:

I should be able to swap this model.

only to realize SDK lock-in makes it a project-level rewrite, this might be for you.

If you have watched cloud LLM costs creep into a budget that should have been local-model-cheap, this might be for you.

If you maintain a multi-repo project and existing tools force you to think one repo at a time, this is definitely for you.

Repo: https://github.com/esanmohammad/Anvil

I would love feedback, especially on the per-stage routing model.

Does it match how you would want to spend tokens? What stages would you route differently?

Drop a comment.

DEV Community