The pattern in AI coding tools has been bugging me for a while.
You sign up for one of them. You agree to a per-seat subscription. You get exactly one model: the one the vendor picked for you.
Underneath, the whole thing is glued to that vendor’s SDK, so even if you wanted to swap models, you couldn’t without forking. Then the next month, a better model ships from a different vendor, and you’re stuck.
That way of building locks users out of one of the most valuable properties of LLMs:
They are swappable, comparable, and increasingly cheap.
So I built Anvil — an open-source coding agent that takes a one-line feature request and ships a PR end-to-end.
The thing that makes it different is per-stage model routing.
A single pipeline run can cycle through three or four different LLMs, each picked for what it is actually good at.
The pitch in one example
Here is an actual run from yesterday:
clarify → Ollama qwen3:14b (local) ~ $0.00
plan → Claude Sonnet 4.6 (deep analysis) ~ $0.05
build → Ollama qwen3:14b (local) ~ $0.00
test → Ollama qwen3:14b (local) ~ $0.00
validate → Claude Haiku 4.5 (cheap, fast) ~ $0.01
review → Claude Sonnet 4.6 (judgment) ~ $0.08
ship → Ollama qwen3:14b (local git ops) ~ $0.00
──────────
~ $0.14
Read that top to bottom.
That is a fully reviewed, tested, PR’d feature for fourteen cents in cloud spend.
Most stages ran free on a local model. Premium models only showed up where premium models actually move the quality needle: planning the work and reviewing the result.
The routing is just config
The routing is not hardcoded. It is a YAML file:
# ~/.anvil/stage-policy.yaml
stages:
clarify:
capability: reasoning
complexity: S
prefer: [local, cheap, premium]
plan:
capability: reasoning
complexity: L
prefer: [premium]
build:
capability: code
complexity: M
prefer: [local, cheap, premium]
review:
capability: reasoning
complexity: L
prefer: [premium]
You declare:
- the capability each stage needs
- the complexity of the task
- the tier preference order
The resolver walks ~/.anvil/models.yaml and picks the cheapest model that matches.
Eight providers, one pipeline
Anvil ships with eight LLM provider adapters:
- Claude
- OpenAI
- Gemini
- OpenRouter
- OpenCode
- Ollama
- Gemini CLI
- Google ADK
Every adapter speaks the same streaming format, throws the same UpstreamError on retryable failures, and reports cost the same way.
What is deliberately not in there: vendor SDKs.
Every HTTP adapter is hand-rolled with fetch().
No @anthropic-ai/sdk.
No openai package.
No LangChain.
No Vercel AI SDK.
If a model is dropped tomorrow, your code keeps compiling.
That is the whole point of being provider-agnostic: you cannot be agnostic if you are importing the vendor’s TypeScript types.
If a model 429s mid-run, the chain-walker burns it for the rest of that run and falls through to the next entry in the same tier. Same provider or different provider — your call.
What ships in v0.1.0
I just cut MVP 2: v0.1.0.
Here is what is in the box, framed by why each piece keeps runs cheap and quality high.
9-stage pipeline runner
Small, focused stages mean each agent call is short and can run on the cheapest model that is good enough.
Chain fallback also means one rate-limited provider does not kill the run.
Hybrid retrieval
Anvil combines:
- vector search
- BM25
- project graph retrieval
- cross-encoder reranking
- AST chunking via tree-sitter
Sharp context lets cheaper models do work that would otherwise need a premium model.
The build stage rarely needs a frontier model because it is already looking at the right code.
Long-term memory
Anvil has long-term memory with bi-temporal validity and code-fact drift detection.
Agents do not re-derive what they have already learned, and stale memories get pruned automatically.
No tokens burned rediscovering a pattern you fixed last week.
Convention engine
Recurring review complaints get promoted to deterministic rules.
Once a mistake has been called out twice, the rule engine catches it at lint time instead of review time.
That is zero LLM tokens.
Plan validator
The plan validator catches issues before any code is written, including:
- missing tests
- wrong stage routing
- undocumented rollback strategies
The cheapest place to fix anything is the planning stage.
Multi-pass PR review
The PR review system includes:
- evidence gating
- scope matching
- knowledge-base context
- dismissal filtering
Premium spend lands where it actually moves the quality needle.
OpenTelemetry and cost ledger
Every adapter call attaches a real gen_ai.usage.cost from a vendored LiteLLM pricing table.
No estimates. No surprises.
Anvil is MIT-licensed, runs locally, has no hosted plan, and sends no telemetry home.
How to try it
Install the CLI:
npm install -g @esankhan3/anvil-cli
Set up a project:
anvil init
Open the dashboard and ship:
anvil dashboard
The dashboard hosts both the React UI and the WebSocket backend in a single Node process.
Open the browser tab and you get:
- pipeline view
- run history
- knowledge graph
- memory inspector
- settings UI for provider keys
If you have Ollama installed, you can run the cheap-tier stages fully offline:
brew install ollama
ollama pull qwen3:14b
If you have an OpenCode Zen subscription, you do not even need a GPU. It can replace the entire local tier with hosted open-coding models.
For the full walkthrough — prerequisites, provider keys, and troubleshooting — check the getting-started doc in the repo.
What is intentionally not in v0.1.0
A few things are intentionally not in this release.
No hosted plan
No hosted plan. No SaaS.
This is by design. Hosting is a different business, and I want to keep the project unencumbered.
No vendor SDKs
Same reason as above.
The goal is provider-agnostic infrastructure, not a wrapper around one vendor’s client library.
No durable execution yet
Today the pipeline is “Pattern 1”: audit log plus state-file granularity, not cross-process step replay.
That is the next big thing on the roadmap.
Memory-layer vector retrieval is still in progress
Vector retrieval is stubbed today in the memory layer. Knowledge-core retrieval is fully featured.
Sleeptime population of memory embeddings is in flight.
Who this is for
If you have ever felt the pull of:
I should be able to swap this model.
only to realize SDK lock-in makes it a project-level rewrite, this might be for you.
If you have watched cloud LLM costs creep into a budget that should have been local-model-cheap, this might be for you.
If you maintain a multi-repo project and existing tools force you to think one repo at a time, this is definitely for you.
Repo: https://github.com/esanmohammad/Anvil
I would love feedback, especially on the per-stage routing model.
Does it match how you would want to spend tokens? What stages would you route differently?
Drop a comment.
Top comments (0)