I am a solo founder. I do not have a lab or a team of researchers. I have a bill I pay every month, a family I am building this for, and a stubborn habit of not trusting a number until I have measured it myself.
Two things bugged me about the AI coding agents I was using. The first was the cost. Every request, easy or hard, went to the most expensive model available. The second was quieter: I did not actually know where my code was going when the agent wrote it.
So I built something to fix both, and I measured it carefully, because I would rather tell you the honest number than a flattering one. Here is what I found, including the parts that are not flattering.
The idea. Most coding requests are not hard. A cheaper or local model handles them fine. Only a small slice genuinely needs a frontier model. So instead of paying frontier prices on everything, the system routes each request to the cheapest model that can actually do the job, checks the result, and escalates to a frontier model only when the check fails. Verified answers get cached, so work you have done before comes back fast. Router, verifier, frontier backstop, cache. I am holding the engineering details, but the idea is not the hard part. Measuring it honestly is.
The numbers. Same benchmark, same harness, across all of them. HumanEval+, 164 problems.
I want to be careful with the words, because it matters. This is parity. It ties the frontier models. It does not beat them. Anyone who tells you their cheap setup beats the frontier on accuracy is either measuring wrong or selling something. What I am claiming is narrower and more useful: you can land in the same accuracy band as the frontier without paying frontier prices on every request. The cheap model alone was 84.8%. The routing and verification is what closes the gap to 94.5%.
The cost, which is the actual point. Measured over 313 production requests, from the real usage logs. Blended cost came out to about $0.002 per request, versus about $0.017 for the frontier equivalent. Roughly 8x cheaper for work in the same accuracy band. On that run, 96% of requests were served by the cheap tier, and about 3.7% escalated.
There is a second win I did not expect. Verified answers are cached, and a cache hit returns in about 0.16 seconds, which in my testing was 24 to 185x faster than solving it fresh. The more you code, the more of your work is instant. I will be honest that this compounds with real usage, and I am early, so I am watching it, not overclaiming it.
Where it is weak (you should know before you trust it).
The hardest problems still escalate to a frontier model. That is by design. On hard, multi-step problems the savings shrink, because more of them escalate. It is not a cheap model doing frontier work by magic. It is the right model for each job, with a backstop.
HumanEval+ is a benchmark. Real-world code is messier, and I am still measuring that part honestly rather than pretending the benchmark settles it.
The verifier is only as good as the checks it runs. Give it weak tests and the gate is weaker.
The privacy half. The other reason I built this: your code should not quietly leave your machine. So it runs on infrastructure you control, isolated per tenant, and wiped when you are done. That is not a feature I bolt on later. It is why I started.
Where I am. I am opening it to a first group of testers. If a cheaper, private coding agent that ties the frontier on accuracy is useful to you, I would genuinely like your help pressure-testing it. Tell me where the numbers do not hold up. I will keep publishing results as more people run it, the ugly ones included, because the honesty is the whole reason to trust a number from a guy you have never met.
Thanks for reading. If you have questions about the method, ask. I am around.

Top comments (0)