Jonathan Murray

Posted on Jun 6

We built a coding harness that beats frontier models using open ones. It's in open beta.

#ai #coding #programming #opensource

Here is the bet we made: build software memory-first, not model-first, and it will outperform.

Everyone else is racing to wrap the next model. We did the opposite. We built the memory layer first, the routing first, tool-calling, now the recursive engine, then let the model be a swappable part.

Today that bet has a name: Backboard Development Studio. It starts with the R-CLI, a coding harness now in open beta.

The headline result? It beats frontier models using open ones. Keep reading, the numbers are below and there is a promo code at the bottom.

Test it.

The beta is open. Two lines and you are running.

# macOS / Linux
curl -fsSL https://app.backboard.io/api/cli | bash

# Windows (PowerShell)
irm https://app.backboard.io/api/cli/windows | iex

Get your API key: https://app.backboard.io

Promo code: DEVTOCLI for credit toward inference while you put it through its paces. Find the Promo submit in the top right corner of the billing page.

The hypothesis, stated plainly

Model-first thinking says: pick the smartest model, prompt it well, hope it remembers.

Memory-first thinking says: give the system real persistence, real routing, real recall, and a "smaller" model will outwork a "smarter" one that forgets everything between turns.

We believed the second one. So we built it. The R-CLI is powered by our memory algorithms (the same ones that rank #1 on LoCoMo and LongMemEval) and runs on Backboard's unified API: memory, routing across 17,000+ models, RAG, and stateful threads behind one key.

Then we tested it in public. That part did not go quietly.

The numbers we're getting on internal test runs this week

92% on Terminal Bench 2.1 running Codex 5.5
70% on Terminal Bench 2.1 running GLM 5.1, an open-source model
Up to 30% fewer tokens and up to 90% lower cost than the closed harnesses
0% of your code used to train anyone's model <-- Please read the T's & C's of your fav harnesses...

Read that second line again. An open model, inside our harness, posting numbers that go toe to toe with Claude Code, at a fraction of the cost.

And to be clear: we are not the cheap open-source alternative. We run the full frontier lineup too. We just happen to beat frontier results with open models like GLM 5.1 and DeepSeek V4. Same harness, your choice of brain.

Then it gets weird: /expert mode

You do not have to pick one model. You can use two in a single task.

Try /expert mode: plan with Opus 4.7, execute with DeepSeek V4.

The expensive model architects. The fast cheap one ships. The harness orchestrates the handoff. Frontier reasoning where it counts, frontier-beating cost where it does not. One command.

Nobody else is selling that, because nobody else built memory and routing first.

A developer tried to take it apart in public

We launched. A serious builder showed up in the comments and pushed back hard.

Well-tooled local repo. His own RAG, skills, memory, a knowledge graph he had clearly invested months in. He ran the CLI and came back with a fair verdict: "kind of specific, not super helpful for a setup like mine."

Serious builder. Serious objection. The strongest one a developer can make: "I already hand-built the thing you are selling."

Then one fact flipped the whole conversation.

The fact that ended the argument

The R-CLI is stateful by default.

The persistence he was hand-building? The session-priming file he writes and re-reads every time? The weekly cron jobs auditing how often his agents drift? The pre-commit hooks keeping them on the rails?

Native on our side. Not a layer you bolt on. The default behavior. That is what memory-first actually means in your terminal.

So for him it was never "adopt a whole new ecosystem." It was a harness swap: keep your own RAG, memory, and graph, drop the maintenance tax.

The thread went from "not for me" to "let me talk to your CLI lead." A demo call got booked. The objection did not get argued away. It got dissolved by a capability he did not know was there.

The lesson we took: the pitch was never "we are better." It was "you are doing by hand what we do by default." A developer handed us that line for free.

Four pillars. Miss one and it does not ship.

Best in the world. Performance is the bar, not a tagline. We ran benchmarks internally because we expect to be measured.

Easiest to use. One key. The same key for your R-CLI... well it unlocks: Memory, routing, multi-agent, parallel tool calls, all behind one integrated surface. No stitching eight services together and praying the glue holds.

Most accessible. Frontier coding quality, your choice of model to get there. Closed, open, or mixed in one workflow. GLM 5.1 and DeepSeek V4 are the proof, not the promise.

People stay by choice. Any model, your own embeddings, modular layers, your data exportable through real endpoints. No lock-in, no theatrics, no fear-mongering. If you stay, it is because the flexibility is unrivaled.

One more thing

The R-CLI is the first surface of Backboard Development Studio. The IDE is close.

Same engine, same performance, plus multi-agent sessions, Pi extension integrations, and coding-theme skills pre-built. The CLI is the foundation. We nail the harness with the community first. Then the IDE lands on something already proven.

Come argue with us

The best feedback we have gotten so far came from someone telling us we were wrong. He pushed, we answered, he booked a call, his team switched.

So: paste the command, claim your key, run DEVTOCLI, and try to break it. Then drop a comment with what held up, what did not, and what your current setup still does better.

Memory-first or model-first. We made our bet. Come test it.

Backboard.io is full-stack, model-agnostic AI infrastructure. Backboard Development Studio is our recursive coding environment, stateful by default, built on the unified API.

DEV Community