DEV Community: Backboard.io

We bet against the GPU arms race. Here's what shipped.

Jonathan Murray — Thu, 02 Jul 2026 13:15:25 +0000

On July 1 we announced four things at once. The press release version is here. This is the version for people who actually build things.

The short story: while the industry spends hundreds of billions on new hardware, we took the opposite bet. Get more out of the GPUs that already exist, and keep everything inside the customer's own environment.

Here's what came out of it.

BackboardQuant: compression that doesn't lobotomize the model

Everyone who has quantized a model knows the trade: smaller and faster, but dumber. The interesting engineering problem was making that trade disappear.

BackboardQuant (yes, we call it BBQ) compresses models by up to 70% with functionally no quality loss. In our testing, compressed models retained full-precision performance while running up to 2.7x faster.

What that means in practice: one GPU doing the work of two or three. If you're serving models at scale, that's your inference bill cut by more than half without touching your architecture. It ships built into our enterprise deployments.

Backboard Studio: the benchmark result we didn't expect

We built Studio because frontier-lab coding tools are excellent and priced like it. The goal was matching them at a fraction of the cost.

The result on Terminal-Bench 2.1, the neutral public harness for agentic coding:

Backboard Studio running Claude Opus 4.8: 79.8%
Opus 4.8 on its own harness result: 74.6% The harness matters more than people think. Same model, better scaffolding, five points better.

The part I care most about: running GLM 5.2, an open-source model, Studio clears 72%. That's frontier-class agentic coding with no proprietary model in the loop. Pair that with a built-in token optimizer that cuts frontier model usage by up to 30%, and "up to 90% cheaper" stops sounding like marketing.

Studio runs in the cloud or fully self-hosted, so proprietary code never leaves your infrastructure. It's available now.

Nash: one app instead of shadow AI

Every enterprise we talk to has the same problem: employees are pasting company data into whatever chat app they found. The fix isn't a ban, it's a sanctioned option that's better than what they'd find on their own.

Nash gives users thousands of models across text and image in one chat app, with memory that stays out of the model providers' hands. Consumer and enterprise, live at hellonash.ai.

Memory: still #1, and you can check

Backboard ranks first on LoCoMo and LongMemEval, the two leading independent AI memory benchmarks. We published the results and the harnesses so you can reproduce them yourself:

LoCoMo results
LongMemEval results If you find a problem with our methodology, open an issue. We mean that.

The throughline: sovereign by design

None of these are separate products bolted together. The whole stack, API, application layer, and models, can run inside a customer's own cloud. Data never leaves. For governments, hospitals, and banks, that's the difference between "we'd love to use AI" and actually using it.

One more thing, because it matters to us: all of this was built in Nepean, Ontario, by a team made up entirely of graduates of Canadian universities, colleges, and CEGEPs. The default assumption is that this kind of work only happens in San Francisco. It doesn't.

Try it

If you write code, Backboard Studio is the fastest way to see whether any of this holds up. Run it against whatever you're using now and compare the bill.

Questions about the benchmarks, the compression numbers, or the harness? Ask in the comments. I'll answer.

R-CLI: an open-source model harness that beats Claude Code

Jonathan Murray — Tue, 09 Jun 2026 13:56:17 +0000

R-CLI by Backboard.io hit 92% on Terminal-Bench 2.1, the standard benchmark for autonomous coding agents, placing it on top of the global leaderboard using Codex 5.5. Yes, we beat OpenAI using their own model. That's great, but we're more excited about the next point.

Try it - use the DEVTOCLI promo code in your Backboard.io account.

Inside R-CLI, Backboard.io's coding harness, an open-source model just beat Claude Code at coding.

Not matched. Beat. On Terminal-Bench 2.1, Backboard.io's R-CLI running GLM 5.1 (fully open source) scores 70%. Claude Code running Opus 4.7 scores 69.7%. The open model is in front, and it costs a fraction of what Claude Code costs to run.

R-CLI is the coding surface of Backboard.io, the full-stack, model-agnostic AI platform. If you have been looking for an open source Claude Code alternative, this is the one that does not ask you to trade away performance to get it.

The numbers

Setup	Model	Open source?	Terminal-Bench 2.1
Backboard.io R-CLI	GLM 5.1	Yes	70%
Backboard.io R-CLI	Codex 5.5	No	92%
Anthropic Claude Code	Opus 4.7	No	69.7%

Two results worth sitting with:

With an open-source model, R-CLI beats Claude Code. No proprietary model required to get past the best closed coding agent.
With a frontier model (Codex 5.5), R-CLI hits 92%. The same harness scales up when you want maximum capability.

The harness is the product, not the model

Here is the part most people get backwards. A coding agent's score is not mostly about the model. It is about the harness around it: how it plans, how it manages context, how it recovers from mistakes, how it delegates work.

R-CLI is built on Backboard.io's RLM, our recursive coding engine. Instead of stuffing one giant context window and hoping the model keeps track, the RLM breaks work into bounded child contexts and delegates off the main model. The orchestration does the heavy lifting. That is why a 70% open-source result is even possible: the harness closes the gap that the model alone would leave open.

Swap the model, keep the harness. Run GLM 5.1 to beat Claude Code on open source. Run Codex 5.5 to hit 92%. Same R-CLI underneath.

We destroy them on cost

Performance parity would already be a story. Cost is where it stops being close.

Run R-CLI on an open-source model and you are not paying a per-token premium to a frontier lab at all. Self-host it and the marginal cost of a coding run approaches your own compute. Even when you choose to run R-CLI on a top closed model like Codex, the recursive engine does the same work for meaningfully less than the raw harness, because it is not burning tokens on a bloated single context.

Better score. Open model. A fraction of the cost. Pick all three.

Your code never leaves your VPC

Every closed coding tool, Claude Code, Codex, Copilot, Cursor, ships your source to a vendor's API to function. For a lot of teams that is a hard stop: defence, intelligence, regulated health and finance, anyone with real IP to protect.

Because R-CLI can run entirely on an open-source model, it can also run fully on-prem and air-gapped. Frontier-level coding with zero code leaving your infrastructure. The GLM 5.1 result is the proof that on-prem is not a downgrade. You are not choosing between privacy and performance anymore.

Frontier coding, air-gapped. That combination did not exist until now.

Run it yourself

R-CLI is in alpha right now. We are bringing developers in to run it on their own repos, on the model of their choice, and report back with real numbers, not scripted praise.

Request alpha access: backboard.io

Once you are in, the flow is simple:

Install R-CLI and drop in your Backboard.io API key.
Point it at a model. Choose GLM 5.1 (open source) to reproduce the 70%, Codex 5.5 for 92%, or your own on-prem deployment.
Run it on your codebase and check the result against your own tasks.

We are not asking you to trust the leaderboard. We are asking you to run it and see.

One key, the whole stack

Here is what that Backboard.io API key actually unlocks. It does not just run R-CLI.

The same key gives R-CLI native access to the top coding models, Codex, Opus, and the rest, with nothing else to wire up. And the moment you want to build the software around your code, the same key already reaches the entire Backboard.io platform:

17,000+ models for agents, chatbots, and anything else you are building, routed behind one key.
Memory and stateful threads, so what you build remembers users across conversations.
Agentic RAG over your own documents.
Voice (text-to-speech and speech-to-text), image, web search, and parallel tool calls, all on the same key.

You are not standing up a coding tool here, then a model gateway, then a memory service, then a voice provider. You add one API key and you can build software, ship agentic AI, add voice and image, and run tool calls, all from the same place. R-CLI writes the code. Backboard.io is the stack the code runs on.

Why we built it

Backboard.io's thesis is simple: the best AI infrastructure should be the most open and the most accessible, not the most locked down. R-CLI is that thesis applied to coding, the same one key platform that gives you memory, model routing, and RAG, now pointed at your codebase. The best score on Terminal-Bench 2.1 with an open model, runnable on your own hardware, at a cost that makes closed tools hard to justify.

The open source Claude Code alternative is not a compromise version. It is the better one.

Request alpha access: backboard.io

We're still the only one to hit #1 on both LoCoMo and LongMemEval. Here is how to use it.

Jonathan Murray — Sat, 06 Jun 2026 20:32:52 +0000

Backboard is #1 on LoCoMo and LongMemEval, the two academic benchmarks for long-term AI memory without changing the original guidelines. Other companies have gamed by using newer models with bigger context windows. This post explains why the result matters anyway, what it actually measures, and how to use the memory that earned it.

What these benchmarks test

These are not "find a fact in a wall of text" tests. They measure whether a system can build, maintain, and reason over memory across many conversations.

LoCoMo (Long-term Conversational Memory) evaluates very long-term memory over multi-session dialogues that span weeks. It tests single-session recall, cross-session reasoning, temporal reasoning, outside knowledge, and adversarial questions.

LongMemEval scores five distinct abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates (noticing when a fact about the user changes), and abstention (knowing when it does not know). Its own paper reports that commercial assistants and long-context models lose around 30% accuracy on sustained memory.

That last point is the whole story.

Why we do not advertise the score much anymore

A few honest notes about the result.

We are still #1 on the original academic benchmarks. Other systems have since posted high numbers too, but they got there by pointing a stronger model at the problem and leaning on ever-larger context windows. At the top, everyone is near the ceiling of what these tests can even measure, so the raw number stops being interesting. What is interesting is how you got there.

The difference is where the work happens. We solve memory at the message level. Memory is built as the conversation happens, fact by fact, then retrieved when relevant. We do not stuff a giant context window to paper over a memory architecture that cannot actually remember. A bigger context window is brute force, and the benchmarks already show brute force degrades on long horizons. Message-level memory is the thing the test is supposed to reward. Fixing problems with brute force isn't scalable over months or years, and it guides users to inflated token usage and higher spend. No thanks.

We did not run these benchmarks ourselves. Third-party organizations did. We do not build for benchmarks and we do not tune to a leaderboard. We build the best memory product for our customers. It just happens to be the best.

One more thing, and we will not name names: several of the top open-source memory projects on GitHub run on Backboard for their paid cloud offering. The thing people benchmark against us is, in some cases, us. We think that is funny.

So we let the score sit quietly and we ship the product. Here is how to use it.

How to use it

The memory that tops these benchmarks is one parameter. Store it on the assistant with memory="Auto", reuse the same assistant_id, and facts carry across every conversation.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    # Conversation 1: a fact is extracted and stored at the message level
    await client.send_message(
        "My name is Sarah. I just moved from Chicago to Toronto.",
        assistant_id="your-assistant-id",
        memory="Auto",
    )

    # Conversation 2: new thread, same assistant, memory recalled
    reply = await client.send_message(
        "Where do I live now?",
        assistant_id="your-assistant-id",
        memory="Auto",
    )
    print(reply.content)  # Toronto

asyncio.run(main())

JavaScript (Node 18+)

const send = (body) =>
  fetch("https://app.backboard.io/api/threads/messages", {
    method: "POST",
    headers: {
      "X-API-Key": "YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  }).then((r) => r.json());

await send({
  content: "My name is Sarah. I just moved from Chicago to Toronto.",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

const reply = await send({
  content: "Where do I live now?",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

console.log(reply.content);

cURL

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "My name is Sarah. I just moved from Chicago to Toronto.", "assistant_id": "your-assistant-id", "memory": "Auto"}'

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Where do I live now?", "assistant_id": "your-assistant-id", "memory": "Auto"}'

This maps directly to what the benchmarks reward

Each benchmark ability is just a memory mode in practice:

Knowledge updates (Sarah moved cities): memory="Auto" saves the new fact and supersedes the old one, no code from you.
Multi-session reasoning: facts live on the assistant, so they cross threads automatically. Reuse the assistant_id.
Higher-accuracy retrieval: switch memory="Auto" to memory_pro="Auto" when precision matters more than cost.
Abstention: with memory in Readonly, the assistant recalls what it has and does not invent what it does not.

# Precision retrieval over everything the assistant knows
response = await client.send_message(
    "What were my project deadlines?",
    assistant_id="your-assistant-id",
    memory_pro="Auto",
)

The point

The benchmark number says we are first. The architecture says why it will hold: memory at the message level, not a context window stretched to hide a weaker design. You do not have to take the leaderboard's word for it. Set memory="Auto" and feel the difference in your own app.

Grab a key and try it: app.backboard.io

Memory docs: docs.backboard.io/concepts/memory

Chat with your documents: agentic RAG in a few lines

Jonathan Murray — Sat, 06 Jun 2026 11:44:18 +0000

RAG is not new. Chunk a document, embed the chunks, store them in a vector database, run a retrieval step on each query, then feed the results to the model. Every team building with AI has wired this up at least once. It works, and it is also a stack of moving parts you have to assemble and keep running: a parser, an embedding model, a vector store, a retriever, and the glue between them.

Backboard does nothing novel here. It just puts the whole thing behind one API. Upload a file, wait for it to index, ask a question. Retrieval happens automatically inside the same send_message call you already use. The point is not a new idea, it is that it is all unified and easy.

Three steps

Upload a document to an assistant.
Wait for it to reach indexed status.
Ask a question. RAG runs on its own.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    # 1. Create an assistant and upload a file to it
    assistant = await client.create_assistant(
        name="Docs Assistant",
        system_prompt="Answer questions using the uploaded documents.",
    )
    document = await client.upload_document_to_assistant(
        assistant.assistant_id,
        "knowledge-base.pdf",
    )

    # 2. Wait until the document is indexed
    while True:
        status = await client.get_document_status(document.document_id)
        if status.status == "indexed":
            break
        if status.status == "error":
            raise RuntimeError(status.status_message)
        await asyncio.sleep(2)

    # 3. Ask. Retrieval happens inside send_message
    reply = await client.send_message(
        "What are the key points in the document?",
        assistant_id=assistant.assistant_id,
    )
    print(reply.content)
    print(f"Files used: {reply.retrieved_files_count}")

asyncio.run(main())

JavaScript (Node 18+)

import { readFileSync } from "node:fs";

const KEY = "YOUR_API_KEY";
const base = "https://app.backboard.io/api";

// 1. Create an assistant
const assistant = await fetch(`${base}/assistants`, {
  method: "POST",
  headers: { "X-API-Key": KEY, "Content-Type": "application/json" },
  body: JSON.stringify({
    name: "Docs Assistant",
    system_prompt: "Answer questions using the uploaded documents.",
  }),
}).then((r) => r.json());

// Upload a file (multipart, no Content-Type header so fetch sets the boundary)
const form = new FormData();
form.append("file", new Blob([readFileSync("knowledge-base.pdf")]), "knowledge-base.pdf");

const document = await fetch(
  `${base}/assistants/${assistant.assistant_id}/documents`,
  { method: "POST", headers: { "X-API-Key": KEY }, body: form }
).then((r) => r.json());

// 2. Poll until indexed
let status;
do {
  await new Promise((r) => setTimeout(r, 2000));
  status = await fetch(`${base}/documents/${document.document_id}/status`, {
    headers: { "X-API-Key": KEY },
  }).then((r) => r.json());
} while (status.status !== "indexed" && status.status !== "error");

// 3. Ask
const reply = await fetch(`${base}/threads/messages`, {
  method: "POST",
  headers: { "X-API-Key": KEY, "Content-Type": "application/json" },
  body: JSON.stringify({
    content: "What are the key points in the document?",
    assistant_id: assistant.assistant_id,
  }),
}).then((r) => r.json());

console.log(reply.content);
console.log(`Files used: ${reply.retrieved_files_count}`);

cURL

# 1. Create an assistant
curl -X POST "https://app.backboard.io/api/assistants" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "Docs Assistant", "system_prompt": "Answer questions using the uploaded documents."}'

# Upload a file (use the assistant_id from above)
curl -X POST "https://app.backboard.io/api/assistants/ASSISTANT_ID/documents" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@knowledge-base.pdf"

# 2. Check status until it returns "indexed"
curl "https://app.backboard.io/api/documents/DOCUMENT_ID/status" \
  -H "X-API-Key: YOUR_API_KEY"

# 3. Ask a question
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "What are the key points in the document?", "assistant_id": "ASSISTANT_ID"}'

That is the whole RAG pipeline. No vector database to provision, no embedding service to call, no retriever to write. You uploaded a file and asked a question.

What "agentic" means here

You never call a retrieval endpoint. When you send a message to an assistant that has documents, Backboard decides what to fetch and pulls the relevant chunks with hybrid search (keyword and vector together), then answers. The response tells you what it used:

reply = await client.send_message(
    "Summarize section 3.",
    assistant_id=assistant.assistant_id,
)
print(reply.retrieved_files)        # filenames used as context
print(reply.retrieved_files_count)  # how many

Want deeper or shallower retrieval? Set tok_k on the assistant. It is the number of chunks pulled per query (default 10).

assistant = await client.create_assistant(
    name="Docs Assistant",
    system_prompt="Answer using the documents.",
    tok_k=20,  # retrieve more context per query
)

Two scopes

Where you upload decides who can see the document:

Assistant scope (upload_document_to_assistant): shared across every thread under that assistant. Use it for a knowledge base, product docs, or policies that all users should query.
Thread scope (upload_document_to_thread): visible only in that one conversation. Use it for a file a single user drops into a single chat.

# This file is only visible in one conversation
await client.upload_document_to_thread(thread_id, "meeting-notes.pdf")

Same upload, same query, different reach. No extra config.

Supported files

PDFs, Office files (.docx, .pptx, .xlsx), text and data (.txt, .csv, .md, .json, .xml), source code in most languages, and images. Upload it, and it is searchable once indexed.

The point

Agentic RAG is not a new trick. The win is that you do not build it. One upload, one status check, one message, and your assistant answers from your documents with retrieval handled inside the call. It is all in the same API, and that is the entire feature.

Grab a key and try it: app.backboard.io

Documents docs: docs.backboard.io/concepts/documents

Your memory, your data: read, edit, export, delete

Jonathan Murray — Fri, 05 Jun 2026 10:47:11 +0000

Most AI memory features are a black box. The assistant remembers things about your users, but you cannot see what it stored, you cannot fix a wrong fact, and you definitely cannot take the data with you if you leave. Your users' information lives in someone else's system on someone else's terms.

We want you to be here by choice, not by force.

Backboard treats memory as your data. Every memory an assistant holds is readable, editable, exportable, and deletable through the API. No black box. If you want to inspect it, you can. If you want to leave, you take it with you.

Here is the full lifecycle.

Read: see everything the assistant knows

List every memory on an assistant. Results are paginated, and omitting the page fetches all of them.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")
    assistant_id = "your-assistant-id"

    memories = await client.get_memories(assistant_id, page=1, page_size=25)
    for m in memories.memories:
        print(f"[{m.id}] {m.content}")
    print(f"Total: {memories.total_count}")

asyncio.run(main())

JavaScript (Node 18+)

const assistantId = "your-assistant-id";

const res = await fetch(
  `https://app.backboard.io/api/assistants/${assistantId}/memories?page=1&page_size=25`,
  { headers: { "X-API-Key": "YOUR_API_KEY" } }
);
const data = await res.json();

for (const m of data.memories) {
  console.log(`[${m.id}] ${m.content}`);
}
console.log(`Total: ${data.total_count}`);

cURL

curl "https://app.backboard.io/api/assistants/your-assistant-id/memories?page=1&page_size=25" \
  -H "X-API-Key: YOUR_API_KEY"

You can also search semantically instead of listing everything:

results = await client.search_memories(
    assistant_id,
    query="user interface preferences",
    limit=5,
)
for m in results["memories"]:
    print(f"[{m.get('score', 0):.2f}] {m['content']}")

Edit: fix what is wrong

A user gets promoted, changes a preference, corrects a detail. Update the memory in place. You can also add a fact manually.

Python

# Add a fact yourself
await client.add_memory(
    assistant_id,
    content="User prefers dark mode in all applications",
    metadata={"source": "manual", "confidence": "high"},
)

# Update an existing memory
await client.update_memory(
    assistant_id,
    memory_id,
    content="Updated preference: user prefers system theme",
)

JavaScript (Node 18+)

// Add a fact
await fetch(`https://app.backboard.io/api/assistants/${assistantId}/memories`, {
  method: "POST",
  headers: { "X-API-Key": "YOUR_API_KEY", "Content-Type": "application/json" },
  body: JSON.stringify({
    content: "User prefers dark mode in all applications",
    metadata: { source: "manual", confidence: "high" },
  }),
});

// Update a memory
await fetch(
  `https://app.backboard.io/api/assistants/${assistantId}/memories/${memoryId}`,
  {
    method: "PUT",
    headers: { "X-API-Key": "YOUR_API_KEY", "Content-Type": "application/json" },
    body: JSON.stringify({ content: "Updated preference: user prefers system theme" }),
  }
);

cURL

# Add
curl -X POST "https://app.backboard.io/api/assistants/your-assistant-id/memories" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "User prefers dark mode in all applications", "metadata": {"source": "manual"}}'

# Update
curl -X PUT "https://app.backboard.io/api/assistants/your-assistant-id/memories/MEMORY_ID" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Updated preference: user prefers system theme"}'

Export: take it with you

There is no special export format to learn. List every memory and write it to a file. Because the list endpoint returns all memories when you omit the page, a full export is a few lines.

Python

import json

# Fetch all memories (omit page to get everything)
all_memories = await client.get_memories(assistant_id)

export = [{"id": m.id, "content": m.content} for m in all_memories.memories]

with open("memory_export.json", "w") as f:
    json.dump(export, f, indent=2)

print(f"Exported {len(export)} memories")

JavaScript (Node 18+)

import { writeFileSync } from "node:fs";

const res = await fetch(
  `https://app.backboard.io/api/assistants/${assistantId}/memories`,
  { headers: { "X-API-Key": "YOUR_API_KEY" } }
);
const data = await res.json();

const exportData = data.memories.map((m) => ({ id: m.id, content: m.content }));
writeFileSync("memory_export.json", JSON.stringify(exportData, null, 2));

console.log(`Exported ${exportData.length} memories`);

cURL

curl "https://app.backboard.io/api/assistants/your-assistant-id/memories" \
  -H "X-API-Key: YOUR_API_KEY" \
  -o memory_export.json

Plain JSON, your fields, on your disk. That is the export.

Delete: remove one or wipe the slate

Delete a single memory, or reset every memory on an assistant. Reset removes them from both the database and the vector store and is irreversible.

Python

# Delete one memory
await client.delete_memory(assistant_id, memory_id)

# Delete all memories for an assistant (irreversible)
result = await client.reset_memories(assistant_id)
print(result["message"])

JavaScript (Node 18+)

// Delete one memory
await fetch(
  `https://app.backboard.io/api/assistants/${assistantId}/memories/${memoryId}`,
  { method: "DELETE", headers: { "X-API-Key": "YOUR_API_KEY" } }
);

cURL

# Delete one memory
curl -X DELETE "https://app.backboard.io/api/assistants/your-assistant-id/memories/MEMORY_ID" \
  -H "X-API-Key: YOUR_API_KEY"

For users exercising a delete request, that one call removes their data for good.

The point

Memory is only useful if you trust it, and you trust it when you can see it, fix it, take it, and remove it. Backboard exposes the whole lifecycle through the API: read every fact, edit the wrong ones, export the lot as plain JSON, delete on demand. The data the assistant stores about your users is yours, and you are never locked in.

Grab a key and try it: app.backboard.io

Memory API: docs.backboard.io/sdk/memory

Give your AI memory in one parameter

Jonathan Murray — Thu, 04 Jun 2026 10:39:59 +0000

By default, an LLM forgets you the moment a conversation ends. Start a new chat and it has no idea who you are, what you told it last week, or what you prefer. For a real product, that is a dealbreaker. Users expect the app to remember.

The standard fix is a memory pipeline you build yourself. Extract the important facts from each conversation. Turn them into embeddings. Store the vectors in a database. On every new message, run a similarity search, pull the relevant facts, and inject them into the prompt. That is a meaningful chunk of engineering, and you maintain it forever.

Backboard collapses that into one parameter: memory. Set it to "Auto" and your assistant remembers.

The one parameter

Memory is stored on the assistant, so pass the same assistant_id and memory="Auto". Facts the user shares in one conversation are recalled in the next.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    # Conversation 1: tell it something
    await client.send_message(
        "My name is Sarah. I work at Google as a software engineer.",
        assistant_id="your-assistant-id",
        memory="Auto",
    )

    # Conversation 2: new thread, same assistant, it remembers
    reply = await client.send_message(
        "What do you remember about me?",
        assistant_id="your-assistant-id",
        memory="Auto",
    )
    print(reply.content)  # name, employer, and role

asyncio.run(main())

JavaScript (Node 18+)

const send = (body) =>
  fetch("https://app.backboard.io/api/threads/messages", {
    method: "POST",
    headers: {
      "X-API-Key": "YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  }).then((r) => r.json());

await send({
  content: "My name is Sarah. I work at Google as a software engineer.",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

const reply = await send({
  content: "What do you remember about me?",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

console.log(reply.content);

cURL

# Save: memory="Auto" extracts and stores facts
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "My name is Sarah. I work at Google as a software engineer.", "assistant_id": "your-assistant-id", "memory": "Auto"}'

# Recall: same assistant, new conversation
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "What do you remember about me?", "assistant_id": "your-assistant-id", "memory": "Auto"}'

No embedding step. No vector database. No retrieval code. One parameter, and the assistant extracts the facts, stores them, and recalls them when they are relevant.

What `"Auto"` actually does

Behind that single value, Backboard runs the full loop:

Extraction pulls key facts from the conversation, like "works at Google" or "prefers dark mode."
Storage saves them to a semantic knowledge base tied to the assistant.
Retrieval finds the relevant facts on future messages and feeds them to the model.

It works across every thread under the same assistant, which is exactly the behavior you want: the user is remembered no matter which conversation they are in.

The modes

memory is a per-turn parameter. Pass it on each call where you want memory active. Pick one value:

Parameter	Value	Saves?	Retrieves?	Use it when
`memory`	`"Auto"`	Yes	Yes	The recommended default for most apps
`memory`	`"Readonly"`	No	Yes	Recall facts without writing new ones
`memory`	`"off"`	No	No	One-off requests that should not be remembered
`memory_pro`	`"Auto"`	Yes	Yes	You need higher-accuracy recall and accept higher cost
`memory_pro`	`"Readonly"`	No	Yes	High-accuracy recall only

memory and memory_pro cannot be used together in the same message. Use memory for everyday recall and memory_pro when accuracy matters more than cost.

# Higher-accuracy retrieval
response = await client.send_message(
    "What were my project deadlines?",
    assistant_id="your-assistant-id",
    memory_pro="Auto",
)

When you want manual control

"Auto" covers most apps. When you need to manage memory directly, the assistant exposes full CRUD: list, add, search, update, and delete. You own the data and can export it whenever you want.

# Add a fact yourself
await client.add_memory(
    assistant_id,
    content="User prefers dark mode in all applications",
)

# Semantic search over what the assistant knows
results = await client.search_memories(
    assistant_id,
    query="user interface preferences",
    limit=5,
)
for m in results["memories"]:
    print(m["content"])

The point

Persistent memory is usually a project: an extraction pipeline, a vector store, retrieval code, and ongoing upkeep. Backboard makes it a parameter. Set memory="Auto", reuse the assistant, and your AI remembers your users across every conversation. When you need precision or control, switch to memory_pro or manage memories directly. No database required.

Grab a key and try it: app.backboard.io

Memory docs: docs.backboard.io/concepts/memory

Stop letting your hackathon API keys rot

Jonathan Murray — Wed, 03 Jun 2026 22:12:15 +0000

You've got OpenAI, Anthropic, Gemini, and xAI credits sitting in five dashboards. Plug them all into one API and get free state management, courtesy of Dev.to and MLH.

If you've done a hackathon or run a startup, you have API credits scattered everywhere. OpenAI from one event. Anthropic from another. Gemini and xAI from your last sprint. All sitting in separate dashboards, half-used, slowly expiring.

Backboard fixes that. One API, your keys, every model.

Bring your own key

Drop in keys from any of these providers and route across all of them behind a single Backboard API:

OpenAI
Anthropic
OpenRouter
Google Gemini
xAI
Cohere
ElevenLabs

You keep your credits. You keep your rates. You stop stitching seven SDKs together. One key in front of all of them, with memory, routing, and stateful threads built in.

Free state management, on the house

Memory is the part everyone skips at a hackathon because it's a pain to build. Not here. State management on Backboard is free, brought to you by Dev.to and MLH.

Stateful threads at the message level. No vector DB to spin up, no session glue code. Your agent remembers across the whole build.

Add your keys in 30 seconds

Sign in at app.backboard.io
Go to Dashboard → API Keys
Paste your provider keys
Ship

pip install backboard-sdk
# or
npm install backboard-sdk

from backboard import Backboard

bb = Backboard(api_key="your_backboard_key")

# Your OpenAI, Anthropic, Gemini keys are already wired in.
# Memory and state come free.
thread = bb.threads.create(assistant_id="your_assistant")
bb.messages.create(thread_id=thread.id, content="Remember this for later.")

Got tokens from a hackathon? Credits your startup was granted? Put them to work instead of letting them expire.

Add your keys: app.backboard.io/dashboard/api-keys

Stateful AI without a database: threads and assistants

Jonathan Murray — Wed, 03 Jun 2026 10:17:28 +0000

LLMs are stateless. Every API call to a raw model is a blank slate. The model has no idea what was said two messages ago. So the moment you want a chatbot that remembers the conversation, you are on the hook for state.

The usual answer is infrastructure. Spin up Postgres to store message history. Add Redis to cache sessions. Stand up a vector database for long-term memory. Write the code that loads history, trims it to fit the context window, stitches it into every prompt, and saves the new turn. That is a lot of plumbing before the bot says hello.

Backboard handles state for you. Two ideas replace the whole stack: threads and assistants. You never run a database.

The model

Three things, nested:

Message is one turn. A user message in, an assistant reply out.
Thread is one conversation. An ordered list of messages. Pass its thread_id on the next call and the model sees the full history.
Assistant is the profile above the thread. It holds the name, default instructions, tools, and memory. One assistant can own many threads, for example one thread per end-user.

Memory lives on the assistant, so it is shared across every thread under it. History lives on the thread. Both persist on Backboard's side. Nothing to provision.

Threads: state within one conversation

Send a first message and a thread is created automatically. The response hands you a thread_id. Pass it back on the next call and the conversation continues with full context.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    first = await client.send_message("My favorite color is blue.")

    # Same thread: the model remembers the previous turn
    second = await client.send_message(
        "What did I just tell you?",
        thread_id=first.thread_id,
    )
    print(second.content)  # "You told me your favorite color is blue."

asyncio.run(main())

JavaScript (Node 18+)

const send = (body) =>
  fetch("https://app.backboard.io/api/threads/messages", {
    method: "POST",
    headers: {
      "X-API-Key": "YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  }).then((r) => r.json());

const first = await send({ content: "My favorite color is blue." });

// Same thread: pass the thread_id back
const second = await send({
  content: "What did I just tell you?",
  thread_id: first.thread_id,
});

console.log(second.content);

cURL

# First message, thread auto-created
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "My favorite color is blue."}'

# Continue: pass the thread_id from the first response
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "What did I just tell you?", "thread_id": "THREAD_ID_FROM_FIRST_RESPONSE"}'

No history table. No prompt-stitching code. The thread_id is your conversation state, and Backboard stores it. When a thread gets long enough to crowd the context window, Backboard summarizes older messages automatically so you do not have to manage trimming.

Assistants: state across conversations

A thread remembers one chat. An assistant remembers the user across many chats. Memory is stored per assistant, so to carry facts into a brand new conversation you reuse the same assistant_id and start a fresh thread.

Python

# Conversation 1
await client.send_message(
    "I'm allergic to peanuts.",
    assistant_id="your-assistant-id",
    memory="Auto",
)

# Conversation 2: new thread, same assistant, memory carries over
reply = await client.send_message(
    "Any dietary restrictions you remember?",
    assistant_id="your-assistant-id",
    memory="Auto",
)
print(reply.content)  # "You mentioned you're allergic to peanuts."

JavaScript (Node 18+)

await send({
  content: "I'm allergic to peanuts.",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

const reply = await send({
  content: "Any dietary restrictions you remember?",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

console.log(reply.content);

cURL

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "I am allergic to peanuts.", "assistant_id": "your-assistant-id", "memory": "Auto"}'

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Any dietary restrictions you remember?", "assistant_id": "your-assistant-id", "memory": "Auto"}'

This is the part that normally requires a vector database: embedding facts, storing vectors, running similarity search on every request. Here it is one parameter, memory="Auto", and the assistant owns it.

When to pass what

Goal	Pass
Keep talking in the same chat	The same `thread_id` every call
New chat, but remember the user	Omit `thread_id`, reuse the same `assistant_id` with `memory="Auto"`
One assistant, many users	One `assistant_id`, a separate `thread_id` per user

That last row is the whole pattern for a multi-user app. One assistant defines your AI. Each user gets their own thread. State stays separated without a schema you designed, a migration you ran, or a database you babysit.

The point

Stateless models force you to build a state layer. Backboard makes that layer part of the API. Threads hold the conversation. Assistants hold the profile and the memory. Both persist server-side. You ship a stateful, multi-user AI app and never write a line of database code.

Grab a key and try it: app.backboard.io

Architecture in full: docs.backboard.io/concepts/architecture

Send your first AI message in one API call

Jonathan Murray — Tue, 02 Jun 2026 16:41:37 +0000

Most AI tutorials start with a setup checklist. Pick a model provider. Create an account. Wire up a vector database for memory. Stand up a server to hold conversation state. Glue it all together. Then, finally, you send a message.

Backboard skips all of that. One API call sends your first message. A thread, an assistant, memory, and routing across thousands of models are already running behind that single call. You do not assemble the stack. It is the stack.

Here is the whole thing.

Step 1: Get a key

Sign up at app.backboard.io, go to Settings then API Keys, and copy your key. New accounts get $5 in free credits for 30 days. No credit card.

That is the only setup. Keep your key server-side, never in frontend or mobile code.

Step 2: Send the message

Pick your language. Same call in all three.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    response = await client.send_message(
        "Hello! Tell me a fun fact about space."
    )

    print("Reply:", response.content)
    print("Thread ID:", response.thread_id)
    print("Assistant ID:", response.assistant_id)

asyncio.run(main())

JavaScript (Node 18+)

No install needed. Just fetch.

const response = await fetch("https://app.backboard.io/api/threads/messages", {
  method: "POST",
  headers: {
    "X-API-Key": "YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    content: "Hello! Tell me a fun fact about space.",
  }),
});

const result = await response.json();

console.log("Reply:", result.content);
console.log("Thread ID:", result.thread_id);
console.log("Assistant ID:", result.assistant_id);

cURL

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello! Tell me a fun fact about space."}'

Run it. You get a reply. That is your first AI message.

What just happened

You sent one string. Backboard did the rest:

Created a thread. The thread_id in the response is a live conversation. Send the next message with it and the model remembers what was said.
Created an assistant. The assistant_id is a reusable profile. Attach memory, documents, and tools to it later without changing your call.
Picked a model. No provider config required. It defaulted to openai / gpt-4o. You can change that with two parameters, shown below.

No vector DB. No state server. No provider SDK. One call.

Continue the conversation

Pass the thread_id back. The model now has context.

follow_up = await client.send_message(
    "Make it shorter.",
    thread_id=response.thread_id,
)
print(follow_up.content)  # knows you mean the space fact

That is stateful conversation with zero extra infrastructure.

Swap the model with two parameters

One key gives you thousands of models. Change the provider and model per message. Same thread, same code.

response = await client.send_message(
    "Explain quantum computing simply.",
    llm_provider="anthropic",
    model_name="claude-sonnet-4-20250514",
)

Want a different model next turn? Change two strings. You are never locked to one provider.

Turn on memory

Add memory="Auto" and the assistant remembers facts across conversations, not just within one thread.

# Thread 1: tell it something
await client.send_message(
    "My name is Sarah and I prefer dark mode.",
    assistant_id="your-assistant-id",
    memory="Auto",
)

# Thread 2, same assistant: it remembers
reply = await client.send_message(
    "What do you remember about me?",
    assistant_id="your-assistant-id",
    memory="Auto",
)
print(reply.content)  # "Your name is Sarah and you prefer dark mode."

Persistent memory, one parameter. No database to provision.

The point

The first call is one line because the platform is full-stack. Memory, model routing, RAG, and stateful threads sit behind a single key. You start with a working AI message, then turn on capabilities as you need them by adding parameters, not services.

Full docs: docs.backboard.io

The Hidden Challenge of Multi-LLM Context Management

Jonathan Murray — Fri, 24 Apr 2026 20:19:51 +0000

Why token counting isn't a solved problem when building across providers

Building AI products that span multiple LLM providers involves a challenge most developers don't anticipate until they hit it: context windows are not interoperable.

On the surface, managing context in a multi-LLM system seems straightforward. You track how long conversations get, trim when needed, and move on. In practice, it's considerably more complex — and if you're routing requests across providers like OpenAI, Anthropic, Google, Cohere, or xAI, there's a fundamental mismatch that can break your product in subtle ways.

The Tokenization Problem

Every major LLM provider uses its own tokenizer. These tokenizers don't agree. The same block of text produces different token counts depending on which model processes it. The difference is often 10–20%, sometimes more.

What this means in practice: a conversation that fits comfortably in one model's context window may silently overflow another's. A prompt routed to OpenAI might count as 1,200 tokens; the same prompt routed to Claude might count as 1,450. That gap matters.

Where It Breaks

The failure modes tend to show up at the boundaries. When you switch providers mid-conversation, the new model has to ingest the full prior context. If your context management layer was calibrated to the previous model's tokenizer, the new model may see a context that's already at or over the limit — before it's even responded to anything new.

This produces three common failure patterns:

Unexpected context-window overflow: the conversation that worked before now breaches the limit
Inconsistent truncation: different models truncate at different points, changing what prior context the model actually sees
Routing failures that are unpredictable because the numbers your system used don't match the numbers the model actually used

Why Simple Estimates Fail

The instinct is to maintain a single "token estimate" with a generous safety margin. The problem is that the margin you'd need varies by provider, model version, and content type (code tokenizes differently than prose). A margin calibrated for one use case will either be too tight for another, causing failures, or too generous, causing unnecessary truncation that degrades conversation quality.

The Solution: Provider-Aware Token Counting

A robust multi-LLM context management layer makes token counting provider-specific. Rather than maintaining a single estimate, it measures each prompt the way the actual target model will measure it. The routing layer uses these per-provider measurements to make decisions before requests are sent.

This lets the system stay ahead of context limits: it knows when a conversation is approaching an edge, trims or compresses history calibrated to the specific model receiving the request, and avoids the pricing and failure surprises that come from miscounted tokens.

The end result is what users should see: a smooth conversation experience, regardless of which model is serving it. The complexity of "every model speaks a slightly different token language" stays inside the infrastructure layer, invisible to the people using the product.

This is the approach we've taken in our adaptive context window management component, and it's become a foundational part of how we think about multi-LLM routing more broadly.

Rob Imbeault
Apr 17, 2026

Why LLM Reasoning Is Breaking AI Infrastructure (And How to Fix It)

Jonathan Murray — Fri, 24 Apr 2026 20:18:05 +0000

If you've tried building anything serious on top of large language models (LLMs) recently, you've probably run into this:

"Thinking" is supposed to make models better. In practice, it makes your infrastructure worse.

This isn't a model problem—it's an infrastructure and abstraction problem. And it's getting worse as teams scale across multiple AI providers.

Let's break down exactly where things go wrong.

The Illusion of "Just Turn On Reasoning"

At a high level, LLM reasoning sounds straightforward:

Turn reasoning on → better answers
Turn reasoning off → cheaper, faster

But in production systems, reality looks very different.

What actually happens:

Models don't reason when explicitly prompted
Models over-reason on trivial queries, wasting tokens
Behavior is inconsistent across providers and model versions

Instead of predictable performance, you get variability.

You're no longer just building an AI product—you're debugging model behavior at runtime.

The Fragmentation Problem in LLM Reasoning

One of the biggest hidden challenges in AI infrastructure today is fragmentation.

Every major provider has implemented reasoning differently:

OpenAI → reasoning effort levels (low, medium, high)
Anthropic (Claude) → explicit reasoning token budgets
Google AI (Gemini) → hybrid approaches depending on model version

That's just input configuration.

Output fragmentation is even worse:

Some models return separate reasoning blocks
Others provide summarized reasoning
Some mix reasoning directly into standard responses

There is:

No shared schema
No standardized interface
No predictable structure

What this means for developers:

If you're building a multi-model AI system, you now need:

Input normalization layers
Output parsing logic per provider
Custom handling for reasoning formats

At this point, "simple API routing" becomes complex middleware engineering.

AI Cost Optimization Becomes a Moving Target

Reasoning doesn't just impact performance—it breaks cost predictability.

Billing inconsistencies across providers:

Some expose reasoning tokens explicitly
Others bundle them into total usage
Some introduce custom billing fields

Now you're not just optimizing latency or quality.

You're building a cost translation layer across providers.

This adds complexity to:

Forecasting
Budget control
Scaling decisions

Why Multi-Model Switching Breaks Systems

In theory, switching between LLM providers should improve reliability and cost efficiency.

In practice, it introduces system instability.

Even within a single provider:

Different endpoints behave differently
Input formats change
Output schemas change
Reasoning structures vary

Now add state management:

What context should persist?
How do you maintain reasoning continuity?
How do you prevent token explosion?

The result:

Most teams either:

Abandon portability, or
Build fragile adapter layers that constantly break

The Real Problem: Lack of Abstraction

After working through these challenges, one thing becomes clear:

The core issue isn't reasoning—it's the absence of a unified abstraction layer.

Developers today are forced to:

Learn multiple reasoning systems
Normalize different response formats
Track multiple billing models
Rebuild state handling for each provider

This is not scalable.

What "Unified LLM Reasoning" Should Look Like

To make AI infrastructure truly production-ready, reasoning needs to be abstracted.

A unified system should provide:

A single reasoning parameter
Direct control over reasoning budgets
Consistent behavior across models
Standardized input/output formats

The impact:

Developers can:

Tune reasoning without provider lock-in
Switch models without rewriting logic
Maintain consistent state across systems

And most importantly:

Stop thinking about thinking.

The Uncomfortable Truth About Scaling AI Systems

If you're working with LLMs and haven't encountered these issues yet—you will.

Complexity compounds rapidly when you:

Add a second provider
Enable reasoning features
Optimize for cost
Maintain persistent context

At that point:

You're no longer building your product. You're building AI infrastructure.

The Future of AI Platforms

Short-term impact:

Reduced engineering time (weeks to months saved)
Lower debugging overhead
More predictable cost structures

Long-term shift:

The winning AI platforms won't be defined by model quality alone.

They will be defined by:

Interoperability (model interchangeability)
Statefulness (persistent, portable context)

That's the real unlock in the next phase of AI development.

Quick Audit for Your AI Stack

If you're currently integrating multiple LLM providers, ask yourself:

How many reasoning formats are you handling?
How portable is your state management layer?
How predictable are your AI costs?

If those answers aren't clean and consistent:

You're already paying the infrastructure tax.

Rob Imbeault
Apr 20, 2026

DEV Community: Backboard.io

We bet against the GPU arms race. Here's what shipped.

BackboardQuant: compression that doesn't lobotomize the model

Backboard Studio: the benchmark result we didn't expect

Nash: one app instead of shadow AI

Memory: still #1, and you can check

The throughline: sovereign by design

Try it

R-CLI: an open-source model harness that beats Claude Code

The numbers

The harness is the product, not the model

We destroy them on cost

Your code never leaves your VPC

Run it yourself

One key, the whole stack

Why we built it

We're still the only one to hit #1 on both LoCoMo and LongMemEval. Here is how to use it.

What these benchmarks test

Why we do not advertise the score much anymore

How to use it

Python

JavaScript (Node 18+)

cURL

This maps directly to what the benchmarks reward

The point

Chat with your documents: agentic RAG in a few lines

Three steps

Python

JavaScript (Node 18+)

cURL

What "agentic" means here

Two scopes

Supported files

The point

Your memory, your data: read, edit, export, delete

Read: see everything the assistant knows

Python

JavaScript (Node 18+)

cURL

Edit: fix what is wrong

Python

JavaScript (Node 18+)

cURL

Export: take it with you

Python

JavaScript (Node 18+)

cURL

Delete: remove one or wipe the slate

Python

JavaScript (Node 18+)

cURL

The point

Give your AI memory in one parameter

The one parameter

Python

JavaScript (Node 18+)

cURL

What "Auto" actually does

The modes

When you want manual control

The point

Stop letting your hackathon API keys rot

Bring your own key

Free state management, on the house

Add your keys in 30 seconds

Stateful AI without a database: threads and assistants

The model

Threads: state within one conversation

Python

JavaScript (Node 18+)

cURL

Assistants: state across conversations

Python

JavaScript (Node 18+)

cURL

When to pass what

The point

Send your first AI message in one API call

Step 1: Get a key

Step 2: Send the message

What `"Auto"` actually does