Backboard is #1 on LoCoMo and LongMemEval, the two academic benchmarks for long-term AI memory without changing the original guidelines. Other companies have gamed by using newer models with bigger context windows. This post explains why the result matters anyway, what it actually measures, and how to use the memory that earned it.
What these benchmarks test
These are not "find a fact in a wall of text" tests. They measure whether a system can build, maintain, and reason over memory across many conversations.
LoCoMo (Long-term Conversational Memory) evaluates very long-term memory over multi-session dialogues that span weeks. It tests single-session recall, cross-session reasoning, temporal reasoning, outside knowledge, and adversarial questions.
LongMemEval scores five distinct abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates (noticing when a fact about the user changes), and abstention (knowing when it does not know). Its own paper reports that commercial assistants and long-context models lose around 30% accuracy on sustained memory.
That last point is the whole story.
Why we do not advertise the score much anymore
A few honest notes about the result.
We are still #1 on the original academic benchmarks. Other systems have since posted high numbers too, but they got there by pointing a stronger model at the problem and leaning on ever-larger context windows. At the top, everyone is near the ceiling of what these tests can even measure, so the raw number stops being interesting. What is interesting is how you got there.
The difference is where the work happens. We solve memory at the message level. Memory is built as the conversation happens, fact by fact, then retrieved when relevant. We do not stuff a giant context window to paper over a memory architecture that cannot actually remember. A bigger context window is brute force, and the benchmarks already show brute force degrades on long horizons. Message-level memory is the thing the test is supposed to reward. Fixing problems with brute force isn't scalable over months or years, and it guides users to inflated token usage and higher spend. No thanks.
We did not run these benchmarks ourselves. Third-party organizations did. We do not build for benchmarks and we do not tune to a leaderboard. We build the best memory product for our customers. It just happens to be the best.
One more thing, and we will not name names: several of the top open-source memory projects on GitHub run on Backboard for their paid cloud offering. The thing people benchmark against us is, in some cases, us. We think that is funny.
So we let the score sit quietly and we ship the product. Here is how to use it.
How to use it
The memory that tops these benchmarks is one parameter. Store it on the assistant with memory="Auto", reuse the same assistant_id, and facts carry across every conversation.
Python
pip install backboard-sdk
import asyncio
from backboard import BackboardClient
async def main():
client = BackboardClient(api_key="YOUR_API_KEY")
# Conversation 1: a fact is extracted and stored at the message level
await client.send_message(
"My name is Sarah. I just moved from Chicago to Toronto.",
assistant_id="your-assistant-id",
memory="Auto",
)
# Conversation 2: new thread, same assistant, memory recalled
reply = await client.send_message(
"Where do I live now?",
assistant_id="your-assistant-id",
memory="Auto",
)
print(reply.content) # Toronto
asyncio.run(main())
JavaScript (Node 18+)
const send = (body) =>
fetch("https://app.backboard.io/api/threads/messages", {
method: "POST",
headers: {
"X-API-Key": "YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}).then((r) => r.json());
await send({
content: "My name is Sarah. I just moved from Chicago to Toronto.",
assistant_id: "your-assistant-id",
memory: "Auto",
});
const reply = await send({
content: "Where do I live now?",
assistant_id: "your-assistant-id",
memory: "Auto",
});
console.log(reply.content);
cURL
curl -X POST "https://app.backboard.io/api/threads/messages" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"content": "My name is Sarah. I just moved from Chicago to Toronto.", "assistant_id": "your-assistant-id", "memory": "Auto"}'
curl -X POST "https://app.backboard.io/api/threads/messages" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"content": "Where do I live now?", "assistant_id": "your-assistant-id", "memory": "Auto"}'
This maps directly to what the benchmarks reward
Each benchmark ability is just a memory mode in practice:
-
Knowledge updates (Sarah moved cities):
memory="Auto"saves the new fact and supersedes the old one, no code from you. -
Multi-session reasoning: facts live on the assistant, so they cross threads automatically. Reuse the
assistant_id. -
Higher-accuracy retrieval: switch
memory="Auto"tomemory_pro="Auto"when precision matters more than cost. -
Abstention: with memory in
Readonly, the assistant recalls what it has and does not invent what it does not.
# Precision retrieval over everything the assistant knows
response = await client.send_message(
"What were my project deadlines?",
assistant_id="your-assistant-id",
memory_pro="Auto",
)
The point
The benchmark number says we are first. The architecture says why it will hold: memory at the message level, not a context window stretched to hide a weaker design. You do not have to take the leaderboard's word for it. Set memory="Auto" and feel the difference in your own app.
Grab a key and try it: app.backboard.io
Memory docs: docs.backboard.io/concepts/memory
Top comments (0)