DEV Community

Cover image for We're still the only one to hit #1 on both LoCoMo and LongMemEval. Here is how to use it.
Jonathan Murray for Backboard.io

Posted on

We're still the only one to hit #1 on both LoCoMo and LongMemEval. Here is how to use it.

Backboard is #1 on LoCoMo and LongMemEval, the two academic benchmarks for long-term AI memory without changing the original guidelines. Other companies have gamed by using newer models with bigger context windows. This post explains why the result matters anyway, what it actually measures, and how to use the memory that earned it.

What these benchmarks test

These are not "find a fact in a wall of text" tests. They measure whether a system can build, maintain, and reason over memory across many conversations.

LoCoMo (Long-term Conversational Memory) evaluates very long-term memory over multi-session dialogues that span weeks. It tests single-session recall, cross-session reasoning, temporal reasoning, outside knowledge, and adversarial questions.

LongMemEval scores five distinct abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates (noticing when a fact about the user changes), and abstention (knowing when it does not know). Its own paper reports that commercial assistants and long-context models lose around 30% accuracy on sustained memory.

That last point is the whole story.

Why we do not advertise the score much anymore

A few honest notes about the result.

We are still #1 on the original academic benchmarks. Other systems have since posted high numbers too, but they got there by pointing a stronger model at the problem and leaning on ever-larger context windows. At the top, everyone is near the ceiling of what these tests can even measure, so the raw number stops being interesting. What is interesting is how you got there.

The difference is where the work happens. We solve memory at the message level. Memory is built as the conversation happens, fact by fact, then retrieved when relevant. We do not stuff a giant context window to paper over a memory architecture that cannot actually remember. A bigger context window is brute force, and the benchmarks already show brute force degrades on long horizons. Message-level memory is the thing the test is supposed to reward. Fixing problems with brute force isn't scalable over months or years, and it guides users to inflated token usage and higher spend. No thanks.

We did not run these benchmarks ourselves. Third-party organizations did. We do not build for benchmarks and we do not tune to a leaderboard. We build the best memory product for our customers. It just happens to be the best.

One more thing, and we will not name names: several of the top open-source memory projects on GitHub run on Backboard for their paid cloud offering. The thing people benchmark against us is, in some cases, us. We think that is funny.

So we let the score sit quietly and we ship the product. Here is how to use it.

How to use it

The memory that tops these benchmarks is one parameter. Store it on the assistant with memory="Auto", reuse the same assistant_id, and facts carry across every conversation.

Python

pip install backboard-sdk
Enter fullscreen mode Exit fullscreen mode
import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    # Conversation 1: a fact is extracted and stored at the message level
    await client.send_message(
        "My name is Sarah. I just moved from Chicago to Toronto.",
        assistant_id="your-assistant-id",
        memory="Auto",
    )

    # Conversation 2: new thread, same assistant, memory recalled
    reply = await client.send_message(
        "Where do I live now?",
        assistant_id="your-assistant-id",
        memory="Auto",
    )
    print(reply.content)  # Toronto

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

JavaScript (Node 18+)

const send = (body) =>
  fetch("https://app.backboard.io/api/threads/messages", {
    method: "POST",
    headers: {
      "X-API-Key": "YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  }).then((r) => r.json());

await send({
  content: "My name is Sarah. I just moved from Chicago to Toronto.",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

const reply = await send({
  content: "Where do I live now?",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

console.log(reply.content);
Enter fullscreen mode Exit fullscreen mode

cURL

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "My name is Sarah. I just moved from Chicago to Toronto.", "assistant_id": "your-assistant-id", "memory": "Auto"}'

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Where do I live now?", "assistant_id": "your-assistant-id", "memory": "Auto"}'
Enter fullscreen mode Exit fullscreen mode

This maps directly to what the benchmarks reward

Each benchmark ability is just a memory mode in practice:

  • Knowledge updates (Sarah moved cities): memory="Auto" saves the new fact and supersedes the old one, no code from you.
  • Multi-session reasoning: facts live on the assistant, so they cross threads automatically. Reuse the assistant_id.
  • Higher-accuracy retrieval: switch memory="Auto" to memory_pro="Auto" when precision matters more than cost.
  • Abstention: with memory in Readonly, the assistant recalls what it has and does not invent what it does not.
# Precision retrieval over everything the assistant knows
response = await client.send_message(
    "What were my project deadlines?",
    assistant_id="your-assistant-id",
    memory_pro="Auto",
)
Enter fullscreen mode Exit fullscreen mode

The point

The benchmark number says we are first. The architecture says why it will hold: memory at the message level, not a context window stretched to hide a weaker design. You do not have to take the leaderboard's word for it. Set memory="Auto" and feel the difference in your own app.

Grab a key and try it: app.backboard.io

Memory docs: docs.backboard.io/concepts/memory

Top comments (0)