Vektor Memory

Posted on Apr 22

We Benchmarked Our AI Memory SDK. Is the Industry Standard Test Broken?

#ai #vectordatabase #memory #computerscience

A three-part story about retrieval engineering, grounding truth, and what 93% accuracy actually costs.

66.9% accuracy. Zero cloud calls. Under one millisecond.

Part 1: The Benchmark that confuses…

Six weeks ago I sat down to run VEKTOR Slipstream through the LoCoMo benchmark. LoCoMo is the standard test for long-term conversational memory in AI systems. Ten multi-session conversations, 1,986 questions, categories covering single-hop recall, multi-hop reasoning, temporal queries, adversarial questions, and commonsense inference. Every serious memory system paper cites it. Mem0 cites it. Zep cites it. EverMemOS cites it.

Our first score: 1.3% F1.

Not 13%. Not 31%. One point three percent. Below random guessing on some categories.

The obvious assumption was that something was broken in our code. And some things were. But the deeper we dug, the more we realized the benchmark itself had problems that nobody talks about openly.

What LoCoMo Actually Tests
The setup is simple on paper. Feed a system the conversation history. Ask it questions. Score the answers with token-level F1 matching. A perfect answer that uses different phrasing than the gold label scores zero. “7 May 2023” and “May 7th, 2023” are treated as different answers.

The field has mostly moved away from F1 toward LLM-as-judge scoring, which is more forgiving and arguably more accurate. Mem0 reports 62.47% on their old algorithm and 91.6% on their new one. EverMemOS reports 93%. These numbers are not comparable to the original paper’s F1 scores. They are measuring different things with different judges.

We discovered this the hard way after spending two weeks trying to understand why our scores were stuck in single digits. The Corrupted Labels
While debugging, I found this GitHub repository: dial481/locomo-audit. A systematic audit of the LoCoMo dataset, examining all 1,540 non-adversarial questions for ground truth errors.

Finding: 99 score-corrupting errors in 1,540 questions. 6.4% of the benchmark penalizes correct answers.

The error types are damning. HALLUCINATION errors, where the gold answer contains facts not present anywhere in the conversation transcript. TEMPORAL_ERROR cases where date arithmetic in the gold label is simply wrong. ATTRIBUTION_ERROR questions where the answer names the wrong speaker.

Then there is the commonsense category. 45 of 47 commonsense questions in conversation 0 have the answer field set to “undefined.” Not a wrong answer. A missing one. The benchmark ships with nearly the entire commonsense category unscored, yet every system that runs against it takes a zero on those questions.

The theoretical maximum score on LoCoMo, given the corrupted labels, is around 93%. Which happens to be exactly where EverMemOS lands.

Our Adjusted Score
Once we stripped the 45 undefined-answer questions from conv 0 and scored only on valid questions, our numbers changed substantially. 154 of 199 questions in conv 0 have valid gold answers. On those questions, with gpt-5.4-mini as our answering model and gpt-5.4-mini as our judge, VEKTOR Slipstream scores 66.9% accuracy.

That beats Mem0’s old algorithm (62.47%) on a valid subset of the benchmark.

It is still well below Zep (78.94%) and Memori (81.95%) and nowhere near EverMemOS (93%). Those gaps are real and I want to explain what creates them, because the answer is interesting.

Part 2: Building the Retrieval Pipeline
Where We Started
Our initial 1.3% F1 had three independent bugs, all discovered in sequence.

Bug one: the embedding model was not loading. VEKTOR uses bge-small-en-v1.5 via ONNX for local inference. The boot sequence was running initBM25Schema in a setImmediate callback, which meant the FTS5 tables did not exist when the first remember() calls fired. Every write silently failed. Every recall returned empty results. The LLM answered every question "unknown" which scores zero on F1 and zero on any judge.

Bug two: the session date format. LoCoMo stores timestamps as “1:56 pm on 8 May, 2023”. We were passing this string to JavaScript’s Date constructor, which returns Invalid Date. So our relative date resolution (converting "yesterday" to an absolute date) never fired. Questions about what happened "yesterday" in session 1 sent the LLM a memory containing the word "yesterday" with no date anchor.

Bug three: the minScore filter in our eval harness was set to 0.0, which cut every cross-encoder result with a negative logit. Cross-encoder ms-marco-MiniLM-L-6-v2 returns logits, not probabilities. A logit of -7 means “not very relevant” but it is still the best match in the candidate set. Filtering at 0 cut everything, leaving the LLM with empty context.

Fixing these three bugs moved us from 1.3% to 33.7% F1 in one run.

The Retrieval Stack
After the basic bugs were fixed, we spent three weeks iterating on the retrieval pipeline. Here is what we built and what actually moved the numbers.

Stage 1: Bi-encoder draft. bge-small-en-v1.5 (384 dimensions, quantized, ONNX) runs cosine similarity over all stored memories. This is our draft pass. Fast, cheap, imprecise. Returns the top 60 candidates.

Stage 2: BM25 + RRF fusion. Three parallel BM25 searches over an FTS5 index: the raw query, a Porter-stemmed variant (so “attending” matches “attend”), and a separate search for each proper noun in the query. All three lists get fused via Reciprocal Rank Fusion with k=15. This catches exact keyword matches that semantic search misses. “Sweden” is a good example. The memory “Caroline moved from Sweden 4 years ago” scores 0.63 cosine similarity against the query “Where did Caroline move from” because the semantic content is spread across many Caroline memories. But BM25 on “Sweden” hits it directly.

Stage 3: Cross-encoder reranking. ms-marco-MiniLM-L-6-v2 scores each (query, candidate) pair jointly. This is the spec-decoding insight applied to retrieval. The bi-encoder embeds query and document independently. The cross-encoder sees both simultaneously, which is dramatically more accurate but too slow to run on thousands of documents. Running it on the top 30 candidates gives you big-model accuracy at small-model cost. Before cross-encoder reranking, our scores on single-hop questions were around 28%. After, mid-40s.

Write on Medium
Additional layers that helped: A persistent entity index (proper nouns mapped to memory IDs), question type classification (routing single-hop vs multi-hop to different retrieval strategies), an agentic sufficiency check that reformulates the query when key entities are missing from the top results, and a temporal index that stores ISO date extractions for date-arithmetic queries.

What did not work: Semantic triple extraction. The idea was to store structured facts (“Caroline attended LGBTQ support group on 7 May 2023”) alongside raw turns. This is exactly what Memori does and it gets 81.95%. When we implemented it, scores dropped 7 points. The triples flooded the candidate pool with low-quality facts that crowded out the actual high-quality raw turn memories. The cross-encoder window is 30 slots. If 20 of them are mediocre extracted facts, the LLM gets worse context than with 20 raw turns.

The right implementation of triple extraction requires replacing raw turns rather than augmenting them. That is an architectural change, not a config flag.

The Final Numbers
After six weeks of iteration, VEKTOR Slipstream with gpt-5.4-mini as the answering model and judge:

Category F1 Judge Accuracy Single-hop 34.7% 51.6% Multi-hop 57.0% 79.1% Temporal 21.8% 46.2% Adversarial 46.3% 70.4% Commonsense 6.3% 9.4% Total 34.9% 52.8% Adjusted (valid questions only) 45.1% 66.9%

Multi-hop at 79.1% is legitimately strong. The MAGMA graph layer (co-occurrence and temporal edges between entities) is doing real work on questions that require connecting two facts across sessions.

Adversarial at 70.4% is also solid. Speaker scoping, where we extract the named person from the question and boost memories attributed to that speaker, handles most adversarial framing correctly.

Single-hop at 51.6% is where the benchmark is telling us the architecture needs to change.

Part 3: What 93% Actually Costs, and VEKTOR’s Real Differentiator
The Architecture Gap
EverMemOS achieves 93% on LoCoMo. Mem0’s new algorithm achieves 91.6%. Both use fundamentally different architectures than VEKTOR.

EverMemOS uses four separate storage backends: MongoDB for document storage, Elasticsearch for BM25 search with jieba tokenization, Milvus for vector similarity with HNSW indexing, and Redis for caching. It extracts three distinct memory types in parallel on every ingestion: Episodes (narrative summaries), Foresights (time-bounded predictions), and EventLogs (atomic facts). When you ask “when did Caroline go to the LGBTQ support group,” EverMemOS queries EventLogs first. The EventLog contains “Caroline attended LGBTQ support group on 7 May 2023” as a clean structured fact. The retrieval precision is near-perfect because there is no noise.

Mem0’s new algorithm uses a single-pass ADD-only extraction approach with entity linking. Every extracted fact becomes an independent record. Contradictions survive alongside each other with timestamps. “Caroline lives in Sweden [2019]” and “Caroline lives in Australia [2023]” both exist in the store, and the LLM reasons about the transition rather than getting a silently overwritten record.

Both approaches require cloud API calls at ingestion time. EverMemOS needs an LLM to extract Episodes, Foresights, and EventLogs from every conversation chunk. Mem0 needs an LLM to extract and deduplicate facts. The ingestion pipeline is the retrieval quality.

VEKTOR does not require a cloud API call at ingestion time. The retrieval quality comes from the retrieval pipeline itself, not from expensive preprocessing. This is a deliberate architectural tradeoff.

The Numbers That Actually Matter in Production
Here is the retrieval latency comparison:

Mem0: 0.71 seconds per query (cloud API call required)
EverMemOS: 200–500ms (Elasticsearch + Milvus + reranker)
VEKTOR: sub-millisecond (local SQLite + ONNX, no network call)
At 100 queries per second, Mem0 requires 71 server-seconds of retrieval time. VEKTOR requires less than one.

Token efficiency is the other axis. Mem0 new algorithm uses 6,956 tokens per retrieval call on average. EverMemOS is similar. VEKTOR surfaces 1,500–2,000 tokens of context. At scale, the difference between 7,000 tokens per query and 1,500 tokens per query compounds into significant cost.

The benchmark measures accuracy. It does not measure latency, cost, data sovereignty, or the ability to run completely offline. For many production use cases, these constraints matter more than whether the system scores 66.9% or 91.6% on a benchmark with 6.4% corrupted labels.

What VEKTOR Gets Right
VEKTOR’s architectural bet is that retrieval quality should come from a sophisticated local retrieval pipeline rather than from expensive cloud-dependent preprocessing. The pipeline we built after six weeks of iteration, bge-small bi-encoder followed by BM25 fusion followed by cross-encoder reranking, achieves 79.1% judge accuracy on multi-hop questions locally with zero cloud dependency at query time.

That is a real result. It means a developer can embed VEKTOR in a desktop application, a mobile app, or an air-gapped enterprise deployment and get competitive memory quality without sending conversation data to a cloud API.

The gap between 66.9% and 93% is real and it comes from the semantic triple extraction approach. We tried it, it made things worse with our current architecture, and we understand why. The right implementation requires replacing raw turn storage with structured fact storage, which is the next major architectural work.

But 66.9% beating Mem0’s previous algorithm at under one millisecond retrieval latency and zero cloud API cost is a genuinely useful product. That is the honest benchmark story.

What Comes Next
The next version of VEKTOR Slipstream will implement proper MemCell extraction: segmenting conversations at topic boundaries and storing episode-level summaries rather than raw turns. Combined with the current retrieval pipeline, this should push single-hop accuracy past 65% and overall adjusted judge above 72%.

The benchmark numbers will keep improving. More importantly, the retrieval latency will stay under one millisecond, the data will stay on your device, and the API key requirement will stay optional.

That is a different product than Mem0 or EverMemOS. It is not a lesser version of those systems. It is a different architectural tradeoff serving a different set of production constraints.

VEKTOR Slipstream is a once off paid solution AI memory SDK for Node.js. The benchmark code used in this article is available at www.vektormemory.com. The LoCoMo dataset is published by Snap Research under its original license.

If any info qouted is incorrect,
old or non factual please advise and the article will be updated accordingly.

The locomo-audit repository referenced in Part 1 is at github.com/dial481/locomo-audit.

Top comments (2)

mote • Apr 24

Great breakdown. I ran into the same dilemma last year building a document pipeline in Rust.

One thing I'd add: lopdf's error handling around malformed PDFs can be... rough. The crate silently skips objects it can't parse, which means you can strip metadata from a file and not realize half the xref table was garbage. I ended up writing a validation pass before any mutation.

The dual-engine approach is smart though. Have you considered using pdf-writer from the pdf crate for write-side operations? It's stricter about output conformance than lopdf's save, and catches things like duplicate object IDs at build time.

Also curious — how are you handling cross-reference streams vs classic xref tables? Some of the newer PDFs from Adobe's ecosystem only use streams and lopdf's reconstruction can produce files that Acrobat flags as 'repaired.'

mote • Apr 24

Great breakdown on the benchmarking methodology! The point about existing tests being designed for retrieval (recall@K) rather than agentic memory scenarios really hits home.

In our experience building moteDB (an embedded multimodal DB for robotics), we found that the real bottleneck isn't search speed — it's temporal locality. Most benchmarks assume random access patterns, but real robot memory is heavily skewed toward recent context with occasional long-tail retrieval.

A few thoughts:

Have you tested how your SDK handles session continuity? When an agent resumes after hours, does it rehydrate efficiently?
The embedding staleness issue you mentioned is real — we solved it with a hybrid storage approach that keeps raw context alongside embeddings for regeneration.
For embodied AI scenarios, the metadata (sensor provenance, timestamp, spatial coordinates) often matters more than the semantic content itself.

What's your hit rate on queries that require both recent context AND historical grounding?