Vektor Memory

Posted on Jun 12

79% on LongMemEval: How We Beat Full-Context GPT-4 with a Local SQLite Database

#agents #ai #database #llm

A benchmark result that changes what we thought was possible for local persistent agent vector memory

We ran VEKTOR Slipstream against LongMemEval this week and got a result we were very impressed with.

79.0%. That is 12 points above full-context GPT-4.

To understand why that number matters, you need to understand what LongMemEval is actually testing, why it is hard, and what it took to get there.

What LongMemEval Is and Why It Is the Hardest Memory Benchmark
Memory benchmarks operate on different testing question criteria.

They test whether your system can retrieve a fact that was stored recently, in a clean format, with an obvious query. That is approximately what happens in a controlled demo. It is not what happens in production.

LongMemEval is slightly different. It was designed specifically to stress-test the failure modes of real memory systems over real conversations. The benchmark contains 500 questions drawn from genuine multi-session chat histories, with an average of 344 memory items per question. The questions are distributed across seven categories, each targeting a specific failure mode:

Single-session retrieval tests whether you can answer a question from a single conversation correctly. Sounds easy. The catch is that the answer is buried in a long session, surrounded by noise, and the query phrasing bears no resemblance to how the answer was stored.

Multi-session reasoning asks you to connect facts across conversations that happened at different times. “What did the user say about their job last month” requires knowing that those memories exist and linking them.

Temporal reasoning tests date-anchored facts. “Where was the user living when they started their new job?” requires understanding which memories belong to which time window.

Knowledge updates test whether your system correctly invalidates old facts. If a user says, “I moved to San Francisco" after previously saying, “I live in Los Angeles," the correct answer to "Where does the user live?” is San Francisco. Systems that append rather than supersede fail this category consistently.

Abstention tests whether your system knows when it does not know. Many systems hallucinate an answer rather than say “I don’t have that information.” Abstention at 90% means VEKTOR declined to answer when it lacked the information, nine times out of ten.

The baseline in this benchmark is brutal. Full-context GPT-4, where the entire conversation history is stuffed into the context window, scores 67%. That is the system where the model literally sees everything and has to do nothing intelligent with storage. VEKTOR, running on local SQLite, beat it by 12 points.

The Four Versions We Ran to Get Here

We did not start at 79%. We started at 48.6% and ran four iterations to understand what was failing and why.

v1 (48.6%) was a naive implementation: store every turn as raw memory and retrieve it by vector similarity. The immediate failure was obvious. Questions like "What did the user say about their sister’s wedding?” returned semantically similar memories about events, parties, and celebrations. Technically correct retrieval. Wrong answer.

v2 (57.1%) added BM25 keyword search fused with semantic search via Reciprocal Rank Fusion. This improved single-session recall significantly. Multi-session questions still failed because the system had no way to reason about when memories occurred relative to each other.

v3 (55.2%) was a step backward. We introduced aggressive deduplication and contradiction detection, which accidentally removed valid memories that looked similar but referred to different time periods. Lesson: deduplication needs temporal awareness, not just semantic similarity.

v4 (79.0%) introduced what we are calling routed ingest, and it is the architectural decision that drove the result.

Routed Ingest — The Strategy That Changed Everything

The core insight behind routed ingest is simple: different types of memories benefit from fundamentally different storage strategies.

Before this, every conversation turn was stored the same way. Raw text, embedded, inserted. The problem is that “I moved to San Francisco last Tuesday” and “I prefer dark mode” and “the payment API went live yesterday” are three completely different types of information. Treating them identically is why most memory systems plateau in the 55 to 65% range.

Routed ingest assigns each memory to one of two pipelines at write time:

Extraction pipeline for complex, cross-session, time-sensitive information. The raw turn is sent to an LLM with a structured prompt that extracts discrete factual statements. “Sarah moved to San Francisco in March 2026.” “The user’s sister got married on 14 June.” “Project X launched on 3 April.” These extracted facts are stored as clean, independently queryable memories with resolved dates, named entities, and explicit subjects.

Raw storage for single-session conversational turns, preference statements, and questions where the original phrasing is the important artifact. These go in as-is because they do not benefit from transformation, as it introduces errors.

The routing decision is made by classifying the question type:

Temporal reasoning → extraction pipeline
Multi-session → extraction pipeline

Knowledge updates → extraction pipeline
Single-session → raw storage
Abstention questions → raw storage
The benchmark results by type tell the story directly:

temporal-reasoning 100.0% (15/15)
single-session-assistant 86.7% (13/15)
single-session-user 80.0% (16/20)
multi-session 75.0% (15/20)
abstention 90.0% (9/10)
knowledge-update 66.7% (10/15)
single-session-preference 50.0% (5/10)

Temporal reasoning at 100% is the most striking number. Every single date-anchored question was answered correctly. That is because extracted facts carry explicit date context that survives across sessions, and the temporal index can be retrieved by date range rather than relying on semantic similarity alone.

Multi-session at 75% with a 30-point improvement over v3 confirms that extraction is the right strategy for cross-session reasoning. The extracted facts give the system discrete, searchable statements rather than walls of conversation text.

What Full-Context GPT-4 Cannot Do That We Can

The comparison that surprises people most is beating full-context GPT-4 by 12 points.

Full-context GPT-4 on this benchmark means every conversation in the history is concatenated into a single massive prompt, and GPT-4 answers the question with the entire history visible. No retrieval. No selection. Just read everything and answer.

That approach has a hard ceiling, and it is lower than you might expect.

First, the context window fills up. GPT-4’s context limit means that very long histories get truncated. Information from older sessions simply disappears.

Second, and more interesting, is the attention problem. LLMs do not read a 200,000 token context the way a human reads a document. Attention is not uniformly distributed. Facts buried in the middle of a long context are systematically underweighted relative to facts at the beginning or end. The “lost in the middle” phenomenon is well documented in the research literature and measurable in benchmark performance.

Third, there is no disambiguation. When the same name appears in multiple contexts with different associated facts, the model struggles to track which fact belongs to which temporal context. Everything is simultaneous rather than sequenced.

VEKTOR’s temporal index solves this directly. Memories are stored with explicit date anchors, indexed by a dedicated timeline table, and retrieved with date-range filtering. The question "Where was Sarah living when she started her new job in March?” can be answered by retrieving memories tagged to March rather than scanning the entire history and hoping attention lands on the right passage.

The Architecture Behind the Numbers

Three components drove the benchmark result. They are all in the open SDK and available to anyone building on VEKTOR.

vektor_timeline is a secondary SQLite table that indexes every memory with an extracted ISO date. When a question contains temporal markers, the retrieval pipeline boosts memories from the relevant date range before running semantic search. This is why temporal reasoning hit 100%.

BM25 + RRF dual-channel recall fuses keyword search with semantic search using Reciprocal Rank Fusion. The two channels find different memories. Semantic search finds conceptually similar content. BM25 finds memories containing specific names, dates, and technical terms that do not have obvious semantic neighbors. RRF blends the rankings without requiring a learned fusion model. This is why proper noun recall improved dramatically from v1 to v2.

Entity indexing extracts named entities from every stored memory and builds a secondary index. Queries containing proper names use entity lookup to retrieve memories associated with that person, place, or project, then expand through graph edges to related memories. This is the pathfinding layer that handles the "What language does Sarah use?” class of question.

The full recall pipeline runs in this order for every query:

Classify question type (temporal / multi-session / single-session / adversarial)
Embed query vector
Semantic candidate retrieval (top 60 from 2000 recent memories)
Timeline boost if temporal markers detected
BM25 keyword search, stem table search
Entity lookup and graph traversal
RRF fusion across all channels
Layer 6 additive reranking (importance + strength + causal weight)
Return top K The whole pipeline runs on a local SQLite database. No API calls. No cloud infrastructure. No vector database cloud service or embedding costs. The latency is under 20ms on a laptop.

The two categories that need work.

Knowledge updates at 66.7% is the most interesting; this category tests whether the system correctly answers questions about facts that changed over time. “The user used to live in Los Angeles but moved to San Francisco. Where do they live?” The correct answer requires not only retrieving the more recent memory but also understanding that it supersedes the earlier one.

Our contradiction detection handles this well when the two facts are stored close together and share clear semantic overlap. It struggles when the update is phrased differently from the original or arrives in a different session context. The AUDN loop detects the contradiction but sometimes downgrades both memories rather than cleanly invalidating the older one. We need a harder supersession model, probably one that extracts a canonical attribute (location, job title, or relationship status) and explicitly marks all previous values for that attribute as expired.

Single-session-preference at 50% is trickier. Preference statements like “I prefer dark mode” or “I like concise responses” are stored correctly but recalled unreliably because they are low-importance, short, and semantically flat. They do not activate many recall channels. The fix is a dedicated preference namespace with its own retrieval path, bypassing importance scoring and prioritizing recency and specificity.

Both weaknesses are fixable. The architectural interventions are clear. This is what a benchmark is for.

What This Means for Anyone Building AI Agents

The headline number is 79%. The practical implication is more specific than that.

If you are building an agent that needs to remember things across sessions, you have a few architectural options. You can use full-context injection, which does not scale and has a ceiling around 67% on this benchmark. You can use a vector database with naive retrieval, which plateaus around 55 to 62%. Or you can use an intelligent memory system with routed ingest, temporal indexing, and multi-channel recall.

The gap between those options is not marginal. It is the difference between an agent that answers “where was Sarah living when she started her new job?” correctly and one that either hallucinates or says it does not know.

For production applications, especially in domains like personal assistants, customer service agents, research tools, and coding assistants, the quality of memory retrieval is directly proportional to user trust. Users notice when an agent forgets things. They notice when it contradicts itself. They notice when it cannot connect two facts they told it in the same week.

The benchmark says VEKTOR handles 79% of these cases correctly. The failure cases are known, the interventions are clear, and the architecture is local-first with no cloud dependency.

Next Steps and What We Are Building Toward

The two weaker categories give us a concrete roadmap for further testing on v5.

Knowledge updates need a supersession model that tracks canonical attributes per entity and explicitly expires stale values. The data model is straightforward: every memory gets an attribute type tag at extraction time, and any new memory with the same attribute type for the same entity triggers invalidation of older values.

Preference recall needs a dedicated lightweight pathway that does not compete with importance-weighted retrieval. Preferences should not be ranked against architectural decisions or deployment failures. They should be in their own bucket, retrieved in full when the session starts.

Beyond the immediate fixes, the routed ingest strategy opens up a broader architectural direction. Once you are classifying memories at write time, you can route them to specialized indexes rather than a single general-purpose vector store. Temporal facts to a timeline index. Entity facts to an entity graph. Procedural knowledge to a task index. The benchmark shows that specialization beats generalization significantly.

VEKTOR v1.7.2 is completing testing for future release with the architecture that produced this current result, and 1.6.3 is already live now. The SDK is local-first and available at vektormemory.com.

VEKTOR Slipstream is a local-first persistent memory SDK for AI agents. No cloud required. vektormemory.com

Sqlite
Vector Database
Longmemeval
Vector Memory

Top comments (1)

mote • Jun 18

79% on LongMemEval with local SQLite against full-context GPT-4 is a strong result. I've been working on moteDB — an embedded multimodal database for AI agents — and reached the same conclusion: context windows are a bottleneck pretending to be a feature. Once memory lives on-device in a real database, the architecture shifts from "how much fits in the prompt" to "what's relevant right now."

One thing that bit me: naive similarity search on SQLite started choking around 50K entries. I ended up sharding by recency and building a tiered index — not elegant but it keeps latency under 10ms for the hot tier.

How's your retrieval latency holding up at scale? Did you hit the same wall or find a cleaner approach?