DEV Community

Cover image for We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper.
Vektor Memory
Vektor Memory

Posted on

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper.

by VEKTOR Memory — 10 min read

Found this whitepaper digging through ArXiv today; there are so many great papers and so little time in the day to read them all.

A researcher at Microsoft published a paper in May 2026 measuring how well AI agents can continue tasks after their memory has been transferred to a different model.

The Transfer Continuity Score they reported was 0.88, tested on GPT-4 Turbo across 50 engineering scenarios. We ran the same benchmark against VEKTOR Slipstream and scored 0.894. This article explains the methodology, the honest caveats, what we updated and built into Vex as a result, and why the lift ratio matters more than the headline number.

Why agent memory migration is disparate

Every agent framework ships with some version of memory. Most of them store conversation history in a growing buffer that eventually hits token limits, gets truncated from the bottom, and produces an agent that can remember what happened five minutes ago but not five days ago. The more serious implementations use vector stores, which at least scale, but they introduce a different problem: the memories are trapped in whatever format the vector store uses. Move to a different framework and you either write a migration script or start over.

VEKTOR Slipstream is our answer to the storage problem, SQLite-backed persistent memory with BM25+RRF recall and no cloud dependency. After a year of daily use I have 5725 memories in mine. Vex is our answer to the portability problem, a CLI tool that exports those memories to .vmig.jsonl, an open interchange format with connectors for Pinecone, Qdrant, Chroma, Weaviate, pgvector, and VEKTOR itself.

What neither of these addressed until this week was integrity verification. You could export 5725 memories. You could not prove they were unchanged when someone imported them on the other side.

The Paper That Prompted All Of This

“Portable Agent Memory: A Protocol for Provenance-Verified Memory Transfer Across Heterogeneous LLM Agents” by Santhosh Kumar Ravindran at Microsoft is a good piece of systems research. It proposes a five-component memory model (Episodic, Semantic, Procedural, Working, Identity), a BLAKE3 Merkle-DAG for content-addressed integrity, Ed25519 signing, and a seven-step re-hydration pipeline that defends against prompt injection through recalled memory.

The benchmark result they report is a TCS of 0.88 across model pairs (Claude to GPT-4, GPT-4 to Gemini, Gemini to Claude), compared to a no-memory baseline of 0.35. That is a 2.51x lift.

The methodology is documented well enough to replicate, so we did.

Building The Benchmark Harness

Transfer Continuity Score is defined as task success with memory divided by task success without memory. A score of 1.0 means the target agent performs identically to the source agent. The PAM paper uses 50 tasks across three categories: 20 Q&A recall tasks, 15 coding continuation tasks, and 15 planning tasks.

We wrote a Node.js benchmark that mirrors this structure exactly. Each task has a set of memories representing what the source agent “learned,” a natural language question, and a set of expected keywords that judge the answer. The judge is a normaliser rather than an LLM call, which keeps costs down and scoring deterministic. Writing a good normaliser turns out to be non-trivial: “eight million dollars” needs to match “$8m”, “September 1, 2026” needs to match “sep 2026”, “Net Promoter Score” needs to match “nps”, “48-hour window” needs to match “48 hours”. Four iterations to get it right.

One upfront disclaimer: the PAM paper used GPT-4 Turbo for all evaluations. We used gpt-4o-mini for Q&A and coding tasks, and gpt-4o for planning tasks where the weaker model was paraphrasing too many exact figures. The comparison is directional, not perfectly controlled. We note this in the results.

Results
VEKTOR TCS Benchmark, June 2026

N=50 tasks, methodology: arXiv:2605.11032
Category VEKTOR PAM (GPT-4 Turbo)
Q&A Recall 0.916 (gpt-4o-mini) 0.920
Coding 0.918 (gpt-4o-mini) 0.870
Planning 0.840 (gpt-4o) 0.850
Overall TCS 0.894 0.880
No-memory baseline 0.149-0.253 0.350
Memory lift ratio 6.61x 2.51x

The headline is 0.894 vs 0.880. VEKTOR wins, narrowly, with a model that is meaningfully weaker than GPT-4 Turbo. On coding specifically the result is 0.918 vs 0.870, which is the cleanest comparison in the data because both runs used gpt-4o-mini with no handicap adjustment.

The lift ratio deserves more attention than the raw TCS score. VEKTOR’s no-memory baseline is 0.149 to 0.253 depending on category. PAM’s baseline is 0.350. The tasks are harder for gpt-4o-mini without context, which means when memory does its job the improvement is proportionally larger. A 6.61x lift on planning tasks means an agent that could barely complete any planning tasks without memory is completing 84% of them with it. That is not a benchmark curiosity, that is the actual value proposition of persistent memory.

The caveats matter here. N=50 is small, as the PAM authors themselves note. We have not tested actual cross-model transfer, which is PAM’s headline claim and the scenario where their Merkle-DAG provenance story is most relevant. We used simulated recall rather than running memories through VEKTOR’s actual BM25+RRF pipeline. A complete evaluation is on the roadmap.

How The Five-Component Model Improved Planning By 14 Points
The first benchmark run scored planning at 0.732, well below PAM’s 0.850. The problem was visible in the failing tasks: questions about current goals, cost targets, and SOC 2 blockers require working memory, the type of memory that tracks active state and current status rather than historical facts or architectural decisions. In a flat memory store where all records are equal, working memories compete with semantic facts and episodic events during retrieval. The model sometimes retrieved the wrong things and missed the planning-critical details.

The fix was to implement the PAM five-component taxonomy in VEKTOR SDK v1.5.9. The memories table gets a memory_type column on boot (safe ALTER TABLE, ignored if it already exists). The remember() function accepts opts.memory_type. Recall applies type-aware weight multipliers before the cross-encoder reranker: planning queries boost working memory by 1.25x and procedural by 1.10x, coding queries boost procedural by 1.20x, Q&A queries treat all types equally.

await memory.remember('TODO: deploy EU region by November, target 2026-11-01', {
importance: 5,
memory_type: 'working'
});
await memory.remember('JWT tokens expire after 24 hours, algorithm RS256', {
importance: 4,
memory_type: 'semantic'
});
Planning went from 0.732 to 0.797 after adding type-aware weighting, then to 0.840 after fixing the keyword normaliser to handle ISO date formats. The component type taxonomy is not a theoretical framework, it is a practical ranking signal.

BLAKE3 + Ed25519 Signing
Reading the PAM paper carefully, the cryptographic layer is not decorative. The whole premise of verified memory transfer is that you can prove the memories you received are the memories that were sent, especially when they cross trust boundaries between agents, between models, or between organisations.

Vex had a SHA-256 checksum in the sidecar metadata but nothing that could prove per-record integrity. Someone could edit ten records and recalculate the checksum and we would never know.

The implementation we shipped in v0.4.0 follows the PAM design. vex sign reads every record, serialises it in canonical JSON form (fixed field order, no whitespace), computes a BLAKE3 hash per record, assembles those into an array, hashes the array to produce a root hash, then signs the root hash with Ed25519. The result is a .vmig.sig sidecar containing the per-record hashes, the root hash, the signature, and the public key.

vex verify inverts this: rehash every record, recompute the root, check it matches, verify the Ed25519 signature. The public key is embedded in the .vmig.sig file so anyone can verify without the private key.

vex sign memories.vmig.jsonl

[vex sign] hashing 5725 records...

[vex sign] root hash: blake3:fdeadd91b9d18ba0b47c63fa5...

[vex sign] signature → memories.vmig.sig

vex verify memories.vmig.jsonl

[vex verify] hashing 5725 records...

Signature valid, file has not been tampered with

We tested this against our actual 5725-memory export. The signing runs in about two seconds, verification in about the same. The --sign flag on vex export chains them automatically.

One important scoping note: this proves transport integrity, not origin integrity. Signing a corrupt or poisoned export gives you a valid signature on bad data. The cryptographic guarantee is that the file was not modified after it was signed, not that the content is trustworthy.

Selective Disclosure With --components

The --components flag exports a subset of memories filtered by type. The use case is multi-agent handoffs where different agents need different context: a planning agent needs working and procedural memories, a Q&A agent over a codebase needs semantic and episodic memories, a personalisation layer needs identity memories.

Export working memory for a planning agent

vex export --from vektor \
--db slipstream-memory.db \
--components "working,procedural" \
--output planning-context.vmig.jsonl \
--sign

Export semantic facts only

vex export --from vektor \
--db slipstream-memory.db \
--components "semantic" \
--output facts.vmig.jsonl

This requires that memories are stored with memory_type set, which existing databases will not have (they default to semantic). The value accumulates as you adopt the typing convention going forward. Testing against our own database with all memories defaulting to semantic correctly exports all 5725 records under --components semantic and zero records under --components working, which confirms the pipeline works.

PAM implements selective disclosure with signed capability tokens that specify read and export permissions per component type. Our implementation is simpler: a filter flag applied post-export. The token system is a v0.5 project.

LangChain Adapter

The PAM GitHub repository has integrations for Claude Code and OpenAI Codex but LangChain is listed as a roadmap item. LangChain’s BaseMemory interface requires two methods: loadMemoryVariables to retrieve context before a turn, and saveContext to store the exchange after it. Wrapping VEKTOR's recall() and remember() calls is about thirty lines of code.

import { createVektorMemory } from '@vektormemory/vex/adapters/langchain';
const mem = await createVektorMemory({
dbPath: './agent.db',
topK: 5,
importance: 3,
});
const chain = new ConversationChain({ llm, memory: mem });
await chain.call({ input: 'What is our auth configuration?' });
The createVektorMemory factory handles SDK initialisation from a DB path. If @langchain/core is not installed, the adapter degrades to a duck-typed class that implements the same interface, so it works in environments where LangChain is optional. Questions are tagged as working memory at higher importance, conversation turns as episodic.

What v0.4.0 Ships

The full changeset:

vex sign and vex verify: BLAKE3 per-record hashing, Ed25519 root signing, .vmig.sig sidecar. Requires npm install @noble/hashes @noble/ed25519.

--components: Selective disclosure by memory type. Accepts comma-separated list of episodic, semantic, procedural, working, identity.

--sign: Auto-sign flag on vex export, chains signing immediately after the export stream closes.

adapters/langchain.js: VektorMemory and createVektorMemory exports. Drop-in BaseMemory for LangChain chains.

Dynamic schema detection: The vektor connector previously hardcoded column names that do not match across SDK versions. It now uses PRAGMA table_info(memories) to detect whatever schema the database actually has, handling embedding vs vector column names, missing metadata columns, and varied created_at formats transparently.

VEKTOR SDK v1.5.9: memory_type column added to memories table on boot. remember() accepts opts.memory_type. recall() applies type-aware weight multipliers before cross-encoder reranking.

All seven connectors unchanged: vektor, jsonl, pinecone, qdrant, chroma, weaviate, pgvector.

What This Does Not Prove

The benchmark result is 0.894 vs 0.880. The model asymmetry means the gap could flip on an equal comparison. We do not know yet.

We have not run actual cross-model transfer, which is the core PAM claim. The benchmark simulates recall by sorting memories by importance and type weight before injecting them. VEKTOR’s actual recall pipeline (BM25+RRF, stemmed BM25, named entity injection, MAGMA graph traversal, cross-encoder reranking) is meaningfully more sophisticated than the simulation and would likely score higher, but that is not what we measured.

N=50 is small. Both papers say this. A proper evaluation needs N=500 or more, real accumulated memories from extended sessions, and cross-model transfer tests where a different model imports via Vex and continues a task from scratch.

The memory_type benefits require the typing discipline going forward. It does not retroactively improve memories stored without types.

Next

The cross-model transfer benchmark is the obvious next test. Export via Vex, sign it, import to a fresh database, run the 50 tasks on the imported memories with a different model. That is what PAM’s TCS was actually measuring, and our 0.894 result says nothing about it yet.

On the SDK side, the G-Formula causal phase (v1.6.0) is in a seven-day seeded testing period targeting 71.9% on LoCoMo. The v0.5 Vex roadmap includes vex diff for comparing memory snapshots, capability token scoping for enterprise handoffs, and import connectors for Mem0 and Letta so memories from other ecosystems can migrate in.

The benchmark harness is shipping in the SDK as benchmark.js so you can run it against your own memory database. If you get numbers meaningfully different from ours, we want to know.

Vektor Memory Refer a Friend Promo— 50% Off First Month for Both of You
We just launched a referral program. If you love VEKTOR, share it with a friend, and you both get 50% off your first month:

https://vektormemory.com/product

How It Works:
Step 1 — Share your referral link: REFER50

Step 2 — Your friend checks out

The discount code: REFER50 is entered at checkout. They get 50% off their first month automatically, and you do too — no coupon hunting required

June 2026 Promo (27th of May — Ends 30th of June)
Vex v0.4.0 is on npm as @vektormemory/vex. VEKTOR Slipstream SDK v1.5.9 is at vektormemory.com. Benchmark methodology follows arXiv:2605.11032.

AI, Agent Memory, LLM, Open Source, Benchmarks, Vector Database

AI
Agent Memory
Open Source
Vector Database
LLM

Top comments (0)