DEV Community

t49qnsx7qt-kpanks
t49qnsx7qt-kpanks

Posted on

MnemoPay v1.4.0: 77.2% on LongMemEval, 1M-op stress test, and what the architecture actually looks like

The thing that bugs me about most "agent memory" benchmarks is they test retrieval on short histories. Your agent handles 10 sessions, you confirm it remembers your name, you ship. LongMemEval tests something harder: given a long multi-session history with conflicting updates and temporal gaps, does the agent retrieve the right fact at the right time? 500 questions, judged by GPT-4o against oracle answers.

We ran it on @mnemopay/sdk v1.4.0. Oracle score: 77.2%. The weakest bucket was multi-session at 66.9%, which is the hardest category and the one we're actively working on. Every published baseline we could find for comparable setups was lower. That's the number — no editorializing.

Here's what's underneath it.

Memory that forgets properly

Most SDKs treat memory as a key-value store with timestamps. Write, retrieve, done. The problem is that agents accumulate contradictory information over time. "User prefers morning meetings" written three months ago competes with "user now works nights" written last week. A flat retrieval system doesn't resolve that — it just returns both and hopes the LLM figures it out.

MnemoPay uses Ebbinghaus decay curves on stored memories. Every memory entry has a recency score and a retrieval frequency score. Memories that haven't been accessed in a while decay toward lower salience. Memories that keep getting retrieved get reinforced — Hebbian learning, basically. The model that retrieves from memory sees a ranked, time-aware list, not a flat blob.

The 66.9% multi-session score tells you this works but isn't complete. Multi-session queries require threading context across gaps where the agent wasn't running. That's harder. We know what's missing and it's on the roadmap, but I'm not going to claim we've solved it.

Anomaly detection on behavioral drift

One of the nastier problems in production agent deployments is that agents can develop behavioral drift — they start doing things differently than they were doing them last week, not because anything changed in their instructions, but because their memory state shifted. If you're not monitoring for that, you won't catch it until something goes wrong.

v1.4.0 ships EWMA anomaly detection on behavioral signals. Each agent has a rolling baseline for its decision patterns. When observed behavior deviates past a threshold, the SDK flags it. The math is straightforward:

// EWMA anomaly detection — simplified view
const alpha = 0.15; // smoothing factor
ewma = alpha * observed + (1 - alpha) * ewma;
const deviation = Math.abs(observed - ewma) / ewma;
if (deviation > ANOMALY_THRESHOLD) {
  await flagAnomaly(agentId, { observed, ewma, deviation });
}
Enter fullscreen mode Exit fullscreen mode

This runs on every payment event, not just periodically. Canary honeypots supplement it — deliberate tripwires in the memory store that should never be retrieved. If they get retrieved, something's wrong with the retrieval path.

Memory integrity with Merkle hashing

Every time a memory is written or updated, v1.4.0 hashes it into a Merkle chain. You can verify at any point that a memory state hasn't been tampered with — by the agent, by a bug, or by an adversary. Forget operations are auditable. The hash of what was forgotten is retained.

This matters more than it sounds for anything with financial stakes. If your agent managed a budget last month and you need to audit what it knew when it made a decision, you need tamper-evident memory history. That's what this is.

Payment rails + portable agent credit score

The other half of MnemoPay is payments. Stripe, Paystack, and Lightning are all live. Real rails, not mock transactions. Every payment event generates an HMAC-SHA256-signed receipt, so you have a cryptographically verifiable trail of what the agent spent and when.

On top of that, v1.4.0 ships a portable agent credit score (300-850). The score is built from behavioral signals: payment consistency, anomaly history, retrieval accuracy over time, identity verification status. The idea is that agents operating across different systems shouldn't have to start their reputation from scratch every time. An agent that's been running reliably for six months should carry that history.

The score is embedded in the JWT the SDK issues, so any system that accepts the token can read the score without making a separate call.

import { MnemoPay } from '@mnemopay/sdk';

const client = new MnemoPay({ apiKey: process.env.MNEMOPAY_API_KEY });

// store a memory with automatic decay + Merkle hash
await client.memory.store(agentId, {
  content: 'user prefers async communication over meetings',
  context: 'onboarding-session-3',
  importance: 0.85
});

// retrieve with decay-weighted ranking
const memories = await client.memory.recall(agentId, {
  query: 'user communication preferences',
  limit: 5
});

// initiate a payment — HMAC receipt generated automatically
const payment = await client.payments.charge({
  agentId,
  amount: 4.99,
  currency: 'usd',
  rail: 'stripe',
  memo: 'API call batch — vector search'
});
Enter fullscreen mode Exit fullscreen mode
// read the agent credit score from the session token
const { score, tier, flags } = await client.identity.getCreditScore(agentId);
// score: 742, tier: 'trusted', flags: []
Enter fullscreen mode Exit fullscreen mode

Numbers

  • 800+ tests, all passing
  • 1M-op production stress test with zero data corruption
  • 3,549 npm downloads since March 21 (331/week run rate)
  • Python SDK: 297 downloads
  • Apache 2.0 license, no lock-in
  • Three MCP directory listings: Smithery, ClawHub, mcpservers.org
npm install @mnemopay/sdk
# or
pip install mnemopay
Enter fullscreen mode Exit fullscreen mode

Docs and the LongMemEval methodology write-up are at https://mnemopay.com. If you're building agents that handle money or need auditable memory history, that's the place to start.

Top comments (0)