DEV Community: Baran Özdemir

Why agents need memory that improves itself

Baran Özdemir — Thu, 11 Jun 2026 23:55:59 +0000

"Agent memory" usually means a vector database: embed everything the user said, query by similarity, paste the top matches into the prompt. It's a useful trick, but it isn't memory. It's a lookup table that never learns, never forgets correctly, and can't tell you what was true last month versus today. An agent built on it doesn't get smarter the longer you run it — it just accumulates more haystack to search.

The name Eidentic is deliberate: an agent without memory has no identity. We think real memory needs four things working together.

1. Facts with a lifetime

Plain vector recall has no concept of time. If a user was on the starter plan in March and the team plan in June, both sentences sit in the index with equal weight, and the model picks whichever embeds closer. That's how agents confidently tell you yesterday's truth.

Eidentic stores facts in a temporal knowledge graph where each fact carries a validity interval. New information supersedes the old without deleting it: the agent can answer "what plan are they on now" and "what plan were they on in April" from the same store, and contradictions resolve instead of piling up. Memory that can't reason about time isn't memory — it's a cache.

2. Memory the agent edits itself

People don't store a transcript of every conversation; they keep a running summary and revise it. Eidentic gives agents self-editing memory blocks — compact, structured notes the agent rewrites as it learns — plus passive extraction that pulls salient facts out of every turn automatically. You don't write ingestion pipelines or decide what to remember; the agent maintains its own working memory.

3. Consolidation between sessions

If memory only ever grows, retrieval gets slower and noisier over time — the opposite of improving. Eidentic runs sleep-time consolidation: between sessions it compresses and merges what was learned, so the next session starts knowing more without a larger prompt. This is the step that makes memory self-improving rather than merely cumulative.

4. Recall you can trust

Lexical and vector retrieval each miss things the other catches, so Eidentic fuses both with reciprocal-rank fusion and returns results with citations. An answer drawn from memory can point at the session and the fact it came from — which matters the moment an agent does anything consequential.

Why this is the hard part

Wiring a model to a tool loop is a weekend. Memory that stays correct as it grows — across contradictions, across time, without ballooning the prompt — is the part teams underestimate and then rebuild three times. It's also what separates a demo from an agent you'd put in front of real users for months.

An agent without memory has no identity. Eidentic gives agents theirs — and keeps it honest as the world changes.

This isn't theoretical: on long histories it's measurably better and cheaper than stuffing everything into context (the benchmarks). If you're building something an agent has to remember, start with the docs or the source.

Memory beats full context on LongMemEval — and the wins we don't get

Baran Özdemir — Thu, 11 Jun 2026 23:48:44 +0000

A common objection to agent memory is that you don't need it: context windows are huge now, so just put the whole history in the prompt. We wanted a real answer, not a vibe, so we ran two public long-term-memory benchmarks against a full-context baseline. Here's what we found — including the case where the baseline wins.

The setup

We compared two configurations on the same questions. The full-context baseline stuffs the entire conversation history into the prompt. Eidentic memory ingests the history into its four-tier engine and retrieves only what each question needs. Both use the same model and the same LLM judge. We ran the full sets — no sampling — and we're publishing wins and losses together.

LongMemEval: memory wins across the board

LongMemEval uses long histories — roughly 115k tokens across ~50 sessions, 500 questions. This is where memory should help, and it does: 55.2% overall vs 41.0% for full context, a 14.2-point gain, winning all six question types.

Question type	Full context	Eidentic memory
Single-session · user	67.1%	84.3%
Single-session · assistant	73.2%	92.9%
Single-session · preference	3.3%	26.7%
Multi-session	27.8%	42.1%
Temporal reasoning	20.3%	34.6%
Knowledge update	66.7%	70.5%
Overall	41.0%	55.2%

The cost difference is the other half of the story. Memory answers each question with about 2,550 tokens of retrieved context; the baseline spends about 99,435 re-reading the whole history every time — up to ~39× fewer tokens for the better score. Retrieval isn't just more accurate here, it's dramatically cheaper.

LoCoMo: where full context still wins

LoCoMo has a much smaller haystack. When the entire history comfortably fits in the window, brute force is hard to beat: the model can see everything at once, and single- and multi-hop questions don't need retrieval. Here the full-context baseline comes out 7.8 points ahead. Memory still uses far fewer tokens (~893 vs ~19,030), but on a small history that trade-off doesn't pay for itself on accuracy.

The larger the history, the more memory wins — on accuracy and on cost. On small histories, full context stays competitive. We'd rather you know both numbers than just the flattering one.

What this means in practice

If your agent's conversations are short and bounded, you may not need a memory engine at all — and we'll tell you that. But the moment histories grow past what you want to pay to re-read on every turn, retrieval-based memory wins twice: better answers, far fewer tokens. That crossover arrives quickly in real products.

Full methodology, the harness, and the raw per-question records are in the benchmarks docs, and the runner lives in the repo. Reproduce it, and tell us where we're wrong.

Introducing Eidentic

Baran Özdemir — Thu, 11 Jun 2026 23:47:57 +0000

Today we're releasing Eidentic, an open-source TypeScript SDK for building AI agents with self-improving memory and the production fundamentals built in — not bolted on. It's Apache-2.0, with no enterprise tier, and it runs on Node, Bun, Deno, and the edge.

The two things you keep rebuilding

Every serious agent eventually needs the same two things, and most stacks make you assemble both yourself.

The first is memory that actually improves. Not a vector store you query and paste into a prompt, but something that remembers across sessions, resolves contradictions, and gets sharper the longer it runs. The second is the production layer: durable runs, cost limits that are actually enforced, multi-tenant isolation, sandboxed tools, evals that gate CI. In most ecosystems that layer shows up late, as an enterprise add-on, or never.

Eidentic ships both, in one composable package, fully open.

Thirty seconds to a memory-backed agent

npm install eidentic

import { Agent, AIModel, SqliteStore } from "eidentic";
import { anthropic } from "@ai-sdk/anthropic";

const agent = new Agent({
  id: "support",
  instructions: "You are a support agent. Remember the user.",
  model: new AIModel(anthropic("claude-sonnet-4-5")),
  store: new SqliteStore("./eidentic.sqlite"),
});

for await (const ev of agent.query("What did we decide last week?", {
  sessionId: "u-42",
})) {
  if (ev.type === "stream.delta") process.stdout.write(ev.delta.text);
}

The agent recalls prior sessions for that sessionId inside query(), with citations, and consolidates what it learned while idle. Swap SqliteStore for @eidentic/libsql or @eidentic/postgres and the agent code doesn't change — that's the ports-and-adapters design running through the whole SDK.

What's in the box

A four-tier memory engine: lexical + vector recall, self-editing memory blocks, a temporal knowledge graph, and sleep-time consolidation.
Durable execution — checkpoint and resume with exactly-once tool dispatch.
Enforced cost ceilings, rate limits, quotas, and multi-tenant isolation.
Sandboxed tools, deny-by-default permissions, and one-call GDPR erasure.
An eval harness with a CI pass-rate gate, an MCP host + server with OAuth, and A2A.
First-class React hooks, a Next.js handler, and a CLI.

Honest about where we are

Eidentic is pre-1.0 and stabilizing toward v1. We'd rather over-disclose gaps than oversell, so we publish our benchmarks in full — including the ones we lose. On LongMemEval, memory beats a full-context baseline by 14.2 points at up to ~39× fewer tokens; on LoCoMo's smaller haystack, full context still wins. Both runs are public.

Get started

Read the docs, browse the source on GitHub, or clone an example for Next.js, React, or Express. If you build something with it, we'd love to hear about it.