We Installed a Hippocampus in AI, Modeled After the Human Brain

zhide liu — Tue, 23 Jun 2026 09:16:34 +0000

The AntGroup AFX team is open-sourcing Hebb Mind, a memory system for AI agents. Instead of just "store and retrieve," it brings the brain's full memory loop — encode → consolidate → activate → forget — into engineering.

GitHub: https://github.com/afx-team/hebb-mind

Docs: https://afx-team.github.io/hebb-mind/quick-start.html

How do we make AI remember context today?

The most common approach is a handful of MEMORY.md, AGENTS.md, and CLAUDE.md files paired with grep (keyword search over text). Whatever you need to remember gets appended line by line; when you need it, you search. The appeal is that it's dead simple — zero install, zero dependencies, zero cost — and you can start right now.

But the limits show up fast: there's no structure. The longer you use it, the more your memories pile up flat inside one ever-growing file. Phrase your query differently and you miss the hit; scale it up and "memory" degrades into "a tangled mess."

A step up is the more systematic option: a vector store plus semantic retrieval (i.e., RAG). After each turn you make a model call to extract the key points, write them into a store indexed "by semantic similarity," and next time you ask, you recall "the few entries closest in meaning." This solves grep's rephrasing problem and gives you genuine semantic search.

But hold these up against a real brain and you'll notice a few missing dimensions:

Neither one organizes and settles scattered experiences into structured knowledge during idle time (a .md is a puddle of text, a vector store is a heap of vectors — nothing gets filed away);
Neither one forgets on purpose — an offhand remark from three years ago carries the same weight as a preference you stressed yesterday;
Recall is just "pull up the closest few entries." It can't give you the brain's feeling of one cue lighting up a whole region.

In short: .md + grep is "good enough," and a vector store gets you "search by meaning" — but even if model vendors push the context window from 200k to 1m, it still doesn't solve the real problem. Because what our brains do has never been simple storage plus retrieval. It's a complete loop — encode → consolidate → activate → forget.

We rewrote it the way the brain takes notes

We tried a range of off-the-shelf options, and on intelligence they all came up short. So we decided to look hard at the structure and the memory machinery of the human brain, and see if it held any clues. Below is the standard map of major brain regions that researchers broadly agree on — and the secret of memory is hidden in there.

Step one, encoding. What you just experienced isn't immediately carved into the deep brain. The hippocampus first stashes it away quickly and cheaply — like jotting it on a sticky note.

Step two, consolidation — and it happens while you sleep. This is the step everyone overlooks, and also the most important one. During sleep, the hippocampus replays the day's experiences at high speed to the cortex, weaving the new information into your existing web of knowledge, pass after pass. That's why after a good night's sleep you often feel like you've "figured it out" or "locked it in" — memory gets cemented while you rest. And the stuff that's important, the stuff that fits what you already know, cements fastest and firmest.

Step three, retrieval — a "spreading activation." Remembering isn't an exact table lookup. One cue ("that café") travels along the association network and lights up "the project we discussed that day," "the person sitting across from me." Even when the cue is only half there, the brain fills in the rest.

Step four, forgetting — an active ability. What you use often and used recently sticks around; what you haven't touched in a long time fades exponentially on its own. This isn't "the disk filled up and overflowed" — it's deliberate. The brain wants your head clear when you're thinking, not stuffed with years of stale noise.

By contrast: today's mainstream AI memory approaches drop exactly the three steps that matter most — consolidation, spreading activation, and active forgetting — and keep only "store" and "retrieve."

Following this machinery, we built Hebb Mind — and it's now open source. "Hebb" comes from the neuroscientist Donald Hebb, who in 1949 proposed what's now known as Hebbian theory, usually boiled down to one line: "neurons that fire together, wire together." That's the underlying mechanism by which the brain forms associations and leaps from one point to a whole region — and it's the starting point for Hebb Mind's design: memories that often appear together get linked, and memories that get used often stick around longer.

So the core of Hebb Mind isn't "yet another RAG framework." It's that biological loop, ported one-to-one into engineering:

New things land in the "hippocampus" first; during "sleep" they're moved to the "cortex"; when you think, multiple paths light up and one point leads to a whole region; what goes unused slowly fades.

▲ The loop as a system: integration layer → four-stage memory loop → local storage

Concretely:

Encode: the first stop for every new memory isn't long-term storage — it's a working-memory buffer that maps to the "hippocampus." Fast, cheap, format-agnostic, no extra processing, readable and writable verbatim. This step is fully local and needs no external cloud service.
Consolidate: a "sleep-like" background job runs on a schedule every day (the system defaults to evening, as if drifting off), replaying that day's scattered notes in the buffer to a large model, session by session, as whole transcripts — the engineering version of the brain's "high-speed replay" during sleep. In a single pass the model decides: which category this memory belongs to, how to distill and rewrite it, how important it is, and how to resolve conflicts with existing memories. The cleaned-up result is written into long-term memory, then the buffer's drafts are cleared — emptying the hippocampus to make room for new experiences.
Activate: when looking something up, three paths light up at once — "semantically close," "literal match," and "tag-related" are all recalled together, then merged and ranked. After a hit, it follows the context of the same conversation and its related tags to pull in "that whole region" too. This is the engineering version of "one point leads to a whole region, and partial cues still get completed."
Forget: how long a memory survives is decided dynamically by "how often it's used × how important it is × how long since it was last used." High-frequency, important memories are practically nailed into long-term storage; obscure, idle ones quickly collapse and get reclaimed. Literally use it or lose it.

These four steps map, respectively, onto several well-known theories in cognitive science — Complementary Learning Systems (CLS), Tulving's multiple memory systems, sharp-wave-ripple replay during sleep, ACT-R activation, and the Ebbinghaus forgetting curve. We studied this research as the starting point for the design, and wired it into the code one piece at a time.

So does it actually work? Straight to the numbers

However nice the mechanism sounds, it has to prove itself. The numbers below all come from public datasets, and they were produced by the same code that ships in the real product — not some "lab-only special." What you install is the same build, and all of it is reproducible.

LongMemEval (500 long-horizon memory questions, a widely used industry benchmark)

Finding it (retrieval recall): Top-5 hit 99.0%, Top-10 hit 99.4%; four of the six question types hit 100% outright.
Getting it right (end-to-end answer accuracy): 79.0% — using the dataset's official neutral prompt with no task-specific tuning, already closing in on the theoretical ceiling (about 82%) you'd get assuming retrieval were 100% perfect.
On the same 500 questions under the same official protocol, Zep's published numbers are Top-10 retrieval hit 95.5% and answer accuracy 71.2% — same questions, same scoring; on both "finding it" and "getting it right," Hebb Mind comes out ahead.

▲ Ability to find the evidence — ahead at every depth

▲ Accuracy when actually answering

LoCoMo (multi-turn long conversations spanning dozens of sessions, built to test "do you still remember what was said several sessions ago")

On what this dataset measures best — "can you find it" — Top-10 hit 95.75% (94.14% even with rerank off). Under the same model and the same config, it edges out the published MemPalace numbers.

▲ LoCoMo recall, same-protocol comparison

These are just the two most representative datasets. If you want the full scores on every dataset (more benchmarks, with each table carrying its protocol, k value, version, embedding/rerank config, and the exact rerun command), the complete benchmark report is open — you're welcome to check it out:

📊 Full benchmark report: afx-team.github.io/hebb-mind/benchmarks/

Beyond the scores — how to choose

Scores alone aren't enough. There are plenty of memory frameworks, each with its own approach. Put a few of the commonly compared ones side by side and it roughly looks like this:

Framework	Starts as	Write needs an LLM?	Active forgetting	License	Strength / when to pick it
Hebb Mind	Local-first, zero external services	No — only consolidation does	Yes — staged, dynamic TTL	MIT	Local & private + faithful mechanism + full neuroscience loop
mem0	Self-hosted / cloud, needs a vector store	Yes	No	Apache core + paid cloud	Earliest mover, widest ecosystem and integrations
Zep / Graphiti	Cloud-first; self-host needs a graph DB	Yes — builds a graph on write	No	Apache-2.0 (Graphiti)	Temporal reasoning + bitemporal knowledge graph
Letta (MemGPT)	Self-hosted / cloud, needs Postgres	Yes — agent runtime	No	Apache-2.0	Stateful-agent runtime + self-editing memory
Cognee	Self-hosted (graph + vector backend)	Yes — LLM extraction	No	Apache core	Vector + graph hybrid, pluggable backends

▲ A factual comparison, not a benchmark — each tool has its place

mem0 got out early and is widely adopted; Zep's temporal knowledge graph is genuinely strong on "when did it happen, what came before what" questions; Letta turned the "stateful agent" into a complete runtime; MemPalace is easy to integrate. Most of them are cloud-shaped, and a write requires a model call up front — if you want something managed with a ready-made ecosystem, any of them is a reasonable pick.

Hebb Mind, by contrast, aims to be more biologically faithful in mechanism and more elegant in engineering: local-first, multi-agent compatible, a complete memory loop, MIT-licensed — and right now, the only row that checks all four boxes at once is Hebb Mind.

Three commands, running locally

It's not just the numbers that hold up — getting from install to running is just as simple:

pipx install hebb-mind
hebb setup
hebb service install

Once installed, it runs as an OS-level background service (macOS / Linux / Windows all supported, user-level by default, no admin rights needed), bringing up a local API and a web console — open it in your browser to manage memories by hand.

More importantly, Hebb Mind has two core strengths:

Local-first, zero external services: writes and retrieval are fully offline, running on the bundled local embedding model. Only when the "consolidation" step actually needs to do its work do you need to configure a large-model key (skip it and nothing errors out — that step just silently no-ops). At the model layer it supports common LLMs including Claude / GPT / Llama / Qwen / GLM / Kimi.
Built for agents, not for chat: native integration with Claude Code and Codex — one command to install, and your coding assistant remembers your project, your preferences, and the traps you've hit across sessions, with no need to re-explain every time.

Once installed, you have a local service: a memory loop that organizes itself and also forgets, running without any cloud.

Wiring it into your coding assistant: Claude Code / Codex

Claude Code integration

Claude Code is the deepest integration, giving you two layers at once. One is the MCP tools — letting Claude actively call write_memory / search_memory / consolidate inside a conversation, deciding for itself what to record and what to look up. The other is the Hooks auto-memory layer, the invisible one: when a session opens, cross-session memories are recalled and injected into context automatically; every message you send is automatically de-noised, deduplicated, and written to memory; and when the session ends, consolidation is triggered automatically. No intervention required.

hebb claude-code install --scope user    # injects hooks + MCP; restart Claude Code to apply

Restart Claude Code after installing and it takes effect. From then on it remembers your project, preferences, and the traps you've hit across sessions — no need to re-explain every time.

Codex integration

Codex goes through MCP, and the command Hebb Mind provides installs it automatically:

hebb codex install --scope user
codex mcp list                          # verify the install succeeded

After installation, Codex can call tools like write_memory / search_memory / consolidate / ingest_conversation.

Web console

To make management easier, we also ship a web console. Run the following command to open it:

hebb console

Closing thoughts

An agent's memory shouldn't be just "storage + retrieval." What makes the brain truly remarkable is that it organizes during sleep, forgets on purpose, and leaps from one point to a whole region. That's the loop we've open-sourced — it's still early, but you've seen the numbers above, and we're walking this direction on steady ground.

If you also think "AI memory" matters, there's a lot left for us to polish — come kick the tires, open PRs, and drop a star 🌟:

GitHub: https://github.com/afx-team/hebb-mind
Docs: https://afx-team.github.io/hebb-mind/quick-start.html
Install: pip install hebb-mind or pipx install hebb-mind

DEV Community: zhide liu