DEV Community: Pragadeesh

Why AI agents need three types of memory (and how I built all of them)

Pragadeesh — Sat, 04 Jul 2026 20:34:57 +0000

Most "agent memory" today is one thing wearing three hats: a vector database.You embed the past, you retrieve the nearest neighbor, you paste it into the prompt. It works until it doesn't, and when it doesn't, you can't tell whether the agent forgot a fact, forgot an experience, or never learned the skill.

The human brain does not do this. It keeps three separate systems. I built an agent that does the same, and measured what it buys you.

The problem

I gave a language model a database it had never seen (Northwind) and asked it to write SQL. Cold, with no help, a strong model gets it right about a quarter of the time. Not because it can't write SQL, but because it doesn't know the schema, doesn't remember what worked last time, and has no sense of how to approach a question of a given shape. Those are three different kinds of not-knowing, and a single vector store treats them as one.

The brain analogy

Cognitive science splits long-term memory into three kinds:

Semantic memory is what things mean. The capital of France. That Orders.CustomerID is a foreign key to Customers.
Episodic memory is what happened to you. The query you ran last Tuesday that actually returned the right count.
Procedural memory is how to do things. Riding a bike. The fact that for an aggregate-over-a-join you filter before you count.

You do not retrieve "how to ride a bike" by finding the most similar past bike ride. Procedural memory is a different mechanism. That is the insight the whole project is built on.

Cognee for two of them

Cognee gives me a knowledge graph with four verbs: remember, recall,
forget, improve. I use it for the first two memory types.

Semantic: before the benchmark runs, I load the Northwind schema once.
Tables, columns, foreign keys. It is never forgotten because the schema never changes.

Episodic: every time the agent gets an answer right, I store the
question-and-SQL pair as an episode, tagged with a dataset name that encodes the question type. Later, recall surfaces the ones that match the current question.

One hard-won detail: Cognee's graph extraction normalizes column names to
snake_case during ingestion, which quietly poisons a schema full of PascalCase Northwind columns. The fix was to keep the raw schema text I fed in and inject that directly, using the graph only for episode retrieval. If you take one practical thing from this post, take that.

Synapse for the third

Procedural memory needed a different engine, so I used Synapse-DB. Every question gets hashed to a bucket by its type (its intent plus the tables it touches), not its wording. Each attempt writes a thought with a success score: 1.0 for right, 0.0 for wrong. Synapse reinforces the winners and lets the losers decay, Hebbian-style.

The per-type memory that a vector store cannot give you lives in how those buckets key the episodes: a successful query is filed under its type's hash, and recall returns past queries of the same type, not merely the nearest neighbor by wording. That is how the agent accumulates, across hundreds of attempts, a sense of how to approach a type of problem rather than a single lookalike answer.

One honest caveat I found while building the demo: in the Synapse build I used, the best-next lookup returns a global salience signal and does not filter by state hash (an unseen hash returns the same thing as a heavily-used one). So the per-type differentiation is carried by the hash-keyed episodic recall, and Synapse plays the role of the global reinforcement-and-decay signal. That signal is what drives the forget step below, and it is real. I would rather tell you exactly where each memory type is doing its work than sell a cleaner story than the one I measured.

The forget() insight

Here is the piece no other approach has. Early in training, the agent sometimes gets a question right by luck and stores a misleading episode. That bad memory then pollutes recall for every similar question.

I let Synapse decide what to forget. At a mid-run checkpoint, any question bucket that Synapse has watched fail more than three times gets its early episodes pruned from Cognee via forget. The procedural signal cleans the episodic store. The agent self-heals without a human in the loop. That is all four Cognee verbs, driven by a principled architecture instead of a cron job.

The benchmark numbers

I ran two agents on the same model, claude-haiku-4-5, over 50 training and 10 hold-out questions.

Vanilla, no memory: 26%.
Memory agent: 58%.
Gap: +32 percentage points, for a total API cost of £0.465.

I used Haiku rather than Sonnet to keep the run inside budget. It is a weaker base model, which is why the absolute numbers are modest — but weak or strong, both agents ran the same model, so the gap is the memory layer and nothing else.

I'm publishing the honest version. The run completed 8 of 10 epochs before a network blip killed it. Learning plateaued at 58% from epoch 3 rather than climbing to the 70s, because the hardest JOIN questions failed on the first pass and so never seeded an episode to learn from. Hold-out stayed flat. None of that touches the headline: the memory layer, and nothing else, moved the same model from 26% to 58%.

What comes next

Three fixes, in order of impact. Seed the hard buckets with a handful of correct episodes so the plateau breaks. Add a retry around the API call so a dropped packet doesn't cost you an overnight run. And persist Synapse state across runs so the procedural memory compounds over weeks, not epochs. The architecture is the contribution. The numbers say it works. The next numbers say how far it goes.

I built a Graph database to catch money launderers. Here's what I actually learned.

Pragadeesh — Tue, 07 Apr 2026 21:08:44 +0000

I want to say upfront: I have not caught any money launderers. I built a database. Whether it would actually catch money launderers in production is a question I can't answer yet, because I have zero production users. That caveat matters and I'll come back to it.

Here's what happened.

The problem I kept reading about

Every AML compliance team I could find publicly describing their stack was running some version of the same setup: a graph database for relationship traversal, a vector database or fuzzy matching library for name similarity, and a service layer stitching them together. Quantexa runs Spark plus Elasticsearch plus postgreSQL plus a graph layer. ComplyAdvantage built a transformer-based name embedding model and runs it against FAISS for sanctions screening, while keeping a separate proprietary graph database for entity relationships. Neo4j has published architecture diagram explicitly recommending you pair of their graph database with Pinecone for the vector part.

These are not small companies running shoddy systems. These are well-funded teams with smart engineers. They built this way because no single system did both things natively. So every team independently arrived at the same two-component architecture.

I wanted to know if that was actually necessary.

The core idea

Vector Symbolic Architecture is a field from cognitive computing that represents concepts as high-dimensional binary vectors and uses simple bitwise operations to encode relationships. XOR two vectors and you get a binding that associates them. Permute a vector and you get a role-encoded version of it. Bundle vector together and you get a superposition that's close to all of them.

The interesting property for a database: if you encode a typed edge as
bind(permute(subject), bind(permute(relation), permute(object))), you can query it back with just the subject and relation vectors, because XOR is its own inverse. The object vector emerges from the query. No index traversal. No query planning. Just bitwise arithmetic on fixed-size vectors.

I thought: what if you built a graph database where every entity and every typed edge is stored as one of these binary hypervectors? you'd get graph traversal and vector similarity search in the same data structure. One HNSW index. One memory-mapped hash table. One gRPC call for a multi-hop chain.

So i built it.

What it actually took

Longer than I expected. The HNSW implementation took weeks to get right - there's a subtle bug where the entry point offset is stored as a layer-0 byte offset but gets treated as a layer-N offset after restart, which causes out-of-bounds memory access that only manifests on the second search after loading a large graph. Finding that took a while.

The persistence layer was harder still. I initially had the edge lookup table - the structure that maps subjectId XOR relationId -> objectId for O(1) chain traversal - as a ConcurrentHashMap in JVM heap. Which meant every server restart wiped it. Chain queries would return null until you re-ingested all the edges. For a 1.87M entity dataset that takes 12 hours. I fixed it this week with a memory-mapped WAL - each edge gets appended as 16 bytes to a mapped file before the map update, and on startup the log is replayed into a fresh HashMap in about 4 milliseconds. The fix is obvious in retrospect. I'm embarrassed it took this long to ship.

The MiniLM encoder integration was surprisingly painless - ONNX Runtime on the JVM, 90MB model, produces 384-dimension float embeddings that get projected to 10,048-bit binary vectors via a seeded random matrix. The projection is deterministic so it regenerates on startup from the seed rather than persisting 30MB of projection weights to disk.

The Spring Boot starter took about as long as the core engine. Not because it was technically hard but because there are a lot of edge cases in autoconfiguration - what happens when Micrometer isn't on the classpath, how the gRPC channel pool interacts with graceful shutdown, how to wire the Watch API without creating a circular dependency in the bean graph. That kind of thing.

The Panama Papers demo

I loaded the full ICIJ Offshore Leaks dataset - Panama Papers, Paradise Papers, and Pandora Papers combined - 1.87 million entities. It took 12 hours on my machine because each entity name has to go through MiniLM for the initial float embedding, and I was running 4 parallel encoder threads on a consumer CPU.

The resulting database is about 7GB on disk. Vector store, HNSW layers, entity index, edge log. All memory-mapped. Server starts in about 5 seconds and the whole thing sits in about 4.5GB of
off-heap memory - no JVM heap pressure because the hot path is entirely Foreign Function and Memory API.

The benchmark numbers I'm willing to stand behind:

4-hop beneficial ownership chain traversal: 3.65ms average on 1.87M entities
Fuzzy entity screening (name match across all three leaks): 886ms
Shell company risk scoring: 290-1748ms depending on graph depth

The 886ms for screening is slower than I'd like. It's going through HNSW on 1.87M vectors and there's real room to optimize the query path. The 3.65ms for chain traversal is the number I'm most confident in - it's a tight operation and I've measured it many times.

The shell risk score is the thing I find most interesting to think about. During ingestion, every Panama Papers entity vector contributes to a majority-vote tally across all 384 dimensions. After ingestion, I threshold the tally at 50% to get a prototype vector - the statistical centroid of roughly half a million real offshore shell companies in binary hypervector space. Any new entity gets a risk score of Hamming(entity, prototype) / 10048. It's not a classifier. It has no labels. It's just: how far is this entity, in the high-dimensional space of financial crime patterns, from the average Panama Papers company?

I genuinely don't know if that's useful in production. It's theoretically interesting. Whether a compliance analyst would trust it for an actual screening decision is a different question.

What the architecture gets wrong

Three things bother me.

First, the analogy query - client.analogy("Mossack Fonseca", "registers").isTo("Panama") - has no equivalent in Neo4j or any vector database I know of. Find me entities that have the same structural relationship to their jurisdiction that Mossack Fonseca has to Panama. That's genuinely novel. I've not seen it elsewhere. But I also can't tell you whether a compliance engineer running sanctions screening at 3am would ever ask that question, or whether it's one of those capabilities that's elegant in theory and never quite fits a real workflow.

Second, the benchmarks are on Windows on my laptop. Every time I post numbers someone will ask about Linux, about cloud VMs, about comparative results against Neo4j on the same hardware. Those numbers don't exist yet. That's a gap.

Third, this is a solo project. No enterprise is going to depend on a database maintained by one person. The path from "interesting technical work" to "thing banks will run in production" is long and requires more than good code. It requires SOC2 certification, enterprise support contracts, multi-year stability guarantees, and a team that's not going to disappear. I'm one person. That's not a path I can walk alone.

Why I'm posting this

I spent a few months on this. I want to find out if the underlying idea - unified binary representation for graph traversal and vector search, no separate vector database, embeddable in a JVM process - is actually useful to people building financial crime detection systems, or whether it's an interesting technical exercise that doesn't map to any real production need.

If you work on AML systems, financial crime technology, graph databases, or JVM infrastructure and you have an opinion about whether any of this is useful, I'd rather hear it now than spend another six months building features nobody wants.

GitHub: https://github.com/Pragadeesh-19/HammingStore

The AML demo: https://github.com/Pragadeesh-19/hammingstore-aml-demo