Alex P

Posted on Mar 17

I Built a Graph-Based AI Memory. Then Its Brain Turned to Mush.

#ai #architecture #database #buildinpublic

A few weeks ago I wrote about using a graph database for AI memory instead of vector search. The mug cakes story, the CO_REFERENCED_WITH feedback loop, the overnight consolidation cycles. That post was about the architecture that convinced me graph-backed memory works.

So in the light of fairness, this one is about everything I got wrong.

v1 proved the concept could work! v2 was about finding out that a graph with loose types is just a fancier document store.

The RELATES_TO Problem

In v1, everything connected to everything via RELATES_TO.

Cooking RELATES_TO wife. Sleep RELATES_TO productivity. Diet RELATES_TO goals. All technically true. None of it useful.

The problem didn't show up immediately. With 50 nodes, traversal works fine even with generic edges. You ask "what's related to this goal?" and you get back a handful of things that are, in fact, related. It feels like it's working.

Turns out, though, at 300 nodes, your graph is a disgusting, chewed-up, hairball. Everything is 1-2 hops from everything else. You ask "what supports this goal?" and the answer is: we don't know, because RELATES_TO doesn't distinguish support from correlation from coincidence.

So I killed RELATES_TO entirely and replaced it with 18 typed edges across 6 categories:

Motivation & Causation: MOTIVATED_BY, CAUSED_BY
Structural: ENABLES, BLOCKS, REQUIRES, PART_OF
Knowledge: SUPPORTS, CONTRADICTS, INFLUENCES
Goal & Commitment: COMMITTED_TO, FOLLOWED_THROUGH, ADVANCES
Organization: TAGGED_WITH, ABOUT
Lifecycle: EVOLVED_INTO, SUPERSEDES, DUPLICATES
Provenance: EXTRACTED_FROM

Forcing the system to pick a typed edge forced it to actually reason about the relationship. A typed edge is a claim. SUPPORTS means something different than CONTRADICTS. When the LLM has to choose between them, it can't be lazy about it. BECAUSE IT WILL BE LAZY.

Of course, the LLM will drift back to inventing generic types within days if you let it. So there's a server-side guardrail. Regex validation against the 18 allowed types. Non-schema edges get rejected with an error listing the alternatives.

const VALID_EDGE_TYPES = new Set([
  'MOTIVATED_BY', 'CAUSED_BY', 'ENABLES', 'BLOCKS',
  'REQUIRES', 'PART_OF', 'SUPPORTS', 'CONTRADICTS',
  'INFLUENCES', 'COMMITTED_TO', 'FOLLOWED_THROUGH',
  'ADVANCES', 'TAGGED_WITH', 'ABOUT', 'EVOLVED_INTO',
  'SUPERSEDES', 'DUPLICATES', 'EXTRACTED_FROM'
]);

if (!VALID_EDGE_TYPES.has(edgeType)) {
  throw new Error(
    `Invalid edge type "${edgeType}". ` +
    `Allowed: ${[...VALID_EDGE_TYPES].join(', ')}`
  );
}

The LLM sees that error and corrects itself. Hard constraints where it counts.

15 Labels, 4 Layers

v1 had loosely typed nodes. "Facts, decisions, preferences, patterns, commitments, observations." No formal taxonomy, no lifecycle rules. It was pretty much just vibes; a classic "yeah that sounds right" decision. A "fact" and an "observation" had the same properties and the same lifespan.

I didn't realize why that was a problem until I tried to build consolidation. Which nodes should get archived after 30 days? Which ones should live forever? Turns out you can't write lifecycle rules for node types that don't have distinct lifecycles.

v2 has 15 knowledge labels across 4 semantic layers. The test for whether a label deserves to exist: what happens to this node in 6 months? If two labels have the same answer, they should be the same label.

Layer	Labels	Lifecycle
Identity (who you are)	Attribute, Preference, Belief, Value, Skill	Evergreen. Only supersession or contradiction removes them
Direction (where you're headed)	Goal, Project, Commitment, Decision	Active until completed or abandoned. Commitments never auto-archive
Intelligence (what the system observes)	Behavior, Insight, Opportunity, Event	Behaviors evolve into Insights. Opportunities expire. Events are immutable
World (external knowledge)	Reference, Resource	Stale after disuse

Here's what I axed (more informative, IMO):

Fact became Attribute. "Alex weighs 178 lbs" is an attribute, not a fact floating in space.
Observation became Behavior. An observation without a pattern is noise.
Pattern became Insight. Patterns only earn that label when confirmed across multiple data points.
Emotion got killed entirely. Stale 20 minutes after creation. No lifecycle, no value.
Idea became Reference. Ideas are external input until someone acts on them.
Metric became Attribute with source: "health_import". Just attributes with provenance tracking.

The Dedup Gate Got Serious

v1 had a write-time similarity check. Single threshold. It caught obvious duplicates and missed everything else.

The failure mode I didn't anticipate: same-session rewording. During a long conversation, the extraction system would generate multiple phrasings of the same insight minutes apart. "Alex works in bursts" at 2:15 PM and "Alex has a sprinter mentality" at 2:22 PM. Same thing. Different enough embeddings to slip past the threshold.

Two changes fixed it.

First, a two-tier threshold. Items created within the past hour get a 0.75 similarity threshold, aggressive enough to catch rewording. Items older than an hour get 0.85, loose enough to let genuinely similar but distinct items coexist. The recency window was the key insight. Duplicate risk is highest within the same conversation.

Second, the gate went cross-label. In v1, it only checked within the same node type. So "Alex prefers direct feedback" could exist as both a Preference and a Behavior because they got classified differently by the extraction LLM. v2 checks across all 15 labels. Existing node always wins, regardless of label mismatch. A DUPLICATES edge gets logged for audit.

The whole write pipeline is also mutex-serialized. One write at a time. I know that sounds like a bottleneck. But concurrent creates on a graph with a dedup gate produce race conditions that are miserable to debug. Two nodes with 0.92 similarity both passing the check because they ran simultaneously = a week later and you have SO MANY DUPES.

One more thing I learned the hard way: when a node gets superseded, you need to migrate its edges. All the context edges (ABOUT, TAGGED_WITH, SUPPORTS, etc.) move to the new node. Provenance edges (EXTRACTED_FROM, SUPERSEDES) stay on the old one for audit. Without this, supersession just creates orphans. Connected knowledge becomes disconnected, and your graph traversal hits dead ends you can't explain.

CO_REFERENCED_WITH Is Dead. Long Live co_reference_count.

The feedback loop was the centerpiece of the v1 post. It was also the first thing I had to tear out (RIP).

Quick recap: nodes retrieved together got a CO_REFERENCED_WITH edge. More co-retrieval meant a stronger edge. Stronger edges meant those nodes were more likely to show up in future retrievals. Retrieval fed co-retrieval, co-retrieval strengthened edges, stronger edges pulled in more nodes.

A self-reinforcing loop with no circuit breaker. Popular nodes got more popular. The graph calcified around whatever topics came up most in the first few weeks.

v1's fix (covered in the original post) replaced CO_REFERENCED_WITH with an access event log. That was better, but it was still a separate data structure tracking behavior that should just live on the edges themselves.

v2 killed the access event nodes too. Now co_reference_count is a property directly on typed edges. When graph expansion traverses an existing edge, that edge's count increments. No new edges created. No separate tracking nodes.

The reinforcement is still Hebbian (things that fire together wire together). But now it's scoped:

+0.05 confidence per co-retrieval, capped at 1.0
Only strengthens existing typed edges
Never creates new edges

That last constraint is the important one. The system can reinforce connections it already knows about. It cannot hallucinate new ones. No more phantom relationships. The graph learns which connections matter without inventing connections that don't exist.

Vector Similarity in Cypher Will Ruin Your Day

Early on, I computed cosine similarity inside Cypher using REDUCE. The embeddings were 3072-dimensional. With 500+ nodes, that's millions of interpreted loop operations per query.

250+ seconds per retrieval.

I moved the vector math to JavaScript. Native Float64 operations.

Under 1 second.

Graph databases are not vector databases. They're great at traversal. They are genuinely terrible at bulk numerical computation. I should have known this but I had to watch a query spin for four minutes before it sank in.

The pattern that works:

JS computes cosine similarity over all embeddings. Returns top-k seed nodes.
Graph traverses 1-2 hops along typed edges. Expands context around those seeds.
Reciprocal Rank Fusion merges both result sets.

Vector similarity finds the starting points. Graph traversal understands the connections between them. Each system does what it's built for.

Side note: Memgraph 3.8 shipped native vector indexes after I built this. I benchmarked it at 31x over my JS implementation, 100% recall@10. The Atomic GraphRAG pattern (single Cypher query: vector search into graph expansion) cuts the whole flow down to one call. The principle still holds: don't compute similarity in interpreted Cypher. Use native indexes or do it externally.

Forgetting Got Smarter Too

v1 had "nightly dedup, weekly patterns, monthly compression." Correct in concept. I hadn't actually specified what any of those did.

v2 has three named algorithms.

Hebbian Reinforcement runs on every retrieval. Co-retrieval strengthens typed edges, +0.05 confidence, capped at 1.0. Never creates new edges, only reinforces existing ones. This is the remembering side. Things that get used together become more tightly associated over time.

Synaptic Pruning runs monthly. Archives weak, unused edges. The criteria: not referenced in 30+ days, confidence below 0.3, co_reference_count below 3. But it protects critical edges: EXTRACTED_FROM, SUPERSEDES, COMMITTED_TO, EVOLVED_INTO. You can prune stale associations. You should not prune provenance or commitments, even old ones.

Schema Formation also runs monthly. This one compresses recurring patterns into canonical representations. Edges with co_reference_count >= 10 and confidence above 0.8 are candidates. Community detection (Louvain via Memgraph's MAGE library) finds clusters of 5+ related nodes that keep getting activated together. Those clusters become candidates for consolidation into higher-level knowledge.

There's also an auto-staleness layer for deprecated architecture. When I replaced my old dispatcher with the wake-up engine, every node that referenced the dispatcher became stale. Rather than cleaning those up manually, the monthly maintenance pass auto-archives items referencing systems that no longer exist. Evergreen types (Preference, Belief, Value, Skill, Behavior, Commitment) are protected from this. The system can forget what architecture it used to run on. It should not forget what it knows about the person it's built for.

What I'd Tell You Before You Build This

These are the things I figured out the slow way.

Type your edges or don't bother with a graph. A graph where everything RELATES_TO everything is a slower document store. The whole point is that edge types carry meaning. If your edges don't encode how things are connected, you're paying the complexity cost without getting the benefit.

Your dedup gate is your most important feature. Not your retrieval, not your embedding model, not your consolidation algorithm. A clean graph beats a complete graph. Noise accumulates faster than you think.

Don't compute vectors in your graph query language. Graph databases do traversal. Vector databases do similarity. If you're writing REDUCE over 3072-dimensional arrays in Cypher, stop. I lost a weekend to this.

Every label needs a lifecycle answer. "What happens to this node in 6 months?" If you don't know, you don't have a schema. You have a bucket.

Feedback loops hide in systems that learn from their own behavior. Any time your system creates signal from its own output, you need a circuit breaker. CO_REFERENCED_WITH taught me this. The system was getting better at retrieving the things it had already retrieved. That looks like learning. It isn't.

Server-side guardrails beat prompt engineering. The LLM will drift. It will invent new edge types. It will create nodes that don't match your schema. It will store duplicates with slightly different wording. Hard constraints at the write layer are the only defense that actually holds over time.

Where This Leaves Things

The v1 post ended with "bigger context windows won't save you." Still true. But I'd add that a messy graph won't save you either.

The gap isn't "use a graph instead of vectors." It's "use a graph with discipline." I suppose the same can be said about vector dbs...but the graph still just makes sense to me. Typed edges that carry meaning. Labels with lifecycle rules. A dedup gate that treats cleanliness as a first-class concern. And the willingness to kill features that create more noise than signal.

The system that asked "do you like to cook?" is the same system that now tracks 15 types of knowledge across 4 semantic layers with 18 typed edge categories. It still runs consolidation at 2 AM. It still mimics sleep. It just has a vocabulary for what it knows now, instead of a pile of things that are vaguely related.

*This is Part 2 of a series on building graph-backed AI memory. Part 1 is here.

DEV Community