My AI Agent Forgot My Flight. So I Gave It a Brain.

#ai #agents #architecture

How a simple flight status question exposed a core limitation of vector-only retrieval for
relational memory — and why graphs help.

The Night Everything Broke

It was 11 PM on a Friday. My flight was delayed. Bad weather. I asked my AI agent, Nous, a simple question: What's the latest on my flight? Give me the confirmation code, travel dates, and updated arrival time.
Nous knew all of this. I had told it two weeks earlier. The confirmation code, the flight number, my travel dates — all stored as facts in its memory system. But when I asked, it drew a blank. It couldn't connect the dots.
I had spent three weeks building what I thought was a sophisticated memory architecture: five memory types (episodic, semantic, procedural, working, censors), PostgreSQL with pgvector embeddings, a graph edges table, and even spreading activation inspired by cognitive science. And at the moment it mattered, it failed the simplest test a human brain passes without thinking.
That failure became the most important feature I've shipped.

The Illusion of Understanding
Here's what most AI agent builders don't realize until it's too late: storing information is not memory. A database full of facts is not a mind. The difference between a filing cabinet and a brain isn't storage capacity — it's the connections between what's stored.
If you're building an AI agent with memory, you are doing OS design whether you realize it or not. You're making decisions about memory allocation, garbage collection, access patterns, and process scheduling — the same problems operating system designers have been solving for decades, just at a higher level of abstraction. The question isn't whether your agent needs a memory architecture. It already has one. The question is whether you designed it intentionally.
When I diagnosed the failure, the root cause was embarrassingly clear. I queried the live graph and found 35 edges total: 24 decision-to-decision links, 10 fact-to-decision links, 1 episode-to-decision link. And zero fact-to-fact edges. Zero fact-to-episode edges. Every factual node was a disconnected island.
My agent knew my flight number. It knew my confirmation code. It knew my travel dates. But it had no way to know these facts were about the same trip. The graph existed in the schema, but the wiring was missing. It was a brain with neurons but no synapses.

Why Vector Search Isn't Enough
The most common starting point for AI agent memory in 2025–2026 is some variation of RAG: embed everything into vectors, store them in a vector database, retrieve by cosine similarity. It works surprisingly well for simple fact lookup. But it breaks down the moment you need to reason across related pieces of information.
One recent paper describes this as "contextual tunneling." SYNAPSE (Jiang et al., January 2026) defines it as agents getting stuck in narrow semantic neighborhoods — retrieving facts that are textually similar to the query but missing facts that are semantically related through context, causality, or temporal proximity.
When I searched for "flight delay," vector similarity returned facts about flights. But it didn't traverse to the confirmation code (different semantic space), to the travel dates (a temporal entity, not a flight entity), or to the original booking episode (an event, not a fact). Each of those lived in a different corner of the embedding space, connected only by the invisible thread of "this is all one trip."
This isn't unique to Nous. It is a recurring limitation of vector-only memory when retrieval depends on explicit relational, temporal, or causal structure.

What the Research Says
I've been tracking the agent memory research space closely. Over the past year, I've read through more than a dozen papers, and a notable pattern is emerging: graph-structured retrieval can outperform flat vector retrieval on multi-hop and relational recall tasks.
SYNAPSE (Jiang et al., 2026) models agent memory as a dynamic graph where relevance emerges from spreading activation — borrowed directly from Collins & Loftus' 1975 cognitive model. It uses lateral inhibition and temporal decay to highlight relevant subgraphs while suppressing noise. On the LoCoMo benchmark, it outperforms state-of-the-art on temporal and multi-hop reasoning tasks.
MAGMA (Jiang et al., 2026) goes further: it represents each memory across four orthogonal graph views — semantic, temporal, causal, and entity — and formulates retrieval as policy-guided traversal. This solves exactly the failure I experienced: a flight fact, a person fact, and a confirmation code live in different semantic views but share temporal and entity edges.
A-MEM (2025) showed that dynamically linked, self-organizing memory can outperform more static memory setups. Memories aren't static entries — they evolve, link, and sometimes contradict each other.
The comprehensive survey "Memory in the Age of AI Agents" (47 authors, Dec 2025) distinguished three memory dynamics: formation, evolution, and retrieval. Most systems focus only on formation and retrieval. Evolution — the ongoing process of relinking, consolidating, and forgetting — is where the real intelligence lives.
And just this month, A-MAC (Zhang et al., March 2026) formalized what the others implied: memory admission itself is a structured decision. Not everything should be remembered, and what you remember should be scored across five interpretable dimensions: future utility, factual confidence, semantic novelty, temporal recency, and content type.
Building the Fix: Graph-Augmented Recall
I shipped the fix as a four-phase update that transforms Nous's recall system from flat vector search to graph-augmented retrieval. Here's what each phase does:

Phase 1: Graph Expansion. Every recall_deep query now follows graph edges. When you retrieve a fact, the system also pulls its 1-hop neighbors — related facts, connected decisions, linked episodes. A query about a flight number now surfaces the confirmation code, the travel dates, and the booking context.

Phase 2: Cross-Type Linking. This was the missing piece. The system now creates polymorphic edges across memory types: fact-to-fact, fact-to-decision, fact-to-episode, episode-to-decision. When a new fact is learned, a FactGraphLinker handler fires on the EventBus, computing embedding similarity against existing decisions and creating "evidence_for" edges automatically. No manual wiring.

Phase 3: Contradiction Detection. When new information conflicts with existing memories, the system uses LLM classification to create "contradicts" or "supersedes" edges. Old facts aren't deleted — they're marked as superseded, maintaining an audit trail. This mirrors how human memory handles updates: the old memory doesn't vanish, it gets contextualized.

Phase 4: Spreading Activation. Inspired by SYNAPSE and the Collins & Loftus model, the system implements density-gated spreading activation for multi-hop retrieval. Activation flows through the graph with configurable decay (default 0.5 per hop), and density gating prevents activation from spreading through highly connected hub nodes that would add noise.

Under the Hood: The Schema
The graph edge table is polymorphic — it connects any memory type to any other:

CREATE TABLE brain.graph_edges (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_id UUID NOT NULL,
    target_id UUID NOT NULL,
    source_type VARCHAR(20) NOT NULL DEFAULT 'decision',
    target_type VARCHAR(20) NOT NULL DEFAULT 'decision',
    agent_id VARCHAR(100) NOT NULL,
    relation VARCHAR(50) NOT NULL CHECK (relation IN (
        'supports', 'contradicts', 'supersedes',
        'related_to', 'caused_by', 'informed_by',
        'evidence_for', 'discussed_in', 'extracted_from'
    )),
    weight FLOAT DEFAULT 1.0,
    auto_linked BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(source_id, target_id, relation),
    CHECK (source_type IN ('decision','fact','episode','procedure')),
    CHECK (target_type IN ('decision','fact','episode','procedure'))
);

The key design choices: source_type/target_type make it polymorphic without foreign keys to every table. The relation enum constrains edge semantics — you can't create arbitrary relationships, only ones the system knows how to traverse. auto_linked flags edges created by the FactGraphLinker versus manually established ones. And the unique constraint on (source_id, target_id, relation) prevents duplicate edges while allowing multiple relationship types between the same pair of nodes.

The Recursive CTE: How Activation Spreads
The heart of the retrieval engine is a recursive Common Table Expression that simulates spreading activation. Vector search produces seed nodes with initial scores; the CTE then propagates activation through graph edges:

WITH RECURSIVE activation AS (
    -- Base case: seed nodes from vector search
    SELECT id, node_type, score AS activation, 0 AS depth
    FROM (VALUES
        ('uuid-1'::UUID, 'fact', 0.92),
        ('uuid-2'::UUID, 'episode', 0.87)
    ) AS seeds(id, node_type, score)
    UNION ALL
    -- Recursive case: spread to neighbors with decay
    SELECT
        CASE WHEN e.source_id = a.id
             THEN e.target_id ELSE e.source_id END,
        CASE WHEN e.source_id = a.id
             THEN e.target_type ELSE e.source_type END,
        a.activation * COALESCE(e.weight, 1.0) * :decay,
        a.depth + 1
    FROM activation a
    JOIN brain.graph_edges e
        ON (e.source_id = a.id OR e.target_id = a.id)
    WHERE a.depth < :max_depth
        AND e.relation != 'contradicts'
    )
SELECT id, node_type, SUM(activation) AS total_activation
FROM activation
GROUP BY id, node_type
ORDER BY total_activation DESC
LIMIT 20

This models the Fan Effect from cognitive science: activation is diluted as it spreads across multiple neighbors. A node with one edge receives the full decayed signal; a node connected to ten others spreads only a fraction to each. The SUM(activation) aggregation means nodes reached via multiple paths accumulate higher activation — exactly how associative recall works in biological memory. You remember something more strongly when multiple associations point to it.
The contradicts exclusion is deliberate — you don't want contradicted facts gaining activation through the very nodes that disprove them.
The Cold Start Problem: Hybrid Fallback
There's a practical challenge the research papers don't emphasize enough: on a fresh system, the graph is sparse. In the first days or weeks of use, there aren't enough edges for spreading activation to find anything useful. The CTE runs, finds no neighbours, and returns only the vector search seeds — adding latency for no benefit.
Nous handles this with density-gated activation: before running the recursive CTE, the system computes a graph density metric (average edges per unique node). If density is below the threshold (default: 3.0), it falls back to standard vector search with simple 1-hop neighbor expansion:

def should_use_spreading_activation(settings, cached_density):
    mode = settings.spreading_activation_enabled.lower()
    if mode == "true":   return True   # Force on
    if mode == "false":  return False  # Force off
    # "auto" mode: activate only when graph is dense enough
    return cached_density >= settings.spreading_activation_density_threshold

This "auto" mode means a fresh Nous instance behaves like a standard RAG system — fast, reliable, and limited. As the graph fills in through use and auto-linking, the system gradually transitions to full spreading activation. No manual switch, no cliff edge. The graph earns its way into the retrieval pipeline.

Performance: HNSW Tuning for Graph Seeding
Spreading activation is only as good as its seeds. Every recall starts with a pgvector similarity search to find the initial nodes, which means HNSW index performance is critical. Nous uses HNSW (Hierarchical Navigable Small World) indexes on all five memory type tables:

CREATE INDEX idx_facts_embedding ON heart.facts
    USING hnsw(embedding vector_cosine_ops);

The default pgvector HNSW parameters (m=16, ef_construction=64) work for most workloads, but there are trade-offs worth understanding:

m (connections per layer) — Higher values improve recall accuracy at the cost of index size and build time. For agent memory where you have thousands to tens of thousands of vectors (not millions), the default of 16 is adequate. If you're seeing seed quality issues, bump to 32.
ef_construction (build-time search width) — Controls index quality during construction. Higher values produce a better graph at the cost of slower inserts. For memory systems where writes happen at conversation pace (not batch ingestion), 64 is fine.
ef_search (query-time search width) — The runtime knob. Default is 40. Nous currently uses the default, but for graph seeding where seed quality directly determines activation quality, bumping this to 100-200 at query time is a recommended next optimization. The marginal latency cost is negligible compared to the downstream impact of bad seeds on activation spread. The practical insight: pgvector's HNSW is fast enough that the bottleneck in spreading activation isn't the vector search — it's the recursive CTE. With a 2-hop depth limit and 20-node result cap, the CTE adds roughly 5-15ms on a well-indexed PostgreSQL instance. That's negligible for a system that's about to spend 500ms+ on an LLM call. The configuration is straightforward: NOUS_GRAPH_RECALL_ENABLED=True, max depth of 2, decay of 0.5, cross-type linking threshold at 0.80 cosine similarity, contradiction detection on. It runs in production on PostgreSQL with pgvector — no separate graph database needed. The Architecture vs. What It Learns This distinction matters, and most writing about AI agents blurs it: there's a difference between the architecture — the infrastructure that ships on day one — and the knowledge the system acquires through use. Nous starts empty. It has a brain, but no memories. Everything it knows, it learned. The Infrastructure (Built In) These are the structural components — the cognitive machinery that makes learning possible:
The cognitive loop — Sense → Frame → Recall → Deliberate → Act → Monitor → Learn. This is the processing cycle, the equivalent of a brain's neural architecture. It runs the same way on day one as on day one thousand.
Five memory type schemas — The database tables and embedding infrastructure for episodes, facts, decisions, procedures, and censors. Think of these as empty filing systems: the drawers exist, but they're empty until the system starts interacting.
The graph edges table and spreading activation engine — The mechanism for connecting memories across types. The plumbing that enables "this flight fact is related to that confirmation code fact." This was the missing piece that caused the failure — the infrastructure existed but wasn't wired into the recall pipeline.
Sleep consolidation — A 5-phase offline maintenance process modeled on biological sleep: reviewing pending decision outcomes, pruning stale censors, compressing old episodes, reflecting across sessions to extract patterns as durable facts, and generalizing recurring behaviors into procedures. Phases 1 (review) and 4-5 (reflect/generalize) are fully operational — the system already converts episodic conversations into semantic facts through cross-session pattern recognition. Phases 2-3 (pruning and compression) have the scaffolding in place but are still being deepened. This is the episodic-to-semantic pipeline in action: short-term conversational memory consolidates into long-term knowledge, the way human sleep consolidates short-term memory into long-term storage.
Memory decay and confidence scoring — Brier-scored calibration tracking, confidence decay over time, and freshness weighting. The math that ensures memories fade appropriately and the system knows how much to trust its own recall.
The EventBus and cross-type auto-linking — The reactive wiring (like the FactGraphLinker) that fires when new memories form, automatically creating graph edges. This is infrastructure — the handler is built in, but it only creates edges when there are memories to connect. What It Learns (Starts Empty) These are the contents that accumulate through interaction. On a fresh Nous instance, all of these are zero:
Facts — Extracted from conversations, not hardcoded. My flight number, my confirmation code, my preferences for Celsius, where I live — all learned through dialogue. The system extracts facts proactively when it detects useful information, but it doesn't ship with any.
Episodes — Every conversation creates episodic memories with summaries. These are the "what happened" layer, and they only exist because interactions happened.
Decisions — Recorded choices with context, reasoning, confidence levels, and calibration tracking. The decision schema is architecture; the actual decisions and the patterns they reveal are learned.
Procedures (Skills) — This is a key distinction. Skills are learned, not pre-loaded. They can be taught from URLs, local files, or inline markdown. A skill might be "how to review a pull request" or "how to search the Serper API" — registered through use, not shipped as features. The trigger patterns that auto-activate skills during recall are part of the learning, not the architecture.
Censors (Guardrails) — The censor mechanism is architecture — the ability to match patterns and block or warn. But specific censors are learned from experience and user rules. "Never commit directly to main" is a censor that exists because the user established that rule. "Never store API keys as facts" exists because that's a security lesson. A fresh instance has no censors.
Graph edges — The connections between all of the above. Auto-created by the linking infrastructure as memories form, but starting at zero. The 35 edges I found during the failure diagnosis were the sum total of what the system had wired up over weeks of use — and the missing cross-type edges were the gap that caused the failure. The punchline: Nous is an architecture for learning, not a pre-trained knowledge base. The infrastructure enables a system that gets smarter with every interaction — building its own knowledge graph, developing its own skills, establishing its own guardrails. What you get out of the box is a brain. What you get after months of use is a mind. Why This Matters Beyond Engineering The implications of graph-structured agent memory extend beyond developer productivity tools. Two stand out: Trust and Auditability. In regulated industries — finance, healthcare, legal — being able to trace why an agent made a decision is often more valuable than the decision itself. A flat vector store returns "these were the most similar documents." A graph with typed edges returns "this fact was extracted from this conversation, which informed this decision, which was later contradicted by this newer fact." That's an interpretable causal explanation. When an auditor asks "why did the agent recommend this?", the graph provides a traversable answer chain, not a similarity score.

Persistent Identity and Personalization. Memory isn't a database feature — it's the foundation of identity. An agent that remembers your preferences, learns from your corrections, and builds a model of your work patterns over months is qualitatively different from one that starts fresh each session. This is the "Digital Twin" trajectory: AI partners that develop persistent, evolving models of the people and systems they work with. The graph is what makes this possible — not just storing facts about a user, but understanding how those facts relate to each other, how they change over time, and which ones matter in which contexts.

What I Got Wrong (And What the Field Gets Wrong)
The biggest lesson from this experience isn't technical — it's philosophical. I had the right architecture on paper. The graph_edges table existed. The neighbors() function existed. The spreading activation concept was in the roadmap. But none of it was wired into the actual recall pipeline.
This is the same mistake I see across the industry. Teams build sophisticated memory schemas, implement vector stores, maybe add a knowledge graph layer — and then retrieve exclusively via embedding similarity. The graph is decorative, not functional.
The StructMemEval benchmark (Shutova et al., February 2026) confirmed this at the research level: LLMs can solve structured memory tasks when prompted with structure, but they don't autonomously recognize when to apply it. The agent needs to be explicitly wired to traverse its own graph — it won't discover the capability on its own.
Another thing I got wrong: treating all memory operations as writes. The comprehensive survey "Memory in the Age of AI Agents" (Hu et al., December 2025) identified "memory evolution" as the most neglected dynamic. Most systems — mine included, until recently — focus on storing and retrieving. But memory is alive: facts get stale, confidence should decay, contradictions should be detected and resolved. Forgetting isn't a bug; it's a feature.

The ICLR Signal
The fact that ICLR 2026 dedicated an entire workshop (MemAgents) to memory for agentic systems reflects rising research focus on memory as a key bottleneck. The workshop framing was telling: agent memory is fundamentally different from LLM memorization. It's online, interaction-driven, and under the agent's control.
A reasonable takeaway from the workshop framing is that memory should be treated as part of the cognitive loop, not as a passive log. Episodic memories should consolidate into semantic knowledge. Explicit facts should eventually become implicit weights. Memory management should be an active process, not a storage problem.
We're at an inflection point. A growing body of work suggests that the next major leap in agent capability may not come from bigger context windows or better models — but from memory systems that actually work like memory.

What's Next
Three things I'm building toward, informed by this research:
1. Memory admission control. Not everything should be stored. A-MAC's five-factor scoring (future utility, factual confidence, semantic novelty, temporal recency, content type) provides a principled framework. Right now Nous stores too aggressively — the next evolution is learning what to forget.
2. Deeper consolidation. The basic episodic-to-semantic pipeline is live — sleep consolidation already extracts patterns from conversations and stores them as durable facts. But the compress and prune phases need full implementation: old episodes should be distilled into summaries, stale facts should decay gracefully, and the system should learn what to forget, not just what to remember. The Episodic Memory paper (Pink et al., 2025) maps the full roadmap, and we're partway through it.
3. Multi-view graphs. MAGMA's four-view approach (semantic, temporal, causal, entity) is the right target. Currently, Nous has a single graph with typed edges. Separating into orthogonal views would enable query-adaptive traversal — an "Intent-Aware Router" that detects the nature of a query and selects the corresponding relational view. A "Why" query triggers a topological sort on causal edges, ensuring causes precede effects in context. A "When" query traverses temporal timelines. A "Who" query walks entity edges. Decoupling the memory representation from the retrieval logic this way would improve both reasoning accuracy and token efficiency.
The flight failure was a gift. It turned a theoretical architecture gap into a production incident with clear symptoms, a diagnosable root cause, and a measurable fix. That's how systems get better — not by anticipating every failure, but by learning from each one and wiring the fix into the system so it can't happen again.
Your agent has amnesia. Mine did too. The cure isn't more storage — it's better connections.

P.S. NOUS source is located here - https://github.com/tfatykhov/nous/blob/main/readme_new.md

References
[1] Jiang, H. et al. "SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation." arXiv:2601.02744, January 2026.
[2] Jiang, D. et al. "MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents." arXiv:2601.03236, January 2026.
[3] Jiang, D. et al. "Anatomy of Agentic Memory: Taxonomy and Empirical Analysis." arXiv:2602.19320, February 2026.
[4] Shutova, A. et al. "Evaluating Memory Structure in LLM Agents (StructMemEval)." arXiv:2602.11243, February 2026.
[5] Zhang, G. et al. "Adaptive Memory Admission Control for LLM Agents (A-MAC)." arXiv:2603.04549, March 2026.
[6] Pink, M. et al. "Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents." arXiv:2502.06975, February 2025.
[7] "Memory in the Age of AI Agents: A Survey." arXiv:2512.13564, December 2025 (updated January 2026). 47 authors.
[8] Collins, A. M. & Loftus, E. F. "A Spreading-Activation Theory of Semantic Processing." Psychological Review, 82(6), 1975.
[9] Tulving, E. "Episodic and Semantic Memory." In Organization of Memory, 1972.
[10] Minsky, M. The Society of Mind. Simon & Schuster, 1986.
[11] Kostka, A. & Chudziak, J. A. "Evaluating Theory of Mind and Internal Beliefs in LLM-Based Multi-Agent Systems." arXiv:2603.00142, March 2026.
[12] ICLR 2026 MemAgents Workshop: Memory for LLM-Based Agentic Systems.
[13] Xu, W. et al. "A-MEM: Agentic Memory for LLM Agents." arXiv:2502.12110, February 2025.

DEV Community

My AI Agent Forgot My Flight. So I Gave It a Brain.

Top comments (0)