I Didn't Trust the AI Memory Until I Built This

Gowrideepa Gorige — Tue, 19 May 2026 08:27:04 +0000

When AI Memory Stopped Being a Gimmick: A Retrospective on Building Nexus Core

An engineering retrospective on persistent context, stateless regret, and the slow realization that memory systems are not magic — but they are sometimes necessary.

1. Introduction: Why AI Memory Seemed Unreliable at First

I spent the better part of two years being openly skeptical of AI memory systems. Not dismissive in the hand-wavy way — I had specific, technical reasons. I had watched teams bolt on vector stores to LLM pipelines and end up with systems that confidently recalled the wrong things, surfaced outdated context at the worst moments, and introduced failure modes that were genuinely harder to debug than the problems they were supposed to solve.

My default position was simple: if you cannot trust what your system remembers, you are better off designing around memory entirely. Stateless was predictable. Stateless was testable. Stateless was something I could reason about at 2 AM when something broke in production.

Then we started building Nexus Core, and that position became increasingly difficult to defend.

Nexus Core is an AI operating system layer — an orchestration substrate that coordinates intelligent agents across heterogeneous environments. It is not a single model or a single pipeline. It is a system that needs to operate continuously, across sessions, across users, and across contexts that evolve over time. The moment we moved from prototype into any kind of real use, the stateless assumption started accumulating debt. This article is about when that debt came due, what we did about it, and what we actually learned from it — including the parts that did not work the way we expected.

2. Why I Avoided Memory Systems in AI Agents

The skepticism was not arbitrary. It came from observing what happened when teams added memory to LLM-based systems without thinking carefully about what "memory" actually needed to do.

The first problem is retrieval quality. Vector similarity search is a proximity heuristic, not a facts lookup. When you ask an embedding model to find relevant past context, you get the context that is semantically close to the current query — which is not the same thing as the context that is actually relevant to the current decision. These diverge constantly in practice. A user who asked about "deployment configuration" six sessions ago will have that context surface when they ask about "deploying to staging," even if the old configuration was for a completely different service that has since been deprecated.

The second problem is confidence laundering. LLMs are bad at expressing uncertainty about retrieved context. When a model is given retrieved text and told "here is relevant context from past sessions," it tends to treat that context as authoritative unless it has been specifically tuned not to. This creates a category of failure where the model does not hallucinate from its weights — it hallucinate from stale retrieved data. That failure mode is worse, in some ways, because it looks more grounded.

The third problem is operational. Adding a memory layer adds latency, storage, retrieval infrastructure, and a new class of correctness bugs that are orthogonal to everything else you are already managing. For teams already stretched on infrastructure, this is real cost.

So the original architecture of Nexus Core was designed around passing context explicitly, window-managing conversation history, and structuring tasks to be self-contained where possible. For a while, this worked.

3. The Limitations of Stateless LLM-Based Systems in Production

The theoretical cleanliness of stateless design degrades in contact with real workloads. Here is what that actually looked like.

First, context windows are not free. As sessions get longer and tasks get more complex, the cost of passing full history on every inference call grows quickly. We were making architectural decisions — about what to include in context, what to truncate, what to summarize — that were essentially ad-hoc memory decisions dressed up as prompt engineering. We were building a memory system. We were just building a bad one, implicitly, without admitting it.

Second, users do not operate in discrete, self-contained sessions. This is obvious in retrospect, but it took seeing it fail concretely to internalize it. A user who set up an integration on Tuesday does not re-explain it on Thursday. A user who changed a preference last week expects that preference to persist. When it does not — when the system behaves as if every session is the first — the experience is not "neutral." It is actively frustrating, and users correctly interpret it as the system being dumb.

Third, agents coordinated by Nexus Core need shared operational context. If an agent has already attempted a remediation step and failed, other agents — and future invocations of the same agent — need to know that. Without persistent state, you get repeated attempts at approaches that have already been proven not to work. We saw this happen. It is embarrassing to watch.

The stateless design was not wrong for the reasons I thought it might be wrong. It was wrong because the system we were actually building was not stateless in practice — it was just badly stateful, with the state living in user expectations and never getting written down.

4. The Breaking Point: A Real Failure Caused by Missing Memory

There was a specific incident that forced the conversation.

We had a configuration-heavy workflow where a user was stepping through a multi-stage environment setup over several sessions. The workflow involved dependencies between steps — step three had a precondition that required knowing the output of step one. In a stateful system, this is trivial. In our stateless design, the user was expected to re-anchor the system at the start of each session by providing relevant context.

The user did not do this consistently. On one session, they skipped the anchoring because they had done it multiple times and, reasonably, expected the system to remember. The agent proceeded through step three using what it inferred from the current prompt. It inferred incorrectly. The resulting configuration was wrong in a way that was not immediately obvious — it passed validation but produced incorrect behavior at runtime.

The user spent three hours debugging something that the system had caused by not knowing something it should have been tracking. When they traced it back, they were right to be frustrated. There was no graceful recovery path. We had to walk the configuration back manually.

This was not a model failure. The model did exactly what it was designed to do given the context it had. This was a system design failure. We had built a system that could not be trusted with tasks that required continuity, and then put it in front of workloads that required continuity.

After this incident, the conversation about memory shifted from "should we?" to "how do we do this without making things worse?"

5. The Decision to Experiment with Persistent Context Storage

We approached this with deliberate caution. The goal was not to build a comprehensive memory system. The goal was to introduce the minimum viable persistence layer that would address the class of failures we were seeing, without adding a lot of new complexity or new failure modes.

We defined success criteria before writing any code:

The system must not confidently act on stale memory without flagging it
Retrieval latency must be within an acceptable envelope for the use cases we were targeting
The memory layer must be independently inspectable and correctable by operators
We must be able to disable memory for specific agents or workflows without system-wide changes

We also defined what we were not trying to do: we were not trying to give agents permanent autobiographical memory, we were not trying to build a semantic knowledge base, and we were not trying to make the system behave as if it had perfect recall. Perfect recall is not a good goal for a system that processes language. Perfect recall with no mechanism for conflict resolution produces confident wrong answers, not useful ones.

The experiment started with a small subset of workflows — specifically, the multi-session configuration tasks that had caused the most friction. The constraint of a limited scope kept the work tractable and forced us to think concretely about what needed to be persisted versus what we were tempted to persist out of instinct.

6. How We Designed a Lightweight Memory Layer

The design settled on three components, which we tried to keep as independent as possible.

Sessions. Each user-agent interaction is tracked as a named session with a persistent identifier. Sessions carry metadata — creation time, last active time, the agent types involved, and a short structured summary of the session's purpose. The summary is generated at session close by the model itself, with a constrained output format that forces it to be concrete about what was accomplished, what was decided, and what remains unresolved. These summaries are what gets retrieved most often, not raw transcript.

Event logs. Within sessions, significant events are written to an append-only log. "Significant" required definition — we landed on: any decision that affects external state, any user-provided constraint or preference, any failure and its context, and any checkpoint in a multi-step workflow. This is not a transcript. It is a structured record of things that happened that would be consequential to know later.

Embeddings index. Session summaries and a subset of event log entries are embedded and stored in a vector index. This is what enables semantic retrieval — finding past sessions relevant to a current context without requiring exact string match. We use this sparingly and with explicit confidence thresholds. Retrieval results below a threshold are not passed to the model unless the operator has explicitly configured the workflow to allow lower-confidence recall.

The retrieval path is worth describing in more detail because it is where most of the interesting engineering lives. When a new session starts, we do a retrieval pass against the index using the initial user message and session metadata as the query. We return at most three candidate past sessions, each represented by their structured summary and relevant event log excerpts. This context is prepended to the system prompt with explicit framing — "the following is retrieved context from past sessions; treat it as potentially relevant but not authoritative." We tested variants of this framing and the explicit qualification made a measurable difference in how the model handled ambiguous or conflicting retrieved information.

The whole system writes to Postgres for sessions and events, with pgvector for the embedding index. This was a deliberate choice to avoid adding a specialized vector database to the stack. The retrieval performance is adequate for our query volumes and the operational simplicity is worth it.

7. What Actually Changed After Adding Memory

The honest answer is: some things got meaningfully better, and some things we expected to improve did not change much.

What improved: multi-session workflow continuity. The specific class of failure that had motivated the project — users losing state across sessions on configuration-heavy tasks — dropped sharply. The system now had the information it needed to not repeat questions users had already answered and not revisit approaches that had already been tried. Users noticed this. The feedback was not "the system remembers things now" — it was "the system seems less frustrating," which is the right proxy for the thing we were actually trying to fix.

What improved less than expected: the handling of user preferences. We had hoped that storing user preferences as event log entries would lead to consistent personalization behavior across sessions. In practice, the model's use of retrieved preferences was inconsistent. It would apply some preferences reliably and ignore others, with no obvious pattern that we could tune around. We eventually concluded that preference application is a retrieval and prompting problem that requires more deliberate architecture than generic event log retrieval — it probably warrants its own dedicated lookup path rather than being folded into general context retrieval.

What did not change: performance on single-session tasks. As expected, adding memory had no effect on tasks that were already self-contained. The retrieval overhead existed, but if the returned context was not relevant, it did not hurt performance in any measurable way — though it also provided no benefit.

8. Edge Cases That Made Memory Difficult

This is where the work got genuinely hard.

Conflicting memories. Users change their minds. They also contradict themselves across sessions without realizing it. When retrieved context contains two events that point in different directions — "user prefers minimal logging" from session 4, "user requested verbose debug output" from session 9 — the model needs to resolve the conflict. In our initial implementation, it did not do this well. It would sometimes average the two signals into something that neither event supported. We eventually added explicit conflict detection at retrieval time — when the semantic similarity between retrieved events is high but their content is contradictory, we flag this and surface it explicitly in the system prompt rather than passing both events silently. This forces the model to reason about the conflict rather than paper over it.

Stale data. This one is obvious in theory and annoying in practice. A session from eight months ago about a deployment configuration for a service that has since been completely redesigned is not helpful — it is actively misleading if the model treats it as current. We added time decay to retrieval scoring, which reduces the effective retrieval weight of older sessions. We also added an explicit staleness flag in the retrieved context framing when events are older than a configurable threshold, prompting the model to weight them with more skepticism.

Hallucinated recall. This is the one that bothered me most. In some sessions, the model would confidently reference things that were not in the retrieved context and were not in its weights — things that seemed like they could have been in past sessions but were not. It was generating plausible-sounding history. We identified that this was more likely to occur when the retrieved context was thin — when there was some past context but not much of it, the model appeared to interpolate. We added minimum content thresholds for retrieval — if we cannot return at least a meaningful summary with concrete event entries, we return nothing rather than a sparse context that invites interpolation.

9. Debugging Memory Correctness Issues

The hardest part of operating this system is verifying that retrieved context is correct and that the model is using it correctly.

Debugging a wrong inference is straightforward when the cause is in the current context. You read the prompt, you see the bad input, you fix it. Debugging a wrong inference caused by retrieved memory is harder because you now need to trace back through the retrieval — what was retrieved, why it was retrieved, whether the retrieved content was accurate, and whether the model interpreted it the way it should have.

We built three operational tools to make this tractable.

The first is a retrieval trace log. Every inference call that uses memory writes a log entry containing the query used for retrieval, the candidate sessions considered, the sessions that were returned, and the confidence scores. This lets us reconstruct the retrieval context for any inference after the fact.

The second is a memory inspector — a simple internal interface that lets operators view the event log and session summaries for any user or session. This was essential for the conflict cases: when a user reports unexpected behavior, you need to be able to look at what the system believes about that user and compare it to what is actually true.

The third is a correction API. Operators can flag specific event log entries as incorrect or outdated, and can write corrective entries that will supersede older ones in retrieval. This is not elegant, but it provides a recovery path for the cases where the memory is wrong and the wrongness is causing active harm.

The broader lesson here is that any system that maintains state about users needs to be operable in the same way that a database is operable. You need to be able to read it, audit it, and correct it. Treating the memory layer as a black box — just an input to the model — is a guarantee that you will not be able to debug the failure modes that matter most.

10. Tradeoffs: Latency, Storage Cost, and Complexity

None of this is free. Here is what it actually costs.

Latency. Each session start incurs a retrieval pass — embedding the query, running the ANN search, fetching session summaries and event excerpts, and assembling the context. In our current setup, this adds roughly 80–150ms to session initialization at the 95th percentile. For asynchronous workflows, this is immaterial. For latency-sensitive interactive flows, it is noticeable. We mitigate this with session pre-warming — for recurring users, we trigger retrieval in the background when we see the first keepalive ping, before the user sends their first message. This eliminates most of the perceived latency.

Storage cost. Session summaries and event logs are small. The embedding index is the cost driver, and it scales with the number of embedded items rather than the raw size of sessions. We prune the index aggressively — items older than 12 months are archived out of the live index unless explicitly tagged as persistent. For our current user base, storage cost is not a primary concern, but it would become one at scale without careful pruning policy.

Complexity. This is the real cost. The memory layer adds a component that has its own operational needs, its own failure modes, and its own debugging requirements. It adds surface area to every inference call. It requires ongoing curation — the conflict detection, the staleness handling, the correction tooling — all of which needs maintenance. Teams considering this need to be honest about whether they have the operational capacity to run it well. A memory system that is not maintained actively degrades. Stale data accumulates, corrections do not get made, and the system becomes less reliable than stateless would have been.

11. What Surprised Me After Real-World Testing

A few things that I did not anticipate going in.

The value of structured summaries over raw transcripts. I expected retrieval from raw session transcripts to work reasonably well. It does not, or at least not as well as retrieval from structured summaries. The model-generated summaries — which force the system to express what was accomplished, what was decided, and what remains open — are a much better retrieval target than full conversation text. The forcing function of the summary format makes the information more reliable and more useful than free-form retrieval from raw text.

How often users expect the system to know more than it does. Once users understood that the system had memory, they started assuming it tracked things we had not built tracking for. They expected it to remember offhand comments, implied preferences, and contextual details that we had not written event log entries for. Managing expectations about the scope of memory turned out to be as important as the memory implementation itself.

How much the framing of retrieved context matters. The difference between "the following is authoritative context" and "the following is potentially relevant context from past sessions" is not subtle at the model level. The framing affects how the model weights retrieved information relative to current context and its own priors. Getting this framing right required iteration, and it remains one of the more fragile parts of the system — small changes to the framing language produce noticeable differences in model behavior.

That agents trusting each other's memories is a separate, harder problem. We built user memory. Agent-to-agent shared memory — where one agent writes context that another agent reliably reads and uses correctly — has turned out to be significantly harder. The coordination semantics, the write conflicts, the question of authority over shared state — all of it is more complex than single-user memory. We have not solved this well yet.

12. Final Thoughts: When AI Memory Is Worth It and When It Is Not

After everything, my view on AI memory systems has shifted — but not to the point of enthusiasm. It has shifted to something like: memory is appropriate for a specific class of problems, and it is inappropriate, or at least overkill, for everything else.

It is worth it when:

Your system is designed for multi-session continuity and users will reasonably expect past context to carry forward
The cost of re-establishing context on every session is high and falls on the user
The workflows involve sequential decisions where prior outputs are inputs to future steps
You have the operational infrastructure to inspect, correct, and maintain the memory layer

It is probably not worth it when:

Sessions are short, self-contained, and high-volume — the operational overhead will not be justified by the benefit
Your retrieval quality is not good enough to trust — and you should test this rigorously before committing
You cannot build the correction and inspection tooling — a memory system you cannot debug is worse than no memory
The primary goal is personalization without continuity — recommendation-style personalization usually has better-fit architectures

The core mental model I keep returning to: memory systems do not make systems smarter. They make systems more informed. Whether being more informed leads to better behavior depends entirely on the quality of the information being stored and the reliability of the retrieval. Both of those are engineering problems with real failure modes, and both require sustained attention to get right.

For Nexus Core, the memory layer has been net positive. The class of failures that motivated it has been reduced. The system handles multi-session workflows more reliably, and users encounter fewer situations where they have to repeat context that should have been retained. But it has added operational complexity that we are still managing, and there are categories of memory-related failure that we introduced through the implementation that we did not have before.

That tradeoff is honest, and I think it is the honest way to describe what memory systems do. They exchange one set of problems for another. Whether the exchange is worth it depends on your specific workload, your operational capacity, and your willingness to build and maintain the tooling that makes the memory layer trustworthy.

For us, at this point in the project, it was worth it. I would not universalize that conclusion.

This article reflects implementation experience from the Nexus Core AI OS project. Specific performance numbers reflect our infrastructure configuration and workload characteristics; results will vary.

DEV Community: Gowrideepa Gorige