R Hiroshini

Posted on May 19

"The Bug That Forced Us to Add Agent Memory"

#agents #ai #llm #machinelearning

The Bug That Forced Us to Add Agent Memory

Project: Nexus Core AI OS
Stack: Hindsight (persistent memory) · cascadeflow (runtime intelligence & routing)

1. Introduction

I didn't plan to build a memory system. Nobody on the team did. When we started Nexus Core AI OS, we drew the architecture the way most people do: agents receive input, produce output, and move on. Clean, stateless, predictable. The kind of diagram that looks great in a Notion doc and makes sense for approximately the first week of real usage.

Then the bugs started.

Not crashes, exactly. More like the system quietly forgetting things it should have known. Agents repeating work they'd already done. Routing decisions that made no sense in context. A pipeline that looked healthy in isolation but behaved like it had amnesia the moment you strung more than two tasks together.

This article is about what we found when we dug into it, why stateless design eventually stopped working for us, and what we built to fix it. It's also about the specific bug — a deceptively small one — that forced us to take memory seriously rather than treating it as a nice-to-have.

2. The Problem We Faced

Nexus Core AI OS is built around the idea of a composable AI operating layer — a system where multiple specialized agents collaborate on tasks that range from research and synthesis to code generation, scheduling, and decision support. Each agent has a defined scope. They communicate through a shared runtime managed by cascadeflow, and their outputs are meant to chain together coherently.

The phrase "meant to chain together" is doing a lot of work in that sentence.

In practice, what we found was that coherence degraded the moment tasks required any form of backward reference. An agent asked to summarize findings would produce a summary that contradicted what a previous agent had concluded ten minutes earlier — not because either agent was wrong in isolation, but because neither one knew what the other had done. An agent responsible for routing a task to the right sub-agent would make a decision that was perfectly reasonable if this were the first task in the session, but made no sense given the three tasks that had already completed.

The system was consistent. It was just consistently amnesiac.

We had a lot of conversations early on about whether this was actually a problem or just a limitation we should work around at the application layer. Looking back, that framing was wrong. The amnesia wasn't a quirk of our architecture. It was the architecture. We'd built a system where each agent call was a fresh transaction, and we were surprised that the system had no memory of prior transactions.

3. Why Stateless AI Systems Became a Problem

Statelessness is a virtue in a lot of contexts. It makes systems easier to test, easier to reason about, and easier to scale horizontally. When you're building a simple request-response pipeline, it's exactly the right default. Each call is independent. Nothing leaks between requests. You can replay, retry, and parallelize without worrying about hidden shared state.

The problem is that "agent collaboration on a complex, evolving task" is not a simple request-response pipeline. It's closer to a conversation — and conversations depend entirely on shared context. You can't have a coherent conversation with someone who forgets everything you said the moment you stop speaking. That's not a feature. That's a fundamental communication failure.

In a multi-agent system, the equivalent failure looks like this: Agent A completes a research task and hands off a set of findings to Agent B. Agent B uses those findings to produce a plan. Agent C is then asked to evaluate whether the plan is consistent with the original constraints. But Agent C has no access to what Agent A found or what constraints were in place at the start. Agent C evaluates the plan in a vacuum. The evaluation is useless at best and actively misleading at worst.

We hit this pattern repeatedly. The longer the task chain, the worse it got. Agents were making locally reasonable decisions that were globally incoherent, because "globally" required a shared understanding of history that didn't exist.

There's also a subtler version of this problem: agents were re-doing work. Not because they were misconfigured, but because they genuinely didn't know the work had already been done. We'd see an agent kick off a data retrieval task that a prior agent had already completed and cached. The prior agent's work was sitting there, but the current agent had no way to know that. So it made another retrieval call, spent time and tokens doing it, and produced a result that was nearly identical to what we already had.

Statelessness was costing us real money and real latency in addition to correctness.

4. The Bug That Forced Us to Add Agent Memory

The bug that finally made this impossible to ignore was, on the surface, a small one.

We had a multi-step task running through Nexus Core: an agent pipeline responsible for analyzing a set of documents, extracting key decisions, and producing a structured brief. The pipeline was: extract → classify → synthesize → format. Four agents, four discrete steps, outputs chained sequentially.

The bug appeared in the synthesize step. Intermittently — maybe one in every six or seven runs — the synthesis agent would produce a brief that omitted entire sections that the classify agent had explicitly flagged as high-priority. Not just downweighting them. Omitting them entirely, as if they'd never been classified at all.

At first, we assumed it was a prompt issue. We tweaked the synthesis prompt, ran it again, and it worked fine for a few cycles. Then it dropped sections again.

We assumed it was a model temperature issue. Lowered it. Still happened.

Then we added more explicit logging, and we found the actual cause. The classify agent was writing its output to a shared context object that the synthesize agent was supposed to read from. But cascadeflow, at that point, was managing agent invocations without any guaranteed ordering of context reads relative to upstream writes. In high-throughput conditions — when prior tasks were still completing and flushing their state — the synthesize agent would sometimes read the context object before the classify agent had fully written its classification results. It wasn't reading stale data. It was reading incomplete data.

The synthesize agent didn't know the data was incomplete. It just processed what it received and produced a brief based on a partial view of the classification output. The sections that had been classified after the synthesize agent read the context simply didn't exist from its perspective.

This was a race condition, but the reason it was a race condition was deeper than the timing. It was a race condition because we had no durable, session-scoped record of what each agent had produced. Every agent was reading from a transient, in-memory context object that could be partially written, fully written, or not yet written depending on when you looked at it. There was no persistent ground truth. There was no authoritative record of "here is what the classify agent concluded, and it is complete."

We needed memory. Not caching. Not a bigger context window. Actual persistent, queryable, agent-scoped memory that could be written once and read reliably by any downstream agent, regardless of timing.

5. How Hindsight Helped Us Build Persistent Memory

Hindsight is the persistent memory layer we integrated into Nexus Core to solve this. The core idea behind Hindsight is that agent outputs — not just final results, but intermediate states, decisions, and observations — are written to a durable store with enough structure to be queried meaningfully by downstream agents.

When we integrated Hindsight, we changed the architecture in a specific way. Each agent, on completion, writes a memory record to Hindsight. That record contains the agent's output, its classification or category tags, the task context it was operating in, and a timestamp. Before any agent begins its work, it queries Hindsight for relevant prior records from the current session. If it finds them, it incorporates them. If it doesn't find them, it proceeds without them — but at least the absence of prior records is a deliberate, queryable state, not an accident of timing.

This solved the race condition directly. The synthesize agent no longer reads from a transient shared context object. It queries Hindsight for the classify agent's output. Hindsight only returns a record if it has been written and committed. Partial writes don't appear in query results. So if the classify agent hasn't finished, the synthesize agent gets back an empty result set — and we handle that case explicitly, either by waiting or by flagging the dependency as unmet.

The more interesting benefit was what happened to the longer task chains. Agents started becoming genuinely context-aware in a way they hadn't been before. When a downstream agent queries Hindsight and finds records from three prior steps in the same session, it has real information to work with. It can see what was already attempted, what was already found, what decisions were already made. It's not starting from zero.

We did run into design questions that took some iteration to resolve. The main one was granularity: what should a memory record contain, and at what level of detail? If you write too much, the query results become noisy — downstream agents are pulling in paragraphs of prior output and trying to integrate all of it, which adds latency and increases the chance of the agent's context being polluted with irrelevant history. If you write too little, the records aren't useful.

We settled on a structure where each Hindsight record has a short structured summary — the key output of the agent in a compact, retrievable form — alongside a reference to the full output stored separately. Downstream agents query the summaries first. If they need the full output, they fetch it explicitly. This keeps the common case fast while preserving the full fidelity of each agent's work.

6. Runtime Costs and Agent Instability

Memory solved the correctness problem. It introduced a different one.

The first version of our Hindsight integration was naive about write timing and query load. Every agent was writing to Hindsight synchronously at completion and querying Hindsight synchronously at the start of each task. This meant that the runtime cost of agent invocations went up significantly — we were adding latency at both ends of every agent call, and in high-parallelism scenarios where multiple agents were running concurrently, the Hindsight store was fielding a large number of simultaneous read and write operations.

We also saw a new kind of agent instability that we hadn't anticipated. Some agents, when they pulled a large amount of prior context from Hindsight, would start behaving inconsistently. The issue was that we were giving the agents too much history without enough filtering. An agent responsible for a narrow, well-defined task was receiving memory records from entirely different task types in the same session, because our initial query logic wasn't filtering by task type or relevance. The agent's effective context was being diluted by irrelevant history, and its outputs reflected that.

These two problems — runtime cost and context dilution — were related. Both were symptoms of the same underlying issue: we hadn't thought carefully enough about when memory should be written, when it should be read, and how much of it any given agent actually needs.

7. How cascadeflow Helped Us Control Runtime

cascadeflow is the runtime intelligence and routing layer in Nexus Core. It manages how agent invocations are scheduled, sequenced, and routed — including decisions about which agent should handle a given task, when to run agents in parallel versus sequentially, and how to handle failures and retries.

When we hit the runtime cost and context dilution problems, cascadeflow became the place where we addressed them.

The first change was moving memory operations out of the agents themselves and into the cascadeflow orchestration layer. Rather than having each agent query and write Hindsight directly, cascadeflow handles memory at the transition points between agents. Before invoking an agent, cascadeflow queries Hindsight for relevant prior context and passes only the relevant subset to the agent as part of its input. After an agent completes, cascadeflow handles the write to Hindsight. The agents themselves don't need to know anything about the memory layer.

This had two immediate effects. First, it centralized the logic for deciding what "relevant prior context" means. Instead of each agent implementing its own filtering logic (which they were doing poorly and inconsistently), cascadeflow applies a consistent relevance model: query by session ID, by task type, and by recency, with a configurable maximum record count. Agents receive a clean, filtered slice of history rather than a dump of everything that's happened.

Second, it allowed cascadeflow to make memory reads and writes asynchronous where safe to do so. Not all agents need prior context before they can start. For agents that don't have upstream dependencies in the task graph, cascadeflow can invoke them immediately and write their output to Hindsight in the background after they complete. For agents that do depend on prior output, cascadeflow enforces the dependency explicitly and ensures the required Hindsight records exist before invocation.

The routing improvements were significant as well. cascadeflow's routing decisions — which agent handles which task — were previously made purely based on task type. After integrating with Hindsight, cascadeflow can now route based on prior session context. If an agent has already handled a related task in the current session and produced output that's relevant to the new task, cascadeflow can factor that into the routing decision. It can also detect when a requested task is substantially similar to one that's already been completed, and short-circuit the invocation entirely by returning the cached result from Hindsight.

That last capability — duplicate task detection — turned out to be more valuable than we expected. In complex pipelines with conditional branching, it's not uncommon for the same subtask to be requested multiple times via different paths. Without Hindsight-aware routing, each request would result in a separate agent invocation. With it, cascadeflow catches the duplicate and returns the prior result, which both reduces cost and keeps the session history clean.

8. Architecture of Nexus Core AI OS

The current architecture of Nexus Core looks like this:

Request Layer — Incoming tasks enter through a structured request format that specifies the task type, the session ID, any explicit parameters, and a priority level. The request layer validates the input and hands it to cascadeflow.

cascadeflow Runtime — cascadeflow parses the task, queries Hindsight for relevant session context, selects the appropriate agent or agent chain, and constructs the input package for each agent invocation. It manages the execution graph — which agents run in parallel, which run sequentially, which depend on the output of others. It handles retries, timeouts, and fallbacks. After each agent completes, cascadeflow writes the output to Hindsight before moving to the next step in the graph.

Agent Layer — Individual agents are single-purpose: one agent for research, one for classification, one for synthesis, one for formatting, and so on. Agents receive a structured input package that includes the task parameters and a filtered slice of prior session context. They produce a structured output. They do not interact with Hindsight or cascadeflow directly.

Hindsight Memory Store — Hindsight maintains a session-scoped, durable record of every agent output in the current session. Records are structured with a summary, a reference to full output, metadata tags, and a session identifier. The store supports queries by session, by task type, by recency, and by tag. cascadeflow is the primary client; agents never query Hindsight directly.

Output Layer — Final outputs are assembled by cascadeflow from the results of the terminal agents in the task graph. For multi-step pipelines, the output layer handles collation and formatting before returning the result.

One thing worth noting: we kept the session boundary explicit. When a new session starts, it gets a new session ID. Hindsight records from prior sessions are not automatically available to agents in the new session. This was a deliberate choice — cross-session memory introduces a whole category of problems around relevance decay, privacy, and context pollution that we weren't ready to solve. Intra-session memory was the problem we needed to solve, and we solved it without coupling it to the harder cross-session problem.

9. Before vs After Adding Memory

The difference in behavior is significant enough that it's worth being concrete about.

Before:

A five-step pipeline would complete successfully in terms of each agent returning an output. But if you read the final output carefully, you'd often find internal inconsistencies: a summary that missed key points from the classification step, a plan that didn't reflect constraints established in the research step, a formatted brief that used different terminology than the upstream agents had agreed on. The system was completing tasks. It wasn't completing them coherently.

Re-running the same pipeline on the same input would sometimes produce noticeably different results — not because of intentional randomness, but because the race conditions in context reads meant the agents were sometimes working with different subsets of prior output.

Duplicate work was common. We estimated that roughly 15-20% of agent invocations in complex pipelines were redundant — agents redoing work that had already been done by an earlier agent in the same session.

After:

Coherence is substantially better. Downstream agents have access to a consistent, complete record of what upstream agents concluded. Terminology is more consistent. Constraints established early in the pipeline are honored by agents later in the pipeline, because those agents can see that the constraints were established and what they were.

Race conditions in context reads are gone. Hindsight only returns committed records, so agents either get complete prior context or get nothing — and we handle both cases explicitly.

Duplicate work has dropped to near zero for task types where cascadeflow can detect similarity. Latency for repeat-pattern tasks is significantly lower because cascadeflow can return cached results without invoking an agent at all.

The failure modes have changed too. Before, failures were often silent — an agent would produce output that looked complete but wasn't, because it had worked with incomplete input. After, dependency failures are explicit. If an agent's required prior context isn't in Hindsight when it's needed, cascadeflow flags the dependency as unmet and handles it through the retry or fallback path. We know when something is wrong.

10. Unexpected Problems We Faced

A few things surprised us during this process.

Memory made some agents more conservative. When agents have access to prior session context, they sometimes anchor too heavily on it. An agent asked to propose a solution would see a prior agent's proposed solution in its Hindsight context and produce something very similar — not because it was wrong, but because the prior solution was sitting right there and the path of least resistance was to build on it rather than reason independently. We're still tuning the prompt patterns to counteract this.

Session ID management is harder than it looks. We assumed that managing session IDs would be trivial — generate an ID when a session starts, attach it to everything, done. In practice, the definition of a "session" in a multi-tenant, asynchronous system is genuinely ambiguous. Is a new request from the same user 30 minutes later part of the same session? What about a request that's triggered programmatically rather than by a user action? We ended up with an explicit session lifecycle API rather than trying to infer session boundaries from timing heuristics.

Hindsight query performance at scale required more thought than we gave it initially. Our initial implementation did full-table queries filtered by session ID. Fine for development. Not fine when you have many concurrent sessions each generating dozens of Hindsight records. We had to add proper indexing and think more carefully about data locality.

cascadeflow's routing decisions became more complex with memory. The routing logic that was once a straightforward task-type lookup is now a function of task type, session context, available agents, and prior Hindsight records. The complexity is justified — the routing decisions are genuinely better — but it's also harder to debug when something routes incorrectly. We built a routing trace log that records every factor cascadeflow considered when making a routing decision, which has been invaluable.

Memory introduced a new category of correctness bug. Before memory, bugs were mostly about missing information. After memory, we started seeing bugs about wrong information persisting. If an upstream agent produced an incorrect output and it got written to Hindsight, downstream agents would faithfully incorporate the incorrect output into their own reasoning. The error propagated silently through the pipeline. We added a validation step where cascadeflow checks downstream agent outputs for consistency with upstream Hindsight records, and flags discrepancies for review rather than passing them through unchecked.

11. Lessons Learned

The biggest one: stateless is the right default for infrastructure, but it's the wrong default for systems that are supposed to reason over time.

Agents that collaborate on a task are doing something that is inherently stateful. Each agent's work is predicated on what came before. Trying to layer collaboration on top of a stateless architecture means you're constantly fighting the architecture. You end up with implicit shared state (the transient context object that caused our race condition) instead of explicit, managed shared state. Implicit shared state is worse in every way — harder to test, harder to debug, harder to reason about.

The lesson we took from the specific bug was to be skeptical of shared state that isn't managed by a purpose-built system. Our transient context object was shared state. We just didn't think of it that way because it looked like a data structure rather than a database. It had all the problems of shared state — race conditions, partial writes, no atomicity — without any of the tools that purpose-built state management systems provide.

Memory and routing are not independent concerns. One of the best decisions we made was integrating cascadeflow's routing with Hindsight's memory layer. If we'd built them as separate systems that the agents had to coordinate manually, we would have ended up with the same class of problems we started with — just at a higher level. The orchestration layer needs to understand the memory layer to make good decisions.

Start with the smallest memory footprint that solves your problem, and add to it deliberately. Our initial Hindsight integration wrote too much and queried too much. The discipline of thinking carefully about what each agent actually needs — not what it might find useful, but what it actually needs — produced a cleaner system.

Explicit failure modes are better than silent degradation. The before state, where agents would quietly work with incomplete context and produce plausible-but-wrong output, was worse than the after state, where missing context is an explicit, handleable failure. Build systems that fail loudly when something is wrong, even if loud failures feel worse at first. They're better.

12. Final Thoughts

We didn't set out to build a memory system. We set out to build a good AI operating layer, and eventually the absence of memory made that impossible. The bug that forced the issue was small — a race condition in a shared context object — but it was pointing at something fundamental about the design.

Nexus Core is a more capable system now than it was before. The agents are more coherent, the pipelines are more reliable, and the failure modes are more understandable. None of that happened because we followed a design principle about memory-first architectures. It happened because we built something, watched it fail in interesting ways, understood why it was failing, and fixed the actual problem instead of the symptom.

Hindsight gave us the durability and queryability we needed to make memory a first-class concept. cascadeflow gave us the orchestration intelligence to use that memory without making the agents themselves more complex. The combination is what made it work.

There are still open problems. Cross-session memory. Memory relevance decay. Agents that over-anchor on prior context. These are real problems and we'll eventually have to solve them. But they're the right problems to be working on — problems that come from having memory, not from lacking it.

If you're building something similar and you're finding that your agents are inconsistent, redundant, or context-unaware, the answer probably isn't a better prompt. It's probably that your system doesn't remember anything, and you've been treating that as a configuration problem when it's actually a design one.

Nexus Core AI OS — built with Hindsight for persistent memory and cascadeflow for runtime intelligence and routing.

DEV Community