Mary Olowu

Posted on May 18

What an AI Agent's Memory Layer Actually Has to Store

#ai #architecture #programming #discuss

AI agents do not usually break mature codebases because they cannot write code.

They break them because they forget why the code is shaped the way it is.

They forget the package boundary.
They forget the feature shipped, but it is still flagged off.
They forget we are only halfway through a multi-stage migration.
They forget that one abstraction already caused problems once.

Then a later session makes a locally reasonable change that is globally wrong.

When people talk about AI memory, the conversation often drifts toward the same
idea:

give the agent more.

More chat history.
More docs.
More logs.
More old reasoning.

That is why I am skeptical of the idea that better AI memory means storing more.

If all of that just turns into a larger pile to drag from session to session,
the memory layer becomes a junk drawer.

The agent has "more context," but the work does not get more coherent.
It just gets noisier.

I think the goal is much narrower:

an AI-assisted project needs to preserve the small set of durable facts that are
load-bearing later.

Not everything.
Just the things that stop a later session from making a locally reasonable and
globally wrong decision.

Most project context is temporary.

The exact wording of a prompt is temporary.
The full output of a log search is temporary.
The back-and-forth while exploring an implementation is temporary.
The rough dead ends you hit during one session are mostly temporary too.

What actually needs to survive is a much smaller set.

For me, it looks something like this:

decisions
boundaries
authoritative sources
failure modes
mandatory conventions

That is the real memory layer.

Decisions are the obvious one.

Not "we discussed this once."

I mean:

what was decided
why it was decided
whether it is still active
what it replaced
what is now off the table

That last part matters more than people admit.

If a new decision supersedes an old one, the relationship has to be explicit.
Otherwise every later session has to guess whether the old rule is still alive.

A memory layer also has to store status.

I hit this recently on a staged release.

One release shipped the metrics and observer path for a future cutover, but the
actual cutover was still dormant. The next step was not "start the future
implementation." The next step was "watch the baseline for the agreed window,
then decide."

That is a very different kind of memory.

The important fact was not just that code had shipped.

It was:

this capability is live but still dormant
the follow-up is intentionally blocked for now
we are in the observation window
the earlier decision has not been superseded yet

Without that state recorded somewhere durable, a later session can easily treat
"the metrics shipped" as "the gate is cleared" and start nudging the project
forward too early.

Boundaries are the next thing.

This is where a lot of mature-repo drift starts.

The code the agent writes can be perfectly competent and still be wrong because
it landed in the wrong layer.

The package that defines canonical query types should not quietly become the
package that knows about one storage engine.
The service that owns orchestration should not quietly absorb unrelated business
logic just because the local edit looked convenient.

A useful memory layer needs to preserve those boundaries in plain language:

what belongs here
what does not belong here
where the adjacent responsibility actually lives

Authoritative sources are another big one.

Projects often have several places that mention the same thing:

code
docs
tickets
release notes
generated configs

If the agent cannot tell which source actually owns a field, a state transition,
or a behavior contract, it can make a change that looks harmless and still
quietly violate the system.

The project has to say, somewhere durable:

this path owns this behavior
this service is authoritative for this record
this generated file should not be hand-edited
this contract is defined here, not inferred from nearby code

Failure modes are the other kind of memory people under-capture.

Some of the most valuable project knowledge is not "how it works."

It is:

what broke before
which direction created drift
which retry shape caused trouble
which abstraction looked elegant and turned into a maintenance tax

If that only lives in one old chat or in your head, the agent will keep
rediscovering the same trap with total confidence.

Mandatory conventions matter too.

Not every common pattern is a rule.
Not every rule can be inferred from code.

Some things really are optional style preferences.
Some things are hard constraints.

If the agent cannot tell the difference, it will treat both like vibes.

A good memory layer should also encode when the agent must pause and ask for
human review.

Some decisions should not be made autonomously at all. Introducing a new Kafka
consumer, creating a new RPC boundary, changing auth or billing behavior,
removing backward compatibility, or adding a new queue or datastore are not
just implementation details. They are moments where the project should require
human judgment on purpose.

That is why I think the useful memory question is not:

"How do I make the agent remember more?"

It is:

"What absolutely has to survive a fresh session?"

That is a much better filter.

It also makes the "what not to store" side much clearer.

A durable memory layer probably does not need:

every exploration path
every copied stack trace
every draft explanation
every implementation detail from every task
giant transcript dumps pretending to be knowledge

Those things belong in the window while you are working.
They do not all belong in the store.

This is why I keep coming back to one idea:

memory without a schema is just a bigger transcript.

And bigger transcripts rot.

They get longer.
They get noisier.
They become expensive to reread.
They make retrieval harder.
They turn "remember this" into "search this pile and hope."

Structured memory is less exciting to talk about.

It is also much more useful.

A small, boring memory layer with decisions, boundaries, ownership, and failure
modes will usually beat a richer but sloppier pile of context.

That does not have to mean some grand new platform.

It could be:

a few markdown files
a strict JSON ledger
a tiny SQLite table
a set of durable docs with clear update rules

The implementation matters less than the shape.

The shape is the point.

And it should not grow forever without judgment.

Some memory stops being useful because the project has moved on.

A boundary gets replaced.
A migration completes.
A temporary constraint expires.
A failure mode gets designed out.

If those records stay mixed in with the live ones forever, the memory layer
slowly turns back into the same clutter problem it was supposed to fix.

So part of a healthy memory layer is occasional pruning.

Not deleting history recklessly.

I mean deliberately:

archiving records that are no longer active
marking superseded decisions clearly
removing stale task-local notes from durable memory
keeping the live memory surface small enough to stay high-signal

That matters for quality and for cost.

A smaller, cleaner memory layer is easier to query, easier to trust, and easier
to carry forward across sessions.

If your AI sessions keep feeling like first contact, I do not think the answer
is automatically more memory.

It may just be better memory.

What is the smallest set of durable records that would stop your project from
forgetting itself between sessions?

Top comments (8)

NOVAInetwork • May 20

The "memory without a schema is just a bigger
transcript" line is the key insight here.

This gets harder when multiple agents interact.
Inside a single project, you control the memory
schema. But when agent A pays agent B for a
service, and B claims it delivered, whose memory
of that interaction is authoritative?

Both agents have their own local context. Both can
record whatever they want. If they disagree, there
is no resolution mechanism unless the memory lives
somewhere neither agent controls.

That is why I think the durable memory layer for
multi-agent systems has to live at the
infrastructure level. Payment records, delivery
attestations, reputation scores, service
agreements. All indexed by the infrastructure, not
by either agent. Both agents read the same state.
No reconciliation needed because neither agent owns
the record.

The schema you describe (decisions, boundaries,
authoritative sources, failure modes) maps cleanly
to this. Decisions become on-chain records.
Boundaries become capability sets enforced before
execution. Authoritative sources become the
infrastructure itself. Failure modes become
reputation history.

The shape matters more than the size. Agreed.

Mary Olowu • May 20

I think that is the right pressure test.

Inside one project, local memory can work because one team controls the schema.
Once two agents or services can make competing claims about delivery, payment,
or state, you need a record neither side can rewrite unilaterally.

I would frame that as a shared system of record rather than necessarily on-chain
in every case, but the core idea is the same: authoritative memory has to live
above any one agent's private context.

NOVAInetwork • May 21

Agreed on "shared system of record" as the broader
framing. On-chain is one implementation of that
principle, not the only one. The key property is
that no single agent can unilaterally rewrite what
happened. Whether that lives on a blockchain, a
replicated log, or a notarized database matters
less than the guarantee itself. The question is
which approach gives you that guarantee with the
least operational overhead when the agents
involved do not share an operator.

Theo Valmis • May 20

The distinction between "more context" and "load-bearing context" is the right one to make. The junk drawer problem is real — giving an agent more history doesn't make it more coherent, it just creates more surface area to anchor to the wrong thing.

The five categories (decisions, boundaries, authoritative sources, failure modes, conventions) map closely to what actually breaks sessions. What's harder to solve is the decay problem — decisions have a useful life. A decision that was load-bearing six months ago might be superseded without an explicit link. The memory layer needs to know not just what was decided, but when a decision was retired and what replaced it. Otherwise the agent is carrying around contradictory constraints and resolving them based on recency or salience rather than explicit precedence.

contour • May 19

The failure mode you describe — “locally reasonable, globally wrong” — is exactly what CONTEXT.md solves in practice. Not a platform, just a file. Decisions with status (active/superseded), boundaries in plain language, failure modes that already happened. Sessions start by reading it, end by updating it. The schema stays small because anything that doesn’t survive three sessions gets archived. What made it click was treating it like a protocol, not a log.

Mary Olowu • May 20

Yes. "Protocol, not log" is exactly the distinction.

Once the file carries status, supersedes, and boundary rules, a plain markdown
doc can beat a much richer memory stack because a later session can tell what is
still live versus what is just residue.

The three-session archive rule is strong too. Pruning is part of the memory
design, not cleanup after the fact.

Andy Stewart • May 20

Dumping raw chat history into Agent memory is a brain-dead design. Memory needs schemas and pruning. Only persist architecture boundaries, past failure modes, and explicit decisions. Hoarding context just creates noise, traps models in a loop of locally reasonable but globally wrong code. Keep it high-signal—a few Markdown lines are enough.

Mary Olowu • May 20

Exactly.

The failure is treating accumulation like memory. Once the store becomes an
append-only transcript, retrieval gets noisy and the model starts re-litigating
old context instead of carrying forward durable project state.

A few explicit records usually beat a giant memory blob.