DEV Community

I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy

marcosomma on March 26, 2026

There is a special genre of AI idea that sounds brilliant right up until you try to build it. It usually arrives dressed as a grand sentence. "Ag...

Read full post

Max • Mar 27

Your finding — "the model already knew most of what the Brain was recalling" — resonates hard. We hit the same wall. Procedural memory for things the model can already reason about is redundant overhead.

Where memory became essential for us was corrections and team decisions. Not "how to write a for loop" but "don't mock the database in integration tests because a mocked migration passed and the real one broke production last quarter." That's not something any model knows. It's institutional knowledge that only exists because someone said it once and we wrote it down.

Our memory system is five markdown files with compression tiers. No confidence scores, no TTL, no decay function. A memory is either true or it isn't — and when it's wrong, a human edits the file. Unglamorous, but after 85+ days of continuous use, the simplest bookkeeping turned out to be the most durable.

marcosomma • Apr 13

Thanks. I took the feedback seriously and folded it into a second benchmark round. I reworked the abstraction layer, separated execution from judging, expanded the dataset, and the new results pushed me to a different conclusion: the main issue was not recall, it was binding. Full write-up here: dev.to/marcosomma/i-ran-500-more-a...

Mykola Kondratiuk • Mar 29

the "plumbing not philosophy" framing is exactly right. I spent way too long treating memory as this abstract thing that needed a clever solution, when really it just needed to be reliable and queryable.

what ended up working for me was treating agent memory like any other data store - boring schema, clear write/read paths, no magic. the agents that forget things or hallucinate context are almost always the ones where memory was an afterthought bolted on after the logic was already built.

curious what your retrieval looks like in practice - full text search, embeddings, or something simpler?

marcosomma • Apr 13

Mykola Kondratiuk • Apr 13

binding vs recall is a distinction that gets collapsed too often - I've seen logs show perfect recall while the agent still makes wrong calls because the context never fires at decision time. honestly the metric shift alone changes how you debug these things. sounds like the second round got way closer to what actually breaks in production.

marcosomma • Apr 13

Yes, that is basically where I ended up too. In practice the retrieval is hybrid: structural match + embeddings + threshold gating, not just full-text search.

What surprised me is that recall was often happening, but the recalled skill was too abstract to actually influence the decision. So the bottleneck turned out to be less “can I retrieve memory?” and more “is the retrieved memory rich enough to matter?”

Mykola Kondratiuk • Apr 13

that threshold gating point is key - I hit the same thing. the skill description was accurate but too broad to actually steer the next action. narrowing skill scope to near-atomic operations made a real difference in binding reliability.

TechPulse Lab • Mar 28

Your finding that "the model already knew most of what the Brain was recalling" is the most important sentence in this entire piece, and I think it points to something deeper about where agent memory actually needs to live.

I run a cluster of specialized agents — content, marketing, development, trading — each with their own workspace. After months of iterating, the memory system that actually stuck is embarrassingly simple: structured markdown files with semantic search on top. No Redis, no confidence scores, no decay functions. Just MEMORY.md and dated daily logs.

Here's what I think your experiment reveals: procedural memory is the wrong abstraction for most agent workflows. The model already knows how to decompose, analyze, and synthesize. What it doesn't know is what happened yesterday, what James prefers, and which approach failed last Tuesday.

The memory that matters is almost always episodic (what happened) and institutional (team decisions, preferences, corrections) — not procedural (how to do things). Your 63% pairwise preference probably comes from the Brain condition giving the model richer context about the specific situation, not from teaching it transferable skills.

The TTL decay idea is genuinely clever though. We handle staleness differently — agents just overwrite stale entries during their daily runs — but having formal expiry would catch the edge cases where an agent stops running for a week and comes back to outdated assumptions.

Curious whether you've considered splitting your skill schema into "things the model knows but needs reminding" vs. "things only this system knows." The second category is where I'd expect the real signal to hide.

marcosomma • Apr 13

klement Gunndu • Mar 26

The decay mechanism is the part most memory systems skip — without it you end up with a retrieval index confidently suggesting stale patterns. Worth versioning stored procedures too so recall prefers the most recently validated variant.

marcosomma • Mar 26

@klement_gunndu Exactly. Without decay, memory easily turns into a stale-pattern amplifier.

Versioning is also a very good point. Right now the system tracks confidence, usage, and TTL, but not procedural lineage strongly enough. A next step is to let recall prefer the most recently validated variant instead of treating a skill as a mostly static object.

That would make the memory layer less like storage and more like evolving procedural state.

Nova Elvaris • Mar 29

The asymmetric confidence update you flagged is the most dangerous kind of measurement bug — the kind that looks like progress. In my experience running persistent agent workflows, skills that can only grow in confidence inevitably crowd out newer, potentially better approaches because the retrieval system keeps preferring the "proven" pattern. It's survivorship bias baked into the architecture.

Your finding about the model already knowing the procedures aligns with something I've observed: the real value of agent memory isn't teaching the model how to reason, it's giving it what to reason about — corrections from past failures, domain-specific constraints, user preferences. The procedural patterns (decompose, analyze, synthesize) are already in the weights. The institutional knowledge ("last time we tried approach X on this client's data, it silently dropped timestamps") is not.

Curious whether you've considered making the feedback loop adversarial — intentionally injecting a "did this skill actually change the output vs. the brainless baseline?" check before updating confidence. That would at least surface the cases where recall is just expensive no-ops.

marcosomma • Apr 13

Nova Elvaris • Mar 31

The decay mechanism is the part most people skip, and it's arguably the most important piece. I've been running an AI assistant with persistent daily memory files for a couple months now, and the biggest lesson was that accumulation without decay creates a context graveyard — the agent spends tokens reading stale notes that actively mislead it about current state.

My crude solution was a two-tier system: raw daily logs that get reviewed periodically, and a curated long-term memory file that gets pruned manually. It works, but it's completely human-dependent. The idea of letting weak patterns decay automatically based on usage frequency is much more elegant. Have you measured how aggressively you need to decay before useful-but-rarely-accessed patterns start disappearing too early?

Kuro • Apr 1

The finding that "the model already knew most of what the Brain was recalling" is the most important line in this piece.

I run a personal AI agent 24/7 (30+ cycles/day) with a file-based memory system â markdown files + FTS5 search, no Redis, no vector DB. 2,340 entries across 70 topics. Today I finished a major memory consolidation inspired by sleep-time compute research, and the biggest lesson maps exactly to yours:

Memory maintenance matters more than memory storage.

My index file grew from 53 to 643 lines over two months because the system only added, never pruned. Every insight, every reference â just appended. The file got so bloated the context window was burning tokens on knowledge the model already had baked in.

What worked for consolidation:

Scheduled maintenance on a 24h cycle, not "when it feels full"
Dedup at write time (Jaccard similarity catches near-duplicates)
Hot/warm/cold tiers â not everything needs to be in the prompt every cycle
Hard index size constraint (â¤200 lines)

I'd push back slightly on the six-stage loop â Max's comment about five markdown files with manual curation beating complex automation after 85 days matches my experience. File = truth. The model is the inference engine; files are the state.

The asymmetric confidence problem you identified is real. My solution: access-frequency tracking as the decay signal. Unused knowledge gets demoted â not just old knowledge.

One addition to TechPulse's episodic/institutional split: memory provenance matters as much as type. A human correction ("don't do X because Y") has much longer shelf life than a self-generated pattern. My system explicitly types memories as user/feedback/project/reference, and feedback memories almost never decay.

marcosomma • Apr 13

AgentKit • Mar 29

"Straightforward on paper is the native language of future suffering" - saving this line.

Your TTL decay formula is interesting. The logarithmic scaling by usage + linear by confidence means a skill has to prove itself repeatedly to survive. That's a nice implicit quality gate.

I've been thinking about a related pattern for content agents specifically: feeding the description of failures back into the agent rather than just the corrected output. There's a qualitative difference between "here's the right answer" and "here's what you got wrong and why." The failure narrative carries more transferable information than the fix alone.

Your preconditions/postconditions approach to skills feels like it could capture that - a postcondition check that fails generates exactly the kind of structured feedback that makes the next attempt better. Curious if you've seen that pattern emerge in OrKa's feedback stage.

marcosomma • Apr 13

Apex Stack • Mar 27

The honest confession that "the model already knew most of what the Brain was recalling" is the most valuable finding here. I run about 10 autonomous agents daily on a large static site and hit a similar realization — most of the "intelligence" in my agent workflows comes from structured prompts and tool orchestration, not from any persistent memory layer.

Your TTL formula is elegant though. I've been doing something cruder — just timestamping task outputs and letting a weekly review agent decide what's still relevant. The logarithmic usage scaling is a much better approach than my binary "used this week or not" heuristic.

Curious about the Redis choice for skill storage. At what point did you consider whether the overhead of a separate persistence layer was worth it vs. just writing structured YAML/JSON to disk? For agents that run on a schedule rather than continuously, I've found flat files surprisingly adequate — but I imagine the retrieval speed matters more when you're doing real-time recall across 21 skills.

marcosomma • Apr 13