DEV Community: Kaelii

How We Architected a Cognitive Memory Engine for AI Agents (10MB Rust Binary)

Kaelii — Sun, 01 Mar 2026 20:32:18 +0000

The previous article introduced engram-rs's three-layer memory architecture and design motivation. This one tackles a more specific question: how does retrieval quality not degrade as memories accumulate?

The answer lives in the scoring algorithms. Here's a visual breakdown of five core mechanisms.

1. Use It or Lose It

Left panel: a memory that's never recalled after storage. Importance decays smoothly, sinking to the bottom layer.

Right panel: a memory that gets periodically recalled. Each retrieval triggers an activation boost (yellow dots), pushing importance back up. The red dashed line shows the unrecalled trajectory for comparison.

This isn't a feature — it's the system's first principle: a memory's survival is determined by how often it's used. Retrieval isn't just a read operation — it's also a vote telling the system this memory still matters.

The result? After hundreds of consolidation epochs, frequently-used knowledge stays prominent, stale noise naturally sinks, and retrieval quality doesn't degrade as total memory count grows.

2. Exponential Decay, Not Linear

The previous article used importance × e^(-decay_rate × idle_hours / 168) for retrieval-time recency weighting. But how does importance itself decay? That's what actually determines whether a memory lives or dies.

Three curves show the decay trajectories for each memory kind:

Kind	Half-life	Why
episodic	~35 epochs	"Yesterday's debug log" — should fade if unused
semantic	~58 epochs	"Auth uses OAuth2" — knowledge decays slower
procedural	~173 epochs	"Deploy steps" — procedures should almost never fade

The floor is 0.01. Memories never truly reach zero — given a precise enough query, a sunken memory can still be retrieved. This mirrors a human memory property: you think you've forgotten, but the right cue pulls it back.

Why exponential instead of linear? Linear decay has a fatal flaw: the cliff. The moment importance linearly decrements to zero, the memory is permanently lost with no chance of recovery. Exponential decay never reaches zero — it just gets closer and closer, leaving an infinitely long tail.

3. Logarithmic Saturation for Reinforcement

When a memory is stored repeatedly or recalled multiple times, its weight increases. But the growth curve is logarithmic, not linear.

rep_bonus  = 0.17 × ln(1 + repetition_count),  cap 0.7
access_bonus = 0.12 × ln(1 + access_count),    cap 0.55

Why logarithmic?

Consider a counterexample: if rep_bonus were linear (say, 0.1 × count, cap 0.5), then a memory stored 5 times would max out its bonus. The 6th, 50th, and 500th submission — all identical in effect. You can't distinguish "mentioned a few times" from "repeatedly emphasized."

Logarithmic growth pushes the saturation point out to ~30 reps and ~100 accesses. The first few interactions matter most, then returns diminish while still contributing. This matches human learning research — spaced repetition works, but each additional review yields less marginal benefit.

4. Additive Biases Instead of Multiplicative

A memory's final weight is also influenced by its kind and layer. The chart shows the weight effect for all nine combinations (3 kinds × 3 layers):

procedural + core ranks highest (+0.15 + 0.1 = +0.25)
episodic + buffer ranks lowest (-0.1 - 0.1 = -0.2)
semantic + working is the baseline (0)

Why emphasize "additive"?

An earlier version used multiplication: procedural memories ×1.3, core layer ×1.2. Sounds reasonable, but 1.3 × 1.2 = 1.56, while episodic × buffer = 0.8 × 0.8 = 0.64. The gap between the highest and lowest is 2.4× — procedural + core would systematically crush everything else, regardless of how relevant the content actually is.

Additive biases compress this ratio to under 1.6×. Kind and layer still influence ranking, but not enough to override the semantic relevance signal itself.

5. Sigmoid Score Compression

The final ranking score combines semantic relevance, memory weight, and time decay. This raw score is mapped through a sigmoid to the 0–1 range:

score = 2 / (1 + e^(-2x)) - 1

Why not just clamp at 1.0?

Because clamping destroys information. Say two memories score 1.3 and 2.1 in raw — after clamping, both become 1.0, and the system thinks they're "equally good." Sigmoid approaches 1.0 asymptotically but never reaches it, preserving discrimination in the high-score region.

The shaded area in the chart represents the ranking information that sigmoid preserves — the differences that a hard clamp would flatten.

The Full Scoring Formula

Putting all five mechanisms together, a memory's final retrieval score is:

weight = importance + rep_bonus + access_bonus + kind_bias + layer_bias

raw = relevance × (1 + 0.4 × weight + 0.2 × recency)

score = sigmoid(raw)

Where relevance comes from a hybrid of semantic embeddings and BM25 keyword search, recency is time-based exponential decay, and importance is the value after per-epoch exponential decay (counteracted by activation boosts on recall).

No magic numbers — every coefficient maps to an explainable cognitive mechanism.

Specs


Language	Rust, single binary, zero external dependencies
Memory	~100 MB RSS in production
Storage	SQLite, one .db file
Search	Semantic embeddings + BM25 (with CJK tokenization)
Platforms	Linux, macOS, Windows

GitHub: github.com/kael-bit/engram-rs

My AI Agent Ran for a Week — Here's How It Remembers Things

Kaelii — Sat, 28 Feb 2026 16:37:25 +0000

I run a 24/7 AI agent connected to Telegram. It handles daily tasks, spawns sub-agents, and runs scheduled jobs.

Most agent frameworks have some kind of memory mechanism — a markdown file that gets loaded at session start, where the agent writes things down and reads them back next time. Basic persistence works fine.

But after running it for a while, I noticed a problem: the memory file kept growing.

The agent dumped everything in — debug logs, temporary state, duplicate information, long-outdated decisions. The file grew and grew, useful information buried under noise. And there was no cleanup mechanism — things went in, nothing came out.

I realized the agent didn't just need "the ability to remember things." It needed a memory system with a lifecycle: what to remember, how long to keep it, and when to forget. So I plugged in an external memory service to replace the markdown file. But having the tool didn't mean the problem was solved — the hardest part was teaching the AI to store and retrieve correctly.

Pitfall #1: It Doesn't Store Anything

After setting up the memory service, I wrote storage instructions in the system prompt. The first version used a table: what to store, what tags to use, what category. Clean and structured — looked great to me.

The result? The agent stored almost nothing.

A 30-minute conversation where I corrected two mistakes, confirmed a technical approach, and set a rule — it didn't store a single one. After the session ended, all of it was gone.

The reason is simple: an LLM's instinct is to respond, not to record. It'll go all-out answering your question, but it won't spontaneously think "is there something worth remembering in this conversation?" No matter how clean the table is, it won't pause mid-conversation to consult it.

Pitfall #2: Scattered Instructions

After discovering the storage problem, I switched to more forceful imperative instructions with lots of emphasis markers. "CRITICAL: When I correct you, store first, then reply." "HIGHEST PRIORITY: User feedback." "⚠️ Don't miss storing."

It worked better, but the rules were scattered across different parts of the prompt. CRITICAL appeared three times, ⚠️ twice, 🚫 once. They competed for attention. When everything screams "I'm the most important," nothing is.

Pitfall #3: Over-Compression

Realizing the prompt was too long, I did an aggressive trim. It backfired — shorter, yes, but the crucial guidance on when to store got cut too. The agent went passive: it only stored things when I explicitly said "remember this," no longer proactively extracting decisions and lessons from conversations.

The Structure That Finally Worked

After about two or three weeks of iteration, I converged on a stable structure. Four lines, each with a clear function:

Principle: Store everything valuable, store it immediately, never batch. Over-storing costs nothing, forgetting costs everything.
Action rule: User corrects you → store FIRST, then reply. If you think "I'll store this later," you're already wrong.
What to store: Identity, preferences, decisions, constraints, lessons, milestone recaps.
What not to store: Command output, step-by-step narration, info already in code/config files.

Each line gets an emoji prefix (🧠⚠️✅🚫). Not for decoration — they're visual anchors that help the model parse the structure at a glance. All in one compact block, not scattered across multiple sections.

Two specific wording changes made the biggest difference:

"Over-storing costs nothing, forgetting costs everything" — eliminated the agent's hesitation. It stopped agonizing over "is this worth storing?" because the answer is always "storing it can't hurt."
"Store FIRST, then reply" — solved the timing problem. After finishing a reply, the agent often forgot to store. Forcing store-before-reply meant corrections actually stuck.

Resume: The First Thing After Waking Up

With storing solved, there was still the retrieval problem. Each new session starts as a blank slate — the agent needs to know what it already knows.

I wrote a hard rule in the prompt: the first action of every session must be calling the resume endpoint, no exceptions. Before replying to the user, before reading files, before anything.

Resume doesn't return every memory in full (that would blow up the context). Instead, it returns an index — like a table of contents listing all topics and how many memories each contains. When the agent needs details on a specific topic, it pulls them on demand.

This design resolves a fundamental tension: the agent needs the confidence of "everything is saved" to be willing to store, but the context window can't actually load all memories. The index gives you both.

But the real pitfall with resume wasn't the design — it was that it often didn't get triggered.

A long-running agent continuously accumulates context. The framework periodically compresses it (compaction). The compressed summary preserves the rough outline of the conversation but loses details. The problem: the summary looks "good enough" — the agent reads the compressed context, thinks it knows what's going on, and starts working, not feeling any need to call resume.

This is a variant of hallucination: the agent gets false confidence from the compressed summary, believing it has enough context, when it's actually lost a mass of specifics — the exact wording of a rule, the reasoning behind a decision, the lesson from last time's mistake.

Writing "MANDATORY FIRST ACTION" in the prompt wasn't enough. Because the post-compaction context might already contain a seemingly reasonable conversation history, the agent prioritizes responding to that context over following a rule that "doesn't seem urgent."

My final solution wasn't a prompt rule — it was a file hook. I created a WORKFLOW_AUTO.md that the framework force-loads after every compaction. The file says one thing: call resume. No matter how the context gets compressed, the agent reads this file and triggers the resume call.

Moving a critical behavior from "a rule in the prompt" to "a hook in the filesystem" is a completely different level of reliability. Prompts can be ignored. File loading is deterministic.

Triggers: Reflexes Before Actions

Once, the agent did something I had explicitly told it not to do. It wasn't being defiant — it had "remembered" the rule (it was in the memory service), but it didn't think to check before executing the action.

This led me to add a trigger mechanism. When the agent learns a lesson, it stores it with a trigger tag (e.g., trigger:git-push). Before executing a related action, the prompt instructs it to check for relevant lessons first.

It's like muscle memory — no conscious recall needed. When the relevant action comes up, the lesson surfaces automatically. Far more reliable than depending on the agent to "remember."

After One Week

The agent has been running stably for a week now. Over a hundred memories, automatically clustered into a dozen-plus topics. It remembers who I am, the project's technical decisions, and mistakes it made before. Context restoration after a session restart takes about 300ms.

Looking back, the biggest lesson isn't technical — it's that the essence of prompt engineering isn't "what to say" but "how to say it so the model actually listens." The same rule, scattered vs. consolidated, passive vs. active voice, with or without explaining why — the difference in effectiveness is night and day.

The memory system's architecture matters, of course. But if the agent won't use it, the best architecture in the world is useless.

Memory service: engram — single Rust binary, MCP-compatible. Agent framework: OpenClaw.

Adding a Lifecycle to AI Agent Memory

Kaelii — Fri, 27 Feb 2026 10:59:22 +0000

This isn't a product pitch. I just want to share some real problems I ran into while building persistent memory for an AI agent, and the approach I ended up with. The code is open source — my approach might not be the best one, and I'd love to hear how others are tackling the same problems.

The Problem

When building memory for an agent, the most immediate question is: once you've stored hundreds of memories, how do you make sure the most relevant ones surface during retrieval?

Information has a shelf life. Yesterday's debug log, last week's temporary workaround, last month's architecture decision — they all have very different levels of importance. If every memory is treated equally, retrieval results get flooded with stale noise, and the actually valuable stuff gets buried.

My approach was to give memories a lifecycle — new information starts in an observation period, valuable stuff gets promoted upward, and outdated entries naturally sink to the bottom.

Three-Layer Design: Buffer → Working → Core

I settled on a three-layer structure:

Most ephemeral information ("just ran a test", "build passed") stays in Buffer and naturally sinks. The genuinely valuable stuff floats up over time. No manual curation needed — the system filters on its own.

Decay: Letting Priority Shift Over Time

Decay doesn't delete data. It adjusts retrieval ranking to reflect recency. The longer a memory goes unused, the lower it ranks in search results.

decay_score = importance × e^(−decay_rate × idle_hours / 168)

Buffer:   decay_rate = 5.0   → sinks within days of inactivity
Working:  decay_rate = 1.0   → takes weeks to noticeably drop
Core:     decay_rate = 0.01  → practically permanent

There's one special case — procedural knowledge (deployment steps, coding standards, etc.). These get a decay rate of 0.01 regardless of layer, because process knowledge shouldn't lose priority over time. It doesn't matter if you haven't looked up "how to deploy" in a month — it needs to be there when you need it.

An early mistake I made: applying uniform decay to all memories. The result was that the agent kept losing track of deployment procedures and had to ask again every time. Once I differentiated by memory type, the problem went away.

Repetition = Reinforcement

Human memory has a well-known property: repeated exposure strengthens retention. I mimicked this in the system:

The more often the same knowledge is mentioned, the more "durable" it becomes — higher importance, still ranks well even after decay. This wasn't part of the original design; it was added after noticing in practice that the agent kept failing to recall things I'd told it multiple times.

Retrieval: Semantic + Keyword Hybrid

Storing memories is only half the problem — you also need to find them. Retrieval uses a hybrid strategy:

One real-world gotcha I ran into: short CJK queries produce unreliable embeddings. For example, searching "部署" (deploy) — the embedding model returns nearly identical similarity scores for all Chinese-language memories, making discrimination impossible. The fix was a special case: for short CJK queries, reduce the weight of semantic search and lean harder on keyword matching.

Why SQLite

This might be the most controversial choice, but I think it fits the use case well.

My scenario is single-agent use with hundreds to a few thousand memories. At this scale, SQLite's read/write performance is more than sufficient, and it comes with built-in SQL queries and FTS5 full-text search — no extra dependencies needed.

The end result: the entire system compiles to a single binary, runs directly on any machine, and all data lives in one .db file. Backup is cp. Migration is scp.

Of course, for scenarios with many agents writing concurrently or significantly larger data volumes, the storage choice would need to be reconsidered.

Results So Far

It's been running for a few days now, with 80+ memories distributed across the three layers. Ephemeral information in Buffer typically sinks within hours to a day, while valuable entries gradually promote to Working and Core.

One interesting case: the agent genuinely stops repeating past mistakes — because lesson-type memories are tagged with triggers, and those triggers fire automatically before related operations.

Open Questions

How to organize memories at scale? 80 entries is manageable; what about 800? I've since built a self-organizing topic tree (k-means clustering), but that's a separate discussion.
Cross-agent memory sharing — the system supports multiple agents on a single instance via namespace isolation, but how agents could safely share subsets of memory is still an open question.
Evaluation metrics — how do you quantify "memory quality"? Right now I'm eyeballing logs, which isn't exactly scientific.

Code is on GitHub. Written in Rust, MIT licensed.

If you're working on agent memory too, I'd love to hear from you — especially around how you handle memory lifecycle management.