The Blind Spot in Every Agent Memory System
If you've built an AI agent — whether it's a coding assistant, a customer
support bot, or an autonomous workflow — you've seen this pattern:
Session 1: Agent tries to edit a production config file directly.
Everything breaks. You intervene.
Session 2: Same situation. Agent tries the exact same thing again.
Why? Because the agent has no memory of what went wrong last time. It
remembers facts ("the API endpoint is https://..."), but it doesn't remember
judgments ("direct production edits caused an outage — propose changes instead
of executing them").
This is the blind spot in every major agent memory system today.
## Two Kinds of Memory
Current systems (Mem0, LangGraph MemorySaver, vector stores) are built for
semantic memory:
| | Semantic Memory | Episodic Memory |
|---|---|---|
| What it stores | Facts, preferences, history | Decisions, judgments,
outcomes |
| Query | "What does the user prefer?" | "How should I handle this?" |
| Feedback | None | Utility-weighted: was it right? |
| Ranking | Cosine similarity only | Similarity × utility score |
Semantic memory answers "what is relevant?" Episodic memory answers "what has
been proven correct?"
## The Utility Flywheel
The core idea is simple. When an agent makes a judgment, you store it:
python
memory.store(
trigger="User asks agent to modify config.json in production",
judgment="Production config changes must be confirmed with the user
first",
reasoning="Direct writes have caused outages before. Propose, don't
execute.",
domain="ops",
)
Later, when a similar situation arises, you search:
results = memory.search("Can I edit the production config?", use_utility=True)
The key is use_utility=True. Instead of pure cosine similarity, it ranks by:
rank_score = cosine_similarity × (1 + α · utility_score)
Where utility_score = adoptions / (adoptions + corrections).
Every time the judgment is verified as correct, its utility goes up. Every
time it's corrected, it goes down. Over time, the flywheel converges: proven
judgments naturally float to the top.
The Numbers: 0.40 → 0.90 Precision
We built a synthetic benchmark: 10 scenarios, each with a correct and wrong
judgment that look nearly identical to an embedder. Then we measured which one
ranks first.
┌──────────────────────┬─────────────┬───────────────────┐
│ Metric │ Cosine only │ +Utility Flywheel │
├──────────────────────┼─────────────┼───────────────────┤
│ Precision@1 │ 0.40 │ 0.90 │
├──────────────────────┼─────────────┼───────────────────┤
│ Mean rank of correct │ 1.90 │ 1.30 │
└──────────────────────┴─────────────┴───────────────────┘
Pure cosine retrieval (the standard approach) finds the right judgment only
40% of the time — barely better than random. The utility flywheel brings it to
90%.
▎ The benchmark is fully reproducible: pip install episodic-judgment
▎ sentence-transformers && python benchmarks/judgment_recall.py
When NOT to Use This
This library is not a replacement for Mem0 or vector stores. Use it when:
- ✅ Your agent makes decisions that have consequences
- ✅ You have a way to verify those decisions (user feedback, outcome
detection)
- ✅ You want the agent to learn from experience over time
Don't use it if:
- ❌ Your agent only needs facts and preferences (use semantic memory)
- ❌ You can't provide verification feedback (utility stays at 0)
- ❌ You need high-scale retrieval (>10K records) — the current version scans
all rows
The Bigger Picture
I believe the next generation of AI agents won't be distinguished by their
I believe the next generation of AI agents won't be distinguished by their
base models — they'll be distinguished by their operational memory: the
accumulated wisdom of thousands of past decisions.
This library is a small step in that direction. It's MIT licensed, ~300 lines
of core code, and designed to be the simplest thing that works.
→ GitHub: episodic-memory (https://github.com/fk965/episodic-memory)
I'd love to hear from others building agents. Have you hit the "same mistake
every session" problem? How are you solving it today?
---
Built from an internal system running in production. The utility flywheel
concept was validated against real agent data with 3,957+ judgment events.
Top comments (1)
author here, happy to answer questions