DEV Community

TamDDD
TamDDD

Posted on

AI Agents Remember Facts But Can't Learn From Mistakes — Here's a Fix Tags: ai, agents, machinelearning, python, opensource

The Blind Spot in Every Agent Memory System

If you've built an AI agent — whether it's a coding assistant, a customer
support bot, or an autonomous workflow — you've seen this pattern:

Session 1: Agent tries to edit a production config file directly.
Everything breaks. You intervene.

Session 2: Same situation. Agent tries the exact same thing again.

Why? Because the agent has no memory of what went wrong last time. It
remembers facts ("the API endpoint is https://..."), but it doesn't remember
judgments ("direct production edits caused an outage — propose changes instead
of executing them").

This is the blind spot in every major agent memory system today.

## Two Kinds of Memory

Current systems (Mem0, LangGraph MemorySaver, vector stores) are built for
semantic memory:

| | Semantic Memory | Episodic Memory |
|---|---|---|
| What it stores | Facts, preferences, history | Decisions, judgments,
outcomes |
| Query | "What does the user prefer?" | "How should I handle this?" |
| Feedback | None | Utility-weighted: was it right? |
| Ranking | Cosine similarity only | Similarity × utility score |

Semantic memory answers "what is relevant?" Episodic memory answers "what has
been proven correct?"

## The Utility Flywheel

The core idea is simple. When an agent makes a judgment, you store it:


python
  memory.store(
      trigger="User asks agent to modify config.json in production",
      judgment="Production config changes must be confirmed with the user
  first",
      reasoning="Direct writes have caused outages before. Propose, don't
  execute.",
      domain="ops",
  )

  Later, when a similar situation arises, you search:

  results = memory.search("Can I edit the production config?", use_utility=True)

  The key is use_utility=True. Instead of pure cosine similarity, it ranks by:

  rank_score = cosine_similarity × (1 + α · utility_score)

  Where utility_score = adoptions / (adoptions + corrections).

  Every time the judgment is verified as correct, its utility goes up. Every
  time it's corrected, it goes down. Over time, the flywheel converges: proven 
  judgments naturally float to the top.

  The Numbers: 0.40 → 0.90 Precision

  We built a synthetic benchmark: 10 scenarios, each with a correct and wrong
  judgment that look nearly identical to an embedder. Then we measured which one
  ranks first.

  ┌──────────────────────┬─────────────┬───────────────────┐
  │        Metric        │ Cosine only │ +Utility Flywheel │
  ├──────────────────────┼─────────────┼───────────────────┤
  │ Precision@1          │ 0.40        │ 0.90              │
  ├──────────────────────┼─────────────┼───────────────────┤
  │ Mean rank of correct │ 1.90        │ 1.30              │
  └──────────────────────┴─────────────┴───────────────────┘

  Pure cosine retrieval (the standard approach) finds the right judgment only
  40% of the time — barely better than random. The utility flywheel brings it to
  90%.

  ▎ The benchmark is fully reproducible: pip install episodic-judgment 
  ▎ sentence-transformers && python benchmarks/judgment_recall.py

  When NOT to Use This

  This library is not a replacement for Mem0 or vector stores. Use it when:

  - ✅ Your agent makes decisions that have consequences
  - ✅ You have a way to verify those decisions (user feedback, outcome
  detection)
  - ✅ You want the agent to learn from experience over time

  Don't use it if:

  - ❌ Your agent only needs facts and preferences (use semantic memory)
  - ❌ You can't provide verification feedback (utility stays at 0)
  - ❌ You need high-scale retrieval (>10K records) — the current version scans
  all rows

  The Bigger Picture

  I believe the next generation of AI agents won't be distinguished by their
  I believe the next generation of AI agents won't be distinguished by their
  base models — they'll be distinguished by their operational memory: the
  accumulated wisdom of thousands of past decisions.

  This library is a small step in that direction. It's MIT licensed, ~300 lines
  of core code, and designed to be the simplest thing that works.

  → GitHub: episodic-memory (https://github.com/fk965/episodic-memory)

  I'd love to hear from others building agents. Have you hit the "same mistake
  every session" problem? How are you solving it today?

  ---
  Built from an internal system running in production. The utility flywheel 
  concept was validated against real agent data with 3,957+ judgment events.
Enter fullscreen mode Exit fullscreen mode

Top comments (1)

Collapse
 
fk965 profile image
TamDDD

author here, happy to answer questions