<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: TamDDD</title>
    <description>The latest articles on DEV Community by TamDDD (@fk965).</description>
    <link>https://dev.to/fk965</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3991226%2F38c83aa9-ff30-4e4f-b7a6-753de4f50b2d.png</url>
      <title>DEV Community: TamDDD</title>
      <link>https://dev.to/fk965</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fk965"/>
    <language>en</language>
    <item>
      <title>AI Agents Remember Facts But Can't Learn From Mistakes — Here's a Fix Tags: ai, agents, machinelearning, python, opensource</title>
      <dc:creator>TamDDD</dc:creator>
      <pubDate>Thu, 18 Jun 2026 15:59:38 +0000</pubDate>
      <link>https://dev.to/fk965/ai-agents-remember-facts-but-cant-learn-from-mistakes-heres-a-fix-tags-ai-agents-2ml9</link>
      <guid>https://dev.to/fk965/ai-agents-remember-facts-but-cant-learn-from-mistakes-heres-a-fix-tags-ai-agents-2ml9</guid>
      <description>&lt;h2&gt;
  
  
  The Blind Spot in Every Agent Memory System
&lt;/h2&gt;

&lt;p&gt;If you've built an AI agent — whether it's a coding assistant, a customer&lt;br&gt;
  support bot, or an autonomous workflow — you've seen this pattern:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session 1:&lt;/strong&gt; Agent tries to edit a production config file directly.&lt;br&gt;
  Everything breaks. You intervene.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session 2:&lt;/strong&gt; Same situation. Agent tries &lt;em&gt;the exact same thing again&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Why? Because the agent has no memory of &lt;em&gt;what went wrong last time&lt;/em&gt;. It&lt;br&gt;
  remembers facts ("the API endpoint is https://..."), but it doesn't remember&lt;br&gt;
  judgments ("direct production edits caused an outage — propose changes instead&lt;br&gt;
  of executing them").&lt;/p&gt;

&lt;p&gt;This is the blind spot in every major agent memory system today.&lt;/p&gt;

&lt;p&gt;## Two Kinds of Memory&lt;/p&gt;

&lt;p&gt;Current systems (Mem0, LangGraph MemorySaver, vector stores) are built for&lt;br&gt;
  &lt;strong&gt;semantic memory&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;| | Semantic Memory | Episodic Memory |&lt;br&gt;
  |---|---|---|&lt;br&gt;
  | What it stores | Facts, preferences, history | Decisions, judgments,&lt;br&gt;
  outcomes |&lt;br&gt;
  | Query | "What does the user prefer?" | "How should I handle this?" |&lt;br&gt;
  | Feedback | None | Utility-weighted: was it right? |&lt;br&gt;
  | Ranking | Cosine similarity only | Similarity × utility score |&lt;/p&gt;

&lt;p&gt;Semantic memory answers "what is relevant?" Episodic memory answers "what has&lt;br&gt;
  been proven correct?"&lt;/p&gt;

&lt;p&gt;## The Utility Flywheel&lt;/p&gt;

&lt;p&gt;The core idea is simple. When an agent makes a judgment, you store it:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
  memory.store(
      trigger="User asks agent to modify config.json in production",
      judgment="Production config changes must be confirmed with the user
  first",
      reasoning="Direct writes have caused outages before. Propose, don't
  execute.",
      domain="ops",
  )

  Later, when a similar situation arises, you search:

  results = memory.search("Can I edit the production config?", use_utility=True)

  The key is use_utility=True. Instead of pure cosine similarity, it ranks by:

  rank_score = cosine_similarity × (1 + α · utility_score)

  Where utility_score = adoptions / (adoptions + corrections).

  Every time the judgment is verified as correct, its utility goes up. Every
  time it's corrected, it goes down. Over time, the flywheel converges: proven 
  judgments naturally float to the top.

  The Numbers: 0.40 → 0.90 Precision

  We built a synthetic benchmark: 10 scenarios, each with a correct and wrong
  judgment that look nearly identical to an embedder. Then we measured which one
  ranks first.

  ┌──────────────────────┬─────────────┬───────────────────┐
  │        Metric        │ Cosine only │ +Utility Flywheel │
  ├──────────────────────┼─────────────┼───────────────────┤
  │ Precision@1          │ 0.40        │ 0.90              │
  ├──────────────────────┼─────────────┼───────────────────┤
  │ Mean rank of correct │ 1.90        │ 1.30              │
  └──────────────────────┴─────────────┴───────────────────┘

  Pure cosine retrieval (the standard approach) finds the right judgment only
  40% of the time — barely better than random. The utility flywheel brings it to
  90%.

  ▎ The benchmark is fully reproducible: pip install episodic-judgment 
  ▎ sentence-transformers &amp;amp;&amp;amp; python benchmarks/judgment_recall.py

  When NOT to Use This

  This library is not a replacement for Mem0 or vector stores. Use it when:

  - ✅ Your agent makes decisions that have consequences
  - ✅ You have a way to verify those decisions (user feedback, outcome
  detection)
  - ✅ You want the agent to learn from experience over time

  Don't use it if:

  - ❌ Your agent only needs facts and preferences (use semantic memory)
  - ❌ You can't provide verification feedback (utility stays at&amp;nbsp;0)
  - ❌ You need high-scale retrieval (&amp;gt;10K records) — the current version scans
  all rows

  The Bigger Picture

  I believe the next generation of AI agents won't be distinguished by their
  I believe the next generation of AI agents won't be distinguished by their
  base models — they'll be distinguished by their operational memory: the
  accumulated wisdom of thousands of past decisions.

  This library is a small step in that direction. It's MIT licensed, ~300 lines
  of core code, and designed to be the simplest thing that works.

  → GitHub: episodic-memory (https://github.com/fk965/episodic-memory)

  I'd love to hear from others building agents. Have you hit the "same mistake
  every session" problem? How are you solving it today?

  ---
  Built from an internal system running in production. The utility flywheel 
  concept was validated against real agent data with 3,957+ judgment events.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
