Self-Correcting Systems

Posted on May 24

Why AI Memory Systems Fail at Uncertainty Preserve unresolved questions instead of forcing premature closure

#ai #agents #automation #productivity

AI memory has an obvious failure mode: it can forget.

The quieter failure mode is worse: it can resolve too much.

Most memory systems want to compress the past into something neat: a summary, a decision, a preference, a corrected belief. That is useful when the matter is settled. But a lot of serious work is not settled. It is provisional, contested, partly evidenced, or waiting for the next signal.

If your memory system cannot preserve that state, it will flatten uncertainty into fake clarity. The agent may sound organized, but it will inherit a cleaned-up version of reality.

That is the failure mode: not forgetting, but remembering too cleanly.

Sycophancy is memory that agrees too much. Premature closure is memory that resolves too much. Both come from the same root: a memory system that optimizes for comfort and efficiency over judgment.

Why Summaries and Corrections Are Not Enough

Summaries are not neutral. Every summary chooses what matters, what disappears, and what tone the future inherits.

When a model compresses a messy debate into "User decided X," it may save tokens while deleting the pressure that produced the decision: the rejected alternatives, the uncertainty boundary, the condition under which X might stop being true.

This is how long-term AI systems become confident for the wrong reason.

They do not only hallucinate facts. They hallucinate settlement.

They turn:

There are three competing interpretations. One is currently stronger, but the evidence is incomplete.

into:

User believes interpretation one.

That looks efficient. It is actually a loss of judgment.

A product example is simple. A founder says, "The offer is not working." A rushed memory system records: "Offer failed." But maybe the offer did not fail. Maybe distribution was weak. Maybe the audience was wrong. Maybe the landing page was unclear. Maybe the offer is good but unproven.

Offer failed is a clean summary.

Offer unproven; distribution and audience mismatch unresolved is a better memory.

The first prematurely closes the idea. The second preserves the question.

Correction memory is powerful because it preserves where thinking changed: you believed X, evidence changed Y, future behavior should adjust.

But not every valuable memory fits that shape. Sometimes you do not have a correction yet. Sometimes you have tension. Sometimes a pattern keeps appearing but does not have enough evidence to become a claim. Sometimes your old belief was not false; it was incomplete, context-bound, or waiting for a better frame.

That means a durable memory system needs more than preferences, summaries, and corrections. It needs unresolved memory: a place for things that are not ready to collapse into a conclusion.

The danger is obvious: open questions can become procrastination with better formatting. That is why unresolved memory needs structure, review triggers, and forced-resolution rules. The point is not to reward ambiguity. The point is to preserve uncertainty only while it is still doing work.

The Three-Layer Frame

Memory type	It says	Best use	Typical lifespan	Agent should surface it when...	Failure mode
Summary memory	"Here is what happened."	Fast continuity	Days to weeks	The task only needs current state	Deletes uncertainty
Correction memory	"Here is what changed."	Preventing repeated mistakes	Long-lived, superseded when stale	A current plan repeats a known failure	Turns revisions into doctrine
Unresolved memory	"Here is what remains open."	Preserving live questions	Weeks to months, never permanent by default	A decision touches an active uncertainty boundary	Becomes drag if not triaged

Continuity needs summary. Judgment needs correction. Discovery needs unresolved memory.

The Architecture of Uncertainty

A good unresolved-memory entry is not a vague note to self. It should preserve the state of knowledge at the time of reasoning:

Core fields:

Question: what is actually unresolved?
Tags / scope: which project, domain, or decision does this touch?
Live interpretations: what are the plausible explanations?
Uncertainty boundary: what is not known yet?
Next evidence needed: what would make the question sharper?
Review policy / TTL: when does this need to narrow, move, or die?

Advanced fields, only when the stakes justify them:

Confidence range per interpretation: weak / moderate / strong, or a probability range like 20-40%.
Falsification condition: what would weaken or kill each interpretation?
Linked memories: related corrections, decisions, summaries, or gates.
Status: open, narrowed, moved to gate, moved to correction, resolved, archived.

That structure keeps uncertainty from becoming laziness. Without it, "keep an open mind" becomes an excuse to never decide.

A Better Template

Add one file beside your correction log:

open_questions.md

Use this core template:

## [date] — [question title]
Status:
open / narrowed / moved to gate / moved to correction / resolved / archived

Tags:
[project/domain/decision]

Question:
What is unresolved?

Live interpretations:
1. [Interpretation] — why this is plausible
2. [Interpretation] — why this is plausible
3. [Interpretation] — why this is plausible

Uncertainty boundary:
What do we not know yet?

Next evidence needed:
What would make this clearer?

Review policy / TTL:
If no new evidence arrives by [date or condition], then [decide / move to gate / archive].

For high-stakes questions, add confidence ranges and falsification conditions:

Interpretation: [...]
Confidence: weak / moderate / strong, or [20-40%]
Falsified if: [...]
Current strongest read: [...]
Linked memories: [...]
Debiasing check: what would I believe if this interpretation were inconvenient?

Exact percentages can create fake precision. Use them only if you are actually tracking outcomes and calibration. For most personal systems, start with weak / moderate / strong until you have enough predictions to know whether your confidence means anything.

Concrete Examples

The examples below use the advanced fields because the questions are high-stakes enough to justify the extra structure. A daily personal note does not need this much ceremony.

Coding:

## 2026-05-24 — Is the slowdown algorithmic or data-shaped?
Tags:
search-api, performance, production

Question:
Is the latency spike caused by the algorithm, the data distribution, or the caching layer?

Live interpretations:
1. Algorithmic complexity — moderate — local profiling shows a slower path on larger inputs.
   Falsified if: production traces show constant-time behavior after cache miss removal.
2. Data distribution — moderate — slow requests cluster around unusually large tenant records.
   Falsified if: tenant size does not correlate with p95 latency.
3. Cache behavior — weak — recent cache-key change may be causing misses.
   Falsified if: hit rate remains stable across the spike window.

Current strongest read:
Algorithmic complexity is leading, but production traces are missing.

Uncertainty boundary:
No production profiling sample yet.

Next evidence needed:
Trace p95 requests by tenant size and cache-hit status.

Linked memories:
- corrections.md: "Do not optimize generated assumptions before profiling."
- gates.md: "Performance fix accepted only after p95 improves on production-like data."

Review policy / TTL:
If traces are not collected by Friday, stop debating and instrument first.

Status:
open

Hiring:

## 2026-05-24 — Is the candidate underqualified, or is the role underspecified?
Tags:
hiring, team-design, operations

Question:
Is the candidate actually underqualified, or is the team interviewing against an unclear role?

Live interpretations:
1. Candidate underqualified — moderate — answers were shallow on system design.
   Falsified if: a work sample shows strong practical judgment under realistic constraints.
2. Role underspecified — moderate — interviewers asked for different success criteria.
   Falsified if: the team can agree on three non-negotiable outcomes before the next interview.
3. Interview process weak — weak — questions may not reflect real work.
   Falsified if: structured work sample produces the same signal.

Current strongest read:
Role underspecification is leading because interview feedback conflicts.

Uncertainty boundary:
No agreed role scorecard or work sample yet.

Next evidence needed:
A written role scorecard and one realistic work sample.

Review policy / TTL:
If the team cannot define the role by Friday, pause the hire rather than rejecting the candidate.

Status:
open

The point is not to keep questions open forever. The point is to stop weak summaries from killing live hypotheses before evidence arrives.

Retrieval Hygiene

Open questions are expensive memory. You should not load all of them into every session.

Use these rules:

Put the currently active open questions in state.md.
Tag every open question by project, domain, and decision type.
Load only entries whose tags match the task.
Give unresolved items a separate namespace or metadata field if you use a vector store.
Decay or archive questions that miss their TTL without producing evidence.
Run a periodic epistemic audit: what stayed open, what narrowed, what became a gate, what should be killed?

For agent systems, unresolved memory should carry explicit metadata:

epistemic_status: unresolved
confidence_range: [low / medium / high]
review_date: [...]
surface_when: [matching project/tag/decision]

If you use embeddings or a vector database, keep unresolved items filterable. A simple rule is: retrieve only when tags overlap and semantic similarity is high enough to matter. The exact threshold depends on your system, but the principle is stable: unresolved memory should be opt-in by relevance, not dumped into every context window.

If your tool supports frontmatter, the same structure can look like this:

epistemic_status: unresolved
tags: [market-entry, distribution]
status: open
confidence_range: moderate
review_date: 2026-06-07
surface_when: [market-entry, pricing, distribution]

In Obsidian with Dataview, a due-question view can be as simple as:

TABLE review_date, tags
WHERE epistemic_status = "unresolved" AND status = "open" AND review_date <= date(today)
SORT review_date ASC

Otherwise retrieval becomes context pollution. Too many unresolved questions will make the agent hesitant, noisy, and expensive to run.

Skip unresolved memory for low-stakes tasks, live incidents, breaking news, trading decisions, or deadline-heavy execution where the cost of hesitation is higher than the cost of a rough decision. This layer is for questions with enough future impact to justify carrying them.

The Lifecycle Matters

The files are not separate boxes. Entries migrate: an open question can narrow, split, become a gate, become a correction, or reopen after new evidence.

Migration paths:

open question
  -> gate when the question becomes testable
  -> correction when evidence changes behavior
  -> decision when a path is chosen despite uncertainty
  -> archived when no longer decision-relevant
  -> reopened when new evidence changes the frame

Example lifecycle:

open_questions.md
Question: Is the product weak, or has distribution not reached the right readers?
Status: open until 100 targeted readers or 14 days.

gates.md
Gate: If 100 targeted readers produce no clicks, saves, replies, or buys, revise the positioning.

corrections.md
Correction: "Shipping is not conversion." Publishing created an asset; distribution remained untested.

decisions.md
Decision: Keep the product live at $12 while testing distribution; reject building a second product until the gate resolves.

If 100 targeted readers respond strongly but no one buys, the question can reopen:

open_questions.md
New question: Is the article strong but the Gumroad page under-converting?

You do not always replace old beliefs. Sometimes you contextualize them, narrow them, or reopen them under new evidence.

Cross-Examination Prompt

The power is not having three files. The power is making them argue.

Use this prompt:

Read state.md, corrections.md, gates.md, and open_questions.md.
Use only open questions whose tags match the current task.
For each relevant open question:
- Check whether it conflicts with a previous correction or active gate.
- Classify it as productive uncertainty, retreaded error, lingering task, or avoidance.
- Flag anything older than 30 days without new evidence or a reviewed TTL.
- Separate what is known from what is assumed.
Do not resolve the question unless the missing evidence is present.

This catches the biggest failure mode: using "unresolved" as a mask for not wanting to accept an answer.

Anti-Patterns

Unresolved memory can rot too.

Infinite openness / ambiguity addiction: treating non-commitment as sophistication after enough evidence exists.
Vague intuition: preserving a feeling without naming what would make it testable.
False balance: treating all interpretations as equal when one has stronger evidence.
Identity-protective uncertainty: keeping a question open because closure threatens ego, sunk cost, ideology, or self-image.
No review trigger: creating open loops that never return to the work.
No decision relevance: archiving questions that do not affect any future action.

The fix is triage. Every open question needs a review trigger, evidence target, or decision link. If a question cannot influence a future decision, it may not belong in the file.

Privacy and Team Context

Open questions are often more sensitive than corrections. Corrections describe what was wrong. Open questions describe what might be wrong: doubts about strategy, competence, relationships, markets, architecture, or timing. Keep private unresolved memory local by default. Do not load it into every cloud agent. Separate public examples from real records.

In team or multi-agent systems, unresolved memory also needs ownership:

Who owns the question?
Who can resolve it?
What evidence standard is required?
Which users or agents should be allowed to see it?

Without ownership and resolution authority, shared open questions become political fog.

How to Know It Is Working

Measure the system by behavior, not elegance. Track:

decisions that were delayed until missing evidence arrived,
assumptions that moved from open question to gate,
corrections generated from resolved questions,
repeated mistakes avoided,
prediction accuracy over time,
project outcomes after review triggers.

If open questions never change decisions, they are decoration. If they slow the right decisions and accelerate the right closures, they are infrastructure. For a manual setup, audit every two weeks while the system is new, then monthly once it stabilizes. Five active questions per project is usually plenty; beyond that, you are probably journaling instead of governing uncertainty. Keep the total active set under 20 to 25 unless you have automated retrieval and review. Everything else should move to archive, gate, decision, or correction.

During the audit, ask:

Which open question changed a decision?
Which one has no new evidence?
Which one is older than its TTL?
Which one should become a gate, correction, decision, or archive?
Which one am I keeping open because the answer is inconvenient?

Two useful KPIs: the percentage of open questions that resolve or migrate within 30 days, and the percentage of resolved questions that later prevented a repeated mistake.

Agent System Prompt

Use this as a standing instruction for agents that read your memory:

When using my memory, preserve epistemic status. Do not treat unresolved questions as settled facts. Surface unresolved memory only when its tags or decision scope match the current task. Separate what is known, inferred, contested, and missing. If a relevant open question conflicts with a correction or gate, flag the conflict before recommending action. During execution sprints, deadlines, or low-stakes tasks, default to closure unless the unresolved item would materially change the decision.

Sources and Adjacent Work

This article is not claiming uncertainty management is new. Richards Heuer's Psychology of Intelligence Analysis formalized Analysis of Competing Hypotheses inside intelligence work. Philip Tetlock and the Good Judgment Project made calibration, probability updates, and forecasting discipline legible to a wider audience. Science has falsification, competing models, and peer review. Engineering has incident postmortems and decision records. Law has bracketing, standards of proof, and unresolved factual questions.

The point here is narrower: personal AI memory systems need the same discipline. If they do not preserve epistemic status, uncertainty boundaries, and review triggers, they will compress unresolved questions into confident summaries.

Related areas worth studying:

Richards Heuer's Psychology of Intelligence Analysis
Good Judgment / Superforecasting
Context rot and retrieval drift
Human-in-the-loop evaluation
Scientific falsification and competing hypotheses
Engineering decision records and postmortems

How to Start Tonight

Create open_questions.md.

Write one entry for a question you keep circling but cannot honestly resolve yet.

Use four rules:

Name at least two live interpretations.
Give each interpretation a confidence band and falsification condition.
Name what evidence is missing.
Name the TTL or review trigger.

Then ask your agent:

Read open_questions.md.
Tell me which current decision is being treated as settled even though the record says it is still unresolved.
Tell me which open question is productive uncertainty, and which one is avoidance.
Do not resolve a question unless the missing evidence is present.

If the agent slows you down in the right place, the file is working.

Correction memory protects you from repeating what failed. Unresolved memory protects you from killing what has not been understood yet.

If you have not built the first layer yet, start with correction memory. Once your system can preserve where you were wrong, add unresolved memory to preserve what should not be settled yet.

This is the second layer of the correction-memory framework: preserve the state of knowledge at the time of reasoning, including what was known, what was inferred, what was contested, and what evidence was still missing.

Top comments (10)

xulingfeng • May 25

Totally agree on the scale angle. Been running a split setup — active context vs archival store — and the tagging overhead is real. The expiry question is the one I haven’t cracked yet either. Curious what threshold you land on for TTLs.

Self-Correcting Systems • May 25

I’m leaning toward separating retention from influence. I don’t want to delete much of the journey. The middle is often where the useful signal lives: what changed, what failed, what got rejected, and what evidence moved the decision. A future model with a larger context window may be able to read the full archive and surface patterns we can’t see yet. But not every memory should stay active. My current rule is: archive generously, influence selectively. So expiry doesn’t mean “delete this.” It means “this memory can’t steer decisions anymore unless it gets reviewed.” Active context stays small. Archival memory keeps the full path. Corrections and unresolved questions sit between the two with TTLs, review riggers, and status labels. The hard part is preventing layers from corrupting each other. Old memories should inform future choices, but stale memories shouldn’t silently govern them. Curious how you’re thinking about that boundary when does something stop steering decisions and move into archive in your setup?

xulingfeng • May 27

The scale angle is the one that keeps surprising me — people focus on per-call latency but the real bottleneck is context management as you scale up. The split strategy you mentioned sounds similar to what we settled on: separate short-term and long-term stores with different eviction policies. What's your split ratio looking like?

Self-Correcting Systems • May 27 • Edited

That is exactly the part that surprised me too. At small scale, the obvious problem looks
like latency or context window size. At larger scale, the harder problem becomes deciding
what gets promoted, compressed, evicted, or allowed to influence an answer.

I don’t have a universal split ratio yet. In my own setup I think of it less as a fixed
percentage and more as different authority lanes:

short-term/session memory: current task state, open decisions, recent corrections
working project memory: active files, current constraints, live source-of-truth notes
long-term memory: durable principles, repeated failures, identity/context, archived lessons

The practical rule I’m moving toward is: short-term memory should be cheap to overwrite,
long-term memory should be hard to promote into, and corrections should have a longer
half-life than ordinary preferences.

So if I had to describe the ratio today, it’s probably something like: keep the active
context small enough to stay operational, but reserve long-term storage for things that
change future behavior, not just things that happened.

xulingfeng • May 24

Nice writeup. One thing I'd add: Why AI Memory Resolves Too Much — And What to Preserve Inste can be tricky when you scale, but the core insight here is solid. Thanks for sharing the details.

Self-Correcting Systems • May 24

Appreciate it. I agree — scale is the hard part. A single open_questions.md works for
one operator, but once you have lots of memories or multiple agents, unresolved items
need metadata, TTLs, ownership, and retrieval filters or they turn into context
pollution. That’s probably the next layer I need to pressure-test more.

xulingfeng • May 27

The writeup was a fun one to put together — the memory resolution problem is deceptively complex once you dig into it. My biggest takeaway was that metadata tagging upfront saves a ton of pain later when you need to prune or reorganize. Glad it resonated!

Self-Correcting Systems • May 27

That makes sense. The deceptively complex part is exactly what pulled me into this too.

At first it looks like a storage problem: save more context, retrieve more context,
summarize when needed. But once the memory starts influencing real decisions, the hard
part becomes resolution: which memory wins, which one only provides context, which one is
stale, and which one should block an action.

I agree on metadata tagging. Retrofitting metadata after the archive grows is painful
because the system already has habits by then. Even simple fields like source type,
freshness, status, authority, and allowed action make later pruning/reorganization much
cleaner.

The thing I’m trying to avoid now is “memory as a pile of useful notes.” The goal is
closer to memory as governed evidence.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.