AI memory has an obvious failure mode: it can forget.
The quieter failure mode is worse: it can resolve too much.
Most memory systems want to compress the past into something neat: a summary, a decision, a preference, a corrected belief. That is useful when the matter is settled. But a lot of serious work is not settled. It is provisional, contested, partly evidenced, or waiting for the next signal.
If your memory system cannot preserve that state, it will flatten uncertainty into fake clarity. The agent may sound organized, but it will inherit a cleaned-up version of reality.
That is the failure mode: not forgetting, but remembering too cleanly.
Sycophancy is memory that agrees too much. Premature closure is memory that resolves too much. Both come from the same root: a memory system that optimizes for comfort and efficiency over judgment.
Why Summaries and Corrections Are Not Enough
Summaries are not neutral. Every summary chooses what matters, what disappears, and what tone the future inherits.
When a model compresses a messy debate into "User decided X," it may save tokens while deleting the pressure that produced the decision: the rejected alternatives, the uncertainty boundary, the condition under which X might stop being true.
This is how long-term AI systems become confident for the wrong reason.
They do not only hallucinate facts. They hallucinate settlement.
They turn:
There are three competing interpretations. One is currently stronger, but the evidence is incomplete.
into:
User believes interpretation one.
That looks efficient. It is actually a loss of judgment.
A product example is simple. A founder says, "The offer is not working." A rushed memory system records: "Offer failed." But maybe the offer did not fail. Maybe distribution was weak. Maybe the audience was wrong. Maybe the landing page was unclear. Maybe the offer is good but unproven.
Offer failed is a clean summary.
Offer unproven; distribution and audience mismatch unresolved is a better memory.
The first prematurely closes the idea. The second preserves the question.
Correction memory is powerful because it preserves where thinking changed: you believed X, evidence changed Y, future behavior should adjust.
But not every valuable memory fits that shape. Sometimes you do not have a correction yet. Sometimes you have tension. Sometimes a pattern keeps appearing but does not have enough evidence to become a claim. Sometimes your old belief was not false; it was incomplete, context-bound, or waiting for a better frame.
That means a durable memory system needs more than preferences, summaries, and corrections. It needs unresolved memory: a place for things that are not ready to collapse into a conclusion.
The danger is obvious: open questions can become procrastination with better formatting. That is why unresolved memory needs structure, review triggers, and forced-resolution rules. The point is not to reward ambiguity. The point is to preserve uncertainty only while it is still doing work.
The Three-Layer Frame
| Memory type | It says | Best use | Typical lifespan | Agent should surface it when... | Failure mode |
|---|---|---|---|---|---|
| Summary memory | "Here is what happened." | Fast continuity | Days to weeks | The task only needs current state | Deletes uncertainty |
| Correction memory | "Here is what changed." | Preventing repeated mistakes | Long-lived, superseded when stale | A current plan repeats a known failure | Turns revisions into doctrine |
| Unresolved memory | "Here is what remains open." | Preserving live questions | Weeks to months, never permanent by default | A decision touches an active uncertainty boundary | Becomes drag if not triaged |
Continuity needs summary. Judgment needs correction. Discovery needs unresolved memory.
The Architecture of Uncertainty
A good unresolved-memory entry is not a vague note to self. It should preserve the state of knowledge at the time of reasoning:
Core fields:
- Question: what is actually unresolved?
- Tags / scope: which project, domain, or decision does this touch?
- Live interpretations: what are the plausible explanations?
- Uncertainty boundary: what is not known yet?
- Next evidence needed: what would make the question sharper?
- Review policy / TTL: when does this need to narrow, move, or die?
Advanced fields, only when the stakes justify them:
- Confidence range per interpretation: weak / moderate / strong, or a probability range like 20-40%.
- Falsification condition: what would weaken or kill each interpretation?
- Linked memories: related corrections, decisions, summaries, or gates.
- Status: open, narrowed, moved to gate, moved to correction, resolved, archived.
That structure keeps uncertainty from becoming laziness. Without it, "keep an open mind" becomes an excuse to never decide.
A Better Template
Add one file beside your correction log:
open_questions.md
Use this core template:
## [date] — [question title]
Status:
open / narrowed / moved to gate / moved to correction / resolved / archived
Tags:
[project/domain/decision]
Question:
What is unresolved?
Live interpretations:
1. [Interpretation] — why this is plausible
2. [Interpretation] — why this is plausible
3. [Interpretation] — why this is plausible
Uncertainty boundary:
What do we not know yet?
Next evidence needed:
What would make this clearer?
Review policy / TTL:
If no new evidence arrives by [date or condition], then [decide / move to gate / archive].
For high-stakes questions, add confidence ranges and falsification conditions:
Interpretation: [...]
Confidence: weak / moderate / strong, or [20-40%]
Falsified if: [...]
Current strongest read: [...]
Linked memories: [...]
Debiasing check: what would I believe if this interpretation were inconvenient?
Exact percentages can create fake precision. Use them only if you are actually tracking outcomes and calibration. For most personal systems, start with weak / moderate / strong until you have enough predictions to know whether your confidence means anything.
Concrete Examples
The examples below use the advanced fields because the questions are high-stakes enough to justify the extra structure. A daily personal note does not need this much ceremony.
Coding:
## 2026-05-24 — Is the slowdown algorithmic or data-shaped?
Tags:
search-api, performance, production
Question:
Is the latency spike caused by the algorithm, the data distribution, or the caching layer?
Live interpretations:
1. Algorithmic complexity — moderate — local profiling shows a slower path on larger inputs.
Falsified if: production traces show constant-time behavior after cache miss removal.
2. Data distribution — moderate — slow requests cluster around unusually large tenant records.
Falsified if: tenant size does not correlate with p95 latency.
3. Cache behavior — weak — recent cache-key change may be causing misses.
Falsified if: hit rate remains stable across the spike window.
Current strongest read:
Algorithmic complexity is leading, but production traces are missing.
Uncertainty boundary:
No production profiling sample yet.
Next evidence needed:
Trace p95 requests by tenant size and cache-hit status.
Linked memories:
- corrections.md: "Do not optimize generated assumptions before profiling."
- gates.md: "Performance fix accepted only after p95 improves on production-like data."
Review policy / TTL:
If traces are not collected by Friday, stop debating and instrument first.
Status:
open
Hiring:
## 2026-05-24 — Is the candidate underqualified, or is the role underspecified?
Tags:
hiring, team-design, operations
Question:
Is the candidate actually underqualified, or is the team interviewing against an unclear role?
Live interpretations:
1. Candidate underqualified — moderate — answers were shallow on system design.
Falsified if: a work sample shows strong practical judgment under realistic constraints.
2. Role underspecified — moderate — interviewers asked for different success criteria.
Falsified if: the team can agree on three non-negotiable outcomes before the next interview.
3. Interview process weak — weak — questions may not reflect real work.
Falsified if: structured work sample produces the same signal.
Current strongest read:
Role underspecification is leading because interview feedback conflicts.
Uncertainty boundary:
No agreed role scorecard or work sample yet.
Next evidence needed:
A written role scorecard and one realistic work sample.
Review policy / TTL:
If the team cannot define the role by Friday, pause the hire rather than rejecting the candidate.
Status:
open
The point is not to keep questions open forever. The point is to stop weak summaries from killing live hypotheses before evidence arrives.
Retrieval Hygiene
Open questions are expensive memory. You should not load all of them into every session.
Use these rules:
- Put the currently active open questions in
state.md. - Tag every open question by project, domain, and decision type.
- Load only entries whose tags match the task.
- Give unresolved items a separate namespace or metadata field if you use a vector store.
- Decay or archive questions that miss their TTL without producing evidence.
- Run a periodic epistemic audit: what stayed open, what narrowed, what became a gate, what should be killed?
For agent systems, unresolved memory should carry explicit metadata:
epistemic_status: unresolved
confidence_range: [low / medium / high]
review_date: [...]
surface_when: [matching project/tag/decision]
If you use embeddings or a vector database, keep unresolved items filterable. A simple rule is: retrieve only when tags overlap and semantic similarity is high enough to matter. The exact threshold depends on your system, but the principle is stable: unresolved memory should be opt-in by relevance, not dumped into every context window.
If your tool supports frontmatter, the same structure can look like this:
epistemic_status: unresolved
tags: [market-entry, distribution]
status: open
confidence_range: moderate
review_date: 2026-06-07
surface_when: [market-entry, pricing, distribution]
In Obsidian with Dataview, a due-question view can be as simple as:
TABLE review_date, tags
WHERE epistemic_status = "unresolved" AND status = "open" AND review_date <= date(today)
SORT review_date ASC
Otherwise retrieval becomes context pollution. Too many unresolved questions will make the agent hesitant, noisy, and expensive to run.
Skip unresolved memory for low-stakes tasks, live incidents, breaking news, trading decisions, or deadline-heavy execution where the cost of hesitation is higher than the cost of a rough decision. This layer is for questions with enough future impact to justify carrying them.
The Lifecycle Matters
The files are not separate boxes. Entries migrate: an open question can narrow, split, become a gate, become a correction, or reopen after new evidence.
Migration paths:
open question
-> gate when the question becomes testable
-> correction when evidence changes behavior
-> decision when a path is chosen despite uncertainty
-> archived when no longer decision-relevant
-> reopened when new evidence changes the frame
Example lifecycle:
open_questions.md
Question: Is the product weak, or has distribution not reached the right readers?
Status: open until 100 targeted readers or 14 days.
gates.md
Gate: If 100 targeted readers produce no clicks, saves, replies, or buys, revise the positioning.
corrections.md
Correction: "Shipping is not conversion." Publishing created an asset; distribution remained untested.
decisions.md
Decision: Keep the product live at $12 while testing distribution; reject building a second product until the gate resolves.
If 100 targeted readers respond strongly but no one buys, the question can reopen:
open_questions.md
New question: Is the article strong but the Gumroad page under-converting?
You do not always replace old beliefs. Sometimes you contextualize them, narrow them, or reopen them under new evidence.
Cross-Examination Prompt
The power is not having three files. The power is making them argue.
Use this prompt:
Read state.md, corrections.md, gates.md, and open_questions.md.
Use only open questions whose tags match the current task.
For each relevant open question:
- Check whether it conflicts with a previous correction or active gate.
- Classify it as productive uncertainty, retreaded error, lingering task, or avoidance.
- Flag anything older than 30 days without new evidence or a reviewed TTL.
- Separate what is known from what is assumed.
Do not resolve the question unless the missing evidence is present.
This catches the biggest failure mode: using "unresolved" as a mask for not wanting to accept an answer.
Anti-Patterns
Unresolved memory can rot too.
- Infinite openness / ambiguity addiction: treating non-commitment as sophistication after enough evidence exists.
- Vague intuition: preserving a feeling without naming what would make it testable.
- False balance: treating all interpretations as equal when one has stronger evidence.
- Identity-protective uncertainty: keeping a question open because closure threatens ego, sunk cost, ideology, or self-image.
- No review trigger: creating open loops that never return to the work.
- No decision relevance: archiving questions that do not affect any future action.
The fix is triage. Every open question needs a review trigger, evidence target, or decision link. If a question cannot influence a future decision, it may not belong in the file.
Privacy and Team Context
Open questions are often more sensitive than corrections. Corrections describe what was wrong. Open questions describe what might be wrong: doubts about strategy, competence, relationships, markets, architecture, or timing. Keep private unresolved memory local by default. Do not load it into every cloud agent. Separate public examples from real records.
In team or multi-agent systems, unresolved memory also needs ownership:
- Who owns the question?
- Who can resolve it?
- What evidence standard is required?
- Which users or agents should be allowed to see it?
Without ownership and resolution authority, shared open questions become political fog.
How to Know It Is Working
Measure the system by behavior, not elegance. Track:
- decisions that were delayed until missing evidence arrived,
- assumptions that moved from open question to gate,
- corrections generated from resolved questions,
- repeated mistakes avoided,
- prediction accuracy over time,
- project outcomes after review triggers.
If open questions never change decisions, they are decoration. If they slow the right decisions and accelerate the right closures, they are infrastructure. For a manual setup, audit every two weeks while the system is new, then monthly once it stabilizes. Five active questions per project is usually plenty; beyond that, you are probably journaling instead of governing uncertainty. Keep the total active set under 20 to 25 unless you have automated retrieval and review. Everything else should move to archive, gate, decision, or correction.
During the audit, ask:
- Which open question changed a decision?
- Which one has no new evidence?
- Which one is older than its TTL?
- Which one should become a gate, correction, decision, or archive?
- Which one am I keeping open because the answer is inconvenient?
Two useful KPIs: the percentage of open questions that resolve or migrate within 30 days, and the percentage of resolved questions that later prevented a repeated mistake.
Agent System Prompt
Use this as a standing instruction for agents that read your memory:
When using my memory, preserve epistemic status. Do not treat unresolved questions as settled facts. Surface unresolved memory only when its tags or decision scope match the current task. Separate what is known, inferred, contested, and missing. If a relevant open question conflicts with a correction or gate, flag the conflict before recommending action. During execution sprints, deadlines, or low-stakes tasks, default to closure unless the unresolved item would materially change the decision.
Sources and Adjacent Work
This article is not claiming uncertainty management is new. Richards Heuer's Psychology of Intelligence Analysis formalized Analysis of Competing Hypotheses inside intelligence work. Philip Tetlock and the Good Judgment Project made calibration, probability updates, and forecasting discipline legible to a wider audience. Science has falsification, competing models, and peer review. Engineering has incident postmortems and decision records. Law has bracketing, standards of proof, and unresolved factual questions.
The point here is narrower: personal AI memory systems need the same discipline. If they do not preserve epistemic status, uncertainty boundaries, and review triggers, they will compress unresolved questions into confident summaries.
Related areas worth studying:
- Richards Heuer's Psychology of Intelligence Analysis
- Good Judgment / Superforecasting
- Context rot and retrieval drift
- Human-in-the-loop evaluation
- Scientific falsification and competing hypotheses
- Engineering decision records and postmortems
How to Start Tonight
Create open_questions.md.
Write one entry for a question you keep circling but cannot honestly resolve yet.
Use four rules:
- Name at least two live interpretations.
- Give each interpretation a confidence band and falsification condition.
- Name what evidence is missing.
- Name the TTL or review trigger.
Then ask your agent:
Read open_questions.md.
Tell me which current decision is being treated as settled even though the record says it is still unresolved.
Tell me which open question is productive uncertainty, and which one is avoidance.
Do not resolve a question unless the missing evidence is present.
If the agent slows you down in the right place, the file is working.
Correction memory protects you from repeating what failed. Unresolved memory protects you from killing what has not been understood yet.
If you have not built the first layer yet, start with correction memory. Once your system can preserve where you were wrong, add unresolved memory to preserve what should not be settled yet.
This is the second layer of the correction-memory framework: preserve the state of knowledge at the time of reasoning, including what was known, what was inferred, what was contested, and what evidence was still missing.
Top comments (2)
Nice writeup. One thing I'd add: Why AI Memory Resolves Too Much — And What to Preserve Inste can be tricky when you scale, but the core insight here is solid. Thanks for sharing the details.
Appreciate it. I agree — scale is the hard part. A single
open_questions.mdworks forone operator, but once you have lots of memories or multiple agents, unresolved items
need metadata, TTLs, ownership, and retrieval filters or they turn into context
pollution. That’s probably the next layer I need to pressure-test more.