We had a system that passed every internal test and still failed in production.
Not catastrophically. Worse — subtly.
The assistant would recall details correctly for a few turns, then drift. It would contradict earlier answers, forget constraints, or re-ask questions it had already resolved. No crashes, no errors, just a slow erosion of coherence.
At small scale, it looked like user inconsistency. At larger scale, it became obvious: the system wasn’t stateful in any meaningful way.
Why This Problem Exists
Most AI systems today are stateless by design. Every request is treated as an isolated prompt, with “memory” simulated by stuffing previous messages into the context window.
That works until it doesn’t.
1. Context Windows Are Finite
Even with large windows, you're always trading off:
- More history → less room for reasoning
- Less history → loss of continuity
Eventually, something gets dropped.
2. Token Relevance ≠ Semantic Importance
Truncation strategies (last N messages, token limits, etc.) assume recency equals importance.
In reality:
- A constraint from 10 messages ago might be critical
- The last 3 messages might be irrelevant chatter
The model has no inherent way to distinguish that.
3. LLMs Don’t “Track State”
They infer state from text, not from structured memory. That means:
- No guarantees of consistency
- No persistent grounding
- No notion of truth beyond the current prompt
So “memory” becomes a probabilistic reconstruction, not an actual system capability.
What I Tried (And Why It Broke)
1. Naive Message Buffering
context = last_n_messages(conversation, n=10)
Problem:
- Lost long-term constraints
- Reintroduced previously resolved ambiguity
Increasing n just delayed failure.
2. Summarization Layers
summary = summarize(conversation_history)
context = summary + recent_messages
Why it seemed promising:
- Compresses history
- Keeps token usage predictable
Where it failed:
- Summaries drifted over time
- Important details were abstracted away
- Errors compounded silently
Once a summary misrepresented something, everything downstream inherited that mistake.
3. Vector Store Retrieval
relevant_chunks = vector_db.search(query_embedding)
context = relevant_chunks + recent_messages
Better, but not reliable:
- Retrieval depended heavily on query phrasing
- Missed implicit context (e.g., constraints not mentioned in the query)
- Retrieved similar information, not necessarily correct or current
Also, relevance scoring doesn’t understand temporal validity.
4. Hybrid (Buffer + Summary + Retrieval)
At this point we had:
- Recent messages
- Periodic summaries
- Retrieved context
It looked robust.
It wasn’t.
Failure mode: conflicting context sources
Example:
- Summary says user prefers A
- Retrieval pulls older chunk saying B
- Recent message implies C
The model tries to reconcile all three — and often picks the wrong one.
The Turning Point
The key realization was uncomfortable:
We were treating memory as a compression problem, not a state management problem.
Everything we built assumed:
- Context is just text
- Memory is just more text
But real systems don’t work like that.
State needs:
- Structure
- Versioning
- Conflict resolution
- Explicit updates
We weren’t building memory. We were building increasingly complex prompt assembly.
A Better Way (Still Imperfect)
We shifted toward a more explicit memory model.
1. Structured Memory Slots
{
"user_preferences": {
"language": "python",
"framework": "fastapi"
},
"constraints": [
"low latency",
"no external APIs"
]
}
This forced:
- Clear updates
- No silent drift
- Easier validation
2. Event-Based Updates
- Extract events from conversations
- Update only affected fields
if "user changed preference":
memory["user_preferences"]["framework"] = "django"
This reduced cascading errors.
3. Context Assembly with Priority
Not all memory is equal.
We introduced tiers:
- Hard constraints (must include)
- Active task context
- Relevant historical memory
- Recent messages
Instead of:
context = everything_we_have
We moved to:
context = prioritize(memory, task, recency)
Still heuristic-driven, but far more stable.
4. Explicit Conflict Handling
- Detect conflicts before prompt assembly
- Resolve or annotate them
Note: Previous preference was Flask, latest input indicates FastAPI.
This improved consistency more than expected.
Where This Still Breaks
Even with structured memory:
- Extraction is imperfect (LLMs still interpret events)
- Some context is inherently ambiguous
- Over-structuring reduces flexibility
- You still hit token limits eventually
And importantly:
- You’re now maintaining a state system, not just calling an API
During this phase, I also experimented with a memory-focused system like Memwyre (https://memwyre.tech) to explore alternative abstractions around persistence and retrieval.
Takeaways
- Context windows are not memory systems — they’re buffers
- Summarization introduces silent corruption over time
- Retrieval helps, but doesn’t solve state consistency
- If your system needs memory, model it explicitly
- Prioritization matters more than volume
- Conflict resolution should not be left to the model
Most importantly:
If your AI behaves inconsistently, it’s probably not a model issue.
It’s a state management problem wearing a prompt engineering disguise.
Top comments (0)