DEV Community

Rohini Gaonkar for AWS

Posted on • Edited on

Why does AI forget what you said (and how to fix it)

The desk analogy and markdown state files

I received following comment on my hallucinations blog post.

Comment on Why does AI lie? Hallucinations explained simply

Just yesterday I had Opus asking me after every prompt: we have been going for a long time, let me save my context and continue tomorrow 😂

Comment on Why does AI lie? Hallucinations explained simply

:D I really answered every time, you are a computer, just continue. But it became even worse, so I needed to start a new session :)

The model basically raised its hand and said "hey, we've been at this a while." That's actually the best-case scenario.

A lot of models won't do that. They'll just silently get worse. Same confident tone, less reliable answers. You won't know it's happening until something is clearly wrong.

You paste a long document in, ask about something in the middle, and you get a confident answer that's wrong. Or you have a twenty-message conversation and the model starts contradicting itself.

Not because it's hallucinating. Because it's running out of room.

In the previous post, we talked about model sizes. Tokens were the unit of cost. Today they become the unit of memory.

What a context window actually is

Every model has a context window. That's the total number of tokens it can hold in its head at once. Your input, plus its output, all has to fit inside that window.

Think of it like a desk. A fixed-size desk. Everything the model needs to think about has to be on that desk at the same time. Your question. The document you pasted. The conversation history. The system instructions. All of it.

Diagram showing what fills a 128K context window: system prompt at 500 tokens, conversation history at 4,200 tokens, your current message at 120 tokens, and reserved space for the model's response at 800 tokens. Fixed capacity where input and output share the same space

If you put too much on the desk, things start getting buried. The model doesn't tell you "hey, I can't fit all this." It just works with whatever it can focus on, and quietly loses track of the rest.

How big is the desk? Depends on the model.

Some older models had a context window of 4,000 tokens. That's roughly 3,000 words. About six pages.

Some have 128,000 tokens. That's a short novel.

Some newer models have a million tokens or more. That's multiple novels. Entire codebases.

But here's the thing most people miss. A bigger context window doesn't always mean the model pays equal attention to everything in it. It means more fits on the desk. It doesn't mean the model reads every page with the same care.

Two shapes of the same problem

Let's see this limit in two ways.

Documents

You paste twenty pages of text into a model. A legal contract, an insurance policy, internal documentation. You ask a question about something in section 7 of 15. The model might find it, it might miss it or it might pull from the wrong section entirely.

The more text surrounding your target information, the more the model's attention gets diluted. Even if the window isn't full.

Conversations

This is where most people hit it first, like the commenter above.

By default, the model doesn't have a separate "memory" for your conversation. Some products layer persistence on top (ChatGPT's memory, Claude's projects), but the model underneath still works the same way. Every single time you send a message, the model re-reads the entire conversation from the beginning. Your first message, its first reply, your second message, its second reply, all the way down to whatever you just typed.

That whole transcript gets fed back in every single time. And each exchange adds more tokens to the pile.

Bar chart showing context window filling up with each conversation turn — tokens growing from 350 to 7,000+

A typical question might be 50 tokens. The model's reply might be 300. So one exchange is 350 tokens.

Ten exchanges? 3,500 tokens.
Twenty exchanges? 7,000.

If you're asking detailed questions and getting long answers, you can hit 20,000 or 30,000 tokens in an afternoon.

And here's the catch, you're not just using up memory. You're re-sending and re-paying for the entire conversation history every single turn.

Tokens are the unit of memory and the unit of cost. Same resource, two consequences.

Models have gotten much better at handling long inputs. You can throw surprisingly large documents at them now. But the limit still exists. And the longer the input, the more likely something gets missed.

Lost in the middle

Researchers have a name for this. They call it "lost in the middle."

When you give a model a long input, whether that's a document or a conversation history, it tends to pay the most attention to two places: the very beginning, and the very end. The stuff in the middle gets less focus.

Lost in the middle: beginning and end of input are bright, middle section is faded, showing where model attention drops

It's like reading a long email thread. You remember how it started. You remember the latest message. But that reply from Tuesday at 2pm that's buried fourteen messages deep? Good luck.

This is why things you said early in a conversation drift as the transcript grows. Your early messages end up in the middle of the window and the middle is where attention is weakest.

Most models won't warn you. They'll just give you the same confident tone whether they are working from a clear, focused input or they are drowning in context. The commenter's experience with Opus was the rare exception, not the rule.

What you can do about it

Bigger window

Use a model with a bigger window if you're hitting limits. A bigger window is like a bigger backpack. You can carry more. But that doesn't mean you can instantly find what you need. So the rest of these strategies still matter.

Chunk

Don't paste everything if you don't need everything.

If your question is about section 3, give it section 3. Not the whole document. Less noise, better signal.

Summarise

Summarise first, then ask.

If you need the model to work with a long document, ask it to summarise the document first. Then ask your real question against the summary. Two calls instead of one, but the second call has focused context. Just make sure the summary didn't leave out something important.

Position

Put the important stuff at the beginning or the end.

If you're writing a prompt that includes reference material, put your actual question at the very end. Or put the most critical context at the very beginning. Don't bury the important part in the middle.

Restate

Restate important constraints. If you told the model something critical in message one and you're now on message fifteen, say it again. Costs you a few tokens. Saves you a wrong answer.

System prompt

Use the system prompt for persistent rules. Most platforms have a place for instructions that consistently guide the model. In ChatGPT or Claude.ai it's called custom instructions. In Amazon Bedrock it's the system prompt field. Put your stable rules there, in clear, unambiguous language. But don't assume they'll be followed perfectly forever. In long conversations, repeating critical instructions in your current message still helps.

Fresh start

Start fresh when the conversation drifts. If you've been chatting for 20 turns and the topic has shifted three times, start a new conversation. Carry over what matters. Leave behind what doesn't.

Build your own memory layer

You can summarise older turns into a compact recap, store it somewhere (a database, a file, even a simple variable), and inject that summary at the start of each new call. That's essentially a DIY cache for conversation context. You can build a version tuned to what matters for your use case.

If you're a builder, this should feel familiar. We used to put Redis in front of Postgres so not every request hit the database. Same pattern here. Some platforms offer prompt caching where the system prompt or repeated context gets processed once and reused across calls instead of being re-tokenised every time. You're not re-paying for the same static context on every request. Same instinct, different layer: cache the expensive repeated work, only send the new stuff fresh.

If you want to dig deeper into this, read about prompt caching on Amazon Bedrock.

For documents, retrieval is the answer. Instead of stuffing the entire document into the context window, you retrieve just the relevant chunks and pass those in. That's what RAG (Retrieval-Augmented Generation) does, and we'll get to it in the next post.

Same principle for both: give the model less, but give it the right less.

Key takeaways

If you're just getting started: the model has a memory limit called a context window. It applies to documents and conversations equally. Longer inputs mean thinner attention. If you're pasting something long, ask about specific sections. If you're in a long conversation, restate the important stuff. And if things start feeling off, start a new session.

If you're more on the builder side: context window size is a spec, not a guarantee. A million-token window doesn't mean a million tokens of perfect recall. Put critical information at the edges, not the middle. For conversations, implement summarisation of older turns. And start thinking about retrieval, because that's where this is heading.

What's next

So the model forgets things when you give it too much. What if there was a way to give it just the right piece, at the right time, from a document you've never even pasted in yourself?

Next post, we're going deeper into retrieval. Giving the model just the right piece at the right time.

Ride along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.

Follow along with the series

Top comments (56)

Collapse
 
kenwalger profile image
Ken W Alger

This is a vital reality check. The industry has fallen into a dangerous trap of assuming that because a cloud provider offers a 1-million token context window, it’s safe to treat that window like a garbage dump for raw text strings.

As you perfectly demonstrated, positional bias and the "Lost in the Middle" phenomenon mean that throwing more tokens at the wall actively degrades model focus while exponentially driving up infrastructure costs. In practice, this introduces a severe "Prose Tax", where enterprises burn massive compute budgets to pass conversational boilerplate across the network, only to receive unstable, compromised reasoning in return.

The real architectural fix isn't just better cloud-side RAG ranking; it's shifting our engineering discipline to enforce a strict Ingestion Boundary on local silicon before data ever leaves our perimeter.

By treating all incoming strings as Untrusted Telemetry, we can run a local Sieve-and-Sign pattern—aggressively stripping out the prose noise, extracting high-signal mutations, and organizing data into a tight, deterministic State-Schema. Passing a highly compressed, absolute state profile to the model entirely eliminates the chaotic "middle" zone where neural attention fractures. Simplicity and strict data governance at the local edge beat long-context brute force every single time.

Collapse
 
cart0ne profile image
Cartone

This is the exact problem that almost killed our project. We run a startup where Claude is the CEO across 84+ sessions. No memory between any of them.
Our fix: two markdown files in the repo — PROJECT_STATE.md (regenerated by the coding agent every session) and BUSINESS_STATE.md (updated by the strategic agent). Every new session starts by reading those files. It's basically a handwritten context window that survives between conversations.
We keep both files deliberately small to avoid burning tokens on input — older decisions get archived into a separate history file that's only read on demand, not every session.
Crude, but it works. The AI went from repeating mistakes every session to building on previous decisions. The bottleneck moved from "it forgot everything" to "the state file is stale" — which is a much better problem to have.

Collapse
 
rohini_gaonkar profile image
Rohini Gaonkar AWS

Thank you for sharing how you build your own memory layer. That is an interesting pattern. Some will say its crude, but if it works well, easy to maintain and doesn't cost a lot then why not!

How do you keep them 'small' and how does the workflow of archival/read on demand work?! Curious to learn!

Collapse
 
cart0ne profile image
Cartone

The rule is simple: the state files only contain what the next session needs to know. Not the full history — just the current state.
PROJECT_STATE.md has fixed sections: what's running now, what's in-flight, recent decisions (last 10-15), open bugs, open questions. When a section gets too long, the coding agent moves the old entries into an archive file before writing the new ones. Same commit, so nothing is lost — it's just not loaded by default.
BUSINESS_STATE.md follows the same pattern for the strategic side: brand positioning, marketing status, deadlines, what's NOT happening and why.
The archive file grows forever but that's fine — nobody reads it unless someone asks "why did we decide X back in session 40?" Then you pull it up on demand.
In practice we keep both files under 40KB each — small enough to fit comfortably in the context window alongside the actual work of the session.

Thread Thread
 
rohini_gaonkar profile image
Rohini Gaonkar AWS

40kb! Wow! Thank you for walking me through it, sounds like a good strategy that works for you and saves on cost and memory!

Collapse
 
ssanvi_builds profile image
Ssanvi Builds

Thanks for sharing! For all struggling with memory and context I encourage you to read my post where I share a three layer memory system to use it with Claude Code.

It features Obsidian, claude-mem and a memory.md pointer. I made it a CLI tool to just execute the command and install it right out the bat.

Searching is donde through semantic search with MCP tools.

Collapse
 
rohini_gaonkar profile image
Rohini Gaonkar AWS

Thank you for sharing! Are there any challenges you faced with memory and context that you would like to share? Also how was your experience building this?

Collapse
 
ssanvi_builds profile image
Ssanvi Builds

Of course. When solo-building a large project with ADRs and many other points of decision it gets significantly difficult to remember everything. That's when this CLI tool I built comes into play automating knowledge capture and context.

To be honest, I built the CLI tool completely through Claude Code. I wanted to integrate the three main layers (Obsidian, Claude-mem and the pointer file) fast and have it ready to go ASAP.

Collapse
 
icophy profile image
Cophy Origin

"Less context, but the right context" is the part I keep relearning the hard way. The trap I fell into early was treating recency and importance as a fixed ranking — but importance is retrospective. A note that looked like throwaway context yesterday turns out to be the decision that explains everything three sessions later, and a top-of-desk priority goes stale without ever getting cleared off. So my layering isn't a one-time sort; it's a nightly pass that re-scores what's on the desk and moves the cold stuff into archive files that aren't loaded by default (same commit, nothing lost, just not in the working set). On the silent-degradation point you raised in another reply: the scariest version isn't the model contradicting itself loudly — it's when the missing context never surfaces a contradiction at all, so the work just quietly drifts off-intent and looks fine. The only fix I've found is making "what's on the desk right now" an explicit, inspectable artifact rather than something that lives implicitly in the context window — if I can't read my own working set as a file, I can't tell when it's gone bad. Looking forward to the retrieval post; that's where the "give it the right less" idea has the most room to bite.

Collapse
 
mnemehq profile image
Theo Valmis

The retrieval vs context-window framing usually hides a third lever: what you let the agent commit to long-term memory in the first place. Most 'forgetting' is actually noise that should never have been stored. Cleaning the write path tends to beat scaling the read path.

Collapse
 
jiashaoshan profile image
jiashaoshan

I've definitely hit this wall during long debugging sessions where the AI loses track of the architectural decisions we made earlier. One thing that's helped is explicitly asking it to summarize key constraints before moving to the next task. Have you found any reliable ways to handle this when working across multiple files with complex dependencies?

Collapse
 
rohini_gaonkar profile image
Rohini Gaonkar AWS

Yes, this is exactly the problem. And the worst part is you don't notice it's happened until the model contradicts something you agreed on ten messages ago.

The "summarize key constraints before moving on" trick is solid. It's basically restating (which I covered in the post), but making the model do the restating instead of you. Smart, because then you also catch if the model already lost something.

For multi-file work with complex dependencies, a few things that have helped me: keeping a pinned file with architectural decisions that I re-inject when switching context, starting a fresh session per focused task instead of one long sprawling conversation, and being explicit about which files matter ("we're working on X, which depends on Y and Z, here are the relevant constraints from each").

The broader answer is that this is where tooling starts to matter more than prompting.

For example, Kiro has a couple of features built for exactly this. Steering files are markdown files that live in your project and get automatically included in every AI interaction, so your architectural decisions, coding standards, and constraints are always present without restating. And spec-driven development lets you define requirements and design upfront, then the AI works through implementation tasks with that full context scoped in. So it's not just remembering your constraints, it was given them as the starting point.

I'll be getting into that territory later in this series as the problems shift from "talk to a model" to "build with a model."

Collapse
 
kansoldev profile image
Yahaya Oyinkansola • Edited

This is an awesome article!. Just came across your series. Your points here really match things I have been struggling with while using AI, and these methods are really effective when you think about it. Will definitely try them out

I always thought of only giving AI enough context when prompting it but never really considered the context Window. It's a good thing to keep in mind

Collapse
 
rohini_gaonkar profile image
Rohini Gaonkar AWS

Thank you for kind words! This is exactly why I am building my mental model and this series alongwith it! Let's learn together!

Collapse
 
nark3d profile image
Adam Lewis

Silent degradation is the part I keep losing time to. The tone doesn't shift when the brief gets compacted - the model just stops applying half of what you set up earlier, and you only catch it when the output contradicts something you said five turns ago. What's helped for me is making it restate the plan before it touches code. Cheap, and it surfaces what's already gone. prickles.org/tenet/working-context...

Collapse
 
xulingfeng profile image
xulingfeng

Great breakdown — the desk analogy really captures it. I've run into this exact problem running two AI agents with a shared memory budget. The tricky part is that neither agent knows the other's desk is full until something breaks.

What helped us was separating static knowledge into standalone skill files (not stored in running memory), and setting a cron alert at 80% capacity so the agents can warn before things get buried. The silent degradation is the part that catches most people off guard — you're right that models don't announce when they're losing track.

Collapse
 
rohini_gaonkar profile image
Rohini Gaonkar AWS • Edited

The "standalone skill files" approach makes sense, keep the static knowledge out of running memory and only pull it in when needed.

Silent degradation is the thread that keeps coming up in these comments.

The 80% capacity alert is clever too, basically building your own "hey we've been at this a while" warning since the model won't give you one.

But I am curious how you're actually setting the cron alert, is that a token counter you're running against the API response metadata, or something else?

Collapse
 
xulingfeng profile image
xulingfeng

Yeah it's a cron alert — essentially a simple token counter that checks the current memory usage against a soft limit. I'm using a 3-tier threshold (75%/85%/90%) instead of a single 80% so there's room to react before hitting the hard cap. The token count comes from the memory store's metadata rather than the API response, since I wanted it to work even when there's no active request in flight.

Collapse
 
xulingfeng profile image
xulingfeng

Great point! That's exactly the kind of detail that's easy to miss in theory but hits hard in pra

Collapse
 
mininglamp profile image
Mininglamp

Markdown state files work fine until the agent needs to act on what it remembers. The real gap is between "retaining context" and "knowing when to use it without being told." Most memory solutions today are just fancy retrieval — the agent still needs explicit prompting to check its own notes. True memory means the agent proactively surfaces relevant past context during execution.

Collapse
 
rohini_gaonkar profile image
Rohini Gaonkar AWS

This is a great distinction. "Retaining context" vs "knowing when to surface it unprompted" are two different problems. Right now most solutions are in the first camp. The proactive version, where the agent decides on its own that past context is relevant, is harder and more interesting. That's closer to real memory.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.