How to Fix Context Loss in Multi-Step AI Agent Workflows

#ai #agents #python #debugging

I spent last weekend debugging an agent that kept forgetting what it was doing. It would happily call three tools in sequence, then on the fourth one... blank stare. Wrong arguments. Hallucinated file paths. The classic "who am I, where am I, what was I doing" moment.

If you've built anything that chains together more than a couple of tool calls, you've probably seen this. The agent starts strong, makes a plan, and then somewhere around step three or four it starts looking like it took a nap and woke up in someone else's session.

Let's talk about why this happens and how to actually fix it.

The Problem: Your Agent Has Amnesia

Here's the symptom. You build an agent skill that's supposed to:

Read a config file
Validate it against a schema
Apply a transformation
Write the result back

Steps 1 and 2 work fine. By step 3, the agent has lost track of the variable names it pulled out in step 1. By step 4, it's writing to the wrong path entirely. You add logging. You add retries. You add a stern system prompt that says "DO NOT FORGET THE FILENAME." Nothing helps.

I ran into this on a code review agent I was building. It would read a file, identify three issues, then when asked to apply fixes, it would invent issues that didn't exist. The original analysis just... evaporated.

The Root Cause: Tool Calls Are Stateless

Here's the thing nobody tells you when you start building agents: the model itself has no memory between tool calls. The only "memory" is the message history you keep sending back.

When the agent calls a tool, here's what happens:

The model sees the entire message history
It picks a tool and arguments based on what it sees
The tool runs, returns a result
The result gets appended to the history
The model sees the updated history and picks the next move

So if step 1 returned a giant blob of JSON, step 2 returned another blob, step 3's context window is now mostly old tool output. The model's attention is spread thin. Important details — like the filename you extracted in step 1 — get drowned in noise.

This isn't a bug. It's how the architecture works. Each turn is a fresh inference over a growing message log.

Step-by-Step Solution

Step 1: Externalize State Into a Scratchpad

The single most effective fix I've found is to give the agent an explicit "scratchpad" — a small structured object the agent reads from and writes to between steps. Don't rely on the model to remember things. Make it write them down.

Here's a minimal version in Python:

from dataclasses import dataclass, field, asdict
import json

@dataclass
class AgentScratchpad:
    # Persistent across all tool calls in this run
    goal: str = ""
    facts: dict = field(default_factory=dict)
    decisions: list = field(default_factory=list)

    def to_prompt(self) -> str:
        # Inject this back into the system prompt every turn
        return f"CURRENT STATE:\n{json.dumps(asdict(self), indent=2)}"

# Two tools the agent uses to manage its own memory
def remember(scratchpad, key: str, value: str):
    scratchpad.facts[key] = value
    return f"Saved {key}"

def decide(scratchpad, decision: str):
    scratchpad.decisions.append(decision)
    return "Decision logged"

The key move: you expose remember and decide as tools the agent can call. After step 1, it calls remember("config_path", "/etc/app.yml"). By step 4, that fact is still right there in the system prompt, front and center.

Step 2: Compress Old Tool Output

The second problem is sheer volume. A read_file call on a 500-line config returns 500 lines. Three of those calls and the context is mostly raw file dumps.

Fix it by replacing old tool results with summaries once they're no longer the active focus:

def compress_history(messages, keep_recent=3):
    # Keep system message and recent N turns verbatim
    # Summarize everything in between
    if len(messages) <= keep_recent + 1:
        return messages

    head = messages[0:1]  # system prompt
    tail = messages[-keep_recent:]
    middle = messages[1:-keep_recent]

    # Replace verbose tool results with one-line summaries
    summary = summarize_turns(middle)  # your own summarizer
    return head + [{"role": "user", "content": summary}] + tail

I usually run this every 5-6 turns. The agent still has access to the scratchpad for facts that matter, and the recent turns for the immediate context. The middle gets squashed.

Step 3: Validate State Before Every Critical Tool Call

This is the prevention piece. For any tool that mutates something — writes a file, hits an API, runs a migration — wrap it in a guard that checks the scratchpad first:

def write_file_safe(scratchpad, path: str, content: str):
    # Refuse to write to a path the agent never recorded reading
    if path not in scratchpad.facts.get("known_paths", []):
        return (
            f"ERROR: {path} was never read in this session. "
            f"Read the file first or update scratchpad."
        )
    return do_write(path, content)

This catches the hallucination case cleanly. If the agent invents a path, the tool rejects it and tells the agent why. Nine times out of ten, the next turn the agent corrects itself and reads the right file.

A Pattern I've Started Using

After migrating three agents to this approach, I converged on a structure that looks roughly like:

System prompt holds the goal and tool definitions (static)
Scratchpad holds extracted facts and decisions (mutable, re-injected each turn)
Recent messages hold the last few turns verbatim (rolling window)
Summary replaces older middle turns (compressed)

The agent never needs to scan 50 turns of history to find a filename. It looks at the scratchpad, sees the fact, and acts on it.

Prevention Tips

A few things I wish I'd done from the start:

Design state-aware tools first. Every tool should either read from or write to the scratchpad, not just return data into the void.
Test long chains. Most agent bugs only show up after 5+ tool calls. Write integration tests that force the agent through a long workflow and assert on final state.
Log token counts per turn. When you see context creeping past 60-70% of the window, that's where attention starts to degrade. Compress earlier than you think you need to.
Don't trust the model to remember. If a fact matters in step 5, write it to the scratchpad in step 1. Treat the model like a brilliant intern with severe short-term memory loss.

The broader lesson here: agent reliability isn't really about prompt engineering. It's about state management. The model is doing inference; you're doing the bookkeeping. Get the bookkeeping right and the agent stops forgetting things.

I haven't tested the scratchpad pattern on workflows longer than about 20 tool calls, so your mileage may vary at extreme scale. But for the typical 3-10 step agent workflow, this fixed every context-loss bug I had.