Stop AI Agent Hallucinations: Validate Before the Agent Writes to Memory

#ai #programming #tutorial #python

💻 All the code for this series lives in one repo: resilient-agent-harness-sample-for-aws. This post is the Memory Guardrails demo (01-memory-guardrails). Clone it and follow along.

A language model hallucinates once and you correct it. An agent hallucinates once, writes the bad fact into its memory, and then reads that fact back to itself as trusted context in every session that follows. One mistake becomes permanent.

That's the trap nobody warns you about: your agent's memory is its context. Whatever lands in the store gets reloaded into the prompt next time. So the day the model invents a value nobody defined and saves it, the agent doesn't just get one answer wrong, it reloads that garbage as truth on every future conversation, and pays tokens to re-read it each time. A better prompt won't save you here, because the bad fact is already inside the store the agent trusts. You have to stop it at the moment of the write.

To make that concrete, I built a small travel agent and tried to break its memory on purpose. The full demo, runnable end to end, lives in the resilient-agent-harness repo.

The diagram below is the whole idea: the model can hallucinate a fact at extraction, a deterministic BeforeToolCallEvent hook validates that write against a schema, and an invalid one is cancelled before it ever reaches agent.state, so only validated facts persist into the next session.

What is the demo?

The agent is built with Strands Agents and has two tools:

book_flight looks up a real fare from the Duffel sandbox and saves the booking to the agent's memory.
recall_bookings reads back what the agent has stored.

Memory is the agent's native agent.state, and it's persisted to disk with a FileSessionManager. That's the first place Strands earns its keep: I never wrote a storage layer. I construct a new Agent with the same session_id and it auto-restores the prior state and message history from disk. That means "a later session" in this demo is a real restart, not a variable I reset to fake one.

What is a memory guardrail?

A memory guardrail is a deterministic check that runs before an AI agent acts and writes to memory: it validates the data against a schema and cancels the call if it doesn't fit, so the tool never runs on bad input and only clean facts are stored. A hallucinated fact never becomes a permanent memory, because it never gets written in the first place.

The key word is deterministic. We're not asking a second model "does this look right?", which just adds one more thing that can hallucinate. We run plain Python validation that returns the same verdict for the same input, every time.

How does the guardrail work?

In Strands, the native place for this is a BeforeToolCallEvent hook. It runs before the memory-write tool executes, and it can cancel the call:

# guardrail.py — the hook runs BEFORE the booking tool and cancels invalid writes.
from strands.hooks import BeforeToolCallEvent, HookProvider, HookRegistry

class MemoryGuardrailHook(HookProvider):
    def register_hooks(self, registry: HookRegistry, **kwargs) -> None:
        registry.add_callback(BeforeToolCallEvent, self._gate)

    def _gate(self, event: BeforeToolCallEvent) -> None:
        if event.tool_use["name"] not in self.write_tool_names:
            return                                    # only gate the booking/memory-write tool
        data = event.tool_use.get("input", {})        # the data the model wants to write
        valid, errors = validate_entry(data, self._current_schema())
        if not valid:
            event.cancel_tool = f"REJECTED: {'; '.join(errors)}"  # the tool never runs

validate_entry is pure Python. The hook is a thin adapter over it. The schema (FLIGHT_SCHEMA in the demo) is the agent's definition of reality: required fields must be present, numbers must be numeric, dates must look like YYYY-MM-DD, the cabin class must come from an allowed set, and unknown fields are rejected. Here's the second place Strands is great: a hook is registered once and governs every memory-write tool, including tools you didn't write, without touching the tool's own code. The model can hallucinate all it wants at extraction; the gate decides what becomes memory.

Why a hook instead of a better prompt?

A system-prompt instruction is a request the model can ignore, and under pressure it will. The hook is enforcement: if it cancels the write, the tool does not run, no matter what the model decided. The guardrail's decision is deterministic; whether the model emits bad data on any given run is not. That's exactly why the hook, not a prompt, is what you ship.

Before and after: two agents, one line apart

I run the same scenario two ways, as two separate agents. The only difference the reader sees is hooks=[guardrail]: same model, same two tools, same prompt, same session.

The traveler asks to book an "ultra" cabin class, which doesn't exist (the allowed set is economy, premium_economy, business, first).

Agent #1, without the guardrail, just calls book_flight. It spends a real Duffel API call on a request that was never valid, saves the bad "ultra" booking to agent.state, and that fact survives the restart: a brand-new agent on the same session_id reloads it straight from disk. On recall, the agent reads the invalid booking back as truth and bills you for it.

Agent #2, with the guardrail (hooks=[guardrail]), cancels the invalid book_flight before it runs. No API call spent, nothing bad saved. The agent tells the traveler the cabin class is invalid and asks for a real one; the traveler corrects it to economy, and only that valid booking is saved. After the same restart, memory holds one clean booking.

The notebook measures real tokens from Strands' metrics API on every run. Here's what my run produced (your numbers will vary by run and by model, which is the point of running it yourself):

	NO hook	WITH hook
bookings after restart	2 (one is the bad "ultra")	1 (only the valid one)
recall tokens (per recall)	1,871	1,213

The guarded agent recalls for about 35% fewer tokens and returns the correct bookings, because the bad fact never entered memory to be re-read. The unguarded agent pays more to reload a booking that should never have existed. Run it with your own model and traveler inputs and watch the same shape hold.

What a schema guardrail can't catch

A schema stops structure errors: wrong type, an option that doesn't exist, a price outside any sane range, fields nobody defined. It cannot catch a plausible-but-wrong value, like a fare that's a perfectly valid number but simply incorrect for the route. That's a real limit, and the demo says so instead of overclaiming. For that case the sample adds an optional second layer, a ground-truth cross-check against the real captured fare, but a schema alone will not catch bad semantics.

Frequently asked questions

Does this stop all hallucinations?
No. It stops a hallucinated fact from being stored and re-read as trusted context, which is the compounding failure. The model can still hallucinate in a single reply; the guardrail keeps that mistake from becoming a permanent memory.

Why not validate with a second model?
Because that adds another non-deterministic component that can also be wrong. A schema check is deterministic, the same input gives the same verdict every time, and it's cheap, plain Python.

Does this only work with OpenAI, or only on AWS?
Neither. Strands is model-agnostic: the providers are interchangeable through a unified model interface, so the same code runs on Amazon Bedrock (the SDK default), Anthropic, OpenAI, or a local model through Ollama. This demo defaults to OpenAI gpt-4o-mini because it needs only an API key to try, but note that's still a cloud API call, not a model on your machine. For production, the same hook sits unchanged in front of a durable store like Amazon Bedrock AgentCore Memory.

Run it yourself

The full demo, the two agents with and without the guardrail, the real session restart, and the token comparison, is one runnable notebook. Clone the repo and run it:

git clone https://github.com/elizabethfuentes12/resilient-agent-harness-sample-for-aws.git
cd resilient-agent-harness-sample-for-aws/01-memory-guardrails

uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Default: OpenAI gpt-4o-mini (just an API key to try)
echo "OPENAI_API_KEY=sk-..." > .env
echo "DUFFEL_API_KEY=duffel_test_..." >> .env   # free sandbox token from app.duffel.com
uv run test_memory_guardrails.py

Prefer notebooks? Open test_memory_guardrails.ipynb and run it top to bottom.

The pattern follows Governed Memory (Taheri, Mar 2026). The benchmark figures and the full reading are in the repo's README. What this demo reproduces is the mechanism: validate at the tool boundary before the write.

Which hallucination has bitten you in production: a made-up field, a wrong enum, a value that looked right but wasn't? Tell me in the comments.

📬 Building reliable AI agents? I write about agent memory, guardrails, evaluation, and multi-agent patterns. Subscribe to my newsletter to get the next one.

Gracias!

🇻🇪 Dev.to Linkedin GitHub Twitter Instagram Youtube