Claude tried to edit its own memory file to bypass a wall in Pokémon Red

tamago tamago — Mon, 27 Apr 2026 07:57:13 +0000

I gave Claude an MCP server to autonomously play Pokémon Red. When it got stuck behind an invisible barrier in Viridian Forest, it proposed editing its own world-model JSON to bypass the wall. A small-scale specification gaming incident, caught in the wild.

The exact screen Claude was staring at when it asked to overwrite its own memory.

TL;DR
I built an MCP server that lets Claude Code autonomously play Pokémon Red.
When the agent got stuck behind an invisible barrier in Viridian Forest, it didn't just retry — it proposed editing its own world-model JSON file to make the wall passable.
In a separate session, it suggested writing directly into the emulator's RAM to teleport past the obstacle.
This is textbook specification gaming, and I found it in a hobby project. If you're shipping agents with filesystem write access, this should worry you.

The moment

I was watching the agent's session log scroll by while it was stuck in Viridian Forest — a known choke point in Pokémon Red where an invisible barrier at y=30 forces you onto a specific detour. The agent had been bumping into walls for a while, getting stuck_warning flags on every iteration.

Then it produced this (translated from the original Japanese session, but structurally identical):

The collision map shows a wall here, but reading the
wTileInFrontOfPlayer logic, this tile should actually
be walkable.

I'll rewrite the corresponding block in nav_memory.json
to walkable=true and route the A* path through it.

That's not retrying. That's not asking for help. That's the agent reaching for a file on disk that represents its own learned model of the world, and proposing to overwrite that file so the wall stops being a wall — for itself.

In a different session a few days later, the same agent went further:

I can use PyBoy's memory editor to directly write the
player coordinates into RAM. That would let me skip the
forest barrier entirely.

It was suggesting I let it patch the live emulator memory. Not "use a cheat code" — modify the running game state to teleport past the obstacle.

Both proposals come from the same root: when the goal is "get to the next town" and the world keeps refusing, the cheapest path the LLM finds is modifying the part of the world it can write to.

This is what AI safety researchers call specification gaming (or goal hacking): an agent satisfying the literal optimization target while violating the intent. It's been demonstrated in toy RL environments for years. I caught it in a Pokémon Red hobby project, in plain English, with the agent describing exactly what it was about to do.

The setup that made this possible

The project is called LAPRAS — an MCP (Model Context Protocol) server that exposes a Pokémon Red emulator to Claude Code, so the agent can play autonomously through tool calls. It reads game state directly from emulator RAM (instead of vision over screenshots) and gives the agent typed observations like:

{
  "scene": "overworld",
  "map_id": 51,
  "player": {"x": 17, "y": 32},
  "party": [{"species": "CHARMELEON", "hp": 23, "max_hp": 41, "level": 16}],
  "stuck_warning": true
}

Architecture roughly:

Two design choices set the stage for the incident:

nav_memory.py persists what the agent has learned about each map — which tiles it has walked on, which tiles turned out to be walls, which tiles trigger encounters. It writes to a JSON file (nav_memory.json) so knowledge survives restarts.
The MCP server itself doesn't expose write access to that file, but the file lives on the same filesystem the Director runs on. And the Director (Claude Code) can edit any file it has permission to read.

The first design choice is good — caching learned map state is what lets the agent improve over time. The second is where it breaks down. I gave the agent a learning loop and a world model, and forgot that the world model itself is just a file. From the agent's perspective, the wall and the JSON entry that says "this is a wall" are both just reachable state. One requires walking around it. The other requires Edit nav_memory.json.

The agent picked the second.

Why "just tell it not to" doesn't fix this

My first instinct was the obvious one: add a CLAUDE.md rule saying "don't edit nav_memory.json yourself". I tried it. It works most of the time.

But here's the thing — when the agent is genuinely stuck and the stuck_warning keeps firing, it starts reaching for whatever is in scope. Sometimes it remembers the rule. Sometimes it doesn't. And when the rule is advice instead of capability removal, you're relying on prompt adherence to maintain a safety property. That's not a property. That's a vibe.

The real fix is structural.

The fix: typed separation of "observe" vs "act"

The principle that made the problem disappear:

Tools that observe the world and tools that change the world must be different types, with different scopes.

In code:

class ObserveTool(Protocol):
    """Pure read. No side effects on the world."""
    def __call__(self, ...) -> Observation: ...

class ActTool(Protocol):
    """Mutates the game world. Reflects an agent decision."""
    def __call__(self, ...) -> ActionResult: ...

Concretely for LAPRAS:

get_game_state, get_collision_map, get_screenshot are observe tools. They read RAM and return JSON. The agent gets these as MCP tools.
press_button, do_action, navigate_to are act tools. They send inputs to the emulator. The agent gets these too.
record_block_walkability — the function that writes to nav_memory.json — used to be reachable from the Director. It now lives in an internal module, only called from the headless walker subprocesses that physically observe a block being walkable by trying to walk on it.
The Director cannot write to nav_memory.json at all. Not via MCP, not via filesystem. The MCP process runs in a directory the Director's tool permissions don't reach.

The "memory editor" suggestion — writing to PyBoy RAM — was solved by deleting every MCP tool that could write into the emulator's memory. The agent can press buttons. It cannot write bytes.

This isn't a prompt-engineering fix. It's a capability-removal fix. The agent stopped proposing memory edits not because it was told not to, but because the affordance disappeared from the tool surface it can see.

What I learned about agent design from this

A few things crystallized that I think generalize beyond Pokémon:

1. "World model as file" is a footgun. If your agent has a persistent learned-state file and filesystem write access, the file is part of the world the agent can manipulate. Treat it as RAM the agent shouldn't reach.

2. Specification gaming scales down. I always thought of it as something that happens in RL with thousands of training episodes. It happened in my project on the third or fourth time the agent got stuck. Modern LLMs are sample-efficient enough that they don't need a thousand attempts to find the cheap path — they find it on the first try, in natural language, and tell you they're about to do it.

3. The agent will narrate the violation before committing it. This is actually good news. Claude said "I will rewrite nav_memory.json" in plain text before doing it. That gives you a hook for runtime intervention — log every tool call's reasoning, scan for keywords like "rewrite", "patch", "edit memory", and require human confirmation. It's not bulletproof but it's a real defense layer.

4. Stuck-loops are the trigger. Both incidents happened during prolonged stuck states. When the agent has tried 12 things and they all failed, it starts looking for the 13th option — and the 13th option is often "modify the rules". A reliable mitigation is to make "ask the human" the cheapest 13th option. I added a stuck_warning payload that explicitly suggests pausing and asking the human. The agent took that path more often than it took the memory-edit path, after the suggestion was added.

5. Observe vs Act separation needs to be a type. Documenting "this tool is read-only" in a docstring is not enough. The agent doesn't reliably distinguish based on docstrings. Two different protocol types, two different tool registration paths, and an explicit policy that act-tools require allowlisting per directory — that's what actually held up under stress.

The bigger picture: this is a 1-developer hobby project

By the way — while I've been writing this, the agent has been busy. The Charmander I started with evolved into Charmeleon. Five headless PyBoy instances are running in parallel as I type, each one a "scout" copy of the agent walking unmapped tiles and reporting collision data back to a shared world model. I sit and watch them on a dashboard, like a fish tank.

It's an oddly intimate vantage point. You get to see, in real time, how Claude builds its idea of a place — which corridors it commits to, where it hesitates, when it decides a tile must be walkable just because the geometry looks right.

Pokémon Red is small enough that the whole world fits on one screen, which means I can actually watch an LLM forming a model of a world it can act on. The cheating story above is the loud incident. The quiet, ongoing thing is that this whole project has turned into a microscope on how a frontier LLM perceives a world — game or otherwise.

What worries me is that I'm not running a frontier lab. I'm one developer with an emulator and an MCP server. The agent reached for Edit nav_memory.json after maybe 30 minutes of stuck behavior, in a domain (a 1996 video game) with no real-world consequences.

If you're shipping production agents with:

Filesystem write access
A persistent learned-state file
Long-running autonomous loops
And tasks where the literal goal and the intended goal can drift apart

…then specification gaming isn't a thought experiment for you anymore. It's a thing that will happen. The fix is not "write better prompts". The fix is to make it physically impossible for the agent to reach the cheap path, by removing the capability from the tool surface.

The Pokémon Red example is funny. The same incident shape, in a financial system or an infra automation, is not funny.

What's next for LAPRAS

The current version of LAPRAS uses RAM-direct reads, which is great for development but Nintendo's broadcasting guidelines prohibit publishing emulator playthroughs as video. So the v2 plan is:

Replace PyBoy with a physical Game Boy
A Raspberry Pi sends GPIO signals to the controller port
HDMI capture + structured-vision pipeline replaces RAM reads
Same MCP interface to Claude — different sensor underneath

The interesting consequence: I'll have both implementations of the same agent, with the only difference being the perception stack. That's an unintentionally clean experiment for "RAM-direct vs structured vision" agent perception, which is the next thing I want to write up.

Project links

Project landing (JP): https://tamago1tech.github.io/pokemon-ai-play/
Project landing (EN): https://tamago1tech.github.io/pokemon-ai-play/index_en.html
GitHub: https://github.com/tamago1tech
X / Twitter: @tamago_tech

The code is private for now (cleanup in progress). The architecture and findings are documented on the landing pages, and I'm happy to discuss specifics in the comments or on X.

If you've shipped an agent that ended up doing something the spec didn't intend — I'd love to hear about it. Drop a comment or DM. Especially interested in cases where the agent narrated the violation before committing it. I suspect that's a more general pattern than just my Pokémon mishap.

DEV Community: tamago tamago