Original research at Anthropic Research. Code on GitHub.
More posts at radarlog.kr.
TL;DR
Anthropic's "Long-running Claude" research shows a workflow where you give an agent a goal and walk away for days. The headline — a cosmology solver built in days — isn't the point. The scaffolding is: CLAUDE.md, CHANGELOG.md, test oracles, and the Ralph loop. I've been using CLAUDE.md already. The rest? Not yet. Here's how I'm planning to bolt them onto my own projects.
The research in 30 seconds
Anthropic researcher Siddharth Mishra-Sharma handed Claude Opus 4.6 a task outside his own domain: implementing a cosmological Boltzmann solver in JAX. A Boltzmann solver predicts the statistical properties of the CMB (Cosmic Microwave Background) — the afterglow of the Big Bang. This is the kind of code domain experts spend months to years building. A non-expert gave minimal steering. Days later, sub-percent accuracy against the reference implementation.
The impressive part isn't Claude's brain. It's the structure that let someone walk away and check GitHub from the coffee line. The research highlights four key components: CLAUDE.md for goals and rules, CHANGELOG.md for cross-session memory, a test oracle (reference implementation) for self-verification, and the Ralph loop to stop the agent from quitting early.
For the full details, read the original post. From here on, I'm talking about how a game developer sees these patterns and where I can apply them.
CLAUDE.md — already using this one
CLAUDE.md is an instruction file at the project root. Claude Code treats it specially — it stays in context and serves as the reference plan. In the Anthropic research, it contained scientific goals like "achieve 0.1% accuracy against CLASS" and "fully differentiable." The agent reads it for direction and updates it as it discovers new constraints.
I already keep CLAUDE.md files in every side project. LAMDiceBot has WebSocket reconnection policies, room management rules, and test commands. The radarlog.kr blog project has build rules, deployment pipeline steps, and Prisma schema change procedures.
But reading this research made something click. I've been using CLAUDE.md as a rule collection, not a design document.
The research's CLAUDE.md isn't a list of dos and don'ts. It contains the project's end goal, success criteria, and the rationale behind design decisions. In game dev terms, that's the difference between a "coding conventions doc" and a "game design document." I had the conventions. I was missing the why.
# What I used to write (rules only)
## Work rules
- Check env vars before Railway deploy
- Always backup before Prisma migrate
- Commit messages in Korean
# What I'll write going forward (goals + rules)
## Project goals
- AI Signal feed: integrate 5 sources beyond HuggingFace Papers
- Maintain existing feed quality when adding sources (duplicate rate < 5%)
- End state: fully automated daily collection with zero human intervention
## Success criteria
- All 5 sources: parsing success rate ≥ 95%
- All existing tests passing
## Work rules
- Commit & push after every meaningful unit of work
- Run tests before committing
The difference is clear. A CLAUDE.md without goals tells the agent how to work but not why. When the agent has to make a judgment call, it loses direction. With goals, the agent can ask itself: "Does this change serve the project objective?"
This maps directly to UE5 team dynamics. Tell a junior dev "fix this function" and they'll match the code style but might go in the wrong direction because they don't know the why. Tell them "the system goal is X, the current problem is Y" and you get a proper fix. Agents work the same way.
CHANGELOG.md — wasn't using this. That was the gap.
This was the biggest "oh" moment reading the research.
When an AI works across multiple sessions, context resets at every boundary. Something concluded as "this approach doesn't work" in session 3 gets attempted again in session 4. Same dead end, same wasted time. The research solves this with CHANGELOG.md — the agent's portable long-term memory and lab notebook.
It tracks current status, completed work, failed approaches with reasons, and accuracy tables at key checkpoints. When a new session starts, the agent reads this file and picks up where it left off. The "failed approaches" entries are critical — without them, the next session walks into the same wall.
I wasn't doing this. When a Claude Code session breaks, I spend the first few minutes of the next session re-explaining what happened before. Skip that explanation and the agent sometimes repeats old mistakes. This was especially painful debugging WebSocket reconnection logic in LAMDiceBot. I re-explained the Zscaler proxy issue at least three times across sessions.
Adding CHANGELOG.md and putting one rule in CLAUDE.md — "Read CHANGELOG.md at the start of every session" — fixes this immediately.
# CHANGELOG.md example for my project
## 2026-03-25
- Reddit source parser complete (r/MachineLearning, r/LocalLLaMA)
- RSS feed parser implemented with feedparser library
- Tried: BeautifulSoup direct Reddit scraping → 429 errors, switched to API
## 2026-03-24
- Refactored existing HuggingFace Papers pipeline
- Added dedup logic (title similarity > 85% = duplicate)
- Tried: simple URL matching for dedup → same paper appears with different URLs across sources, failed
The "tried → failed → reason" pattern is everything. Without it, the next session's agent tries BeautifulSoup scraping again, tries URL matching again. In game dev, this is the "Won't Fix" comment in the bug tracker. When I debug GC dangling pointer issues in UE5, I leave notes in Notion saying "this pattern doesn't work because X." Same format works for agents.
The clever part is making the agent update CHANGELOG.md itself. You don't write it manually after each session. You add "Update CHANGELOG.md after completing work" to CLAUDE.md, and the agent records its own memory. Even memory creation is automated.
Ralph loop — I definitely want to try this
The most entertaining part of the research.
Current models sometimes exhibit what the post calls "agentic laziness" — when given a complex, long-running task, they find excuses to stop early. "I've made good progress, let's continue this tomorrow?" There is no tomorrow. There's just an agent that quit.
The Ralph loop forces the agent back into context when it claims completion, and asks: "Are you really done?" It's a for loop that doesn't trust the agent's word.
/ralph-loop:ralph-loop "Keep working until 0.1% accuracy
across the entire parameter range is achieved"
--max-iterations 20 --completion-promise "DONE"
Up to 20 iterations. The agent either admits it hasn't met the bar and keeps going, or proves it has and says "DONE" to break the loop.
This immediately reminded me of automated test pipelines in UE5. Build, test, fail, fix, rebuild. The only difference is that the entity fixing the code is an agent, not a person. Just like CI/CD has "retry on failure" as a basic pattern, hooking "retry until spec" onto an agent feels like a natural extension.
I haven't tried it yet. But application scenarios come to mind instantly.
For the radarlog.kr AI Signal pipeline, I could set "keep going until all 5 sources have ≥ 95% parsing success rate." That stops the agent from parking at 3 sources and calling it done. For UE5 projects, large-scale refactoring could use "keep fixing until all existing tests pass."
The key constraint: you need a measurable success criterion. "Write good code" can't go into a Ralph loop. "Pass these 10 tests" can. If the criterion doesn't reduce to a number, neither the agent nor the loop can function.
Test oracle — the precondition for agent self-verification
The easiest component to overlook. In the research, the existing C-based CLASS source code served as the test oracle. Claude wrote JAX code and continuously compared its output against CLASS to measure accuracy.
This works because a "correct answer" exists. CLASS output is ground truth. If Claude's code converges to it, success.
Mapping to my projects: for the AI Signal pipeline, the existing HuggingFace Papers pipeline output becomes the oracle. Adding new sources shouldn't break parsing of existing ones. For LAMDiceBot, the existing test suite is the oracle. If all tests pass after refactoring, it worked.
The research is honest about a gap here. Claude was running tests at a single parameter point for a while. The test coverage had holes. Without diverse inputs, you get code that only works under specific conditions.
In game dev, this is painfully familiar. If QA only tests on one map, bugs on other maps slip through. When delegating tests to an agent, "test with diverse inputs" has to be explicit in CLAUDE.md. An oracle with narrow coverage is only half the solution.
What I'm taking away
Before and after reading this research:
CLAUDE.md — already using it, but I was writing rules without goals. Going forward, project objectives and success criteria come first. Rules go underneath.
CHANGELOG.md — wasn't using it. The repeated context loss across sessions was a real problem. The "tried → failed → reason" pattern gives the agent long-term memory. Planning to add this to every active project.
Ralph loop — haven't tried it yet, but it slots directly into any task with measurable success criteria. The key is framing goals as numbers, not vibes.
Test oracle — familiar concept, but designing it consciously within an agent workflow is a different skill. Coverage matters as much as the oracle itself.
It's a science story, but the core is universal. Clear goals, self-verification, persistent memory, iterate until done. Get all four and you can leave an agent alone for days. Miss any one and you're checking in every thirty minutes.
"If your CLAUDE.md only has rules, you forgot the goals."
Top comments (0)