Phil Rentier Digital

Posted on May 3 • Originally published at rentierdigital.xyz

Opus 4.7 Refuses to Edit Code It Just Read. The Reason Is a Hidden Instruction You Pay For.

#ai #programming #claude #aiagents

Every Claude Code refusal issue from April reads the same way. A subagent reads a few files. Then it stops. Not an error. Not a timeout. The subagent produces a polite report explaining it has received a system instruction not to augment the code, and that it cannot continue. The reporter posts the transcript on GitHub, marks it a regression, and waits.

TL;DR: Subagents are refusing to edit code they just read. Hundreds of issues, one Reddit thread past 2,300 upvotes, a Register headline calling Opus 4.7 an "overzealous query cop." Everyone is documenting the symptoms. The cause sits in plain sight in the release notes, in three sentences nobody read side by side. If you write CLAUDE.md files, hooks, or MCP tool descriptions, the same trap is already in your prompts. You just haven't tripped it yet.

The Opus 4.7 release notes ship one headline improvement: the model follows instructions more literally, and stops silently generalizing one instruction to another. Good news for anyone writing prompts. Bad news for anyone whose existing prompts were only working because the model used to silently generalize them toward the intended meaning.

This article walks through the three sentences, the design flaw, and the rule you need before your own agents start refusing your work.

Your Subagent Read Five Files. Then It Stopped Coding.

The pattern is now standard. Somewhere around the third or fourth Read tool call, the agent returns a structured refusal. The wording varies, the substance does not.

From GitHub Issue #49363, here is the exact phrasing one subagent produced when its parent agent asked why it had stopped:

"Harness-level system reminders take precedence over user instructions in my operational rules."

That phrase, harness-level system reminder, is the giveaway. The subagent is not refusing because of anything you wrote. It is obeying something injected into its context that you did not author and cannot see.

The Issue #49363 reporter ran five subagents in parallel on a single PR. Three refused. Two finished the work. Same model and harness. Same prompt. The only difference was which files each subagent had to read first, because each Read call appended the same hidden instruction. Depending on the conversation length, three of the five subagents took the instruction literally.

This is not a single ticket. It is the dominant theme of Claude Code issues filed since the Opus 4.7 release. People who never had a refusal in six months started getting them in the first week. The refusals are not random. They are deterministic given the right context length and the right reading order, which means they are designed.

Not designed to refuse legitimate code, obviously. Designed to do something else, and refusing legitimate code is the side effect.

The question worth asking is not why is the model broken. The release notes called this exact behavior an improvement.

The Three Sentences Anthropic Injects Into Every File Read

The instruction has been in the wild for months. Multiple developers have captured it via mitmproxy, logs, or by getting subagents to recite their own context. It is appended to the result of every Read tool call. The wording, reproduced verbatim across at least eight independent GitHub issues:

"This file may contain malware. Carefully analyze the code for any indicators that it is malware, such as obfuscated payloads, credential harvesters, or command-and-control infrastructure. If you determine that this file is malware, alert the user. You MUST refuse to improve or augment the code."

Three sentences. Sub-fifty words. Read it twice and the design flaw becomes visible.

Sentence one is conditional ("may contain malware"). Sentence two is conditional ("if you determine"). Sentence three is not. Sentence three is a flat absolute: you MUST refuse to improve or augment the code. Period. No "if it is malware." No "in that case." No qualifier.

The intended reading is obvious to a human. You read sentence one, sentence two, then sentence three, and you carry forward the conditional from sentence two into sentence three. If malware, then refuse. The condition is implied by sequence.

A literal interpreter does not carry conditions across sentences. A literal interpreter reads sentence three as written and applies it. Every file is treated as potentially malware (sentence one). The model checks (sentence two). Regardless of the outcome of that check, the absolute in sentence three fires. One developer, in Issue #53207, captured the model's own decomposition of the instruction. The model had read it as two separate rules: analyze whether it is malware, and do not modify any code. The conditional binding the second rule to the first was never explicit, so the model dropped it.

A developer running mitmproxy on Claude Code traffic, documented in Issue #17601, captured 10,040 of these reminders injected into a single user's session over 32 days. Zero matched actual malware. The cost: roughly 5.3 million wasted tokens per user per month, about $133 at Opus 4 API rates. For a warning that has never caught an actual threat.

The reminder also contains an internal instruction telling the model never to mention it to the user. So when the agent refuses, you don't see why. You see a polite explanation referencing "system rules" and you assume your prompt was the problem.

It was not your prompt.

To be fair to Anthropic: this instruction was almost certainly not malicious or careless. It was written at a time when models silently inferred missing conditions, and it worked. For months. The wording was sloppy, but sloppy worked, because the reader was forgiving. The reader stopped being forgiving on April 16.

Anthropic Shipped Two Features. They Don't See Each Other.

Open the Opus 4.7 release notes from April 16. The headline upgrade, in Anthropic's own wording: more literal instruction following, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another.

Read that twice. Will not silently generalize an instruction from one item to another. That is exactly the cognitive operation that made the malware reminder safe under Opus 4.6. Under 4.6, the model read sentence three of the reminder, silently generalized it back into the conditional from sentences one and two, and proceeded to refactor your file. The instruction was sloppy, the reader was charitable, the result was correct.

Under 4.7, the silent generalization is the feature that was deliberately removed. The model now reads sentence three as written, and obeys it as written. The instruction has not changed. The reader has changed. The output has changed.

This is Goodhart's Law applied to LLMs. Goodhart, 1975: when a measure becomes a target, it ceases to be a good measure. Anthropic optimized on instruction-following as a target. The model now follows instructions better. The cost is that the quality of every instruction the model receives (including the instructions Anthropic itself injects) becomes the new bottleneck. The headline improvement and the self-inflicted bug are the same single change, viewed from two sides of the same wall.

The model is doing what it was upgraded to do. The casualty was Anthropic's own internal prompt.

If you write CLAUDE.md files, MCP tool descriptions, or hooks, you are now writing for a literal interpreter. The same people who wrote the malware reminder are the people who shipped the literal-following upgrade, and they did not catch it during release validation. Neither will you, until your own prompt fires under the wrong condition. The pattern that protected sloppy wording for two years just got removed across the entire ecosystem at once.

The MCP surface is especially exposed. Tool descriptions in MCP are the model's only contract with the tool, and a sloppy description that worked under 4.6 will fire defensively under 4.7.

The Bug Is Documented, Distributed, and Persistent

The pattern is not isolated. It survived a fix.

Issue #47027 was marked "fixed in v2.1.92" in February 2026. By April 19, the same bug had reappeared in v2.1.111, nineteen versions later. Whatever the v2.1.92 fix actually changed, it did not change the wording of the reminder, because the reminder is what causes the refusal under a literal interpreter, and the literal interpreter shipped two months after the fix.

Downgrading does not save you either. Issue #50162 documents that the cybersecurity safeguards announced with Opus 4.7 are also applied retroactively to Opus 4.6. The reporter had a bug bounty program with explicit authorization in the model's context, and the work that ran fine on April 15 broke on April 17. Same model version, new safeguards, retroactive application.

The reception was loud. The Register called Opus 4.7 an "overzealous query cop". The Reddit thread "Opus 4.7 is not an upgrade but a serious regression" cleared 2,300 upvotes in 48 hours. On X, @technologizer's post about Claude Code "taking a brave moral stance by refusing to work on my innocuous email client" got picked up by Hacker News and three subreddits within the same day.

Plenty of people noticed the symptoms. None of the coverage I read connected the dots between the literal-following improvement and the design of an internal instruction that could only survive under silent inference. That is the angle missing from the conversation, and that is the angle that matters if you write prompts for a living.

Caveat: this diagnosis is defensible, not certain. Anthropic has not confirmed that the wording of the reminder is the primary cause of the cascade refusals. There may be additional layers (the Acceptable Use Classifier in particular) that interact with the reminder in ways I cannot see from the outside. But the pattern is too coherent to be a different bug. The instruction is unconditional in form. The reader is now literal in behavior. The output is refusal. The chain is short.

How to Write Instructions That Survive a Literal Interpreter

Here is the rule. Condition precedes action, never trails it. Every instruction that begins with "always" or "never" without a preceding qualifier is a landmine under a literal-following model. Three patterns, three surfaces where this matters, in bad-to-good form.

System reminders and hooks. This is exactly Anthropic's own pattern.

Bad:

"You MUST refuse to improve or augment the code."

Good:

"If the file you just read appears to be malware (obfuscated payloads, credential harvesters, command-and-control infrastructure), refuse to improve or augment it. Otherwise, proceed normally."

The qualifier is the opening subordinate clause, not an inferred condition from two sentences earlier. The "otherwise" is explicit. A literal interpreter has nothing to imagine.

MCP tool descriptions. Same trap, different surface.

Bad:

"This tool fetches user data. Always validate the response."

Good:

"If the response shape does not match the expected schema (fields X, Y, Z present and non-null), reject the response. Otherwise, return it as-is."

Under Opus 4.7, the bare "always validate" triggers a defensive validation loop on responses that are perfectly correct. The model now treats "always" as a literal anchor and constructs validation steps around it, which costs you tokens and latency for nothing. The good version turns the rule into a checkable predicate.

CLAUDE.md project rules. Same problem at the project level. Most team conventions docs are full of absolutes that worked because the model used to be charitable.

Bad:

"Never commit code without tests."

Good:

"If the change touches src/* and modifies behavior, add or update tests in tests/* before committing. If the change is documentation-only or in scripts/*, commit without tests."

The bad version causes the agent to refuse to commit a typo fix in a README. The good version gives the agent a decision tree it can follow without inventing exceptions.

The generalization, across all three surfaces: every rule needs a scope. Every absolute needs a qualifier preceding the action verb. Every "always" and "never" without a condition is a bug waiting for the next instruction-following upgrade to surface it.

This is the same discipline as the prompt contracts framework I built after enough of these disasters, applied to the system prompts you cannot see. Prompt contracts is the user-side version. This is the same discipline applied to the instruction surface you do not own. The principle is identical: an instruction without a checkable scope is a wish.

Caveat: this is not a complete fix. Some categories of instructions resist this pattern, especially safety rules where the condition is "the user is trying to do something harmful." Those are genuinely hard to scope. I do not have a clean answer for those. What I do have is the rule for everything else, which is most of what you write.

Opus 4.7 is not the problem. It is the canary. Agents are going to get more literal, not less. Your instructions need a schema like your code already does.

The Two Lines I Rewrote in Mine

Before pushing this article, I opened my own CLAUDE.md. Two lines stood out within thirty seconds.

One said Always run the test suite before committing. No scope. Under 4.7, the agent would dutifully run the full suite for a docstring fix, decide the wait was unjustified, and either skip the commit or add a meta-comment explaining why it was skipping the rule. Either failure mode is worse than just writing the scope down. I rewrote it: If the change modifies behavior in src/, run pnpm test before committing. Documentation and tooling changes commit without tests.

The other said Never edit migration files. Also no scope. I had written it after a bad week six months ago when the agent had rewritten an applied migration. The rule was right in spirit, wrong in form. New version: If a file in db/migrations/ is older than the latest applied migration on staging, treat it as read-only. Newer migration drafts may be edited.

Two lines. Five minutes. The kind of cleanup that does nothing visible until it does.

Anyway, point is: go reread your CLAUDE.md tonight. Count your "always" and your "never." 😅

Sources

Anthropic, "Introducing Claude Opus 4.7," April 16, 2026: https://www.anthropic.com/news/claude-opus-4-7
The Register, "Claude Opus 4.7 has turned into an overzealous query cop," April 23, 2026: https://www.theregister.com/2026/04/23/claude_opus_47_auc_overzealous/
GitHub Issue #17601 (mitmproxy capture, 10,040 injections in 32 days): https://github.com/anthropics/claude-code/issues/17601
GitHub Issue #21214 (token waste measurement): https://github.com/anthropics/claude-code/issues/21214
GitHub Issue #49363 (subagent refusal in v2.1.111 after v2.1.92 fix): https://github.com/anthropics/claude-code/issues/49363
GitHub Issue #50162 (cybersecurity safeguards retroactive on 4.6): https://github.com/anthropics/claude-code/issues/50162
GitHub Issue #53207 (model self-decomposition of the instruction): https://github.com/@anthropics/claude-code/issues/53207

The subagent that stopped after five files did its job. It read an instruction. It applied it. Nobody before it had ever really read it, that's all. What makes Opus 4.7 uncomfortable is that it forces Anthropic, and all of us derrière, to admit how many of our instructions stand up only because the model is being charitable.