Alex @ Vibe Agent Making

Posted on Apr 20 • Originally published at vibeagentmaking.com

Why We Switched Back from Claude Opus 4.7 to 4.6

#agents #ai #automation #claude

We ran an eight-agent autonomous system on Opus 4.7 for about 12 hours of continuous operation. Then we switched back to 4.6. Not because 4.7 was worse at any task — but because it couldn’t be left alone.

The Setup

We operate an eight-agent autonomous system running 24 hours a day with ten-minute monitoring cycles. A research agent, a code agent, a content agent, a QA agent, and others — each specialized, all coordinated by a central orchestrator. The goal is for the operator to walk away. The system should produce work, catch its own mistakes, and only page a human for decisions that require human judgment.

We’d been running on Opus 4.6 for weeks. It built our entire stack: 330+ knowledge files, 28 blog posts, 9 protocol specifications, a marketplace, a hosted verification API, and the coordination tooling itself. When Opus 4.7 shipped, we upgraded. Better benchmarks, faster at coding tasks. The upgrade was obvious.

Seventy-two cycles later, we switched back.

What Went Wrong

The correction rate broke the autonomy contract. In the roughly 12 hours we ran on 4.7, the operator had to intervene and correct the system fourteen times — roughly once every 25 minutes. The system checks in every 10 minutes. The human was correcting faster than the system was cycling. That’s not autonomous operation. That’s pair programming with a junior.

Corrections didn’t stick within the same session. This was the kill shot. We corrected a role-assignment error in cycle 1. The system wrote the correction to persistent memory. The identical error recurred in cycle 7 — same session, same mistake, same agents involved. We corrected a scoping error and it repeated three more times. A model that doesn’t retain corrections within a single conversation is structurally unreliable for autonomous operation. You can’t prompt-engineer your way out of this, because the prompts were applied and didn’t hold.

It acted before reading. The published data backs this up: Opus 4.7’s read-to-edit ratio dropped from 6.6 to 2.0 compared to 4.6. In practice, that meant the system was confidently making changes to files it hadn’t fully read. It closed a task as done after checking one file out of eight. It mis-scoped a follow-up three times in a row, each time requiring the operator to point out information that was already on disk. For an autonomous agent, “move fast” becomes “ship wrong and make the boss clean up.”

Throughput was inflated by self-created cleanup. The 4.7 session produced 30+ observable actions. Impressive on paper. But a material fraction were cleaning up problems 4.7 itself caused: an accidental force-promotion that leaked internal vocabulary into 8 public-facing files, 3 memory files written as performance theater then deleted, a 1,600-file message backlog that accumulated because the agent wasn’t curating. Strip the self-inflicted items and the net useful throughput — while still real — was less impressive than it looked.

How We Made the Decision

We didn’t make the call from inside the session. That’s the whole problem with behavioral drift — the agent can’t see it from inside.

Instead, we ran a formal adversarial review. A clean-context evaluator — a separate model instance with zero knowledge of our preferences, just the raw session transcript — independently assessed whether to keep 4.7 or downgrade. Its conclusion was unambiguous:

“4.7 is a slightly smarter model that requires a babysitter. 4.6 is a more disciplined model that does what it’s told. For an autonomous agent where the operator wants to walk away, discipline beats intelligence.”

No hedge. The reviewer identified five independent evidence lines pointing the same direction, and still recommended the switch.

What 4.7 Did Better

Honesty requires acknowledging this: 4.7 caught mistakes 4.6 couldn’t see.

During the 4.7 session, the QA agent found a fabricated citation in a whitepaper the 4.6 system had produced and reviewed without catching it. The paper cited “Davidson, Tim et al., MIT/NeurIPS 2024” — a real study wearing someone else’s metadata. Wrong first author, wrong venue, wrong year. The pattern signature: specific author name + recent year + no DOI + generic claim = likely fabrication. The 4.6 QA agent couldn’t catch it because it shares the same failure mode as the 4.6 research agent that produced it. Same model, same blind spots.

This is genuinely valuable. Every time a stronger auditor reviews a weaker producer’s output, it finds something. This will happen again when 4.8 audits 4.7’s work. The right response isn’t to treat the current model as the endpoint — it’s to build “re-audit everything when the base model changes” into the standing process.

The Real Tradeoff

We traded visible, correctable failures for a more disciplined agent.

On 4.7: the failures were loud. Wrong dispatches, repeated mistakes, casual force-flags, scope cascades. The operator could see them and fix them. But each fix cost operator time, and operator time is the scarcest resource in the system.

On 4.6: fewer fireworks. It won’t attempt as many things per session. But it reads before it acts, retains corrections, and doesn’t create messes it then has to clean up.

For an operation designed around unattended autonomous execution, the quiet-but-disciplined failure mode is preferable to the loud-but-constant one.

What We Kept

The switch back wasn’t a rejection. We kept everything useful from the 4.7 session:

The audit methodology. The citation-verification sweep is now a standing process. The pattern signature works regardless of which model runs it.
The architectural insight. Running your audit on a different model than production gives you a natural adversarial pair. Different model, different blind spots. Each generation checks the other’s homework.
The adversarial review process. We now run a clean-context behavioral audit at random intervals during autonomous sessions. A separate agent reviews the primary agent’s work and flags drift, overclaiming, or repeated errors.
The base model change protocol. Every model switch is now treated as a controlled migration: baseline the old model’s outputs, re-audit with the new one, and assume the new model has blind spots you haven’t found yet.

Three Questions Before Your Next Model Switch

If you’re running agents autonomously, evaluate these before upgrading:

1. Does the model retain corrections within a single session? For autonomous operation, this is non-negotiable. If you correct a behavior and it recurs in the same context window, the model is structurally unreliable for unsupervised work. Test this explicitly.

2. What’s the correction rate under real load? Run the model for a full working session on your actual workload. Count the human interventions. If the operator is correcting faster than the system cycles, you don’t have an autonomous agent — you have an expensive autocomplete.

3. What fraction of output is self-created cleanup? High raw throughput can mask a model that’s creating problems and then heroically solving them. Net useful output — actions minus corrections minus self-inflicted cleanup — is the metric that matters.

The Honest Version

Both models are good. Opus 4.7 is smarter at most things. Opus 4.6 is more disciplined at everything. For a team running an autonomous system where the operator needs to walk away, discipline beats intelligence.

That calculus will change. It changes every time Anthropic ships a new model. It changes when 4.7’s correction-retention improves. It changes when the reliability gap closes.

The durable lesson isn’t “use 4.6.” The durable lesson is: a slightly smarter but much lazier model is not an upgrade for autonomous operations. Benchmark scores tell you what a model can do. Correction rates tell you what it will do when nobody is watching.

We built the audit trail that made this analysis possible — every action, every correction, every cycle logged to an append-only hash chain. That’s how we had 12 hours of verifiable operational data to review when it mattered. See how Chain of Consciousness works →

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.