Jangwook Kim

Posted on May 1 • Originally published at jangwook.net

Anthropic's April Double Release — How Opus 4.7 and Managed Agents Change Agent Development

#claude #aiagents #anthropic #llm

I refreshed the Anthropic blog twice in the second week of April. Managed Agents public beta on April 8th, Claude Opus 4.7 on April 16th. In the same month, they shipped upgrades to both the "infrastructure layer" and the "model layer."

My first reaction was genuine excitement. SWE-bench Pro at 64.3% represents roughly a 10.9-point improvement over the previous version, and Managed Agents promised to take the session infrastructure I've been maintaining by hand and let Anthropic run it. Then I started reading the community reaction, and the picture got complicated.

What Opus 4.7 Actually Changed

Four changes were announced at the April 16 launch.

Benchmark numbers: SWE-bench Pro at 64.3% (+10.9 points from 4.6), CursorBench at 70% (+12 points). For coding agents, this is a genuine improvement.

High-resolution image support: Up to 2576px, 3.75MP — expanded from the previous 1568px/1.15MP. For UI testing automation agents or screenshot-based workflows, this is a real upgrade.

The task_budget parameter: This is the change I'm most interested in, even though it launched in beta. You can now set a token budget for the entire agent loop. Activate it with the task-budgets-2026-03-13 header; the minimum is 20k tokens. It's advisory, not a hard cap — when the budget approaches its limit, the model tries to finish within budget rather than stopping abruptly.

xhigh effort level: A new tier above high, for tasks that benefit from deeper reasoning passes.

Here's how task_budget looks in practice:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    extra_headers={
        "anthropic-beta": "task-budgets-2026-03-13"
    },
    # task_budget: token budget for the entire agent loop
    # minimum 20000, advisory (not a hard cap)
    task_budget=50000,
    messages=[
        {
            "role": "user",
            "content": "Find all deprecated API calls in this repository's Python files and replace them with the current versions."
        }
    ]
)

I couldn't run this directly without an Anthropic API key. The code above is based on official documentation and release notes. The advisory behavior is something I've been exploring alongside Managed Agents production deployment.

What Makes Managed Agents Different

Claude Managed Agents, which moved to public beta on April 8th, is conceptually straightforward. The agent execution environment you used to manage yourself — sandboxes, session state, permission validation, long-running containers — is now operated by Anthropic's platform.

Core features per the official documentation:

Isolated sandbox: Bash commands, file operations, web search, and MCP server execution run in an isolated environment
Session state persistence: Filesystem and conversation context are preserved across tasks that run for minutes or hours
Credential security: API keys and secrets are handled through permission delegation, not direct agent exposure
Multi-agent coordination: In research preview, but allows multiple agents to collaborate on workflows

Pricing is $0.08 per session-hour plus standard Claude API token costs. The docs explicitly note that idle time is included.

Notion, Rakuten, and Sentry are named as early production adopters. Notion reported 90% cost reduction and 85% latency improvement; Rakuten claimed 97% error reduction across 70+ business units; Sentry shipped a patch agent in "weeks." These are impressive numbers — though I'd note they compare against previously unstable self-managed infrastructure, so take them as upper-bound estimates.

The Good: Real Reduction in Agent Infrastructure Overhead

Running this blog's automation system myself, I know exactly how tedious agent session management gets. The code defending against session drops, context loss, and silent failures in long-running tasks often outweighs the actual business logic. At some point, you're spending more engineering hours on plumbing than on what the agent is supposed to do.

If Managed Agents genuinely removes that burden — and Sentry's "weeks to ship" story is real — the value proposition is clear.

The five agentic workflow patterns I've written about map naturally onto Managed Agents. Running an orchestrator-subagent structure on top of Managed Agents means the platform handles state recovery and context synchronization that you'd otherwise code yourself.

task_budget also points in the right direction. Letting the model self-prioritize within a budget tends to produce better completion rates than hard cutoffs that leave tasks in broken intermediate states.

The Bad: Benchmark-to-Practice Gap and Hidden Costs

Then the 24-hour post-launch community feedback arrived. I found a worrying pattern.

Based on developer feedback aggregated at byteiota.com, some power users described Opus 4.7 as "legendarily bad." The specific complaints converge on three things.

Safety overfitting: The threshold for detecting malicious code patterns apparently got tightened to the point where standard network calls and ordinary library usage were being refused. This works fine in a controlled benchmark environment where conservative behavior correlates with accuracy — but in real workflows, it's friction.

Literal instruction parsing: The model seems to interpret instructions more literally than 4.6, prioritizing explicit compliance over flexible reasoning. People who rely on inferring intent from context are finding it less cooperative.

Output style shift: A stronger preference for bullet-point formatting over prose. Some people prefer this; others find it harmful for creative or nuanced tasks.

The hidden cost I care about most isn't a behavior issue. It's the new tokenizer. Opus 4.7 ships a new tokenizer that uses 1〜1.35× more tokens for the same text. Published pricing didn't change, but real cost can increase up to 35%.

I wrote about the actual economics of running AI agents and a tokenizer change of this magnitude requires completely redoing cost projections. Anthropic not foregrounding this in the launch announcement deserves criticism.

Cost Reality: How Much Did It Actually Go Up?

Working from available numbers, here's a rough scenario comparison:

Scenario	Opus 4.6 baseline	Opus 4.7 estimate (tokens +25%)
Simple Q&A (1K tokens)	$0.005	~$0.006
Code review (10K tokens)	$0.05	~$0.063
Long-running agent (100K tokens)	$0.50	~$0.625

Add Managed Agents session cost ($0.08/hour) on top of that. A one-hour agent task means adding $0.08 to whatever the token bill comes to. For short batch tasks, that's expensive. For multi-hour complex tasks where you'd otherwise pay an engineer to manage the infrastructure, it may come out cheaper.

Why task_budget Matters Now

task_budget is the quietest feature in this release. The press covered benchmark numbers and flashy Managed Agents case studies, but for developers running agents over extended periods, this parameter might be the most practically significant change.

The problem it solves: you can't reliably predict how long a complex refactoring agent will run or how many tokens it'll consume. max_tokens limits the length of a single response, not the total cost of a multi-turn agent loop. task_budget fills that gap.

The advisory mechanism is the interesting design choice. Rather than hard-stopping at the limit, the model adjusts its own priorities as the budget approaches — skipping lower-priority exploration to focus on core tasks. Whether this works reliably in practice is something I want to test with my daily-post agent, which currently has no budget guardrail.

Managed Agents and the Development Process

When I first heard about Managed Agents, I assumed it was "Claude API with a sandbox attached." The documentation changed my mind.

The biggest shift is state management. Running agent sessions yourself, you keep hitting the same three problems: context loss when tools chain across multiple turns; restart costs when a session unexpectedly terminates; credential security when an agent needs access to GitHub, DBs, or external APIs without direct key exposure.

Managed Agents handles all three at the platform layer. Sessions that get interrupted resume with filesystem state intact. Credentials go through permission delegation. Context persists across long-running tasks.

Sentry's "weeks to ship" story might not be an exaggeration — some teams spend months building this infrastructure layer themselves.

The limitation worth naming: you're locked into Claude models. Teams running mixed-provider agent fleets will accumulate Anthropic dependency without a clean exit path.

Who Should and Shouldn't Use Opus 4.7

Community feedback and official docs together make the fit assessment fairly clear.

Where Opus 4.7 earns its keep: Complex multi-file refactoring, full-codebase analysis agents, legacy migration work, automated test coverage expansion. Tasks that take a long time to complete and genuinely benefit from deeper reasoning over many steps. High-res image workflows.

Where Opus 4.7 creates friction: Everyday coding assistance. Asking it to "fix this type error" or "refactor this function" is wasteful — Claude Sonnet 4.6 is faster and cheaper for that. Any workflow where conservative safety filtering could cause unexpected refusals. Creative writing or open-ended reasoning where rigid literal interpretation gets in the way.

The Bigger Picture for April's Releases

Reading April's Anthropic releases as a single narrative is interesting. Last month's performance-decline controversy shook community confidence; Anthropic came back one month later with benchmark numbers and a new infrastructure offering.

But developer reactions are maturing toward "benchmarks matter less than real-world behavior." SWE-bench Pro doesn't guarantee performance on your specific codebase, and "legendarily bad" feedback from power users is hard to dismiss.

My conclusion: Opus 4.7 is a genuine coding benchmark improvement, but I'm not doing a wholesale migration until the safety-overfitting reports clarify. task_budget and xhigh are tools I want to experiment with immediately. Managed Agents is my default infrastructure choice for any new agent project starting from scratch — but not worth migrating existing stable systems to. And the tokenizer cost impact needs to be explicitly calculated for every team's budget.

In one month, Anthropic answered "how do we build agents" at both the model layer and the infrastructure layer simultaneously. The answers aren't perfect. But the question is right.

Executability assessment (Source Review basis)

The task_budget code example and Managed Agents feature descriptions in this post are based on official documentation at platform.claude.com/docs and release notes. I was unable to configure a direct execution environment without an Anthropic API key, so the actual advisory behavior of task_budget and session billing mechanics were not directly verified. Everything here is based on documented design and public case studies — not first-hand execution logs.

DEV Community