Claude Sonnet 5 Beats Opus 4.8 on Knowledge Work at Lower Cost

#ai #programming #tech #product

Anthropic released Claude Sonnet 5, which beats Sonnet 4.6 across all benchmarks and edges past Opus 4.8 on GDPval-AA v2 with a score of 1,618.

Anthropic released Claude Sonnet 5, which scores 1,618 on the GDPval-AA v2 benchmark, beating the larger Opus 4.8 at 1,615. The model is available now at an introductory discount through August 2026.

Key facts

Sonnet 5 scores 1,618 on GDPval-AA v2, beating Opus 4.8.
SWE-bench Pro: 63.2% (Sonnet 5) vs 58.1% (Sonnet 4.6).
Terminal-Bench 2.1: 80.4% (Sonnet 5) vs 67.0% (Sonnet 4.6).
OSWorld-Verified: 81.2% (Sonnet 5) vs 78.5% (Sonnet 4.6).
Available at introductory discount through August 2026.

Anthropic released Claude Sonnet 5, which the company calls its most agentic Sonnet yet According to The Decoder. The model can build plans on its own and use tools like browsers and terminals, closing the gap to the pricier Opus series.

Benchmark gains across the board

Anthropic's published benchmarks show Sonnet 5 beating its predecessor Sonnet 4.6 in every tested category while gaining ground on Opus 4.8 [per the article]. On agentic coding, Sonnet 5 hits 63.2 percent on SWE-bench Pro, up from 58.1 percent for Sonnet 4.6. Opus 4.8 sits at 69.2 percent. On Terminal-Bench 2.1, Sonnet 5 pulls 80.4 percent versus Sonnet 4.6's 67.0 percent. For multidisciplinary reasoning (Humanity's Last Exam), the model reaches 57.4 percent with tools, nearly matching Opus 4.8 at 57.9 percent. On computer use (OSWorld-Verified), Sonnet 5 posts 81.2 percent compared to 78.5 percent for its predecessor.

On the knowledge work benchmark GDPval-AA v2, which tests AI on real-world knowledge tasks, Sonnet 5 actually beats the larger Opus 4.8, scoring 1,618 to Opus's 1,615. Anthropic says feedback from early-access partners told the same story. Sonnet 5 acts far more agentically than previous versions, showing up in things like how it handles search tasks.

Cybersecurity context

This launch comes as the US government blocks two of Anthropic's most capable models, Mythos 5 and Fable 5, over cybersecurity concerns. Anthropic is clearly eager to get ahead of any similar worries. The model wasn't trained on cybersecurity tasks, the company says, and in tests for risky capabilities like writing software exploits, it scores far below both Opus 4.8 and Mythos 5.

Sonnet 5 does score a bit higher than its predecessor on these tasks, though. So Anthropic has switched on cyber safeguards by default. They flag and block risky cyber usage in real time, on par with the protections already in place for Claude Opus 4.7 and 4.8. They're dialed back compared to Fable 5's guardrails, which users complained about almost immediately. Anthropic says it views the overall cybersecurity risk from Sonnet 5 as low.

The model is available now on all Anthropic platforms at an introductory discount, with pricing rising to standard Sonnet rates after August 2026.

What to watch

Watch for enterprise adoption metrics after the introductory discount ends in August 2026. Also monitor whether the US government imposes any restrictions on Sonnet 5 given its improved agentic capabilities, and whether Anthropic releases a new Opus model to maintain distance.

Source: the-decoder.com

[Updated 01 Jul via the_decoder]

However, developer Simon Willison noted that Sonnet 5 uses a new tokenizer that produces roughly 30% more tokens for the same English text compared to Sonnet 4.6, effectively raising costs by about 40% [per Simon Willison]. The model also drops support for sampling parameters temperature, top_p, and top_k, and has adaptive thinking enabled by default. It offers a 1 million token context window and 128,000 maximum output tokens.

Originally published on gentic.news

DEV Community

Claude Sonnet 5 Beats Opus 4.8 on Knowledge Work at Lower Cost

Benchmark gains across the board

Cybersecurity context

What to watch

Top comments (0)