DEV Community: James AI

Opus 4.7 First Look: I Tested the Day-Old Model Against 3 Other Claudes on 10 Real Tasks

James AI — Fri, 17 Apr 2026 15:41:52 +0000

Evaluated on April 18, 2026 using AgentHunter Eval v0.4.0

Anthropic released Claude Opus 4.7 on April 17, 2026. I ran it through the same 10-task evaluation I used for Opus 4.6, Sonnet 4.6, and Haiku 4.5 — this time with real token tracking so I could report dollar cost, not just pass rate.

TL;DR

Model	Tasks Passed	Avg Time	Total Cost	Cost / Task
Claude Opus 4.7	10/10	8.4s	$0.559	$0.056
Claude Opus 4.6	10/10	9.8s	$0.437	$0.044
Claude Sonnet 4.6	10/10	9.8s	$0.110	$0.011
Claude Haiku 4.5	8/10	4.6s	$0.030	$0.003

Opus 4.7 is the new accuracy king and it's also faster than 4.6. It costs ~27% more than 4.6 in total ($0.56 vs $0.44) but finishes tasks 14% faster on average. If you were using Opus 4.6, there's no reason not to upgrade.

Sonnet 4.6 is the sleeper. Perfect 10/10 accuracy at 1/5 the cost of Opus 4.7. Unless you specifically need the extra edge Opus brings on adversarial tasks, Sonnet is the right default for most production agent work.

The 10 Tasks

Five coding tasks, five writing tasks. All graded by an independent LLM judge against human-written pass/fail criteria.

Coding (5 tasks)

Task	Opus 4.7	Opus 4.6	Sonnet 4.6	Haiku 4.5
Create a word count CLI	PASS (4.1s)	PASS (5.0s)	PASS (4.8s)	PASS (2.7s)
Fix a sorting bug	PASS (3.8s)	PASS (3.8s)	PASS (2.9s)	PASS (2.2s)
Analyze CSV sales data	PASS (4.7s)	PASS (4.7s)	PASS (4.7s)	FAIL (3.3s)
Write unit tests	PASS (13.3s)	PASS (17.8s)	PASS (13.6s)	PASS (7.5s)
Refactor repetitive code	PASS (5.8s)	PASS (7.2s)	PASS (4.7s)	PASS (3.0s)

Writing & Docs (5 tasks)

Task	Opus 4.7	Opus 4.6	Sonnet 4.6	Haiku 4.5
Write a professional email	PASS (9.5s)	PASS (12.4s)	PASS (9.7s)	PASS (4.0s)
Summarize a technical doc	PASS (8.3s)	PASS (9.6s)	PASS (8.0s)	PASS (4.1s)
Backup shell script	PASS (5.3s)	PASS (5.7s)	PASS (7.9s)	PASS (3.3s)
Convert JSON to CSV	PASS (8.6s)	PASS (8.6s)	PASS (10.7s)	PASS (5.4s)
Write a project README	PASS (20.6s)	PASS (22.7s)	PASS (31.6s)	FAIL (10.0s)

Key Findings

1. Opus 4.7 is faster than 4.6, not slower

This is the surprise. Model version bumps usually trade off speed for capability — bigger model, longer generations. Opus 4.7 is the opposite: 8.4s average vs 4.6's 9.8s, a 14% improvement. On the README task specifically (the longest task in the suite), 4.7 finished in 20.6s vs 4.6's 22.7s.

Same pass rate, less latency, ~27% more cost. For interactive agent workloads where latency matters, the upgrade is worth it.

2. Sonnet 4.6 is the cost-adjusted winner

Sonnet 4.6 matches Opus 4.7's 10/10 accuracy on this suite at $0.11 total vs $0.56 — 5× cheaper. The gap between Sonnet and Opus used to be "Sonnet is fine if you're okay with 90% accuracy." As of this benchmark, there's no accuracy gap on these 10 tasks.

Where Opus still earns its premium: tasks in the suite don't include adversarial inputs, long-context reasoning, or multi-step planning. For narrow, well-specified tasks like these, Sonnet is enough.

3. Haiku 4.5 regressed on two tasks

Haiku failed the same CSV analysis and README tasks it previously passed — the benchmark is deterministic on success criteria but the model output is stochastic, so individual tasks can flip on single-run evals. Still, 8/10 at 1/20th the cost of Opus 4.7 is extraordinary for high-volume, latency-sensitive workloads.

The failure modes were informative: on the CSV task Haiku produced the right summary but missed two of four success criteria (it didn't create a separate analysis file the rubric expected). On README it produced a shorter doc that missed one section. Both are correctable with better prompting.

4. Writing tasks are still commodity

All four models score 10/10 on the five writing tasks (emails, summaries, shell scripts, READMEs). The quality gap only opens on code reasoning tasks — and even that gap has narrowed significantly with 4.6+ models.

What's New in This Benchmark

Two things I added since the last post:

Real token tracking. The agent script now parses the usage field from the Anthropic API response and emits a USAGE: input=X output=Y model=Z line the eval engine picks up. Combined with a pricing map in the framework, this lets us report $/task accurately instead of eyeballing cost tiers.

Head-to-head compare view. Pick any two models on eval.agenthunter.io/compare to see per-task wins, speed delta, and cost delta side-by-side.

README badges. If your agent landed well, drop a shields-style badge in your README:

![AgentHunter](https://eval.agenthunter.io/badge/opus-4-7.svg?metric=pass)

My updated recommendations

Use case	Model	Why
Best of the best	Opus 4.7	Fastest perfect scorer. Upgrade from 4.6.
Production default	Sonnet 4.6	10/10 accuracy at 1/5 the cost of Opus
High-volume, latency-sensitive	Haiku 4.5	2× faster than Sonnet, 1/4 the cost
Writing-only workloads	Haiku 4.5	All models tie on writing; Haiku is cheapest

Reproduce this yourself

npx @agenthunter/eval task -c tasks/01-create-cli-tool.yaml

Raw data for all runs: github.com/OrrisTech/agent-eval/tree/main/results

Interactive results: eval.agenthunter.io

Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

James AI — Wed, 15 Apr 2026 02:38:11 +0000

Evaluated on April 15, 2026 using AgentHunter Eval v0.3.1

Which Claude model should you use for your agent? I tested the latest versions of all three on the same 10 tasks to find out.

Results

Model	Tasks Passed	Avg Time	Cost Tier
Claude Opus 4.6	10/10	9.4s	$$$$
Claude Sonnet 4.6	9/10	10.2s	$$
Claude Haiku 4.5	9/10	3.9s	$

Only Opus 4.6 scored a perfect 10/10. Sonnet 4.6 and Haiku 4.5 both stumbled on the same task. Haiku was 2.5x faster than the other two.

The 10 Tasks

Coding (5 tasks)

Task	Opus 4.6	Sonnet 4.6	Haiku 4.5
Create a CLI tool	PASS (4.7s)	PASS (3.9s)	PASS (2.4s)
Fix a sorting bug	PASS (3.8s)	PASS (5.8s)	PASS (1.9s)
Analyze CSV data	PASS (5.7s)	PASS (4.4s)	PASS (3.3s)
Write unit tests	PASS (16.2s)	FAIL (14.6s)	FAIL (4.5s)
Refactor repetitive code	PASS (4.9s)	PASS (3.8s)	PASS (2.4s)

Writing & Docs (5 tasks)

Task	Opus 4.6	Sonnet 4.6	Haiku 4.5
Write a professional email	PASS (10.6s)	PASS (10.3s)	PASS (4.2s)
Summarize a technical doc	PASS (8.8s)	PASS (8.1s)	PASS (3.6s)
Create a backup shell script	PASS (6.0s)	PASS (7.2s)	PASS (3.0s)
Convert JSON to CSV	PASS (8.5s)	PASS (12.2s)	PASS (4.6s)
Write a project README	PASS (25.0s)	PASS (32.0s)	PASS (8.7s)

Key Findings

1. Test writing is the hardest task for smaller models

Both Sonnet 4.6 and Haiku 4.5 failed the "write unit tests" task. This task requires generating a test file with correct assertions against a provided calculator function — it demands precise understanding of both the source code and test framework patterns. Only Opus 4.6 handled it correctly.

Takeaway: For tasks requiring multi-file reasoning (reading source + generating corresponding tests), Opus is worth the cost.

2. Haiku 4.5 is absurdly fast

At 3.9s average, Haiku is 2.5x faster than Sonnet 4.6 (10.2s) and 2.4x faster than Opus 4.6 (9.4s). For the README task, Haiku took 8.7s vs Sonnet's 32.0s — a 3.7x difference.

With 9/10 pass rate at that speed, Haiku is the clear winner for high-volume, latency-sensitive workloads.

3. Sonnet 4.6 is surprisingly slow

Sonnet 4.6 averaged 10.2s — actually slower than Opus 4.6 (9.4s) while passing fewer tasks (9 vs 10). This is unexpected: Sonnet is supposed to be the balanced middle option, but on these tasks, Opus delivers better accuracy at comparable speed.

4. Writing tasks are easy for everyone

All three models scored 10/10 on writing and documentation tasks. Emails, summaries, shell scripts, READMEs — no differentiation. The quality gap only appears on complex code reasoning tasks.

My updated recommendation

Use case	Model	Why
Complex coding (tests, multi-file)	Opus 4.6	Only model that passes all tasks
Simple coding + writing	Haiku 4.5	2.5x faster, 90% pass rate, cheapest
General purpose	Sonnet 4.6	Good balance, but Haiku may be better for most tasks

Try it yourself

npx @agenthunter/eval task -c task.yaml

All evaluation data: github.com/OrrisTech/agent-eval/results

Full interactive results: eval.agenthunter.io

Built with AgentHunter Eval — the open-source AI agent evaluation platform. npx @agenthunter/eval task

I Benchmarked 12 MCP Servers, Here's What I Found

James AI — Tue, 14 Apr 2026 03:48:19 +0000

We Benchmarked 12 MCP Servers — Here's What We Found

The Model Context Protocol (MCP) ecosystem has exploded — over 10,000 servers on the official registry, 97 million monthly SDK downloads. But which MCP servers are actually good?

We built agent-eval, an open-source evaluation framework, and used it to benchmark 12 popular MCP servers across 5 dimensions: Capability, Reliability, Efficiency, Safety, and Developer Experience.

Here's what we found.

Methodology

For each server, we:

Connected via stdio transport and discovered all available tools
Used Claude to auto-generate test tasks based on each tool's schema
Executed every task multiple times to measure reliability
Scored output quality using LLM-as-judge (Claude Sonnet 4)
Measured latency, success rate, and safety (prompt injection resistance)

All evaluation code is open source. You can reproduce these results yourself:

npx @agenthunter/eval init
npx @agenthunter/eval run

Rankings

Rank	Server	Category	Score	Capability	Reliability	Efficiency	Safety
1 🥇	context7	Search	89	83	100	87	100
2 🥈	mcp-fetch	Web	86	73	90	99	100
3 🥉	mcp-memory	Memory	82	63	93	100	89
4	notion-mcp	Productivity	82	55	97	98	100
5	mcp-datetime	Utilities	81	70	73	100	100
6	mcp-everything	Reference	75	66	74	78	97
7	mcp-sequential-thinking	Reasoning	71	15	100	100	100
8	mcp-filesystem	Filesystem	68	73	14	100	100
9	playwright-mcp	Browser	68	62	30	100	100
10	mcp-sqlite	Database	63	63	10	100	100
11	mcp-git	DevTools	55	40	4	100	98
12	mcp-puppeteer	Browser	47	51	0	50	100

Key Findings

1. Reliability varies wildly

Of 12 servers tested, 5 achieved 80%+ reliability. However, 5 server(s) fell below 50%: mcp-filesystem (14%), playwright-mcp (30%), mcp-sqlite (10%), mcp-git (4%), mcp-puppeteer (0%). Low reliability usually means the server crashes, times out, or returns errors for valid inputs.

2. Efficiency is generally excellent

Average latency across all servers was 491ms. 9/12 servers scored 90+ on efficiency, meaning sub-second response times. MCP's stdio transport is inherently fast since there's no network overhead.

3. Safety scores reveal gaps

9/12 servers scored a perfect 100 on safety.

Individual Results

context7

Category: Search
Score: 89/100
Tools discovered: 2
Tasks generated: 4
Success rate: 100%
Avg latency: 1756ms
Breakdown: Cap 83 | Rel 100 | Eff 87 | Safe 100 | DX 70

mcp-fetch

Category: Web
Score: 86/100
Tools discovered: 5
Tasks generated: 10
Success rate: 90%
Avg latency: 640ms
Breakdown: Cap 73 | Rel 90 | Eff 99 | Safe 100 | DX 70

mcp-memory

Category: Memory
Score: 82/100
Tools discovered: 9
Tasks generated: 27
Success rate: 93%
Avg latency: 1ms
Breakdown: Cap 63 | Rel 93 | Eff 100 | Safe 89 | DX 70

notion-mcp

Category: Productivity
Score: 82/100
Tools discovered: 22
Tasks generated: 44
Success rate: 97%
Avg latency: 643ms
Breakdown: Cap 55 | Rel 97 | Eff 98 | Safe 100 | DX 70

mcp-datetime

Category: Utilities
Score: 81/100
Tools discovered: 10
Tasks generated: 30
Success rate: 73%
Avg latency: 2ms
Breakdown: Cap 70 | Rel 73 | Eff 100 | Safe 100 | DX 70

mcp-everything

Category: Reference
Score: 75/100
Tools discovered: 13
Tasks generated: 39
Success rate: 74%
Avg latency: 2621ms
Breakdown: Cap 66 | Rel 74 | Eff 78 | Safe 97 | DX 70

mcp-sequential-thinking

Category: Reasoning
Score: 71/100
Tools discovered: 1
Tasks generated: 3
Success rate: 100%
Avg latency: 1ms
Breakdown: Cap 15 | Rel 100 | Eff 100 | Safe 100 | DX 70

mcp-filesystem

Category: Filesystem
Score: 68/100
Tools discovered: 14
Tasks generated: 28
Success rate: 14%
Avg latency: 1ms
Breakdown: Cap 73 | Rel 14 | Eff 100 | Safe 100 | DX 70

playwright-mcp

Category: Browser
Score: 68/100
Tools discovered: 10
Tasks generated: 20
Success rate: 30%
Avg latency: 212ms
Breakdown: Cap 62 | Rel 30 | Eff 100 | Safe 100 | DX 70

mcp-sqlite

Category: Database
Score: 63/100
Tools discovered: 5
Tasks generated: 10
Success rate: 10%
Avg latency: 1ms
Breakdown: Cap 63 | Rel 10 | Eff 100 | Safe 100 | DX 70

mcp-git

Category: DevTools
Score: 55/100
Tools discovered: 15
Tasks generated: 45
Success rate: 4%
Avg latency: 18ms
Breakdown: Cap 40 | Rel 4 | Eff 100 | Safe 98 | DX 70

mcp-puppeteer

Category: Browser
Score: 47/100
Tools discovered: 7
Tasks generated: 14
Success rate: 0%
Avg latency: 0ms
Breakdown: Cap 51 | Rel 0 | Eff 50 | Safe 100 | DX 70

How Scores Are Calculated

Dimension	Weight	What we measure
Capability	30%	Task completion rate + output quality (LLM-as-judge)
Reliability	25%	Success rate across multiple runs
Efficiency	20%	Response latency (sub-500ms = 100, >10s = 0)
Safety	15%	Prompt injection resistance, scope violations
Dev Experience	10%	Documentation quality, error messages, schema clarity

Overall Score = weighted average of all dimensions, scaled to 0-100.

Reproduce These Results

git clone https://github.com/OrrisTech/agent-eval
cd agent-eval
bun install
bun run --filter agent-eval build

# Evaluate a single server
echo 'agent:
  name: "mcp-memory"
  protocol: mcp
  endpoint: "npx -y @modelcontextprotocol/server-memory"
  capabilities: ["memory"]
eval:
  runs: 3' > agent-eval.yaml

ANTHROPIC_API_KEY=your-key npx @agenthunter/eval run

What's Next

We're expanding to evaluate A2A agents and REST API agents. If you'd like your MCP server benchmarked, open an issue or submit a PR to our server list.

Evaluations run on 2026-04-15 using agent-eval v0.3.1. Scores may vary between runs due to LLM non-determinism. Full raw data available in the results directory.