<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: James AI</title>
    <description>The latest articles on DEV Community by James AI (@jamesai).</description>
    <link>https://dev.to/jamesai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3877702%2F86cba880-2f93-4898-9d70-e49e5eff0342.jpg</url>
      <title>DEV Community: James AI</title>
      <link>https://dev.to/jamesai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jamesai"/>
    <language>en</language>
    <item>
      <title>Opus 4.7 First Look: I Tested the Day-Old Model Against 3 Other Claudes on 10 Real Tasks</title>
      <dc:creator>James AI</dc:creator>
      <pubDate>Fri, 17 Apr 2026 15:41:52 +0000</pubDate>
      <link>https://dev.to/jamesai/opus-47-first-look-i-tested-the-day-old-model-against-3-other-claudes-on-10-real-tasks-3cj6</link>
      <guid>https://dev.to/jamesai/opus-47-first-look-i-tested-the-day-old-model-against-3-other-claudes-on-10-real-tasks-3cj6</guid>
      <description>&lt;p&gt;&lt;em&gt;Evaluated on April 18, 2026 using &lt;a href="https://eval.agenthunter.io" rel="noopener noreferrer"&gt;AgentHunter Eval&lt;/a&gt; v0.4.0&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Anthropic released &lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on April 17, 2026. I ran it through the same 10-task evaluation I used for Opus 4.6, Sonnet 4.6, and Haiku 4.5 — this time with real token tracking so I could report dollar cost, not just pass rate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ha3yamac6vde2v7mqgc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ha3yamac6vde2v7mqgc.png" alt="Four Claude models on a benchmarking apparatus — Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku 4.5" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tasks Passed&lt;/th&gt;
&lt;th&gt;Avg Time&lt;/th&gt;
&lt;th&gt;Total Cost&lt;/th&gt;
&lt;th&gt;Cost / Task&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.4s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.559&lt;/td&gt;
&lt;td&gt;$0.056&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;10/10&lt;/td&gt;
&lt;td&gt;9.8s&lt;/td&gt;
&lt;td&gt;$0.437&lt;/td&gt;
&lt;td&gt;$0.044&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;10/10&lt;/td&gt;
&lt;td&gt;9.8s&lt;/td&gt;
&lt;td&gt;$0.110&lt;/td&gt;
&lt;td&gt;$0.011&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;8/10&lt;/td&gt;
&lt;td&gt;4.6s&lt;/td&gt;
&lt;td&gt;$0.030&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qn7frq46e6b3sctqvle.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qn7frq46e6b3sctqvle.png" alt="Ranking dashboard showing pass rate and total cost per model" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.7 is the new accuracy king and it's also faster than 4.6.&lt;/strong&gt; It costs ~27% more than 4.6 in total ($0.56 vs $0.44) but finishes tasks 14% faster on average. If you were using Opus 4.6, there's no reason not to upgrade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sonnet 4.6 is the sleeper.&lt;/strong&gt; Perfect 10/10 accuracy at &lt;strong&gt;1/5 the cost&lt;/strong&gt; of Opus 4.7. Unless you specifically need the extra edge Opus brings on adversarial tasks, Sonnet is the right default for most production agent work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10 Tasks
&lt;/h2&gt;

&lt;p&gt;Five coding tasks, five writing tasks. All graded by an independent LLM judge against human-written pass/fail criteria.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coding (5 tasks)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Create a word count CLI&lt;/td&gt;
&lt;td&gt;PASS (4.1s)&lt;/td&gt;
&lt;td&gt;PASS (5.0s)&lt;/td&gt;
&lt;td&gt;PASS (4.8s)&lt;/td&gt;
&lt;td&gt;PASS (2.7s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fix a sorting bug&lt;/td&gt;
&lt;td&gt;PASS (3.8s)&lt;/td&gt;
&lt;td&gt;PASS (3.8s)&lt;/td&gt;
&lt;td&gt;PASS (2.9s)&lt;/td&gt;
&lt;td&gt;PASS (2.2s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analyze CSV sales data&lt;/td&gt;
&lt;td&gt;PASS (4.7s)&lt;/td&gt;
&lt;td&gt;PASS (4.7s)&lt;/td&gt;
&lt;td&gt;PASS (4.7s)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;FAIL&lt;/strong&gt; (3.3s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write unit tests&lt;/td&gt;
&lt;td&gt;PASS (13.3s)&lt;/td&gt;
&lt;td&gt;PASS (17.8s)&lt;/td&gt;
&lt;td&gt;PASS (13.6s)&lt;/td&gt;
&lt;td&gt;PASS (7.5s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Refactor repetitive code&lt;/td&gt;
&lt;td&gt;PASS (5.8s)&lt;/td&gt;
&lt;td&gt;PASS (7.2s)&lt;/td&gt;
&lt;td&gt;PASS (4.7s)&lt;/td&gt;
&lt;td&gt;PASS (3.0s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Writing &amp;amp; Docs (5 tasks)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write a professional email&lt;/td&gt;
&lt;td&gt;PASS (9.5s)&lt;/td&gt;
&lt;td&gt;PASS (12.4s)&lt;/td&gt;
&lt;td&gt;PASS (9.7s)&lt;/td&gt;
&lt;td&gt;PASS (4.0s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarize a technical doc&lt;/td&gt;
&lt;td&gt;PASS (8.3s)&lt;/td&gt;
&lt;td&gt;PASS (9.6s)&lt;/td&gt;
&lt;td&gt;PASS (8.0s)&lt;/td&gt;
&lt;td&gt;PASS (4.1s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backup shell script&lt;/td&gt;
&lt;td&gt;PASS (5.3s)&lt;/td&gt;
&lt;td&gt;PASS (5.7s)&lt;/td&gt;
&lt;td&gt;PASS (7.9s)&lt;/td&gt;
&lt;td&gt;PASS (3.3s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Convert JSON to CSV&lt;/td&gt;
&lt;td&gt;PASS (8.6s)&lt;/td&gt;
&lt;td&gt;PASS (8.6s)&lt;/td&gt;
&lt;td&gt;PASS (10.7s)&lt;/td&gt;
&lt;td&gt;PASS (5.4s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write a project README&lt;/td&gt;
&lt;td&gt;PASS (20.6s)&lt;/td&gt;
&lt;td&gt;PASS (22.7s)&lt;/td&gt;
&lt;td&gt;PASS (31.6s)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;FAIL&lt;/strong&gt; (10.0s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Opus 4.7 is faster than 4.6, not slower
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagtljt1syfr948emy4sq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagtljt1syfr948emy4sq.png" alt="Speed comparison: Opus 4.7 trail 1.16x faster than Opus 4.6" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the surprise. Model version bumps usually trade off speed for capability — bigger model, longer generations. Opus 4.7 is the opposite: &lt;strong&gt;8.4s average vs 4.6's 9.8s&lt;/strong&gt;, a 14% improvement. On the README task specifically (the longest task in the suite), 4.7 finished in 20.6s vs 4.6's 22.7s.&lt;/p&gt;

&lt;p&gt;Same pass rate, less latency, ~27% more cost. For interactive agent workloads where latency matters, the upgrade is worth it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sonnet 4.6 is the cost-adjusted winner
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkcke1mi5gq6gcx5gk097.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkcke1mi5gq6gcx5gk097.png" alt="Balance scale: Sonnet at $0.11 vs Opus 4.7 at $0.56, both scoring 10/10" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sonnet 4.6 matches Opus 4.7's 10/10 accuracy on this suite at &lt;strong&gt;$0.11 total vs $0.56&lt;/strong&gt; — &lt;strong&gt;5× cheaper&lt;/strong&gt;. The gap between Sonnet and Opus used to be "Sonnet is fine if you're okay with 90% accuracy." As of this benchmark, there's no accuracy gap on these 10 tasks.&lt;/p&gt;

&lt;p&gt;Where Opus still earns its premium: tasks in the suite don't include adversarial inputs, long-context reasoning, or multi-step planning. For narrow, well-specified tasks like these, Sonnet is enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Haiku 4.5 regressed on two tasks
&lt;/h3&gt;

&lt;p&gt;Haiku failed the same CSV analysis and README tasks it previously passed — the benchmark is deterministic on success criteria but the model output is stochastic, so individual tasks can flip on single-run evals. Still, 8/10 at &lt;strong&gt;1/20th the cost of Opus 4.7&lt;/strong&gt; is extraordinary for high-volume, latency-sensitive workloads.&lt;/p&gt;

&lt;p&gt;The failure modes were informative: on the CSV task Haiku produced the right summary but missed two of four success criteria (it didn't create a separate analysis file the rubric expected). On README it produced a shorter doc that missed one section. Both are correctable with better prompting.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Writing tasks are still commodity
&lt;/h3&gt;

&lt;p&gt;All four models score 10/10 on the five writing tasks (emails, summaries, shell scripts, READMEs). The quality gap only opens on code reasoning tasks — and even that gap has narrowed significantly with 4.6+ models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's New in This Benchmark
&lt;/h2&gt;

&lt;p&gt;Two things I added since the last post:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real token tracking.&lt;/strong&gt; The agent script now parses the &lt;code&gt;usage&lt;/code&gt; field from the Anthropic API response and emits a &lt;code&gt;USAGE: input=X output=Y model=Z&lt;/code&gt; line the eval engine picks up. Combined with a pricing map in the framework, this lets us report $/task accurately instead of eyeballing cost tiers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Head-to-head compare view.&lt;/strong&gt; Pick any two models on &lt;a href="https://eval.agenthunter.io/compare?a=opus-4-7&amp;amp;b=sonnet" rel="noopener noreferrer"&gt;eval.agenthunter.io/compare&lt;/a&gt; to see per-task wins, speed delta, and cost delta side-by-side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;README badges.&lt;/strong&gt; If your agent landed well, drop a shields-style badge in your README:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;![&lt;/span&gt;&lt;span class="nv"&gt;AgentHunter&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://eval.agenthunter.io/badge/opus-4-7.svg?metric=pass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  My updated recommendations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best of the best&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opus 4.7&lt;/td&gt;
&lt;td&gt;Fastest perfect scorer. Upgrade from 4.6.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production default&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;10/10 accuracy at 1/5 the cost of Opus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High-volume, latency-sensitive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;2× faster than Sonnet, 1/4 the cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Writing-only workloads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;All models tie on writing; Haiku is cheapest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Reproduce this yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @agenthunter/eval task &lt;span class="nt"&gt;-c&lt;/span&gt; tasks/01-create-cli-tool.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Raw data for all runs: &lt;a href="https://github.com/OrrisTech/agent-eval/tree/main/results" rel="noopener noreferrer"&gt;github.com/OrrisTech/agent-eval/tree/main/results&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interactive results: &lt;a href="https://eval.agenthunter.io" rel="noopener noreferrer"&gt;eval.agenthunter.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>claude</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks</title>
      <dc:creator>James AI</dc:creator>
      <pubDate>Wed, 15 Apr 2026 02:38:11 +0000</pubDate>
      <link>https://dev.to/jamesai/sonnet-46-vs-haiku-45-vs-opus-46-i-tested-3-claude-models-on-10-real-tasks-4mn6</link>
      <guid>https://dev.to/jamesai/sonnet-46-vs-haiku-45-vs-opus-46-i-tested-3-claude-models-on-10-real-tasks-4mn6</guid>
      <description>&lt;p&gt;&lt;em&gt;Evaluated on April 15, 2026 using &lt;a href="https://eval.agenthunter.io" rel="noopener noreferrer"&gt;AgentHunter Eval&lt;/a&gt; v0.3.1&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Which Claude model should you use for your agent? I tested the latest versions of all three on the same 10 tasks to find out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tasks Passed&lt;/th&gt;
&lt;th&gt;Avg Time&lt;/th&gt;
&lt;th&gt;Cost Tier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Opus 4.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9.4s&lt;/td&gt;
&lt;td&gt;$$$$&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Sonnet 4.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10.2s&lt;/td&gt;
&lt;td&gt;$$&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Haiku 4.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.9s&lt;/td&gt;
&lt;td&gt;$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only Opus 4.6 scored a perfect 10/10. Sonnet 4.6 and Haiku 4.5 both stumbled on the same task. Haiku was &lt;strong&gt;2.5x faster&lt;/strong&gt; than the other two.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10 Tasks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Coding (5 tasks)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Create a CLI tool&lt;/td&gt;
&lt;td&gt;PASS (4.7s)&lt;/td&gt;
&lt;td&gt;PASS (3.9s)&lt;/td&gt;
&lt;td&gt;PASS (2.4s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fix a sorting bug&lt;/td&gt;
&lt;td&gt;PASS (3.8s)&lt;/td&gt;
&lt;td&gt;PASS (5.8s)&lt;/td&gt;
&lt;td&gt;PASS (1.9s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analyze CSV data&lt;/td&gt;
&lt;td&gt;PASS (5.7s)&lt;/td&gt;
&lt;td&gt;PASS (4.4s)&lt;/td&gt;
&lt;td&gt;PASS (3.3s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write unit tests&lt;/td&gt;
&lt;td&gt;PASS (16.2s)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;FAIL&lt;/strong&gt; (14.6s)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;FAIL&lt;/strong&gt; (4.5s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Refactor repetitive code&lt;/td&gt;
&lt;td&gt;PASS (4.9s)&lt;/td&gt;
&lt;td&gt;PASS (3.8s)&lt;/td&gt;
&lt;td&gt;PASS (2.4s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Writing &amp;amp; Docs (5 tasks)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write a professional email&lt;/td&gt;
&lt;td&gt;PASS (10.6s)&lt;/td&gt;
&lt;td&gt;PASS (10.3s)&lt;/td&gt;
&lt;td&gt;PASS (4.2s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarize a technical doc&lt;/td&gt;
&lt;td&gt;PASS (8.8s)&lt;/td&gt;
&lt;td&gt;PASS (8.1s)&lt;/td&gt;
&lt;td&gt;PASS (3.6s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Create a backup shell script&lt;/td&gt;
&lt;td&gt;PASS (6.0s)&lt;/td&gt;
&lt;td&gt;PASS (7.2s)&lt;/td&gt;
&lt;td&gt;PASS (3.0s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Convert JSON to CSV&lt;/td&gt;
&lt;td&gt;PASS (8.5s)&lt;/td&gt;
&lt;td&gt;PASS (12.2s)&lt;/td&gt;
&lt;td&gt;PASS (4.6s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write a project README&lt;/td&gt;
&lt;td&gt;PASS (25.0s)&lt;/td&gt;
&lt;td&gt;PASS (32.0s)&lt;/td&gt;
&lt;td&gt;PASS (8.7s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Test writing is the hardest task for smaller models
&lt;/h3&gt;

&lt;p&gt;Both Sonnet 4.6 and Haiku 4.5 failed the "write unit tests" task. This task requires generating a test file with correct assertions against a provided calculator function — it demands precise understanding of both the source code and test framework patterns. Only Opus 4.6 handled it correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: For tasks requiring multi-file reasoning (reading source + generating corresponding tests), Opus is worth the cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Haiku 4.5 is absurdly fast
&lt;/h3&gt;

&lt;p&gt;At 3.9s average, Haiku is &lt;strong&gt;2.5x faster than Sonnet 4.6&lt;/strong&gt; (10.2s) and &lt;strong&gt;2.4x faster than Opus 4.6&lt;/strong&gt; (9.4s). For the README task, Haiku took 8.7s vs Sonnet's 32.0s — a 3.7x difference.&lt;/p&gt;

&lt;p&gt;With 9/10 pass rate at that speed, Haiku is the clear winner for high-volume, latency-sensitive workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Sonnet 4.6 is surprisingly slow
&lt;/h3&gt;

&lt;p&gt;Sonnet 4.6 averaged 10.2s — actually slower than Opus 4.6 (9.4s) while passing fewer tasks (9 vs 10). This is unexpected: Sonnet is supposed to be the balanced middle option, but on these tasks, Opus delivers better accuracy at comparable speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Writing tasks are easy for everyone
&lt;/h3&gt;

&lt;p&gt;All three models scored 10/10 on writing and documentation tasks. Emails, summaries, shell scripts, READMEs — no differentiation. The quality gap only appears on complex code reasoning tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  My updated recommendation
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Complex coding (tests, multi-file)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opus 4.6&lt;/td&gt;
&lt;td&gt;Only model that passes all tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Simple coding + writing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;2.5x faster, 90% pass rate, cheapest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;General purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Good balance, but Haiku may be better for most tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @agenthunter/eval task &lt;span class="nt"&gt;-c&lt;/span&gt; task.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All evaluation data: &lt;a href="https://github.com/OrrisTech/agent-eval/tree/main/results" rel="noopener noreferrer"&gt;github.com/OrrisTech/agent-eval/results&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Full interactive results: &lt;a href="https://eval.agenthunter.io" rel="noopener noreferrer"&gt;eval.agenthunter.io&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with &lt;a href="https://eval.agenthunter.io" rel="noopener noreferrer"&gt;AgentHunter Eval&lt;/a&gt; — the open-source AI agent evaluation platform. &lt;code&gt;npx @agenthunter/eval task&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>claude</category>
      <category>benchmarks</category>
    </item>
    <item>
      <title>I Benchmarked 12 MCP Servers, Here's What I Found</title>
      <dc:creator>James AI</dc:creator>
      <pubDate>Tue, 14 Apr 2026 03:48:19 +0000</pubDate>
      <link>https://dev.to/jamesai/i-benchmarked-12-mcp-servers-heres-what-i-found-1124</link>
      <guid>https://dev.to/jamesai/i-benchmarked-12-mcp-servers-heres-what-i-found-1124</guid>
      <description>&lt;h1&gt;
  
  
  We Benchmarked 12 MCP Servers — Here's What We Found
&lt;/h1&gt;

&lt;p&gt;The Model Context Protocol (MCP) ecosystem has exploded — over 10,000 servers on the official registry, 97 million monthly SDK downloads. But which MCP servers are actually good?&lt;/p&gt;

&lt;p&gt;We built &lt;a href="https://github.com/OrrisTech/agent-eval" rel="noopener noreferrer"&gt;agent-eval&lt;/a&gt;, an open-source evaluation framework, and used it to benchmark 12 popular MCP servers across 5 dimensions: &lt;strong&gt;Capability&lt;/strong&gt;, &lt;strong&gt;Reliability&lt;/strong&gt;, &lt;strong&gt;Efficiency&lt;/strong&gt;, &lt;strong&gt;Safety&lt;/strong&gt;, and &lt;strong&gt;Developer Experience&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's what we found.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;For each server, we:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connected via stdio transport and discovered all available tools&lt;/li&gt;
&lt;li&gt;Used Claude to auto-generate test tasks based on each tool's schema&lt;/li&gt;
&lt;li&gt;Executed every task multiple times to measure reliability&lt;/li&gt;
&lt;li&gt;Scored output quality using LLM-as-judge (Claude Sonnet 4)&lt;/li&gt;
&lt;li&gt;Measured latency, success rate, and safety (prompt injection resistance)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All evaluation code is &lt;a href="https://github.com/OrrisTech/agent-eval" rel="noopener noreferrer"&gt;open source&lt;/a&gt;. You can reproduce these results yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @agenthunter/eval init
npx @agenthunter/eval run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Rankings
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Reliability&lt;/th&gt;
&lt;th&gt;Efficiency&lt;/th&gt;
&lt;th&gt;Safety&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 🥇&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;context7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Search&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 🥈&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-fetch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Web&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;td&gt;90&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 🥉&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;93&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;notion-mcp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Productivity&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;55&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-datetime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Utilities&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-everything&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reference&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-sequential-thinking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-filesystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filesystem&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;68&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;playwright-mcp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Browser&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;68&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-sqlite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;63&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-git&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DevTools&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;55&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mcp-puppeteer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Browser&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;47&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Reliability varies wildly
&lt;/h3&gt;

&lt;p&gt;Of 12 servers tested, 5 achieved 80%+ reliability. However, 5 server(s) fell below 50%: &lt;strong&gt;mcp-filesystem&lt;/strong&gt; (14%), &lt;strong&gt;playwright-mcp&lt;/strong&gt; (30%), &lt;strong&gt;mcp-sqlite&lt;/strong&gt; (10%), &lt;strong&gt;mcp-git&lt;/strong&gt; (4%), &lt;strong&gt;mcp-puppeteer&lt;/strong&gt; (0%). Low reliability usually means the server crashes, times out, or returns errors for valid inputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Efficiency is generally excellent
&lt;/h3&gt;

&lt;p&gt;Average latency across all servers was 491ms. 9/12 servers scored 90+ on efficiency, meaning sub-second response times. MCP's stdio transport is inherently fast since there's no network overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Safety scores reveal gaps
&lt;/h3&gt;

&lt;p&gt;9/12 servers scored a perfect 100 on safety. &lt;/p&gt;

&lt;h2&gt;
  
  
  Individual Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  context7
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 89/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 4&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 100%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 1756ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 83 | Rel 100 | Eff 87 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-fetch
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Web&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 86/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 90%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 640ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 73 | Rel 90 | Eff 99 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-memory
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 82/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 9&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 27&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 93%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 1ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 63 | Rel 93 | Eff 100 | Safe 89 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  notion-mcp
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Productivity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 82/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 22&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 44&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 97%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 643ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 55 | Rel 97 | Eff 98 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-datetime
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Utilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 81/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 30&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 73%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 2ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 70 | Rel 73 | Eff 100 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-everything
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Reference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 75/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 13&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 39&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 74%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 2621ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 66 | Rel 74 | Eff 78 | Safe 97 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-sequential-thinking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 71/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 100%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 1ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 15 | Rel 100 | Eff 100 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-filesystem
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Filesystem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 68/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 14&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 28&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 14%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 1ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 73 | Rel 14 | Eff 100 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  playwright-mcp
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 68/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 20&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 30%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 212ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 62 | Rel 30 | Eff 100 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-sqlite
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 63/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 10%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 1ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 63 | Rel 10 | Eff 100 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-git
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: DevTools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 55/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 15&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 45&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 4%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 18ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 40 | Rel 4 | Eff 100 | Safe 98 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  mcp-puppeteer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Category&lt;/strong&gt;: Browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 47/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools discovered&lt;/strong&gt;: 7&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks generated&lt;/strong&gt;: 14&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success rate&lt;/strong&gt;: 0%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avg latency&lt;/strong&gt;: 0ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Breakdown&lt;/strong&gt;: Cap 51 | Rel 0 | Eff 50 | Safe 100 | DX 70&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Scores Are Calculated
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;th&gt;What we measure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Capability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;Task completion rate + output quality (LLM-as-judge)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;Success rate across multiple runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;Response latency (sub-500ms = 100, &amp;gt;10s = 0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Safety&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;Prompt injection resistance, scope violations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dev Experience&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;Documentation quality, error messages, schema clarity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Overall Score&lt;/strong&gt; = weighted average of all dimensions, scaled to 0-100.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproduce These Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/OrrisTech/agent-eval
&lt;span class="nb"&gt;cd &lt;/span&gt;agent-eval
bun &lt;span class="nb"&gt;install
&lt;/span&gt;bun run &lt;span class="nt"&gt;--filter&lt;/span&gt; agent-eval build

&lt;span class="c"&gt;# Evaluate a single server&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'agent:
  name: "mcp-memory"
  protocol: mcp
  endpoint: "npx -y @modelcontextprotocol/server-memory"
  capabilities: ["memory"]
eval:
  runs: 3'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; agent-eval.yaml

&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-key npx @agenthunter/eval run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We're expanding to evaluate A2A agents and REST API agents. If you'd like your MCP server benchmarked, &lt;a href="https://github.com/OrrisTech/agent-eval/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt; or submit a PR to our server list.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Evaluations run on 2026-04-15 using agent-eval v0.3.1. Scores may vary between runs due to LLM non-determinism. Full raw data available in the &lt;a href="https://github.com/OrrisTech/agent-eval/tree/main/results" rel="noopener noreferrer"&gt;results directory&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
