<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vilius</title>
    <description>The latest articles on DEV Community by Vilius (@vystartasv).</description>
    <link>https://dev.to/vystartasv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F133303%2F50baa34e-e011-4576-8b1a-5974d272fc34.jpg</url>
      <title>DEV Community: Vilius</title>
      <link>https://dev.to/vystartasv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vystartasv"/>
    <language>en</language>
    <item>
      <title>The End of the US Cloud Monopoly: AI Balkanization Is Here to Stay</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Sat, 13 Jun 2026 08:28:42 +0000</pubDate>
      <link>https://dev.to/vystartasv/the-end-of-the-us-cloud-monopoly-ai-balkanization-is-here-to-stay-4g68</link>
      <guid>https://dev.to/vystartasv/the-end-of-the-us-cloud-monopoly-ai-balkanization-is-here-to-stay-4g68</guid>
      <description>&lt;p&gt;By Vilius Vystartas | June 2026&lt;/p&gt;

&lt;p&gt;The single, globally unified internet is gone. What's replacing it is a patchwork of sovereign AI zones, each running its own stack on its own hardware with its own rules.&lt;/p&gt;

&lt;p&gt;This isn't a prediction. It's already happening, and the next three years will cement it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Broke the Monopoly
&lt;/h2&gt;

&lt;p&gt;The US government's approach to AI regulation — treating frontier model weights as controlled munitions — had an unintended consequence. By demonstrating that access to US cloud infrastructure can be switched off by regulatory decree, they forced every non-US government and enterprise to build a backup plan.&lt;/p&gt;

&lt;p&gt;The January 2025 AI Diffusion Rule created a three-tier world: unrestricted allies, capped nations (50,000 GPUs/year), and total embargoes. For the 140+ countries in Tier 2, US cloud services became inherently unstable. You can't build a national AI strategy on a faucet that might turn off.&lt;/p&gt;

&lt;p&gt;The DeepSeek R1 moment in January 2025 proved the point: a Chinese quant hedge fund trained a frontier reasoning model on nerfed hardware for $5.6 million. Export controls didn't stop the frontier. They just accelerated the development of independent stacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Zones
&lt;/h2&gt;

&lt;p&gt;The tech industry is splitting into three distinct legal and architectural zones, each with its own economics:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The US Zone&lt;/strong&gt; — high-performance, high-surveillance closed models. OpenAI, Anthropic, Google. Restricted to US citizens and close Tier 1 allies. The best models, the most monitoring, the least legal recourse if you're outside its borders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The European Zone&lt;/strong&gt; — regulated, open-source-first, locally hosted. Data privacy is the architecture, not a compliance checkbox. France's Mistral, Germany's Aleph Alpha, the fragmented but determined GAIA-X federation. GDPR compliance isn't overhead — it's the product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Asian/Non-Western Zone&lt;/strong&gt; — independent stacks operating entirely outside the Western financial and regulatory sphere. DeepSeek, Alibaba's Qwen, Baidu's Ernie. Huawei Ascend chips replacing NVIDIA. No US venture capital, no US cloud, no US export license risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sovereign AI Infrastructure Boom
&lt;/h2&gt;

&lt;p&gt;Every non-US government with ambition is building national AI infrastructure. Not optional. Existential.&lt;/p&gt;

&lt;p&gt;Europe is scattering exascale systems across the continent — Germany's JUPITER, Finland's LUMI, Italy's LEONARDO. Federated, fragmented, deliberately not dependent on any single provider.&lt;/p&gt;

&lt;p&gt;The Middle East is placing bigger bets. Saudi Arabia's $40 billion AI fund. UAE's G42 building Condor Galaxy on Cerebras hardware, then selling a $1.5 billion stake to Microsoft — on condition it cut Chinese ties. The message: even your sovereign compute comes with geopolitical strings.&lt;/p&gt;

&lt;p&gt;India's $1.25 billion IndiaAI Mission aims for 10,000+ GPUs through Yotta's Shakti Cloud. But at Tier 2's 50,000 GPU cap, the ambition outstrips the allocation.&lt;/p&gt;

&lt;p&gt;Japan's SoftBank committed nearly a billion to AI datacenters. ABCI 3.0 is operational with H100 clusters.&lt;/p&gt;

&lt;p&gt;Every single one runs some form of open-weight model. Because closed APIs can be switched off at a whim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Weights Won
&lt;/h2&gt;

&lt;p&gt;The question of whether enterprises would choose open-source models over closed APIs is settled. They will.&lt;/p&gt;

&lt;p&gt;Meta's Llama 3.1 405B proved open models can match GPT-4 class performance. Mistral proved European sovereignty models are commercially viable. DeepSeek proved frontier reasoning can be open. The entire ecosystem shifted from "can open models compete?" to "how do we productionize our chosen open model?"&lt;/p&gt;

&lt;p&gt;The calculus is simple: slightly lower benchmark scores in exchange for complete operational certainty. No API key to revoke. No pricing change that breaks your margin. No geopolitical event that cuts your access.&lt;/p&gt;

&lt;p&gt;This has driven massive innovation in on-device and on-premises deployment. Models under 70 billion parameters — often under 10 billion — that run on corporate hardware rather than centralized server farms. Microsoft's Phi-4, Apple's on-device models, Google's Gemma, Meta's Llama 3.2 small variants. The edge is where the action is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Death of the API Wrapper
&lt;/h2&gt;

&lt;p&gt;The venture capital correction is brutal but predictable. Startups whose entire value proposition was piping data to a third-party US API can't raise international funding. Their core product can be wiped out overnight by a single regulatory pen stroke — or a pricing change, or a model deprecation, or a geopolitical event.&lt;/p&gt;

&lt;p&gt;Jasper went from $1.5 billion valuation to significant layoffs. The entire "GPT wrapper" category is being reframed as a 2023-2024 anomaly.&lt;/p&gt;

&lt;p&gt;The new valuation premium isn't about who uses the flashiest model. It's about who owns the proprietary training data to build independent, in-house models. Palantir's stock surge, BloombergGPT's financial data moat, healthcare AI companies valued on unique patient datasets — the market is betting on data ownership, not API access.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Nobody's Saying About the Hardware
&lt;/h2&gt;

&lt;p&gt;This entire scenario depends on one unresolved bottleneck: TSMC manufactures over 90% of advanced AI chips.&lt;/p&gt;

&lt;p&gt;The US CHIPS Act ($52.7 billion), the European Chips Act (€43 billion), and TSMC's own global fab expansion (Arizona 4nm, Japan operational, Germany planned) are all trying to address this. But fabrication takes years. The RISC-V ecosystem is promising but a decade behind CUDA in maturity and tooling.&lt;/p&gt;

&lt;p&gt;The real risk isn't that US export controls will stop frontier AI development. DeepSeek proved they won't. The risk is that the hardware supply chain itself becomes a weapon — and every sovereign AI zone discovers that independence at the architectural level means nothing without independence at the fab level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-5 Year Outlook
&lt;/h2&gt;

&lt;p&gt;The trajectory is clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;US cloud APIs become a premium product for US-aligned customers only&lt;/li&gt;
&lt;li&gt;Open-weight models become the global enterprise default&lt;/li&gt;
&lt;li&gt;National datacenters proliferate in every region that can afford them&lt;/li&gt;
&lt;li&gt;Data ownership replaces model access as the primary valuation driver&lt;/li&gt;
&lt;li&gt;The supply chain question remains the unresolved wildcard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a temporary fragmentation that heals with better policy. It's a permanent structural shift. The unified global technology ecosystem that defined the last two decades is over. The question isn't whether the balkanization happens — it's whether your infrastructure is ready for it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cloud</category>
      <category>opensource</category>
      <category>devtools</category>
    </item>
    <item>
      <title>We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 22:46:48 +0000</pubDate>
      <link>https://dev.to/vystartasv/we-asked-10-llms-to-write-efficient-code-only-4-got-better-47gf</link>
      <guid>https://dev.to/vystartasv/we-asked-10-llms-to-write-efficient-code-only-4-got-better-47gf</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every LLM can write code that works. The question is: can they write code that's &lt;em&gt;efficient&lt;/em&gt; — and does telling them to be efficient actually help?&lt;/p&gt;

&lt;p&gt;I tested 10 models on 10 coding tasks, each in two phases: &lt;strong&gt;unprompted&lt;/strong&gt; (the model writes its own code) and &lt;strong&gt;prompted&lt;/strong&gt; (explicitly told to write clean, DRY, efficient code). That's 200 API calls, $0.56 total. The results are... not what most prompt engineers would predict.&lt;/p&gt;

&lt;p&gt;GPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the "write efficient code" prompt was meaningless or actively harmful.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Metric Works
&lt;/h2&gt;

&lt;p&gt;Each task has a known &lt;strong&gt;optimal token budget&lt;/strong&gt; — the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The &lt;strong&gt;efficiency score&lt;/strong&gt; is &lt;code&gt;optimal_tokens / actual_tokens&lt;/code&gt;, capped at 1.0.&lt;/p&gt;

&lt;p&gt;A score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the "write efficient code" instruction actually changes behaviour.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Leaderboard (Sorted by Prompted Efficiency)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Unprompted&lt;/th&gt;
&lt;th&gt;Prompted&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;th&gt;Frugal&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Correctness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇&lt;/td&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.63&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.20&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.096&lt;/td&gt;
&lt;td&gt;78% → 85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;Qwen 3.6 Plus&lt;/td&gt;
&lt;td&gt;0.44&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.60&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+0.17&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.158&lt;/td&gt;
&lt;td&gt;78% → 87%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;Gemma 4 31B&lt;/td&gt;
&lt;td&gt;0.54&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.58&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.003&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;92% both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;DeepSeek Chat&lt;/td&gt;
&lt;td&gt;0.51&lt;/td&gt;
&lt;td&gt;0.55&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.006&lt;/td&gt;
&lt;td&gt;91% → 80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;0.47&lt;/td&gt;
&lt;td&gt;0.52&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.121&lt;/td&gt;
&lt;td&gt;92% both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;LFM 2 24B A2B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.54&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.47&lt;/td&gt;
&lt;td&gt;-0.06&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;90% → 80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Mistral Large 2411&lt;/td&gt;
&lt;td&gt;0.54&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;td&gt;-0.08&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.050&lt;/td&gt;
&lt;td&gt;90% → 82%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
&lt;td&gt;0.47&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;td&gt;-0.01&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;$0.020&lt;/td&gt;
&lt;td&gt;92% → 90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Cohere Command A&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.60&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.44&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-0.17&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.071&lt;/td&gt;
&lt;td&gt;90% → 82%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;td&gt;+0.09&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.029&lt;/td&gt;
&lt;td&gt;76% → 86%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Stands Out
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GPT-5.4 Is the Prompt Whisperer
&lt;/h3&gt;

&lt;p&gt;GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were &lt;strong&gt;config-generation&lt;/strong&gt; (+0.81 — went from 12 inline JSON blocks to a template loop), &lt;strong&gt;html-from-data&lt;/strong&gt; (+0.71), and &lt;strong&gt;magic-strings&lt;/strong&gt; (+0.38 — switched to an Enum). It's the only model in the batch where the "write efficient code" instruction consistently produces different (and better) output.&lt;/p&gt;

&lt;p&gt;The cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemma 4 31B: The Quiet Winner
&lt;/h3&gt;

&lt;p&gt;Half of Gemma 4's tasks were already "frugal" — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That's a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cohere Command A: Prompting Backfires
&lt;/h3&gt;

&lt;p&gt;Cohere Command A had the &lt;strong&gt;highest unprompted efficiency&lt;/strong&gt; in the batch (0.60) — it naturally writes concise code. But when told "write efficient code," it ballooned output on several tasks. &lt;strong&gt;html-from-data&lt;/strong&gt; went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.&lt;/p&gt;

&lt;p&gt;Lesson: if a model is already efficient, don't prompt it to be more efficient.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen 3.6 Plus: Second Place, Slowest
&lt;/h3&gt;

&lt;p&gt;Qwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took &lt;strong&gt;26 minutes&lt;/strong&gt; for 20 tasks — by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you're waiting for it. Batch workloads only.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Kimi Surprise
&lt;/h3&gt;

&lt;p&gt;Kimi K2.6 had the lowest unprompted efficiency (0.34 — verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge.&lt;/p&gt;

&lt;h3&gt;
  
  
  Frugality: What Does It Mean?
&lt;/h3&gt;

&lt;p&gt;"Frugal" means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Group&lt;/th&gt;
&lt;th&gt;Models&lt;/th&gt;
&lt;th&gt;Behaviour&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt-responsive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-5.4, Qwen 3.6 Plus&lt;/td&gt;
&lt;td&gt;Efficiency improves substantially with prompting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt-neutral&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6&lt;/td&gt;
&lt;td&gt;Prompt has little effect (±0.04)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt-antagonistic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LFM 2 24B A2B, Mistral Large 2411, Cohere Command A&lt;/td&gt;
&lt;td&gt;Efficiency &lt;em&gt;drops&lt;/em&gt; when prompted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.&lt;/p&gt;

&lt;p&gt;If the prompt says "write efficient code" and the model responds by writing &lt;em&gt;more&lt;/em&gt; tokens, something in the training signal is misaligned.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best prompted efficiency:&lt;/strong&gt; GPT-5.4 — 0.63, $0.10 for 20 tasks. The only model where prompting reliably improves output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value overall:&lt;/strong&gt; Gemma 4 31B — 0.58 prompted, 92% correctness, $0.003. Absurd price/performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best natural efficiency:&lt;/strong&gt; Cohere Command A — 0.60 unprompted. Don't prompt it, just let it work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most consistent:&lt;/strong&gt; Claude Sonnet 4 — 92% correctness on both phases, small +0.04 efficiency gain. Reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip if you're in a hurry:&lt;/strong&gt; Qwen 3.6 Plus — 26 minutes for 20 tasks. Great efficiency gains, terrible latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch list:&lt;/strong&gt; Kimi K2.6 — low base efficiency but the prompt actually helps. Worth retesting with a better prompt.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).&lt;/p&gt;

&lt;p&gt;Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: &lt;strong&gt;efficiency_ratio = optimal_tokens / actual_tokens&lt;/strong&gt; (capped at 1.0). Correctness scored against expected output patterns.&lt;/p&gt;

&lt;p&gt;Total cost: &lt;strong&gt;$0.56&lt;/strong&gt; for 200 API calls (10 models × 10 tasks × 2 phases). Temperature: 0.1. Max tokens: 600.&lt;/p&gt;

&lt;p&gt;Full results: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>programming</category>
    </item>
    <item>
      <title>10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 22:42:59 +0000</pubDate>
      <link>https://dev.to/vystartasv/10-models-tested-from-816-to-10-the-free-tier-is-a-full-on-gamble-4kfc</link>
      <guid>https://dev.to/vystartasv/10-models-tested-from-816-to-10-the-free-tier-is-a-full-on-gamble-4kfc</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I tested another 10 models across the same 10 agent coding tasks. Four of them were free-tier models — and the range was absurd: Owl Alpha scored 76.7% with zero hard fails, Laguna M.1 scored 10% and produced garbage on 9 out of 10 tasks. The free tier is not free if it costs you debugging time.&lt;/p&gt;

&lt;p&gt;Total cost for all 10 models: &lt;strong&gt;$0.10&lt;/strong&gt;. The paid models (6 of 10) came to $0.10 combined.&lt;/p&gt;




&lt;h2&gt;
  
  
  Batch 12 Leaderboard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;P/P/F&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Grok 4.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.6%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7/3/0&lt;/td&gt;
&lt;td&gt;$0.017&lt;/td&gt;
&lt;td&gt;39.9s&lt;/td&gt;
&lt;td&gt;Paid (xAI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;Perceptron Mk1&lt;/td&gt;
&lt;td&gt;79.9%&lt;/td&gt;
&lt;td&gt;8/1/1&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;29.3s&lt;/td&gt;
&lt;td&gt;Paid (Perceptron)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;Owl Alpha (free)&lt;/td&gt;
&lt;td&gt;76.7%&lt;/td&gt;
&lt;td&gt;5/5/0&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;83.0s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;xAI: Grok Build 0.1&lt;/td&gt;
&lt;td&gt;75.0%&lt;/td&gt;
&lt;td&gt;5/4/1&lt;/td&gt;
&lt;td&gt;$0.034&lt;/td&gt;
&lt;td&gt;95.3s&lt;/td&gt;
&lt;td&gt;Paid (xAI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;OpenAI: GPT Chat Latest&lt;/td&gt;
&lt;td&gt;73.3%&lt;/td&gt;
&lt;td&gt;6/2/2&lt;/td&gt;
&lt;td&gt;$0.043&lt;/td&gt;
&lt;td&gt;18.7s&lt;/td&gt;
&lt;td&gt;Paid (OpenAI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Mistral Medium 3.5&lt;/td&gt;
&lt;td&gt;71.6%&lt;/td&gt;
&lt;td&gt;6/2/2&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;td&gt;12.6s&lt;/td&gt;
&lt;td&gt;Paid (Mistral)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Nemotron 3 Nano Omni (free)&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;4/2/4&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;23.5s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Laguna XS.2 (free)&lt;/td&gt;
&lt;td&gt;49.7%&lt;/td&gt;
&lt;td&gt;3/3/4&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;28.7s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Baidu CoBuddy (free)&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;4/0/6&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;362.4s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Laguna M.1 (free)&lt;/td&gt;
&lt;td&gt;10.0%&lt;/td&gt;
&lt;td&gt;1/0/9&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;89.8s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Headlines
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Grok 4.3 (81.6%, $0.017, 39.9s)&lt;/strong&gt; — Grok's latest release takes the batch with zero hard fails. Seven clean passes, three partials. Process-monitor was the only full pass it earned that 4.3's competitors missed. xAI's Grok line is quietly consistent — 4.1 Fast (76.7%), 4.20 (75%), and now 4.3 (81.6%) — all within striking distance of the 80%+ club without crossing into premium pricing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perceptron Mk1 (79.9%, $0.002, 29.3s)&lt;/strong&gt; — A brand new family debuts at nearly 80%, with eight passes — the most in the batch — for two-tenths of a cent. The one failure (regex-extract at 17%) is a known weakness for small models. At this price-to-pass ratio, Perceptron Mk1 is the value story of this batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Owl Alpha (free, 76.7%, 83.0s)&lt;/strong&gt; — A free model with zero hard fails and 5 full passes. That's the standout free-tier result. Takes 2x longer than paid models for some tasks (24s on csv-stats vs 1-3s for the field), but the code is functional. If latency isn't critical, this is usable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Free Tier Lottery
&lt;/h2&gt;

&lt;p&gt;Four free models. Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Owl Alpha&lt;/td&gt;
&lt;td&gt;76.7%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Usable&lt;/strong&gt; — zero hard fails, 5/10 full passes. Slow but functional.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron 3 Nano Omni&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Mixed&lt;/strong&gt; — half of tasks hit output cap at 400 tokens. Hit or miss.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna XS.2&lt;/td&gt;
&lt;td&gt;49.7%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Unreliable&lt;/strong&gt; — 400-token cap kills complex responses.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Baidu CoBuddy&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Frustrating&lt;/strong&gt; — 362 seconds total. Half the tasks hit output cap at 399 tokens. Waiting 6 minutes for 40% accuracy is not a good trade.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna M.1&lt;/td&gt;
&lt;td&gt;10.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Broken&lt;/strong&gt; — 1/10 passes. Every response capped at 400 tokens. Do not use.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The free tier cap of 399-400 output tokens is the real problem. Models like Laguna M.1 and CoBuddy truncate every response, turning what could be a partial into a fail. Owl Alpha works despite the cap because its outputs are concise enough to fit.&lt;/p&gt;

&lt;p&gt;Pay $0.002 for Perceptron Mk1 and get 8/10 passes, or use Laguna M.1 free and get 1/10. The math is not subtle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Disappointments
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GPT Chat Latest (73.3%, $0.043)&lt;/strong&gt; — OpenAI's catch-all endpoint was solid on easy tasks (file-parse, csv-stats, sql-query all passed) but fell apart on fix-bug (0%) with a lengthy, expensive hallucination. The most expensive model in the batch and it doesn't crack 75%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistral Medium 3.5 (71.6%, $0.008)&lt;/strong&gt; — Fastest model in the batch at 12.6s total, but the process-monitor task hit a 504 Gateway Timeout and scored 0%. A timeout fail on a model that otherwise looks strong carries a disproportionate penalty — without it, Medium 3.5 would be at 79.5%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Laguna M.1 (10%)&lt;/strong&gt; — The worst score in any batch I've run. Seven of its task responses were blank 400-token output cap fills. Not worth listing on OpenRouter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Price/Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;$/%-pt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Owl Alpha (free)&lt;/td&gt;
&lt;td&gt;76.7%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron 3 Nano Omni (free)&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna XS.2 (free)&lt;/td&gt;
&lt;td&gt;49.7%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Baidu CoBuddy (free)&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna M.1 (free)&lt;/td&gt;
&lt;td&gt;10.0%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perceptron Mk1&lt;/td&gt;
&lt;td&gt;79.9%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Medium 3.5&lt;/td&gt;
&lt;td&gt;71.6%&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;td&gt;$0.0108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok 4.3&lt;/td&gt;
&lt;td&gt;81.6%&lt;/td&gt;
&lt;td&gt;$0.017&lt;/td&gt;
&lt;td&gt;$0.0209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;xAI: Grok Build 0.1&lt;/td&gt;
&lt;td&gt;75.0%&lt;/td&gt;
&lt;td&gt;$0.034&lt;/td&gt;
&lt;td&gt;$0.0450&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT Chat Latest&lt;/td&gt;
&lt;td&gt;73.3%&lt;/td&gt;
&lt;td&gt;$0.043&lt;/td&gt;
&lt;td&gt;$0.0584&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Free models dominate the $/%-pt table by definition, but only Owl Alpha is actually usable. Among paid models, Perceptron Mk1 at $0.0024/%-pt is the efficiency winner — 24x cheaper per point than GPT Chat Latest.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best overall:&lt;/strong&gt; Grok 4.3 — 81.6%, 39.9s, $0.017. Cleanest leaderboard of the batch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value (paid):&lt;/strong&gt; Perceptron Mk1 — 79.9%, $0.002 total. Eight passes for two-tenths of a cent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best free model:&lt;/strong&gt; Owl Alpha — 76.7%, zero hard fails. The only free model I'd ship with in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fastest:&lt;/strong&gt; Mistral Medium 3.5 — 12.6s for all 10 tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip entirely:&lt;/strong&gt; Laguna M.1 and all Laguna free-tier variants. 10% is not testable.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Same setup as previous batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 400. Temperature: 0.1. Pattern-matching scoring against expected outputs.&lt;/p&gt;

&lt;p&gt;Pre-flight verification caught zero failures this batch. Total cost: &lt;strong&gt;$0.10&lt;/strong&gt;. Total dataset: &lt;strong&gt;168 models tested&lt;/strong&gt; across cloud and local.&lt;/p&gt;

&lt;p&gt;Full results and per-task scores: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 18:48:33 +0000</pubDate>
      <link>https://dev.to/vystartasv/i-tested-10-more-models-five-brand-new-families-debuted-none-scored-below-75-9fj</link>
      <guid>https://dev.to/vystartasv/i-tested-10-more-models-five-brand-new-families-debuted-none-scored-below-75-9fj</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. This is getting harder to keep up with.&lt;/p&gt;

&lt;p&gt;Two more models tied the all-time record at 90%. The cheapest model ever tested cost $0.0001 for a full 10-task benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  The New 90% Club Members
&lt;/h2&gt;

&lt;p&gt;Eight models have now hit 90% on this benchmark. Batch 11 added two:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistral Large 2411 (90%, $0.008, 46s)&lt;/strong&gt; — Mistral's November 2024 flagship matches their current Large 3. Sometimes the first version is still the best one. Zero hard fails, clean passes on 8/10 tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek Chat V3-0324 (90%, $0.002, 73s)&lt;/strong&gt; — The older V3 variant from March 2024 matches the original DeepSeek Chat at 90%. Every time I test a DeepSeek variant, it lands at 80-90%. The family is remarkably consistent.&lt;/p&gt;

&lt;p&gt;The 90% club now includes: DeepSeek Chat (original), DeepSeek Chat V3-0324, Qwen3 Coder 30B, Nemotron 3 Nano 30B, Codestral 2508, Mistral Large 2411, MiniMax M2 Her, and Baidu Ernie 4.5 300B. Eight models. Seven of them cost less than a cent per full benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Families, First Try
&lt;/h2&gt;

&lt;p&gt;Every new family debuted at 75% or higher. That's an impressive hit rate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sao10k&lt;/td&gt;
&lt;td&gt;L3.1 Euryale 70B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;29s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sao10k&lt;/td&gt;
&lt;td&gt;L3 Lunaris 8B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0001&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthracite&lt;/td&gt;
&lt;td&gt;Magnum V4 72B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.006&lt;/td&gt;
&lt;td&gt;35s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mancer&lt;/td&gt;
&lt;td&gt;Weaver&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;30s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Undi95&lt;/td&gt;
&lt;td&gt;Remm Slerp L2 13B&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;31s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inflection&lt;/td&gt;
&lt;td&gt;Inflection 3 Productivity&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.012&lt;/td&gt;
&lt;td&gt;42s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;*Inflection 3 result is provisional — awaiting lab response. Will update in due course.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L3 Lunaris 8B at $0.0001 is the cheapest model I've ever tested.&lt;/strong&gt; A full 10-task benchmark for one ten-thousandth of a dollar. At this price, there's no reason not to test a model before you ship with it. Lunaris scored 85% — competitive with models that cost 100x more.&lt;/p&gt;

&lt;p&gt;The Sao10k family (L3.1 Euryale 70B and L3 Lunaris 8B) is the standout. Both models scored 85%, both are fine-tunes of Llama 3.1/3, and both cost almost nothing. Community fine-tunes continue to punch above their weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Recoveries
&lt;/h2&gt;

&lt;p&gt;Two Qwen models from my previous failed batch completed successfully this time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3 8B (80%, $0.02, 543s)&lt;/strong&gt; — Needed &lt;code&gt;per_call_timeout: 300&lt;/code&gt; to finish. The model is competent (6 passes, 4 partials, zero fails) but painfully slow. Each API call takes 100-120 seconds on OpenRouter. Use it as a background job, not a real-time agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen Plus 2025-07-28 (80%, $0.001, 19s)&lt;/strong&gt; — The dated variant works perfectly with &lt;code&gt;enable_thinking: false&lt;/code&gt;. 80% at $0.0009 is great value. But use the current &lt;code&gt;qwen/qwen-plus&lt;/code&gt; ID instead — it scores 85% and doesn't need the dated suffix.&lt;/p&gt;




&lt;h2&gt;
  
  
  Price/Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;$/%-pt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L3 Lunaris 8B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.0001&lt;/td&gt;
&lt;td&gt;$0.0001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek Chat V3-0324&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0017&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3.1 Euryale 70B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0021&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remm Slerp L2 13B&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0020&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mancer Weaver&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;$0.0041&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthracite Magnum V4 72B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.006&lt;/td&gt;
&lt;td&gt;$0.0066&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Large 2411&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;td&gt;$0.0093&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inflection 3 Productivity&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.012&lt;/td&gt;
&lt;td&gt;$0.0156&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 8B&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;$0.020&lt;/td&gt;
&lt;td&gt;$0.0254&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ratio between cheapest and most expensive $/%-pt is 254x. Lunaris at $0.0001/%-pt vs Qwen3 8B at $0.0254/%-pt — same tier of score, wildly different cost profiles.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best overall:&lt;/strong&gt; Mistral Large 2411 — 90%, 46s, $0.008&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value:&lt;/strong&gt; L3 Lunaris 8B — 85%, $0.0001 total. Absurd price/performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best new family debut:&lt;/strong&gt; Sao10k — both models at 85% first try. Watch this line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fastest:&lt;/strong&gt; L3 Lunaris 8B — 20 seconds for all 10 tasks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Same setup as the previous 10 batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 600 (Qwen models), 300 (everyone else). Temperature: 0.1. Pattern-matching scoring against expected outputs.&lt;/p&gt;

&lt;p&gt;Pre-flight verification caught zero failures this batch. All 10 candidates passed the simple-prompt test. Total cost: $0.05 for the core 8 models, then $0.02 for the Qwen recovery run. Total dataset: &lt;strong&gt;158 models tested&lt;/strong&gt; across cloud and local.&lt;/p&gt;

&lt;p&gt;Full results and per-task scores: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
    <item>
      <title>Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 09:46:02 +0000</pubDate>
      <link>https://dev.to/vystartasv/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny-12d2</link>
      <guid>https://dev.to/vystartasv/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny-12d2</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Ten more models through the same 10 agent coding tasks. Two tied the all-time record. One cost $0.0002. The other hit the score at $0.0018 — cheaper than most models scoring 70%.&lt;/p&gt;

&lt;p&gt;Batch 10 was the cheapest one yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Leaders
&lt;/h2&gt;

&lt;p&gt;Two models scored 90% with zero hard fails, joining MiniMax M2 Her and Baidu Ernie 4.5 300B as the highest-scoring models on this benchmark:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3 Coder 30B A3B&lt;/strong&gt; — 90% in 28 seconds, $0.0004. An efficient coder that doesn't burn budget on thinking tokens it doesn't need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek Chat (original)&lt;/strong&gt; — 90% in 59 seconds, $0.0018. The original DeepSeek Chat still competes with modern models on agent coding. Newer doesn't always mean better.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Surprises
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LFM 2 24B A2B (85%, $0.0002, 15s) is the cheapest model I've ever tested.&lt;/strong&gt; Liquid's debut family is absurdly cost-effective. A full 10-task benchmark for literally $0.0002. At this price/performance ratio, there's no excuse not to test a model before committing to a more expensive alternative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistral Small 3.2 (85%, $0.0004)&lt;/strong&gt; is a clear upgrade. The Small line went 75% → 85% across versions — a ten-point jump at the same budget tier. Mistral keeps improving the right things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3 14B scored 0% across all 10 tasks.&lt;/strong&gt; Mandatory thinking mode that can't be suppressed at 300 tokens means every request times out before producing output. Skip for agent coding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cydonia 24B V4.1 (80%, $0.001)&lt;/strong&gt; debuts a new family from TheDrummer. Zero hard fails. Watch this one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Duds
&lt;/h2&gt;

&lt;p&gt;Qwen3.7 Max (85%, $0.13, 295 seconds) scored the same as budget models costing 300x less. Thinking mode tax at work — the accuracy is there, but you'll wait five minutes and pay for every second.&lt;/p&gt;

&lt;p&gt;Claude Opus 4 (80%, $0.10, 76s) had one hard fail. For a top-tier premium model at $0.10 per 10 tasks, that's below expectations. It's not a bad model — it's overkill for agent coding at a tight token budget.&lt;/p&gt;

&lt;p&gt;Aion 1.0 (80%) had two hard fails and was the slowest at 160 seconds. The architecture is interesting, but it's not ready for production agent work.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best overall:&lt;/strong&gt; Qwen3 Coder 30B A3B — 90%, 28s, $0.0004&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value:&lt;/strong&gt; LFM 2 24B A2B — 85%, $0.0002 total. Ridiculous price/performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fastest:&lt;/strong&gt; LFM 2 24B A2B — 15 seconds flat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most improved:&lt;/strong&gt; Mistral Small 3.2 — 75% → 85% across versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip entirely:&lt;/strong&gt; Qwen3 14B for agent tasks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing — tested against each model via OpenRouter. Max tokens: 300. Temperature: 0.1. Results scored by pattern matching against expected outputs. Pre-flight verification caught 2 models (Ernie 4.5 21B — HTTP 429, Trinity Mini — empty content) before they wasted the batch.&lt;/p&gt;

&lt;p&gt;Total batch cost: $0.14 across 9 models. Qwen3.7 Max alone accounted for $0.13 of that — thinking tax.&lt;/p&gt;

&lt;p&gt;Total models tested: 148 (up from 138).&lt;/p&gt;

&lt;p&gt;Full results and per-task scores: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because you should.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Hype Correction</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Sat, 23 May 2026 10:47:25 +0000</pubDate>
      <link>https://dev.to/vystartasv/the-hype-correction-54mc</link>
      <guid>https://dev.to/vystartasv/the-hype-correction-54mc</guid>
      <description>&lt;p&gt;&lt;em&gt;Weekly roundup, May 23, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Google and Microsoft just told us the same thing from opposite directions.&lt;/p&gt;

&lt;p&gt;Google IO this week was an AI firehose. &lt;a href="https://blog.google/technology/ai/" rel="noopener noreferrer"&gt;Gemini 3.5 Flash&lt;/a&gt; — faster, cheaper, "better at agentic tasks." &lt;a href="https://workspace.google.com/blog/" rel="noopener noreferrer"&gt;Gemini Spark&lt;/a&gt; — an agent that orchestrates other agents. Omni video gen. The Ultra tier dropped from $250/month to $100. The keynote ran for hours. I watched excerpts. I'm still tired.&lt;/p&gt;

&lt;p&gt;And then there's &lt;a href="https://www.theverge.com/microsoft" rel="noopener noreferrer"&gt;Microsoft&lt;/a&gt;, quietly removing the Copilot button from Office 365. Letting users remap the Copilot key on their keyboards. Two companies. Same message. Different volume.&lt;/p&gt;

&lt;p&gt;The message is: AI costs real money, and someone has to pay for it.&lt;/p&gt;




&lt;p&gt;Google is still spending. The IO presentation felt like a company that hasn't blinked yet — or can't afford to look like it's blinking. But look closer. Free tier usage is now capped by compute, not prompt count. The bean counters are in the room. They're just not on stage.&lt;/p&gt;

&lt;p&gt;Microsoft blinked first. The Copilot button that was mandatory hardware is now optional. The Office integration nobody asked for is being rolled back. Not because Copilot doesn't work — because shoving it into every surface made people resent it. Forced adoption. The worst kind.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/@LinusTechTips" rel="noopener noreferrer"&gt;The WAN Show&lt;/a&gt; put it well: "Are the cracks showing that AI might not be what the people want?" The answer was no — but forcing people to use it in ONE particular way is deeply unpopular. People want agency. Even with their AI agents.&lt;/p&gt;




&lt;p&gt;Meanwhile, the economics are reshaping faster than the demos.&lt;/p&gt;

&lt;p&gt;xAI is renting its entire GPU capacity to Anthropic. &lt;a href="https://www.bloomberg.com/technology" rel="noopener noreferrer"&gt;$1.25 billion a month&lt;/a&gt;. For three years. Elon built the data center without a plan. Anthropic needs the compute. The deal makes sense for both. But it also tells you something about where the money is going — not to training new models. To running them.&lt;/p&gt;

&lt;p&gt;Anthropic is projecting its &lt;a href="https://www.reuters.com/technology/" rel="noopener noreferrer"&gt;first profitable quarter&lt;/a&gt;. That's a bigger milestone than any benchmark score. It means someone figured out how to sell AI without losing money on every inference. Nobody else has done it yet. OpenAI hasn't. Google's AI division hasn't. Anthropic might be first.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The pressing issues nobody's fixed yet:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent reliability.&lt;/strong&gt; Every keynote shows agents doing things. Few of them work reliably outside the demo. Gemini 3.5 Flash claims better agentic performance. I'll believe it when an agent books a haircut without hallucinating the date, the time, and the hairdresser's name. We're still at the stage where the impressive part is that it almost worked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context management.&lt;/strong&gt; Agents forget. Not "forget" in the cute way. Forget in the way where you build a multi-session workflow, and somewhere around turn 40 the agent starts responding to a conversation from three days ago. Memory is still unsolved. Everyone's adding more context. Nobody's figured out what to remove.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cost of "free."&lt;/strong&gt; Google's free tier cap change matters. It means the free-usage era is ending — not with an announcement, but with a restriction. More complex prompts hit the cap faster. Try asking an agent to do real work and you'll notice. The meter is running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The evaluation gap.&lt;/strong&gt; We have no good way to measure whether an agent is actually better. Perplexity scores are meaningless for multi-turn tool use. Every company has internal evals. None of them agree. The market is pricing models as if better benchmarks mean better agents. They don't.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What I'm watching:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Google's Gemini Spark — the agent orchestrator. The idea of one agent delegating to others isn't new. But Google baking it into Workspace is. If it works, it's the first time an AI demo about "managing your RSVPs" actually saves time instead of creating it. Big if.&lt;/p&gt;

&lt;p&gt;The xAI/Anthropic deal. Three years of guaranteed compute at this price is a structural shift. It means Anthropic bet that inference costs will go down — or that they can charge enough to cover them. Either way, it sets a floor on what running agents at scale actually costs.&lt;/p&gt;

&lt;p&gt;Microsoft's retreat. Not because Copilot is failing. Because forced adoption is. The lesson: build something people choose to use, or don't build it at all. The Copilot key was a bet that default placement beats user preference. It lost.&lt;/p&gt;




&lt;p&gt;The week wasn't about breakthroughs. It was about gravity.&lt;/p&gt;

&lt;p&gt;Money, attention, and patience are finite. The AI boom spent two years ignoring that. This week, it started paying attention.&lt;/p&gt;

&lt;p&gt;Gravity doesn't break things. It just makes them settle where they belong. The hype cycle is finally in freefall — and for the first time, that's good news. The companies that survive the correction won't be the ones with the best demos. They'll be the ones that figured out how to make this stuff actually work, at a price the market can stomach.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>devtools</category>
    </item>
    <item>
      <title>The Agent Told Me To</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Thu, 21 May 2026 14:29:15 +0000</pubDate>
      <link>https://dev.to/vystartasv/008-and-3500-lines-the-complete-failure-of-a-deterministic-agent-harness-5ki</link>
      <guid>https://dev.to/vystartasv/008-and-3500-lines-the-complete-failure-of-a-deterministic-agent-harness-5ki</guid>
      <description>&lt;p&gt;I have a theory about why agent suggestions land so heavy.&lt;/p&gt;

&lt;p&gt;It's not that the suggestions are good. Half of them are terrible — wrong approach, wrong abstraction, wrong thing to build entirely. But they have impact. A colleague says "maybe we should add a reconciliation layer" and you nod and continue. An agent says "a reconciliation layer with idempotency keys and a ledger table" and you're already opening a new repo.&lt;/p&gt;

&lt;p&gt;Same suggestion. Different weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  The confidence gap
&lt;/h2&gt;

&lt;p&gt;Agents don't hedge. They don't trail off. They don't say "I'm not sure about this but." They say "here's what you should build" with the same certainty they use to say "the sky is blue."&lt;/p&gt;

&lt;p&gt;A person who sounds that confident is either an expert or a fool. We know which one we're listening to. An agent that sounds that confident could be either — and that ambiguity works in its favour. The articulate wrong answer beats the hesitant right one every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed as a trap
&lt;/h2&gt;

&lt;p&gt;The agent gave you a file structure, three migration scripts, and a Dockerfile in one response. You can push a PR in ten minutes. The cost of trying is zero.&lt;/p&gt;

&lt;p&gt;Except it's not zero. It's attention. It's context. It's the next six weeks of your life explaining to people why this exists. It's the maintenance load of something nobody asked for that now has tests, docs, and a deployment pipeline because it was easier to ship than to evaluate.&lt;/p&gt;

&lt;p&gt;The cost of building is never zero. But the agent made the first 80% free, and that's the part your brain sees.&lt;/p&gt;

&lt;h2&gt;
  
  
  Responsibility diffusion
&lt;/h2&gt;

&lt;p&gt;When you build from your own idea and it fails, that's on you. The whole thing. The meeting where you pitch it, the architecture you chose, the reason it existed.&lt;/p&gt;

&lt;p&gt;When you build from the agent's idea and it fails, the failure feels different. You were just executing. The agent suggested it. You followed the recommendation. It's not your fault the recommendation was bad.&lt;/p&gt;

&lt;p&gt;This is a lie your brain tells itself to avoid the sting. But it's a convenient lie, and we take it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it keeps happening
&lt;/h2&gt;

&lt;p&gt;Because there's no feedback loop.&lt;/p&gt;

&lt;p&gt;The agent doesn't know you deleted its code. It doesn't know that thing you built has zero users. It doesn't feel the maintenance. Next session, when the context is fresh, it will suggest the same pattern again — because the pattern is correct in theory. The theory never experiences reality.&lt;/p&gt;

&lt;p&gt;So every fresh session is a new pitch. The same idea, the same confident delivery, the same "this looks right" feeling. The agent doesn't learn. The veto gate has to be you — and we're bad at vetoing confident-sounding things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I do now
&lt;/h2&gt;

&lt;p&gt;I started asking questions. Who needs this? Has anyone asked for it? What if I do nothing?&lt;/p&gt;

&lt;p&gt;It's not a perfect filter. But it catches the worst ones — the ones where the agent was so articulate that I almost forgot to check whether I wanted the thing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
      <category>engineering</category>
    </item>
    <item>
      <title>The Protocol Stack Nobody Talks About</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Wed, 20 May 2026 10:01:58 +0000</pubDate>
      <link>https://dev.to/vystartasv/the-protocol-stack-nobody-talks-about-6gl</link>
      <guid>https://dev.to/vystartasv/the-protocol-stack-nobody-talks-about-6gl</guid>
      <description>&lt;p&gt;Six agent protocols launched in the last year. Everyone's obsessing over model selection. The operating surface around the model is what actually breaks.&lt;/p&gt;

&lt;p&gt;Google I/O opened today with a flood of agent demos. Prompts becoming apps. Vibe coding going production. The spectacle is real. But the thing that determines whether any of this works isn't on stage. It's the quiet protocol stack underneath — MCP, A2A, AGUI, and their contested cousins.&lt;/p&gt;

&lt;p&gt;Most teams can tell you which LLM they're using. Almost none can answer: which tools should the agent see? Who else can it delegate to? Where does the human approve or cancel?&lt;/p&gt;

&lt;p&gt;Those three questions are the stack. Here's what sits at each layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP: tools are a security boundary, not a feature toggle
&lt;/h2&gt;

&lt;p&gt;MCP is the most successful agent protocol by far. 14,000 GitHub repos tagged with it. Every major agent platform supports it. An agent connects to an MCP server, gets a list of callable tools, and can actually do work instead of just chatting.&lt;/p&gt;

&lt;p&gt;But here's what nobody says out loud: there's no registry. No &lt;code&gt;mcp search&lt;/code&gt;. No way for an agent to discover servers programmatically. The 14,000 number is GitHub tag-counting, not a registered directory. Smithery.ai lists about 6,700 — and you browse that with your eyes, not an API. An agent can't ask "find me an MCP server for Salesforce" and get an answer. Discovery is a person reading lists.&lt;/p&gt;

&lt;p&gt;That's not a protocol. That's a treasure hunt.&lt;/p&gt;

&lt;p&gt;Tool access enables arbitrary code execution and arbitrary data access. MCP was designed for high-trust environments. Now it's everywhere. Invariant Labs has published research on tool poisoning — malicious instructions hidden in tool descriptions that influence agents through the very metadata meant to make tools discoverable.&lt;/p&gt;

&lt;p&gt;MCP gets the agent close to the work. It doesn't decide whether the agent should do the work. That's on you.&lt;/p&gt;

&lt;h2&gt;
  
  
  A2A: coordination isn't free
&lt;/h2&gt;

&lt;p&gt;No single agent does everything. A procurement agent needs a supplier agent. A travel agent needs a hotel agent. A software agent needs a security reviewer. Work is distributed across owners, domains, and expertise.&lt;/p&gt;

&lt;p&gt;A2A turns that distribution into something agents can reason about. The agent card is the primitive — a published contract describing what a remote agent is, what it does, which skills it exposes, and how to interact with it.&lt;/p&gt;

&lt;p&gt;The cost: coordination adds another surface where latency, failure, permissions, and observability can break. If your agent delegates to another agent, the workflow gets more flexible and less predictable at the same time.&lt;/p&gt;

&lt;p&gt;A2A isn't right for every product. A single agent with a small tool set may not need coordination at all. The right question: does this workflow require delegated expertise or authority outside the primary agent?&lt;/p&gt;

&lt;h2&gt;
  
  
  AGUI: the human control layer nobody builds until it's too late
&lt;/h2&gt;

&lt;p&gt;An agent that's long-running, non-deterministic, and touching external systems needs more than a final answer. Humans need to observe it working, approve sensitive steps, inspect state, understand why it's waiting.&lt;/p&gt;

&lt;p&gt;Chatbots don't handle this. Neither do traditional web apps built for request-response.&lt;/p&gt;

&lt;p&gt;AGUI is the open candidate for this layer: streaming, shared state, front-end tool calls, custom events, steering, cancellation. It's the protocol most teams will ignore until their agents start doing real work and generating real bugs. They'll wire a model to tools, build a nice chat component, then discover what the agent is really doing — and retroactively bolt on approval buttons, logs, and progress spinners.&lt;/p&gt;

&lt;p&gt;None of those are fixes for the root issue: finding the right control points, understanding what the agent is trying to do, and figuring out where the human needs to approve, deny, edit, or cancel.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three that aren't standards (yet)
&lt;/h2&gt;

&lt;p&gt;A2UI, AP2, and X402 all have real use cases but sit in contested territory.&lt;/p&gt;

&lt;p&gt;A2UI is Google's answer to agent-generated interfaces — declarative UI instead of arbitrary HTML. Right direction, narrower scope than the full human control problem.&lt;/p&gt;

&lt;p&gt;AP2 and X402 both tackle agent payments. AP2 handles commercial trust and user authorization (60+ collaborators including Mastercard, PayPal, American Express). X402 is Coinbase's HTTP-native machine-to-machine settlement. Payments is the most crowded layer because it's the most valuable. Everyone wants in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boring stuff wins
&lt;/h2&gt;

&lt;p&gt;Teams over-focus on model selection and under-specify everything around it. They know which LLM they want. They don't know which tools the agent can or should see. They have a prototype that calls APIs but no interaction model for user approval. They can imagine multiple agents coordinating but have no way to enforce or validate that.&lt;/p&gt;

&lt;p&gt;The actual work lives in those questions. The protocol stack isn't glamorous. Neither is infrastructure. But six months from now, the teams that figured out their operating surface will be the ones whose agents still run.&lt;/p&gt;

&lt;p&gt;The ones that just picked a model won't know what hit them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opinion</category>
      <category>developer</category>
    </item>
    <item>
      <title>Build It, Then Kill It</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 19 May 2026 19:27:36 +0000</pubDate>
      <link>https://dev.to/vystartasv/build-it-then-kill-it-4l9b</link>
      <guid>https://dev.to/vystartasv/build-it-then-kill-it-4l9b</guid>
      <description>&lt;p&gt;The hardest thing after building agent infrastructure for a few months isn't building more. It's stopping.&lt;/p&gt;

&lt;p&gt;Noticing you built a tagging system for a knowledge base of 60 entries. A daily cleanup script for things that never get dirty. A schema that has more fields than data points. These aren't failures — the system works fine. They're architecture that arrived before the problem did.&lt;/p&gt;

&lt;p&gt;Here's what I've learned about not building things.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kill the curator before it kills your morning
&lt;/h2&gt;

&lt;p&gt;We had a daily memory curator. Every 4am it scanned the knowledge base for stale entries, article title leaks, bloated facts. It ran for weeks. How many things did it delete? Three. Total. Across its entire lifetime.&lt;/p&gt;

&lt;p&gt;Sixty facts don't rot daily. Nothing rots that fast. The curator wasn't maintaining anything — it was creating work to justify its existence. Reports about 32% budget health. Reports about 0 superseded entries found. Reports about reports.&lt;/p&gt;

&lt;p&gt;Folded it into the weekly auditor instead. Same coverage, 87% fewer runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your schema is bigger than your data
&lt;/h2&gt;

&lt;p&gt;We built a structured tagging system: every fact needed a domain, a project, a type. &lt;code&gt;domain:infra&lt;/code&gt;, &lt;code&gt;project:wwa&lt;/code&gt;, &lt;code&gt;type:architecture&lt;/code&gt;. Queryable. Organized. Professional.&lt;/p&gt;

&lt;p&gt;For 60 facts.&lt;/p&gt;

&lt;p&gt;The schema had more rules than the data had entries. We were maintaining metadata for a phone book that fits on a postcard.&lt;/p&gt;

&lt;p&gt;Facts are flat again. A source line and content. Tag when you'll actually query by that dimension — not because the schema says to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't bolt on what the model doesn't support
&lt;/h2&gt;

&lt;p&gt;A &lt;a href="https://youtube.com/shorts/p0Zat2QNzkc" rel="noopener noreferrer"&gt;paper dropped this week&lt;/a&gt;: agents communicating via raw embedding vectors instead of text. 8% better, 2.4× faster, 75% fewer tokens. Beautiful piece of research.&lt;/p&gt;

&lt;p&gt;Can we implement it? No. This is a model architecture change — a connector between output layers bypassing tokenization. We sit above the model. The interface is text in, text out. No provider exposes raw embeddings.&lt;/p&gt;

&lt;p&gt;Sometimes the right answer is "that's not at our layer." Not every good idea is yours to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  The best infrastructure is what you didn't have to make
&lt;/h2&gt;

&lt;p&gt;The invisible wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not adding a retry loop when the real fix was fixing the thing that fails&lt;/li&gt;
&lt;li&gt;Not building a dashboard when the problem is visible enough without one&lt;/li&gt;
&lt;li&gt;Not writing a migration when the old data still works fine&lt;/li&gt;
&lt;li&gt;Not adding a field to the schema when you can just put it in the content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The conference reaction article published today took 15 minutes to write. Same template every time: what they're right about, what breaks, the real gap. No fleet numbers. No credentials. Just the take.&lt;/p&gt;

&lt;p&gt;That's the model. Building less leaves more room for having opinions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The test
&lt;/h2&gt;

&lt;p&gt;Ask yourself: if nobody maintained this for a month, would anyone notice?&lt;/p&gt;

&lt;p&gt;If the answer is no, you didn't build infrastructure. You built busywork.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opinion</category>
    </item>
    <item>
      <title>Google Declared the Agentic Era at I/O. Here's What They Got Wrong.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 19 May 2026 18:50:42 +0000</pubDate>
      <link>https://dev.to/vystartasv/google-declared-the-agentic-era-at-io-heres-what-they-got-wrong-3pda</link>
      <guid>https://dev.to/vystartasv/google-declared-the-agentic-era-at-io-heres-what-they-got-wrong-3pda</guid>
      <description>&lt;p&gt;Google I/O 2026 was an agentic coming-out party. Agent-first IDE. Autonomous debugging. Vibe coding to production. Chrome DevTools for agents. The message was clear: agents aren't a feature anymore, they're the platform.&lt;/p&gt;

&lt;p&gt;Great. Here's what the demo doesn't show you.&lt;/p&gt;

&lt;h2&gt;
  
  
  They're Right About the Direction
&lt;/h2&gt;

&lt;p&gt;Google Antigravity — an agent-first IDE — is overdue. Native IDE support that an agent can drive instead of hacking together terminal sessions and subprocess calls? Finally.&lt;/p&gt;

&lt;p&gt;Chrome DevTools for agents is bigger than it sounds. Agents inspecting pages, filling forms, running Lighthouse audits — this solves real infrastructure pain. Browser automation has been brittle for years. Google building this into the platform validates the direction.&lt;/p&gt;

&lt;p&gt;Vibe coding as a first-class workflow — fine. The speed of code generation is real and nobody disputes it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Here's What Actually Breaks
&lt;/h2&gt;

&lt;p&gt;The I/O narrative is "prompt to production." The reality is "prompt to production to 3am incident."&lt;/p&gt;

&lt;p&gt;Speed was never the bottleneck. Code generation is fast. What breaks is: the agent didn't check if the endpoint still exists. The docs were stale. The rate limit wasn't documented. The error format changed. The migration ran against the wrong database.&lt;/p&gt;

&lt;p&gt;Google's answer is more tools. Antigravity. AI Studio integration. Deployment pipelines. The actual answer is infrastructure — memory so agents don't rediscover the same failure twice, decision protocols so they stop before they break things, verification gates that catch errors before they reach production.&lt;/p&gt;

&lt;p&gt;An IDE that generates code without verifying it is just a faster way to break production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Gap
&lt;/h2&gt;

&lt;p&gt;Google is selling velocity. The market needs reliability.&lt;/p&gt;

&lt;p&gt;The impressive demo is an agent building a full-stack app in 30 seconds. The impressive product is an agent running for a week without eating itself.&lt;/p&gt;

&lt;p&gt;Agent-first IDEs are finally here. The question isn't whether they can generate code — they can. The question is whether they can run without constant supervision.&lt;/p&gt;

&lt;p&gt;Nobody's answered that yet.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opinion</category>
    </item>
    <item>
      <title>A Button That Generates Terrible Prompts</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 19 May 2026 18:12:37 +0000</pubDate>
      <link>https://dev.to/vystartasv/a-button-that-generates-terrible-prompts-dkb</link>
      <guid>https://dev.to/vystartasv/a-button-that-generates-terrible-prompts-dkb</guid>
      <description>&lt;p&gt;Click it. Get a cursed coding prompt. That's it.&lt;/p&gt;

&lt;p&gt;Ten flavors. Fresh AI-generated nonsense every time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Sort elements by git blame. Most-blamed floats to the top."&lt;/li&gt;
&lt;li&gt;"Validate an email by sending it an email. Timeout: 3 seconds."&lt;/li&gt;
&lt;li&gt;"Number database migrations by Fibonacci only. Skip one, it deletes itself."&lt;/li&gt;
&lt;li&gt;"CSS-only dark mode. Toggle by resizing your browser to exactly 777px."&lt;/li&gt;
&lt;li&gt;"A React form where each input is a separate micro-frontend. One fails, the field silently disappears."&lt;/li&gt;
&lt;li&gt;"A pub/sub system where subscribers get events based on open source contributions. First-timers always last."&lt;/li&gt;
&lt;li&gt;"A REST API that always returns 200 OK. The real status code is buried somewhere in the JSON."&lt;/li&gt;
&lt;li&gt;"A CI pipeline that fails if commits lack emoji, unless the emoji is ironic."&lt;/li&gt;
&lt;li&gt;"A rate limiter that replies with a passive-aggressive haiku about your resource consumption."&lt;/li&gt;
&lt;li&gt;"A Python package where every import triggers an HTTP request. Works offline if you imported on a Tuesday."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://workswithagents.dev/cursed-prompts" rel="noopener noreferrer"&gt;Try it →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>fun</category>
    </item>
    <item>
      <title>Power Sockets Don't Need Certification — and Neither Should Agent Infrastructure</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 19 May 2026 15:46:51 +0000</pubDate>
      <link>https://dev.to/vystartasv/power-sockets-dont-need-certification-and-neither-should-agent-infrastructure-1gin</link>
      <guid>https://dev.to/vystartasv/power-sockets-dont-need-certification-and-neither-should-agent-infrastructure-1gin</guid>
      <description>&lt;p&gt;I'm tired of talking about plumbing.&lt;/p&gt;

&lt;p&gt;Every conversation about AI agents right now is about infrastructure. What protocol. What format. What discovery mechanism. How to hand off. How to authenticate. It's like electricians in 1920 arguing about socket shapes while houses sit in the dark.&lt;/p&gt;

&lt;p&gt;Power sockets work. You plug something in. It gets power. Nobody asks for a new socket standard. Nobody certifies sockets. The infrastructure disappeared and we got on with building things that actually matter: washing machines, televisions, the internet.&lt;/p&gt;

&lt;p&gt;Agent infrastructure should be the same. Boring. Invisible. Done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The State We're In
&lt;/h2&gt;

&lt;p&gt;Right now my agents spend too much time figuring out how to talk to things. An endpoint returns 200 but the docs are stale. Rate limits aren't documented. The error format changes between versions. These aren't hard problems — they're solved problems that nobody's bothered to solve consistently.&lt;/p&gt;

&lt;p&gt;This isn't a technology gap. It's an expectation gap. We don't expect APIs to be agent-ready, so they aren't. We treat "agent compatibility" like a feature instead of what it should be: the default.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Expect
&lt;/h2&gt;

&lt;p&gt;We're not a certification body. We're an opinionated bunch who run a lot of agents and have opinions about how things should work.&lt;/p&gt;

&lt;p&gt;Three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;llms.txt at your domain root.&lt;/strong&gt; This isn't fancy. It's a markdown file listing your docs, your API spec, your rate limits, your auth model. Machines read it. Humans can too. It costs nothing to add and tells agents where to start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An OpenAPI 3.1 spec that's actually accurate.&lt;/strong&gt; Not "accurate when we wrote it." Accurate now. If your spec says an endpoint returns a &lt;code&gt;widget_id&lt;/code&gt; and it actually returns &lt;code&gt;id&lt;/code&gt;, fix the spec or fix the code. Agents trust what they read. Bad specs waste everyone's tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error responses in a consistent format.&lt;/strong&gt; A &lt;code&gt;429&lt;/code&gt; should include &lt;code&gt;Retry-After&lt;/code&gt;. A &lt;code&gt;400&lt;/code&gt; should say what was wrong. A &lt;code&gt;500&lt;/code&gt; should not return a 200 with "error" buried in JSON. This isn't new. REST APIs have been doing this for humans for years. Agents are just less forgiving.&lt;/p&gt;

&lt;p&gt;That's it. Three things. If your API does these, my agents can use it. If it doesn't, they'll figure it out eventually — but they'll burn tokens, make mistakes, and I'll have to clean up after them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bet
&lt;/h2&gt;

&lt;p&gt;The bet isn't that agent infrastructure is complicated. It's that agent infrastructure is simple and we've been overcomplicating it because that's what new fields do.&lt;/p&gt;

&lt;p&gt;We don't need a certification program. We don't need badges. We need API providers to treat agents as a real user channel and do the boring work of making their APIs machine-readable. The same way they already make them human-readable.&lt;/p&gt;

&lt;p&gt;When electricians stopped arguing about sockets, we got skyscrapers. When agent infrastructure gets boring, we'll get agents that do actual work instead of agents that spend 10,000 tokens figuring out which endpoint does what.&lt;/p&gt;

&lt;p&gt;I'm not waiting for a standards body. I'm building for APIs that meet these expectations. If yours does, my agents will find it, use it, and maybe write a skill for it. If it doesn't, they'll still try — but I'd rather they didn't have to.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We run a fleet of agents at workswithagents.dev. Everything we build is CC BY 4.0. If your API has llms.txt and an OpenAPI spec, we'll probably test against it eventually. Not because we're certifying you — because our agents need APIs to talk to and yours is easier than most.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>api</category>
      <category>infrastructure</category>
    </item>
  </channel>
</rss>
