<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vilius</title>
    <description>The latest articles on DEV Community by Vilius (@vystartasv).</description>
    <link>https://dev.to/vystartasv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F133303%2F50baa34e-e011-4576-8b1a-5974d272fc34.jpg</url>
      <title>DEV Community: Vilius</title>
      <link>https://dev.to/vystartasv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vystartasv"/>
    <language>en</language>
    <item>
      <title>Five Things to Check When Delivering Fast</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Fri, 03 Jul 2026 15:43:46 +0000</pubDate>
      <link>https://dev.to/vystartasv/five-things-to-check-when-delivering-fast-2k73</link>
      <guid>https://dev.to/vystartasv/five-things-to-check-when-delivering-fast-2k73</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the follow-up to &lt;a href="https://dev.to/vystartasv/what-actually-changed-in-two-weeks-3mki"&gt;What Actually Changed in Two Weeks&lt;/a&gt;. That one was about setting up a project for AI-speed delivery. This one is about something I keep re-learning on every fast delivery.&lt;/p&gt;




&lt;p&gt;You start shipping faster with AI. The code works, the feature lands, it feels good.&lt;/p&gt;

&lt;p&gt;Then a few weeks later the feedback comes back, and some of it catches you off guard. Not because anything is broken — but because a few things that seemed obvious to you weren't obvious to the other side.&lt;/p&gt;

&lt;p&gt;No drama. It happens. Here are five things I'm learning to check earlier.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. What does "done" look like from their side?
&lt;/h3&gt;

&lt;p&gt;To me, done means working software. To someone else it might mean pixel-match with a design. Both are valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helps:&lt;/strong&gt; A quick "what does good enough look like to you?" before the work starts. One sentence can save a lot of back and forth.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. When will they actually look at it?
&lt;/h3&gt;

&lt;p&gt;Sending something doesn't mean it gets reviewed immediately. It lands in a queue like everything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helps:&lt;/strong&gt; Naming a review date alongside the delivery date. "I'll share this Tuesday — could you take a look by Friday?" Turns silence from a mystery into a signal.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. What needs to be perfect vs what can be improved later?
&lt;/h3&gt;

&lt;p&gt;Not everything in the feedback is the same weight. A label change and a broken flow are different things. Without saying so upfront, everything looks like an emergency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helps:&lt;/strong&gt; Two buckets agreed early. "Here's what I'll get right before it ships. Here's what I'd revisit in a follow-up." Makes the first feedback session more productive.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Could they see something before the full delivery?
&lt;/h3&gt;

&lt;p&gt;The first time someone sees your work often sets the tone. Showing one page or one flow halfway through can catch mismatches before they multiply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helps:&lt;/strong&gt; A mid-point check-in. "First page is ready — want to see if this matches what you had in mind?" Five minutes that can save a round of revisions.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Do they have the full picture?
&lt;/h3&gt;

&lt;p&gt;You've been living in this feature. You know what's intentional and what you'd still tweak. They just see what's on screen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helps:&lt;/strong&gt; A quick walkthrough at handover. "Here's what's working, here's what I'd improve with more time, here's where I'm unsure." Gives them the right lens to review through.&lt;/p&gt;




&lt;p&gt;None of this is groundbreaking. It's just the stuff that's easy to skip when you're moving fast — and it turns out skipping it doesn't save time. It costs it later.&lt;/p&gt;

&lt;p&gt;The good news is that when you do check these, the feedback loop compresses to a couple of hours of fixes instead of feeling like a bigger deal than it is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is the sequel to "What Actually Changed in Two Weeks." Same project. Different lesson.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>development</category>
    </item>
    <item>
      <title>More Watts, Less Light</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Mon, 29 Jun 2026 21:11:45 +0000</pubDate>
      <link>https://dev.to/vystartasv/more-watts-less-light-53ne</link>
      <guid>https://dev.to/vystartasv/more-watts-less-light-53ne</guid>
      <description>&lt;p&gt;Token burn and business outcomes are not correlated. More burn means more inefficiency, not more value.&lt;/p&gt;

&lt;h2&gt;
  
  
  The electricity problem
&lt;/h2&gt;

&lt;p&gt;Imagine you walk into a dark room. Turning on a light helps you see. Turning on every light in the building does not help you see better. It's still the same room. Now every surface is equally lit, the contrast is gone, and you're paying for power you didn't use.&lt;/p&gt;

&lt;p&gt;Tokens work the same way. A focused prompt with clear scope is the single overhead light over your desk. A sprawling prompt with unlimited exploration is every light in the building — you're burning power, not producing insight.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tokens are electricity, not output. More throughput doesn't mean more value.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I've had weeks where I burned through my allocation and looked back at the end to find nothing concrete. Code that worked but went unused. Exploratory branches that dead-ended. Agents that generated plausible-looking output that didn't survive first review. A lot of motion. Not much progress.&lt;/p&gt;

&lt;p&gt;The ceiling stops you from doing that indefinitely. It forces a moment of reflection: did this burn produce anything real? If the answer is no, more capacity isn't the fix. More discipline is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three patterns I now use instead
&lt;/h2&gt;

&lt;p&gt;I started paying attention to what actually ships versus what just burns context. I gave the patterns names so I could catch myself faster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RTK — Read The Knowledgebase.&lt;/strong&gt; A focused 15-minute read of the codebase, identifying the exact files and exact changes, saves 200K+ tokens of exploratory waste. The agent doesn't discover the shape of the task — it executes against a known one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caveman — compress before you prompt.&lt;/strong&gt; Strip greetings, filler words ("I think", "basically", "Let me know if that makes sense"), and closing courtesies. Every word in your prompt multiplies across every response token. Less fluff in means less fluff out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ponytail — spec the minimum viable solution.&lt;/strong&gt; "Robust", "scalable", "enterprise-grade", "comprehensive" — these words invite scope creep. Specify the simplest thing that works: "a Map with TTL, not Redis." An agent given clear constraints ships faster than one given permission to over-engineer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't hacks. They're accepting that the constraint is real and learning to work inside it before asking for more room.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question the ceiling forces you to answer
&lt;/h2&gt;

&lt;p&gt;From a business perspective, the question isn't "how many tokens did we use" or even "how much code did we generate." It's "what shipped that wouldn't have shipped otherwise, and was it worth what it cost?"&lt;/p&gt;

&lt;p&gt;If the answer is unclear, more capacity isn't going to fix it. A bigger budget for waste just means you waste more, faster.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The real question:&lt;/strong&gt; if you had unlimited tokens tomorrow, what would you do differently? If you can't articulate a concrete answer — not "ship faster" but "ship which thing, that I can't ship now" — then more capacity is just paying for more waste.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Two perspectives, same person
&lt;/h2&gt;

&lt;p&gt;As the person running the agents day to day, I feel the ceiling. I'd like more room to let exploration run longer before cutting it off. That's an honest feeling.&lt;/p&gt;

&lt;p&gt;As someone accountable for what actually ships, I need the ceiling to exist. It forces a discipline that unlimited capacity never would. It turns "try everything" into "try the right thing." It turns motion into direction.&lt;/p&gt;

&lt;p&gt;The two perspectives aren't in conflict — they're the same person at different zoom levels. The engineering self wants headroom. The business self wants proof that headroom produces value. Both are correct. The tension between them is the whole point.&lt;/p&gt;

&lt;p&gt;Not every team needs to raise the ceiling. Some genuinely do — running many parallel experiments, rapid prototyping, exploring multiple approaches. For them, more throughput would produce measurable value. But that's not most teams. Most teams are fine where they are. Their current output meets their current needs. The ceiling is a signal, not a problem to solve by default. A moment to ask: is this constraint stopping something valuable, or just stopping something busy?&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual skill
&lt;/h2&gt;

&lt;p&gt;I brush against the limit most weeks. Sometimes it's because the work justified every token. Sometimes it's because I let the agent run too long on inefficient prompts.&lt;/p&gt;

&lt;p&gt;More electricity doesn't mean better light. More tokens don't mean better outcomes. The skill is knowing the difference.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>How to onboard an existing project with AI tools</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Fri, 26 Jun 2026 08:39:33 +0000</pubDate>
      <link>https://dev.to/vystartasv/how-to-onboard-an-existing-project-with-ai-tools-270l</link>
      <guid>https://dev.to/vystartasv/how-to-onboard-an-existing-project-with-ai-tools-270l</guid>
      <description>&lt;p&gt;You cloned a mature project, pointed an AI agent at it, and it produced garbage. It didn't know the auth flow, tripped on schema quirks, and kept writing code that didn't fit. You blamed the model. It wasn't the model's fault.&lt;/p&gt;

&lt;p&gt;The problem isn't the agent's ability to code. It's your project's ability to be coded &lt;em&gt;by&lt;/em&gt; an agent. This guide fixes that — one phase at a time. You don't need to finish all of them today. Do one, come back next week. The goal is progress, not perfection.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before you start — take stock
&lt;/h2&gt;

&lt;p&gt;Not every project needs the same treatment. Spend 10 minutes assessing what you're working with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does it have docs? Are they accurate?&lt;/li&gt;
&lt;li&gt;Is there a test suite? Unit, integration, E2E?&lt;/li&gt;
&lt;li&gt;What's the auth situation? Service account, MFA, SSO?&lt;/li&gt;
&lt;li&gt;What does the README claim vs what's actually true?&lt;/li&gt;
&lt;li&gt;Is there a committed schema (GraphQL, OpenAPI) or is the contract tribal knowledge?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You're looking for gaps in what an agent needs to work effectively. You'll fill them one at a time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase one: Documentation baseline
&lt;/h2&gt;

&lt;p&gt;An agent relearns your project every session unless you give it somewhere to look. That somewhere is a &lt;code&gt;docs/&lt;/code&gt; folder at project root. Every plan, spec, architecture decision, and bug fix goes there. Organized home, reduced mental load.&lt;/p&gt;

&lt;h3&gt;
  
  
  What lives in &lt;code&gt;docs/&lt;/code&gt;:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;AGENTS.md&lt;/code&gt;&lt;/strong&gt; — how the agent should write code &lt;em&gt;for this project&lt;/em&gt;. Conventions to follow, patterns to avoid, gotchas that aren't obvious from the code. This is more important than the README in 2026. Start with one, keep it honest. (If your tool prefers &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;CURSOR_RULES&lt;/code&gt;, use those — same purpose, different name.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;docs/BUGS.md&lt;/code&gt;&lt;/strong&gt; — bug catalog with root causes, symptoms, and fixes. The same issue doesn't need to be fixed twice. The agent reads this before starting new work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;docs/LESSONS.md&lt;/code&gt;&lt;/strong&gt; — things that went wrong. Architectural decisions that aged poorly. What the agent should never do. This is durable institutional memory — more valuable than any lint rule. New team members (human or agent) read this first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;docs/TECH-DEBT.md&lt;/code&gt;&lt;/strong&gt; — anti-pattern inventory with a phased fix plan. The agent checks this before refactoring so it doesn't step into known traps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;docs/SPECS/&lt;/code&gt;&lt;/strong&gt; — feature specs and implementation plans. The agent works from a written spec, not memory. One plan per feature, organised by status.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;docs/SCHEMA.md&lt;/code&gt;&lt;/strong&gt; (optional) — data model reference. Committed types, API contracts, field descriptions. If your project has a formal schema, document it. If not, the existing code is the contract — and that's fine.&lt;/p&gt;

&lt;p&gt;Frontend docs cross-reference backend docs. One entry point. The agent goes there first, guesses second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup in one paste — single source of truth
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; docs/SPECS &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; AGENTS.md &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
# AGENTS.md

Before writing code: read this file, then check docs/ for known issues,
the data model, and project conventions. Tests must pass after changes.

All project knowledge lives in docs/. Single source of truth.
Read docs/BUGS.md, docs/LESSONS.md, and docs/TECH-DEBT.md before any work.
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;docs/BUGS.md docs/LESSONS.md docs/TECH-DEBT.md
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"✅ docs/ + AGENTS.md created — agent entry point is set"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Paste that in your project root. Creates &lt;code&gt;docs/&lt;/code&gt; as the single source of truth and &lt;code&gt;AGENTS.md&lt;/code&gt; as the agent's entry point. The agent reads &lt;code&gt;AGENTS.md&lt;/code&gt; first, everything else branches from &lt;code&gt;docs/&lt;/code&gt;. Works with Claude Code, Codex, Cursor, Pi — any agent that respects project docs.&lt;/p&gt;

&lt;p&gt;The agent only links to &lt;code&gt;AGENTS.md&lt;/code&gt;. That file points to &lt;code&gt;docs/&lt;/code&gt;. One entry point, everything discoverable from there.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚠️ The &lt;code&gt;/init&lt;/code&gt; trap
&lt;/h3&gt;

&lt;p&gt;Your favourite CLI will scan your project and generate documentation. Some of it will be wrong — wrong architecture labels, wrong folder purposes, wrong dependency descriptions. Service-layered vs atomic architecture look identical to a scanner. They aren't.&lt;/p&gt;

&lt;p&gt;Read every line it wrote. Find what's incorrect. Fix it. The effort isn't generating text — it's curation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase two: Testing — the trust threshold
&lt;/h2&gt;

&lt;p&gt;This is the hard part. It determines how much AI can assist and how much trust you can hand over. Evals, regression safety, autonomous debugging — everything lives or dies on whether this works.&lt;/p&gt;

&lt;p&gt;Don't have a full suite yet? Start with one smoke test. One spec that proves auth works. Expand later — the first domino matters more than full coverage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authentication — the blocker
&lt;/h3&gt;

&lt;p&gt;Most mature apps won't let an agent in without MFA, SSO, or some redirect chain. If a service account exists, use it. Otherwise:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ideally, use a persistent browser profile.&lt;/strong&gt; Login once, the profile saves, reuses forever. Covers MFA, SSO, redirect chains — handled on day one, never touched again. Playwright makes this trivial:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx playwright open &lt;span class="nt"&gt;--save-storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.auth/profile.json
&lt;span class="c"&gt;# login manually once in the headed browser window&lt;/span&gt;
&lt;span class="c"&gt;# profile.json now saves the session — reuse everywhere&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you can't create a dedicated profile, &lt;strong&gt;clone your actual profile&lt;/strong&gt;. Less ideal — you're coupling test setup to your personal session — but it works and gets you moving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use CDP.&lt;/strong&gt; Chrome DevTools Protocol sounds elegant. Every time I've tried it, it flakes. Connection drops, session expires mid-test, weird race conditions. More time debugging the connection than writing tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  After auth — the details
&lt;/h3&gt;

&lt;p&gt;Login scripts. Cookie popup dismissals. Wait-for-element by specific selectors. Build these once, commit them, forget them. The agent inherits the same setup — same profile, same scripts, same waits.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase three: MCP servers — extending what the agent can do
&lt;/h2&gt;

&lt;p&gt;Once docs and auth are in place, the next step is giving the agent tools it doesn't have natively. MCP servers do that — they're project-local services the agent discovers and calls automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two that matter for project onboarding:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playwright MCP&lt;/strong&gt;&lt;br&gt;
Drives a real browser. The agent navigates to URLs, clicks buttons, reads the DOM, takes screenshots. Uses the same persistent profile from phase two, so auth just works. Drop a link and describe the bug — the agent reproduces it, inspects the mismatch, and writes a fix. No manual reproduction steps, no screenshot pasting.&lt;/p&gt;

&lt;p&gt;Configured in the project's &lt;code&gt;.mcp.json&lt;/code&gt;, pointing at a dedicated Playwright instance with the persistent Chrome profile. The agent discovers it automatically on session start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context7 MCP&lt;/strong&gt;&lt;br&gt;
Gives the agent up-to-date documentation and code examples for any library or framework. When the agent needs to use an unfamiliar API, it queries Context7 instead of guessing or hallucinating. Covers the full ecosystem — React, Fastify, GraphQL, Playwright, everything with published docs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to get these
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Playwright:&lt;/strong&gt; &lt;code&gt;npm install -D @playwright/test&lt;/code&gt; then &lt;code&gt;npx playwright install chromium&lt;/code&gt;. The MCP server comes from &lt;code&gt;@anthropic-ai/mcp-playwright&lt;/code&gt; or configure it manually via the &lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;MCP specification&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context7:&lt;/strong&gt; Available at &lt;a href="https://context7.com" rel="noopener noreferrer"&gt;context7.com&lt;/a&gt; — install their MCP server and configure it in your project.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth setup:&lt;/strong&gt; Run &lt;code&gt;npx playwright open --save-storage=.auth/profile.json&lt;/code&gt;, login once, reuse everywhere.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Phase four: Feedback loop
&lt;/h2&gt;

&lt;p&gt;Now you have docs, auth'd tests, and tool access. What connects them is the workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write a spec&lt;/li&gt;
&lt;li&gt;Agent implements against it&lt;/li&gt;
&lt;li&gt;Tests run — pass or fail&lt;/li&gt;
&lt;li&gt;Agent fixes or you review the output&lt;/li&gt;
&lt;li&gt;Commit&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the cadence. It doesn't need tooling — a single cycle proves the chain works. The loop matters before the automation does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Useful add-ons (not inbuilt, go install these)
&lt;/h2&gt;

&lt;p&gt;Beyond the setup, there are third-party tools worth knowing about. These aren't built into any framework — you find and install them yourself:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caveman&lt;/strong&gt; — compressed specification writing for agent-friendly specs. Cuts token count ~75% while staying precise. Useful when you need the agent to work from a spec that's dense enough to fit in context and precise enough to not hallucinate around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RTK (Rust Token Killer)&lt;/strong&gt; — CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Filters and compresses command outputs (&lt;code&gt;ls&lt;/code&gt;, &lt;code&gt;git&lt;/code&gt;, &lt;code&gt;cargo test&lt;/code&gt;, &lt;code&gt;pytest&lt;/code&gt;, etc.) before they reach your LLM context. Single Rust binary, 100+ supported commands, &amp;lt;10ms overhead. Install via &lt;code&gt;brew install rtk&lt;/code&gt; or &lt;code&gt;rtk init --agent hermes&lt;/code&gt;. When work needs millions of tokens, RTK is the lever — cuts the noise, keeps the signal. &lt;a href="https://github.com/rtk-ai/rtk" rel="noopener noreferrer"&gt;github.com/rtk-ai/rtk&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ponytail&lt;/strong&gt; — efficiency patterns for agentic development. Developer-laziness-as-virtue philosophy: ship what you own, validate what you don't. Covers code compression, structural laziness patterns, and avoiding over-engineering.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this unlocks
&lt;/h2&gt;

&lt;p&gt;Once docs, auth, and MCP tools exist, you've passed the threshold. Not the finish line — but the point where the agent becomes useful instead of dead weight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drop a link, describe the bug — the agent debugs it itself.&lt;/strong&gt;&lt;br&gt;
Playwright MCP opens the browser, navigates, interacts, reads the DOM. One prompt: "the dropdown on this page shows wrong values" → agent reproduces, identifies the mismatch between API response and render, fixes it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feed design screenshots — the agent builds from them.&lt;/strong&gt;&lt;br&gt;
Pixel-perfect? Not yet. Close enough that your job shifts from building to reviewing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask for a new field — the agent checks the schema, migrates if needed, builds.&lt;/strong&gt;&lt;br&gt;
If your project has a committed schema (GraphQL, OpenAPI, types file), the agent reuses what exists or creates the migration. If it doesn't, the agent works from the existing code as the implicit contract. Either way, backend and frontend stay consistent without manual coordination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hand off to a colleague on another OS.&lt;/strong&gt;&lt;br&gt;
Windows, macOS, Linux — same setup, same agent, same reliability. I handed over a solution on a 2-hour call. The agent drove a persistent Chrome browser until deployment issues were resolved. No SSH. Just "fix this."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask it to fix code smells across the codebase.&lt;/strong&gt;&lt;br&gt;
With the test suite as a safety net, the agent can refactor confidently.&lt;/p&gt;




&lt;h2&gt;
  
  
  2026 best practices that changed the game
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agent-specific docs over human READMEs.&lt;/strong&gt;&lt;br&gt;
CLAUDE.md / AGENTS.md is the most important file in the project now. READMEs tell humans how to run it. Agent docs tell the agent how to &lt;em&gt;think&lt;/em&gt; about it. Different audiences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document the data model.&lt;/strong&gt;&lt;br&gt;
If your project has a formal schema (GraphQL, OpenAPI, protobuf), commit it and make it the source of truth. If it doesn't, document the implicit contract — key types, API shapes, field meanings — in &lt;code&gt;docs/SCHEMA.md&lt;/code&gt;. The agent reads this before touching data-layer code. The goal isn't a perfect schema, it's less guesswork.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auth persistence as infrastructure.&lt;/strong&gt;&lt;br&gt;
Persistent browser profiles aren't a convenience. They're infrastructure. Without them, the agent spends half its context on re-authenticating. With them, every session starts ready to work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LESSONS.md as institutional memory.&lt;/strong&gt;&lt;br&gt;
What went wrong last sprint. Why the agent should never mutate cache in a certain path. This prevents more bugs than any lint rule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document the anti-patterns, not just the patterns.&lt;/strong&gt;&lt;br&gt;
What &lt;em&gt;not&lt;/em&gt; to do is more valuable than what to do. An agent can guess the pattern. It can't guess the mistake you made three months ago unless you wrote it down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP servers extend tool reach.&lt;/strong&gt;&lt;br&gt;
The trend in 2026 is project-local MCP servers that give the agent capabilities it doesn't have natively — browser driving (Playwright MCP), documentation lookup (Context7 MCP). Configure them in the project, commit the config, the agent discovers them automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  What still needs you
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-file refactors with implicit dependencies&lt;/strong&gt; — the agent misses ripple effects across distant modules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product judgment&lt;/strong&gt; — anything that needs taste, not correctness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First-time architectural patterns&lt;/strong&gt; — you scaffold the structure, the agent fills it in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curating the docs baseline&lt;/strong&gt; — the initial curation is still a human task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ratio shifts hard. Most of what took hours now takes minutes of review.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to stop
&lt;/h2&gt;

&lt;p&gt;You don't need full coverage. You need &lt;em&gt;enough&lt;/em&gt; coverage to trust the output. One auth'd E2E test, one docs baseline, one MCP server — that's the minimum viable setup. Everything else is compound returns.&lt;/p&gt;

&lt;p&gt;If today wasn't the day — that's fine. The project will be here tomorrow. Pick one thing, move it forward, stop when you've made progress.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Phases one and two are the hard ones. The rest is where the compound returns live. You can stop after any of them.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>tutorial</category>
      <category>agents</category>
    </item>
    <item>
      <title>Three Loops, No Ship</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Thu, 25 Jun 2026 21:59:12 +0000</pubDate>
      <link>https://dev.to/vystartasv/three-loops-no-ship-2pg0</link>
      <guid>https://dev.to/vystartasv/three-loops-no-ship-2pg0</guid>
      <description>&lt;p&gt;I spent three iterations on an auto-fix pipeline that still doesn't work reliably. Here's what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loop 1
&lt;/h2&gt;

&lt;p&gt;Wrote a background script. Pull tickets from Azure DevOps, run them through a local model, hand to a coding agent, push the result.&lt;/p&gt;

&lt;p&gt;Poll → triage → fix → push.&lt;/p&gt;

&lt;p&gt;Worked 40% of the time on trivial tickets. Anything that crossed file boundaries or needed real context — stalled or hallucinated.&lt;/p&gt;

&lt;p&gt;I shipped it anyway. That was naive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loop 2
&lt;/h2&gt;

&lt;p&gt;Made it smarter. Pre-selected relevant files. Broke big tickets into subtasks. Turned complex edits into atomic steps with verification between each.&lt;/p&gt;

&lt;p&gt;Got it to 55% or so. But every fix created two new edge cases. The complexity was compounding faster than the reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loop 3
&lt;/h2&gt;

&lt;p&gt;Went all in. Embeddings for dedup. Multi-repo routing. Auto-revert. A learning loop that fed failures back into future runs.&lt;/p&gt;

&lt;p&gt;The model server started dying. 890 memory errors in a day.&lt;/p&gt;

&lt;p&gt;Root cause: two independent consumers hitting the same local model server, each with its own retry loop. When memory filled up, retries amplified instead of staggering. The system was making itself worse.&lt;/p&gt;

&lt;p&gt;Fixes were simple in hindsight — stop retrying OOM, serialize access, use the local binary not npx. But the pattern kept repeating: add more to fix the last thing, break something else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'm At
&lt;/h2&gt;

&lt;p&gt;The pipeline still only works on easy tickets. Hard ones need a human. After three rounds, the main thing I learned is that local models hit a wall before your ambition does — not in quality, in working memory.&lt;/p&gt;

&lt;p&gt;And adding features doesn't fix reliability gaps. It just moves them around.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The 507 retry spiral taught me more than any successful deploy this year. Because it was entirely my fault. Not the model's, not the framework's. I built concurrent consumers with independent retry loops and expected them to coordinate. They didn't.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'll do a fourth loop. Smaller. A dedicated fast model for cheap work, the big model only for editing. One consumer at a time.&lt;/p&gt;

&lt;p&gt;Might work. Might be loop 5's prologue.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;I'm looking for people building similar things.&lt;/strong&gt; Local agent pipelines, auto-fix loops, small-model orchestration — the stuff that's not quite working yet but you keep iterating on.&lt;/p&gt;

&lt;p&gt;No Slack. No Discord. No newsletter. Just people who build this stuff and want to compare notes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What media would you gravitate around?&lt;/strong&gt; A private GitHub org? A Telegram group? Occasional calls? Reply or find me — curious what works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Failure post, not a success story. If you're building something similar — don't retry OOM, serialize your consumers, and measure what your model server can actually hold.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>localllms</category>
    </item>
    <item>
      <title>What actually changed in two weeks</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Thu, 25 Jun 2026 15:35:43 +0000</pubDate>
      <link>https://dev.to/vystartasv/what-actually-changed-in-two-weeks-3mki</link>
      <guid>https://dev.to/vystartasv/what-actually-changed-in-two-weeks-3mki</guid>
      <description>&lt;p&gt;I built a large feature. That's not what this is about.&lt;/p&gt;

&lt;p&gt;What changed is the baseline — the standards, docs, and automation that exist now and didn't two weeks ago. Everything after this will be built on top of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Automated tests now ship with new features
&lt;/h2&gt;

&lt;p&gt;QA testers were testing. The product was covered. What didn't exist was automation — no E2E suite, no unit tests for new work, no repeatable spec.&lt;/p&gt;

&lt;p&gt;Now it does. The manual QA cycle stays. The automation catches what humans miss on the tenth pass.&lt;/p&gt;

&lt;p&gt;Quality leap going forward. Human hours saved. The next feature ships with both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The baseline is set
&lt;/h2&gt;

&lt;p&gt;Knowledge lives in the repo. Bug catalog with root causes — so the same thing doesn't get fixed twice. Tech debt inventory with a phased plan. Testing strategy documented, not assumed. GraphQL schema committed and validated against — drift gets caught before it ships. Pre-commit hooks that enforce the standards automatically.&lt;/p&gt;

&lt;p&gt;The frontend and backend documentation are cross-referenced as single sources of truth. The agent instructions point to the right places. Everything new builds on what's already written.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema-first development
&lt;/h2&gt;

&lt;p&gt;The workflow is now: if the schema accommodates the new field, reuse what exists. If it doesn't, the schema update creates the new structure, the data migrates, and everything stays consistent.&lt;/p&gt;

&lt;p&gt;No guessing. No drift. One source of truth for what the data looks like.&lt;/p&gt;




&lt;p&gt;The feature is what you see. The baseline is what you don't — and it matters more.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Running a local coding agent on a Mac Mini — the actual setup</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Sun, 21 Jun 2026 16:17:07 +0000</pubDate>
      <link>https://dev.to/vystartasv/running-a-local-coding-agent-on-a-mac-mini-the-actual-setup-47bo</link>
      <guid>https://dev.to/vystartasv/running-a-local-coding-agent-on-a-mac-mini-the-actual-setup-47bo</guid>
      <description>&lt;p&gt;Running a local coding agent on a Mac Mini&lt;/p&gt;

&lt;h1&gt;
  
  
  Running a local coding agent on a Mac Mini — the actual setup
&lt;/h1&gt;

&lt;p&gt;By Vilius Vystartas&lt;/p&gt;

&lt;p&gt;I have an agent that does my low-stakes coding. File edits, test fixes, build verification. The kind of work you'd normally do yourself but it's faster to delegate. It also writes Playwright tests, reviews code, updates documentation, and runs deploys.&lt;/p&gt;

&lt;p&gt;It runs locally — Mac Mini M4, 24 GB. No cloud API calls for the coding part. The orchestration layer still uses a cheap cloud model for planning and routing. The actual file editing is done by Pi, a coding agent that connects to oMLX, an OpenAI-compatible local LLM server.&lt;/p&gt;

&lt;p&gt;The same setup can drive Claude Code, Codex, or any coding agent that speaks OpenAI-compatible API. Pi is what I use, but the oMLX server works with anything.&lt;/p&gt;

&lt;p&gt;All the model names, config files, and paths are inside the script at the bottom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two models
&lt;/h2&gt;

&lt;p&gt;I keep two and swap depending on the task. The 24 GB can't hold both at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One as good as I can have on this machine&lt;/strong&gt; — 9B class, ~20 tok/s. Primary coding model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Another fast&lt;/strong&gt; — 4B class, ~27 tok/s. File edits, quick fixes, daily tasks.&lt;/p&gt;

&lt;p&gt;The swap script moves one out, brings the other in, restarts the server. Takes about 5 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Pi does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;File edits and refactoring&lt;/li&gt;
&lt;li&gt;Writing and fixing tests (Playwright, unit tests)&lt;/li&gt;
&lt;li&gt;Build verification&lt;/li&gt;
&lt;li&gt;Code review&lt;/li&gt;
&lt;li&gt;Documentation updates&lt;/li&gt;
&lt;li&gt;Running deploys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anything more complex than a one-liner goes through RPC mode. The orchestration layer writes a prompt, Pi executes, the result comes back. No tmux, no process wrangling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pi extensions — what they do, why I use them
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pix-optimizer&lt;/strong&gt; — ponytail + caveman (lazy dev mode and token compression). Keeps Pi output tight and skips boilerplate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;context-mode&lt;/strong&gt; — workspace routing and tool call interception. Keeps Pi from wandering into the wrong directories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pi-subagents&lt;/strong&gt; — spawns sub-agents. Parallel work without blocking the main session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pi-workflow-engine&lt;/strong&gt; — multi-step task orchestration. Lets Pi handle sequences without losing context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pi-mcp-adapter&lt;/strong&gt; — MCP server connectivity. Connects to context7 and scrapling for external tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;@fgladisch/pi-caveman&lt;/strong&gt; — additional compression on top of pix-optimizer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Known issues
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Can only keep one model loaded at a time. Two = OOM. Swap script handles it.&lt;/li&gt;
&lt;li&gt;Thinking mode must be disabled. Defaults to chain-of-thought, kills speed.&lt;/li&gt;
&lt;li&gt;Full chat history in prompts crashes the local model. Prompts must be just the files and changes.&lt;/li&gt;
&lt;li&gt;Print mode skips safety controls. Use RPC mode for anything non-trivial.&lt;/li&gt;
&lt;li&gt;First request after a model swap can time out. Retry once.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The blueprint
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://workswithagents.dev/static/setup-local-llm-pi.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>agents</category>
      <category>macos</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What happens when your OpenRouter key gets stolen? Nothing. Then you move on.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 16 Jun 2026 02:53:06 +0000</pubDate>
      <link>https://dev.to/vystartasv/what-happens-when-your-openrouter-key-gets-stolen-nothing-then-you-move-on-okn</link>
      <guid>https://dev.to/vystartasv/what-happens-when-your-openrouter-key-gets-stolen-nothing-then-you-move-on-okn</guid>
      <description>&lt;p&gt;I woke up to a bill that wasn't mine. Balance zeroed, burned on a model I don't use. Someone found my OpenRouter key in an exposed env variable and ran it dry.&lt;/p&gt;

&lt;p&gt;That's it. No alert. No threshold. No "maybe check this." Just a zeroed balance and the lesson.&lt;/p&gt;

&lt;p&gt;I know what you're thinking — rate limits. Secret audits. Budget caps. Yeah. Living in the real world doesn't always work that way. You push things, you trust the token in &lt;code&gt;export OPENROUTER_KEY=sk-...&lt;/code&gt; stays where you left it. It doesn't. A scumbag finds it and your API key becomes their API key.&lt;/p&gt;

&lt;p&gt;The annoying part isn't even the money. It's the rethinking. Where else am I exposed?&lt;/p&gt;

&lt;p&gt;Then you go looking for help. It's not in the dropdown. Not within easy reach of credit history or billing. Not available to tired eyes at 3am when you're trying to figure out what the hell just happened. There's no button to report. No obvious kill switch. Just the knowledge base telling you to be more careful.&lt;/p&gt;

&lt;p&gt;I'm not dropping an X message. Don't care to waste even more time. Support shouldn't be optional. A spending cap should be obvious. An alert for a 3000% spike should exist by default. The "report abuse" button shouldn't require a site drill.&lt;/p&gt;

&lt;p&gt;None of that was there. So I took the hit, added a hard limit, scrubbed my env files, moved on.&lt;/p&gt;

&lt;p&gt;The scumbag? Nobody. They sweep keys from GitHub repos and deployment logs a hundred times a day. I'm not going to find them.&lt;/p&gt;

&lt;p&gt;But I'm going to have a spending cap next time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>personal</category>
    </item>
    <item>
      <title>The End of the US Cloud Monopoly: AI Balkanization Is Here to Stay</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Sat, 13 Jun 2026 08:28:42 +0000</pubDate>
      <link>https://dev.to/vystartasv/the-end-of-the-us-cloud-monopoly-ai-balkanization-is-here-to-stay-4g68</link>
      <guid>https://dev.to/vystartasv/the-end-of-the-us-cloud-monopoly-ai-balkanization-is-here-to-stay-4g68</guid>
      <description>&lt;p&gt;By Vilius Vystartas | June 2026&lt;/p&gt;

&lt;p&gt;The single, globally unified internet is gone. What's replacing it is a patchwork of sovereign AI zones, each running its own stack on its own hardware with its own rules.&lt;/p&gt;

&lt;p&gt;This isn't a prediction. It's already happening, and the next three years will cement it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Broke the Monopoly
&lt;/h2&gt;

&lt;p&gt;The US government's approach to AI regulation — treating frontier model weights as controlled munitions — had an unintended consequence. By demonstrating that access to US cloud infrastructure can be switched off by regulatory decree, they forced every non-US government and enterprise to build a backup plan.&lt;/p&gt;

&lt;p&gt;The January 2025 AI Diffusion Rule created a three-tier world: unrestricted allies, capped nations (50,000 GPUs/year), and total embargoes. For the 140+ countries in Tier 2, US cloud services became inherently unstable. You can't build a national AI strategy on a faucet that might turn off.&lt;/p&gt;

&lt;p&gt;The DeepSeek R1 moment in January 2025 proved the point: a Chinese quant hedge fund trained a frontier reasoning model on nerfed hardware for $5.6 million. Export controls didn't stop the frontier. They just accelerated the development of independent stacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Zones
&lt;/h2&gt;

&lt;p&gt;The tech industry is splitting into three distinct legal and architectural zones, each with its own economics:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The US Zone&lt;/strong&gt; — high-performance, high-surveillance closed models. OpenAI, Anthropic, Google. Restricted to US citizens and close Tier 1 allies. The best models, the most monitoring, the least legal recourse if you're outside its borders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The European Zone&lt;/strong&gt; — regulated, open-source-first, locally hosted. Data privacy is the architecture, not a compliance checkbox. France's Mistral, Germany's Aleph Alpha, the fragmented but determined GAIA-X federation. GDPR compliance isn't overhead — it's the product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Asian/Non-Western Zone&lt;/strong&gt; — independent stacks operating entirely outside the Western financial and regulatory sphere. DeepSeek, Alibaba's Qwen, Baidu's Ernie. Huawei Ascend chips replacing NVIDIA. No US venture capital, no US cloud, no US export license risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sovereign AI Infrastructure Boom
&lt;/h2&gt;

&lt;p&gt;Every non-US government with ambition is building national AI infrastructure. Not optional. Existential.&lt;/p&gt;

&lt;p&gt;Europe is scattering exascale systems across the continent — Germany's JUPITER, Finland's LUMI, Italy's LEONARDO. Federated, fragmented, deliberately not dependent on any single provider.&lt;/p&gt;

&lt;p&gt;The Middle East is placing bigger bets. Saudi Arabia's $40 billion AI fund. UAE's G42 building Condor Galaxy on Cerebras hardware, then selling a $1.5 billion stake to Microsoft — on condition it cut Chinese ties. The message: even your sovereign compute comes with geopolitical strings.&lt;/p&gt;

&lt;p&gt;India's $1.25 billion IndiaAI Mission aims for 10,000+ GPUs through Yotta's Shakti Cloud. But at Tier 2's 50,000 GPU cap, the ambition outstrips the allocation.&lt;/p&gt;

&lt;p&gt;Japan's SoftBank committed nearly a billion to AI datacenters. ABCI 3.0 is operational with H100 clusters.&lt;/p&gt;

&lt;p&gt;Every single one runs some form of open-weight model. Because closed APIs can be switched off at a whim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Weights Won
&lt;/h2&gt;

&lt;p&gt;The question of whether enterprises would choose open-source models over closed APIs is settled. They will.&lt;/p&gt;

&lt;p&gt;Meta's Llama 3.1 405B proved open models can match GPT-4 class performance. Mistral proved European sovereignty models are commercially viable. DeepSeek proved frontier reasoning can be open. The entire ecosystem shifted from "can open models compete?" to "how do we productionize our chosen open model?"&lt;/p&gt;

&lt;p&gt;The calculus is simple: slightly lower benchmark scores in exchange for complete operational certainty. No API key to revoke. No pricing change that breaks your margin. No geopolitical event that cuts your access.&lt;/p&gt;

&lt;p&gt;This has driven massive innovation in on-device and on-premises deployment. Models under 70 billion parameters — often under 10 billion — that run on corporate hardware rather than centralized server farms. Microsoft's Phi-4, Apple's on-device models, Google's Gemma, Meta's Llama 3.2 small variants. The edge is where the action is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Death of the API Wrapper
&lt;/h2&gt;

&lt;p&gt;The venture capital correction is brutal but predictable. Startups whose entire value proposition was piping data to a third-party US API can't raise international funding. Their core product can be wiped out overnight by a single regulatory pen stroke — or a pricing change, or a model deprecation, or a geopolitical event.&lt;/p&gt;

&lt;p&gt;Jasper went from $1.5 billion valuation to significant layoffs. The entire "GPT wrapper" category is being reframed as a 2023-2024 anomaly.&lt;/p&gt;

&lt;p&gt;The new valuation premium isn't about who uses the flashiest model. It's about who owns the proprietary training data to build independent, in-house models. Palantir's stock surge, BloombergGPT's financial data moat, healthcare AI companies valued on unique patient datasets — the market is betting on data ownership, not API access.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Nobody's Saying About the Hardware
&lt;/h2&gt;

&lt;p&gt;This entire scenario depends on one unresolved bottleneck: TSMC manufactures over 90% of advanced AI chips.&lt;/p&gt;

&lt;p&gt;The US CHIPS Act ($52.7 billion), the European Chips Act (€43 billion), and TSMC's own global fab expansion (Arizona 4nm, Japan operational, Germany planned) are all trying to address this. But fabrication takes years. The RISC-V ecosystem is promising but a decade behind CUDA in maturity and tooling.&lt;/p&gt;

&lt;p&gt;The real risk isn't that US export controls will stop frontier AI development. DeepSeek proved they won't. The risk is that the hardware supply chain itself becomes a weapon — and every sovereign AI zone discovers that independence at the architectural level means nothing without independence at the fab level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-5 Year Outlook
&lt;/h2&gt;

&lt;p&gt;The trajectory is clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;US cloud APIs become a premium product for US-aligned customers only&lt;/li&gt;
&lt;li&gt;Open-weight models become the global enterprise default&lt;/li&gt;
&lt;li&gt;National datacenters proliferate in every region that can afford them&lt;/li&gt;
&lt;li&gt;Data ownership replaces model access as the primary valuation driver&lt;/li&gt;
&lt;li&gt;The supply chain question remains the unresolved wildcard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a temporary fragmentation that heals with better policy. It's a permanent structural shift. The unified global technology ecosystem that defined the last two decades is over. The question isn't whether the balkanization happens — it's whether your infrastructure is ready for it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cloud</category>
      <category>opensource</category>
      <category>devtools</category>
    </item>
    <item>
      <title>We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 22:46:48 +0000</pubDate>
      <link>https://dev.to/vystartasv/we-asked-10-llms-to-write-efficient-code-only-4-got-better-47gf</link>
      <guid>https://dev.to/vystartasv/we-asked-10-llms-to-write-efficient-code-only-4-got-better-47gf</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every LLM can write code that works. The question is: can they write code that's &lt;em&gt;efficient&lt;/em&gt; — and does telling them to be efficient actually help?&lt;/p&gt;

&lt;p&gt;I tested 10 models on 10 coding tasks, each in two phases: &lt;strong&gt;unprompted&lt;/strong&gt; (the model writes its own code) and &lt;strong&gt;prompted&lt;/strong&gt; (explicitly told to write clean, DRY, efficient code). That's 200 API calls, $0.56 total. The results are... not what most prompt engineers would predict.&lt;/p&gt;

&lt;p&gt;GPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the "write efficient code" prompt was meaningless or actively harmful.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Metric Works
&lt;/h2&gt;

&lt;p&gt;Each task has a known &lt;strong&gt;optimal token budget&lt;/strong&gt; — the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The &lt;strong&gt;efficiency score&lt;/strong&gt; is &lt;code&gt;optimal_tokens / actual_tokens&lt;/code&gt;, capped at 1.0.&lt;/p&gt;

&lt;p&gt;A score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the "write efficient code" instruction actually changes behaviour.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Leaderboard (Sorted by Prompted Efficiency)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Unprompted&lt;/th&gt;
&lt;th&gt;Prompted&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;th&gt;Frugal&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Correctness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇&lt;/td&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.63&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.20&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.096&lt;/td&gt;
&lt;td&gt;78% → 85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;Qwen 3.6 Plus&lt;/td&gt;
&lt;td&gt;0.44&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.60&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+0.17&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.158&lt;/td&gt;
&lt;td&gt;78% → 87%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;Gemma 4 31B&lt;/td&gt;
&lt;td&gt;0.54&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.58&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.003&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;92% both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;DeepSeek Chat&lt;/td&gt;
&lt;td&gt;0.51&lt;/td&gt;
&lt;td&gt;0.55&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.006&lt;/td&gt;
&lt;td&gt;91% → 80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;0.47&lt;/td&gt;
&lt;td&gt;0.52&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.121&lt;/td&gt;
&lt;td&gt;92% both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;LFM 2 24B A2B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.54&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.47&lt;/td&gt;
&lt;td&gt;-0.06&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;90% → 80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Mistral Large 2411&lt;/td&gt;
&lt;td&gt;0.54&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;td&gt;-0.08&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.050&lt;/td&gt;
&lt;td&gt;90% → 82%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
&lt;td&gt;0.47&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;td&gt;-0.01&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;$0.020&lt;/td&gt;
&lt;td&gt;92% → 90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Cohere Command A&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.60&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.44&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-0.17&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;$0.071&lt;/td&gt;
&lt;td&gt;90% → 82%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;td&gt;+0.09&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;$0.029&lt;/td&gt;
&lt;td&gt;76% → 86%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Stands Out
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GPT-5.4 Is the Prompt Whisperer
&lt;/h3&gt;

&lt;p&gt;GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were &lt;strong&gt;config-generation&lt;/strong&gt; (+0.81 — went from 12 inline JSON blocks to a template loop), &lt;strong&gt;html-from-data&lt;/strong&gt; (+0.71), and &lt;strong&gt;magic-strings&lt;/strong&gt; (+0.38 — switched to an Enum). It's the only model in the batch where the "write efficient code" instruction consistently produces different (and better) output.&lt;/p&gt;

&lt;p&gt;The cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemma 4 31B: The Quiet Winner
&lt;/h3&gt;

&lt;p&gt;Half of Gemma 4's tasks were already "frugal" — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That's a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cohere Command A: Prompting Backfires
&lt;/h3&gt;

&lt;p&gt;Cohere Command A had the &lt;strong&gt;highest unprompted efficiency&lt;/strong&gt; in the batch (0.60) — it naturally writes concise code. But when told "write efficient code," it ballooned output on several tasks. &lt;strong&gt;html-from-data&lt;/strong&gt; went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.&lt;/p&gt;

&lt;p&gt;Lesson: if a model is already efficient, don't prompt it to be more efficient.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen 3.6 Plus: Second Place, Slowest
&lt;/h3&gt;

&lt;p&gt;Qwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took &lt;strong&gt;26 minutes&lt;/strong&gt; for 20 tasks — by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you're waiting for it. Batch workloads only.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Kimi Surprise
&lt;/h3&gt;

&lt;p&gt;Kimi K2.6 had the lowest unprompted efficiency (0.34 — verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge.&lt;/p&gt;

&lt;h3&gt;
  
  
  Frugality: What Does It Mean?
&lt;/h3&gt;

&lt;p&gt;"Frugal" means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Group&lt;/th&gt;
&lt;th&gt;Models&lt;/th&gt;
&lt;th&gt;Behaviour&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt-responsive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-5.4, Qwen 3.6 Plus&lt;/td&gt;
&lt;td&gt;Efficiency improves substantially with prompting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt-neutral&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6&lt;/td&gt;
&lt;td&gt;Prompt has little effect (±0.04)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt-antagonistic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LFM 2 24B A2B, Mistral Large 2411, Cohere Command A&lt;/td&gt;
&lt;td&gt;Efficiency &lt;em&gt;drops&lt;/em&gt; when prompted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.&lt;/p&gt;

&lt;p&gt;If the prompt says "write efficient code" and the model responds by writing &lt;em&gt;more&lt;/em&gt; tokens, something in the training signal is misaligned.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best prompted efficiency:&lt;/strong&gt; GPT-5.4 — 0.63, $0.10 for 20 tasks. The only model where prompting reliably improves output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value overall:&lt;/strong&gt; Gemma 4 31B — 0.58 prompted, 92% correctness, $0.003. Absurd price/performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best natural efficiency:&lt;/strong&gt; Cohere Command A — 0.60 unprompted. Don't prompt it, just let it work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most consistent:&lt;/strong&gt; Claude Sonnet 4 — 92% correctness on both phases, small +0.04 efficiency gain. Reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip if you're in a hurry:&lt;/strong&gt; Qwen 3.6 Plus — 26 minutes for 20 tasks. Great efficiency gains, terrible latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch list:&lt;/strong&gt; Kimi K2.6 — low base efficiency but the prompt actually helps. Worth retesting with a better prompt.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).&lt;/p&gt;

&lt;p&gt;Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: &lt;strong&gt;efficiency_ratio = optimal_tokens / actual_tokens&lt;/strong&gt; (capped at 1.0). Correctness scored against expected output patterns.&lt;/p&gt;

&lt;p&gt;Total cost: &lt;strong&gt;$0.56&lt;/strong&gt; for 200 API calls (10 models × 10 tasks × 2 phases). Temperature: 0.1. Max tokens: 600.&lt;/p&gt;

&lt;p&gt;Full results: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>programming</category>
    </item>
    <item>
      <title>10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 22:42:59 +0000</pubDate>
      <link>https://dev.to/vystartasv/10-models-tested-from-816-to-10-the-free-tier-is-a-full-on-gamble-4kfc</link>
      <guid>https://dev.to/vystartasv/10-models-tested-from-816-to-10-the-free-tier-is-a-full-on-gamble-4kfc</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I tested another 10 models across the same 10 agent coding tasks. Four of them were free-tier models — and the range was absurd: Owl Alpha scored 76.7% with zero hard fails, Laguna M.1 scored 10% and produced garbage on 9 out of 10 tasks. The free tier is not free if it costs you debugging time.&lt;/p&gt;

&lt;p&gt;Total cost for all 10 models: &lt;strong&gt;$0.10&lt;/strong&gt;. The paid models (6 of 10) came to $0.10 combined.&lt;/p&gt;




&lt;h2&gt;
  
  
  Batch 12 Leaderboard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;P/P/F&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Grok 4.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.6%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7/3/0&lt;/td&gt;
&lt;td&gt;$0.017&lt;/td&gt;
&lt;td&gt;39.9s&lt;/td&gt;
&lt;td&gt;Paid (xAI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;Perceptron Mk1&lt;/td&gt;
&lt;td&gt;79.9%&lt;/td&gt;
&lt;td&gt;8/1/1&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;29.3s&lt;/td&gt;
&lt;td&gt;Paid (Perceptron)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;Owl Alpha (free)&lt;/td&gt;
&lt;td&gt;76.7%&lt;/td&gt;
&lt;td&gt;5/5/0&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;83.0s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;xAI: Grok Build 0.1&lt;/td&gt;
&lt;td&gt;75.0%&lt;/td&gt;
&lt;td&gt;5/4/1&lt;/td&gt;
&lt;td&gt;$0.034&lt;/td&gt;
&lt;td&gt;95.3s&lt;/td&gt;
&lt;td&gt;Paid (xAI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;OpenAI: GPT Chat Latest&lt;/td&gt;
&lt;td&gt;73.3%&lt;/td&gt;
&lt;td&gt;6/2/2&lt;/td&gt;
&lt;td&gt;$0.043&lt;/td&gt;
&lt;td&gt;18.7s&lt;/td&gt;
&lt;td&gt;Paid (OpenAI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Mistral Medium 3.5&lt;/td&gt;
&lt;td&gt;71.6%&lt;/td&gt;
&lt;td&gt;6/2/2&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;td&gt;12.6s&lt;/td&gt;
&lt;td&gt;Paid (Mistral)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Nemotron 3 Nano Omni (free)&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;4/2/4&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;23.5s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Laguna XS.2 (free)&lt;/td&gt;
&lt;td&gt;49.7%&lt;/td&gt;
&lt;td&gt;3/3/4&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;28.7s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Baidu CoBuddy (free)&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;4/0/6&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;362.4s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Laguna M.1 (free)&lt;/td&gt;
&lt;td&gt;10.0%&lt;/td&gt;
&lt;td&gt;1/0/9&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;89.8s&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Headlines
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Grok 4.3 (81.6%, $0.017, 39.9s)&lt;/strong&gt; — Grok's latest release takes the batch with zero hard fails. Seven clean passes, three partials. Process-monitor was the only full pass it earned that 4.3's competitors missed. xAI's Grok line is quietly consistent — 4.1 Fast (76.7%), 4.20 (75%), and now 4.3 (81.6%) — all within striking distance of the 80%+ club without crossing into premium pricing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perceptron Mk1 (79.9%, $0.002, 29.3s)&lt;/strong&gt; — A brand new family debuts at nearly 80%, with eight passes — the most in the batch — for two-tenths of a cent. The one failure (regex-extract at 17%) is a known weakness for small models. At this price-to-pass ratio, Perceptron Mk1 is the value story of this batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Owl Alpha (free, 76.7%, 83.0s)&lt;/strong&gt; — A free model with zero hard fails and 5 full passes. That's the standout free-tier result. Takes 2x longer than paid models for some tasks (24s on csv-stats vs 1-3s for the field), but the code is functional. If latency isn't critical, this is usable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Free Tier Lottery
&lt;/h2&gt;

&lt;p&gt;Four free models. Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Owl Alpha&lt;/td&gt;
&lt;td&gt;76.7%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Usable&lt;/strong&gt; — zero hard fails, 5/10 full passes. Slow but functional.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron 3 Nano Omni&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Mixed&lt;/strong&gt; — half of tasks hit output cap at 400 tokens. Hit or miss.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna XS.2&lt;/td&gt;
&lt;td&gt;49.7%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Unreliable&lt;/strong&gt; — 400-token cap kills complex responses.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Baidu CoBuddy&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Frustrating&lt;/strong&gt; — 362 seconds total. Half the tasks hit output cap at 399 tokens. Waiting 6 minutes for 40% accuracy is not a good trade.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna M.1&lt;/td&gt;
&lt;td&gt;10.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Broken&lt;/strong&gt; — 1/10 passes. Every response capped at 400 tokens. Do not use.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The free tier cap of 399-400 output tokens is the real problem. Models like Laguna M.1 and CoBuddy truncate every response, turning what could be a partial into a fail. Owl Alpha works despite the cap because its outputs are concise enough to fit.&lt;/p&gt;

&lt;p&gt;Pay $0.002 for Perceptron Mk1 and get 8/10 passes, or use Laguna M.1 free and get 1/10. The math is not subtle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Disappointments
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GPT Chat Latest (73.3%, $0.043)&lt;/strong&gt; — OpenAI's catch-all endpoint was solid on easy tasks (file-parse, csv-stats, sql-query all passed) but fell apart on fix-bug (0%) with a lengthy, expensive hallucination. The most expensive model in the batch and it doesn't crack 75%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistral Medium 3.5 (71.6%, $0.008)&lt;/strong&gt; — Fastest model in the batch at 12.6s total, but the process-monitor task hit a 504 Gateway Timeout and scored 0%. A timeout fail on a model that otherwise looks strong carries a disproportionate penalty — without it, Medium 3.5 would be at 79.5%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Laguna M.1 (10%)&lt;/strong&gt; — The worst score in any batch I've run. Seven of its task responses were blank 400-token output cap fills. Not worth listing on OpenRouter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Price/Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;$/%-pt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Owl Alpha (free)&lt;/td&gt;
&lt;td&gt;76.7%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron 3 Nano Omni (free)&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna XS.2 (free)&lt;/td&gt;
&lt;td&gt;49.7%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Baidu CoBuddy (free)&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Laguna M.1 (free)&lt;/td&gt;
&lt;td&gt;10.0%&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perceptron Mk1&lt;/td&gt;
&lt;td&gt;79.9%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Medium 3.5&lt;/td&gt;
&lt;td&gt;71.6%&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;td&gt;$0.0108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok 4.3&lt;/td&gt;
&lt;td&gt;81.6%&lt;/td&gt;
&lt;td&gt;$0.017&lt;/td&gt;
&lt;td&gt;$0.0209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;xAI: Grok Build 0.1&lt;/td&gt;
&lt;td&gt;75.0%&lt;/td&gt;
&lt;td&gt;$0.034&lt;/td&gt;
&lt;td&gt;$0.0450&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT Chat Latest&lt;/td&gt;
&lt;td&gt;73.3%&lt;/td&gt;
&lt;td&gt;$0.043&lt;/td&gt;
&lt;td&gt;$0.0584&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Free models dominate the $/%-pt table by definition, but only Owl Alpha is actually usable. Among paid models, Perceptron Mk1 at $0.0024/%-pt is the efficiency winner — 24x cheaper per point than GPT Chat Latest.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best overall:&lt;/strong&gt; Grok 4.3 — 81.6%, 39.9s, $0.017. Cleanest leaderboard of the batch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value (paid):&lt;/strong&gt; Perceptron Mk1 — 79.9%, $0.002 total. Eight passes for two-tenths of a cent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best free model:&lt;/strong&gt; Owl Alpha — 76.7%, zero hard fails. The only free model I'd ship with in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fastest:&lt;/strong&gt; Mistral Medium 3.5 — 12.6s for all 10 tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip entirely:&lt;/strong&gt; Laguna M.1 and all Laguna free-tier variants. 10% is not testable.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Same setup as previous batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 400. Temperature: 0.1. Pattern-matching scoring against expected outputs.&lt;/p&gt;

&lt;p&gt;Pre-flight verification caught zero failures this batch. Total cost: &lt;strong&gt;$0.10&lt;/strong&gt;. Total dataset: &lt;strong&gt;168 models tested&lt;/strong&gt; across cloud and local.&lt;/p&gt;

&lt;p&gt;Full results and per-task scores: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 18:48:33 +0000</pubDate>
      <link>https://dev.to/vystartasv/i-tested-10-more-models-five-brand-new-families-debuted-none-scored-below-75-9fj</link>
      <guid>https://dev.to/vystartasv/i-tested-10-more-models-five-brand-new-families-debuted-none-scored-below-75-9fj</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. This is getting harder to keep up with.&lt;/p&gt;

&lt;p&gt;Two more models tied the all-time record at 90%. The cheapest model ever tested cost $0.0001 for a full 10-task benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  The New 90% Club Members
&lt;/h2&gt;

&lt;p&gt;Eight models have now hit 90% on this benchmark. Batch 11 added two:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistral Large 2411 (90%, $0.008, 46s)&lt;/strong&gt; — Mistral's November 2024 flagship matches their current Large 3. Sometimes the first version is still the best one. Zero hard fails, clean passes on 8/10 tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek Chat V3-0324 (90%, $0.002, 73s)&lt;/strong&gt; — The older V3 variant from March 2024 matches the original DeepSeek Chat at 90%. Every time I test a DeepSeek variant, it lands at 80-90%. The family is remarkably consistent.&lt;/p&gt;

&lt;p&gt;The 90% club now includes: DeepSeek Chat (original), DeepSeek Chat V3-0324, Qwen3 Coder 30B, Nemotron 3 Nano 30B, Codestral 2508, Mistral Large 2411, MiniMax M2 Her, and Baidu Ernie 4.5 300B. Eight models. Seven of them cost less than a cent per full benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Families, First Try
&lt;/h2&gt;

&lt;p&gt;Every new family debuted at 75% or higher. That's an impressive hit rate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sao10k&lt;/td&gt;
&lt;td&gt;L3.1 Euryale 70B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;29s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sao10k&lt;/td&gt;
&lt;td&gt;L3 Lunaris 8B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0001&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthracite&lt;/td&gt;
&lt;td&gt;Magnum V4 72B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.006&lt;/td&gt;
&lt;td&gt;35s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mancer&lt;/td&gt;
&lt;td&gt;Weaver&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;30s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Undi95&lt;/td&gt;
&lt;td&gt;Remm Slerp L2 13B&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;31s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inflection&lt;/td&gt;
&lt;td&gt;Inflection 3 Productivity&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.012&lt;/td&gt;
&lt;td&gt;42s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;*Inflection 3 result is provisional — awaiting lab response. Will update in due course.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L3 Lunaris 8B at $0.0001 is the cheapest model I've ever tested.&lt;/strong&gt; A full 10-task benchmark for one ten-thousandth of a dollar. At this price, there's no reason not to test a model before you ship with it. Lunaris scored 85% — competitive with models that cost 100x more.&lt;/p&gt;

&lt;p&gt;The Sao10k family (L3.1 Euryale 70B and L3 Lunaris 8B) is the standout. Both models scored 85%, both are fine-tunes of Llama 3.1/3, and both cost almost nothing. Community fine-tunes continue to punch above their weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Recoveries
&lt;/h2&gt;

&lt;p&gt;Two Qwen models from my previous failed batch completed successfully this time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3 8B (80%, $0.02, 543s)&lt;/strong&gt; — Needed &lt;code&gt;per_call_timeout: 300&lt;/code&gt; to finish. The model is competent (6 passes, 4 partials, zero fails) but painfully slow. Each API call takes 100-120 seconds on OpenRouter. Use it as a background job, not a real-time agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen Plus 2025-07-28 (80%, $0.001, 19s)&lt;/strong&gt; — The dated variant works perfectly with &lt;code&gt;enable_thinking: false&lt;/code&gt;. 80% at $0.0009 is great value. But use the current &lt;code&gt;qwen/qwen-plus&lt;/code&gt; ID instead — it scores 85% and doesn't need the dated suffix.&lt;/p&gt;




&lt;h2&gt;
  
  
  Price/Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;$/%-pt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L3 Lunaris 8B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.0001&lt;/td&gt;
&lt;td&gt;$0.0001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek Chat V3-0324&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0017&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3.1 Euryale 70B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0021&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remm Slerp L2 13B&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;td&gt;$0.0020&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mancer Weaver&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;$0.0041&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthracite Magnum V4 72B&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;$0.006&lt;/td&gt;
&lt;td&gt;$0.0066&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Large 2411&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;td&gt;$0.0093&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inflection 3 Productivity&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;$0.012&lt;/td&gt;
&lt;td&gt;$0.0156&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 8B&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;$0.020&lt;/td&gt;
&lt;td&gt;$0.0254&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ratio between cheapest and most expensive $/%-pt is 254x. Lunaris at $0.0001/%-pt vs Qwen3 8B at $0.0254/%-pt — same tier of score, wildly different cost profiles.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best overall:&lt;/strong&gt; Mistral Large 2411 — 90%, 46s, $0.008&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value:&lt;/strong&gt; L3 Lunaris 8B — 85%, $0.0001 total. Absurd price/performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best new family debut:&lt;/strong&gt; Sao10k — both models at 85% first try. Watch this line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fastest:&lt;/strong&gt; L3 Lunaris 8B — 20 seconds for all 10 tasks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Same setup as the previous 10 batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 600 (Qwen models), 300 (everyone else). Temperature: 0.1. Pattern-matching scoring against expected outputs.&lt;/p&gt;

&lt;p&gt;Pre-flight verification caught zero failures this batch. All 10 candidates passed the simple-prompt test. Total cost: $0.05 for the core 8 models, then $0.02 for the Qwen recovery run. Total dataset: &lt;strong&gt;158 models tested&lt;/strong&gt; across cloud and local.&lt;/p&gt;

&lt;p&gt;Full results and per-task scores: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
    <item>
      <title>Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 26 May 2026 09:46:02 +0000</pubDate>
      <link>https://dev.to/vystartasv/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny-12d2</link>
      <guid>https://dev.to/vystartasv/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny-12d2</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Ten more models through the same 10 agent coding tasks. Two tied the all-time record. One cost $0.0002. The other hit the score at $0.0018 — cheaper than most models scoring 70%.&lt;/p&gt;

&lt;p&gt;Batch 10 was the cheapest one yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Leaders
&lt;/h2&gt;

&lt;p&gt;Two models scored 90% with zero hard fails, joining MiniMax M2 Her and Baidu Ernie 4.5 300B as the highest-scoring models on this benchmark:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3 Coder 30B A3B&lt;/strong&gt; — 90% in 28 seconds, $0.0004. An efficient coder that doesn't burn budget on thinking tokens it doesn't need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek Chat (original)&lt;/strong&gt; — 90% in 59 seconds, $0.0018. The original DeepSeek Chat still competes with modern models on agent coding. Newer doesn't always mean better.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Surprises
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LFM 2 24B A2B (85%, $0.0002, 15s) is the cheapest model I've ever tested.&lt;/strong&gt; Liquid's debut family is absurdly cost-effective. A full 10-task benchmark for literally $0.0002. At this price/performance ratio, there's no excuse not to test a model before committing to a more expensive alternative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistral Small 3.2 (85%, $0.0004)&lt;/strong&gt; is a clear upgrade. The Small line went 75% → 85% across versions — a ten-point jump at the same budget tier. Mistral keeps improving the right things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3 14B scored 0% across all 10 tasks.&lt;/strong&gt; Mandatory thinking mode that can't be suppressed at 300 tokens means every request times out before producing output. Skip for agent coding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cydonia 24B V4.1 (80%, $0.001)&lt;/strong&gt; debuts a new family from TheDrummer. Zero hard fails. Watch this one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Duds
&lt;/h2&gt;

&lt;p&gt;Qwen3.7 Max (85%, $0.13, 295 seconds) scored the same as budget models costing 300x less. Thinking mode tax at work — the accuracy is there, but you'll wait five minutes and pay for every second.&lt;/p&gt;

&lt;p&gt;Claude Opus 4 (80%, $0.10, 76s) had one hard fail. For a top-tier premium model at $0.10 per 10 tasks, that's below expectations. It's not a bad model — it's overkill for agent coding at a tight token budget.&lt;/p&gt;

&lt;p&gt;Aion 1.0 (80%) had two hard fails and was the slowest at 160 seconds. The architecture is interesting, but it's not ready for production agent work.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Picks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best overall:&lt;/strong&gt; Qwen3 Coder 30B A3B — 90%, 28s, $0.0004&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best value:&lt;/strong&gt; LFM 2 24B A2B — 85%, $0.0002 total. Ridiculous price/performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fastest:&lt;/strong&gt; LFM 2 24B A2B — 15 seconds flat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most improved:&lt;/strong&gt; Mistral Small 3.2 — 75% → 85% across versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip entirely:&lt;/strong&gt; Qwen3 14B for agent tasks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing — tested against each model via OpenRouter. Max tokens: 300. Temperature: 0.1. Results scored by pattern matching against expected outputs. Pre-flight verification caught 2 models (Ernie 4.5 21B — HTTP 429, Trinity Mini — empty content) before they wasted the batch.&lt;/p&gt;

&lt;p&gt;Total batch cost: $0.14 across 9 models. Qwen3.7 Max alone accounted for $0.13 of that — thinking tax.&lt;/p&gt;

&lt;p&gt;Total models tested: 148 (up from 138).&lt;/p&gt;

&lt;p&gt;Full results and per-task scores: &lt;a href="https://benchmarks.workswithagents.dev" rel="noopener noreferrer"&gt;benchmarks.workswithagents.dev&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because you should.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>benchmark</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
