<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Darko from Kilo</title>
    <description>The latest articles on DEV Community by Darko from Kilo (@kilocode).</description>
    <link>https://dev.to/kilocode</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3596172%2Fd582ef62-3145-486c-9eb1-cc50dfb22f58.png</url>
      <title>DEV Community: Darko from Kilo</title>
      <link>https://dev.to/kilocode</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kilocode"/>
    <language>en</language>
    <item>
      <title>The GitHub Copilot Bill Came Due. Here's What Engineering Leaders Should Do.</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Mon, 08 Jun 2026 08:15:18 +0000</pubDate>
      <link>https://dev.to/kilocode/the-github-copilot-bill-came-due-heres-what-engineering-leaders-should-do-ig5</link>
      <guid>https://dev.to/kilocode/the-github-copilot-bill-came-due-heres-what-engineering-leaders-should-do-ig5</guid>
      <description>&lt;p&gt;Right now, as I write this, our team is on the floor at the Gartner Summit, and there's one conversation happening in every hallway and coffee line: what just happened to our GitHub Copilot bill?&lt;/p&gt;

&lt;p&gt;It's the trending topic of the day for a reason. On June 1, Copilot's usage-based billing went live for everyone, and the people feeling it hardest are software engineering leaders who woke up this week to discover that a line item they'd treated as fixed for three years is now a variable cost that swings with their team's most productive days.&lt;/p&gt;

&lt;p&gt;A few weeks ago, &lt;a href="https://blog.kilo.ai/p/the-github-copilot-news-is-just-the" rel="noopener noreferrer"&gt;we wrote that this was coming&lt;/a&gt; — the era of subsidized, all-you-can-eat AI was over, and the only honest path forward was paying for what you use. And it's happening this week.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we're hearing on the floor
&lt;/h2&gt;

&lt;p&gt;This isn't just our read in a vendor blog. Many engineering leaders we've talked to this week are in the same scramble: how to get ahead of a bill that's suddenly a moving target.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually changed
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/" rel="noopener noreferrer"&gt;GitHub has moved from seat-based pricing to an access-plus-consumption model&lt;/a&gt;: your subscription funds a monthly credit pool, and you pay for everything beyond it.&lt;/p&gt;

&lt;p&gt;Copilot now bills by &lt;strong&gt;GitHub AI Credits&lt;/strong&gt;, calculated on token consumption — input, output, and cached — at per-model API rates. Code completions and Next Edit Suggestions stay free and unmetered, so if autocomplete is your whole workflow, little changes. But everything agentic — chat, agent mode, multi-step sessions, tool calls — is now metered, and Copilot code review now also burns GitHub Actions minutes on top of credits. Once your allowance is gone, you pay overage, or you're cut off.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem is that nobody can predict the bill
&lt;/h2&gt;

&lt;p&gt;Teams can plan around a higher bill. What they can't plan around is one that swings unpredictably from one week to the next — and that's what most of this week's complaints are really about.&lt;/p&gt;

&lt;p&gt;Developers are watching credits evaporate in ways they can't anticipate. One Pro+ user &lt;a href="https://www.ghacks.net/2026/06/02/github-copilot-usage-based-billing-takes-effect-drawing-developer-backlash-over-rapid-credit-depletion/" rel="noopener noreferrer"&gt;burned through roughly 8% of their monthly allotment in two hours&lt;/a&gt; and projected the whole thing gone in under two days. Another &lt;a href="https://www.ghacks.net/2026/06/02/github-copilot-usage-based-billing-takes-effect-drawing-developer-backlash-over-rapid-credit-depletion/" rel="noopener noreferrer"&gt;spent more than $6 on a single change request&lt;/a&gt; and called the consumption impossible to predict. A session using Claude 4.8 to fix some site issues &lt;a href="https://www.ghacks.net/2026/06/02/github-copilot-usage-based-billing-takes-effect-drawing-developer-backlash-over-rapid-credit-depletion/" rel="noopener noreferrer"&gt;ate 1,180 credits&lt;/a&gt; — about 16% of a Pro+ monthly allowance — for results the developer called mediocre. One person watched a &lt;a href="https://findskill.ai/blog/github-copilot-too-expensive-alternatives-2026/" rel="noopener noreferrer"&gt;single file review, with no code changes, consume 20%&lt;/a&gt; of their monthly allowance. At the org level, people are circulating projections of monthly costs jumping &lt;a href="https://techcrunch.com/2026/05/30/what-a-joke-github-copilots-new-token-based-billing-spurs-consternation-among-devs/" rel="noopener noreferrer"&gt;from $29 to $750&lt;/a&gt; and &lt;a href="https://techjournal.org/github-copilot-token-billing-backlash" rel="noopener noreferrer"&gt;from $50 to $3,000&lt;/a&gt; in heavy agentic workflows. A &lt;a href="https://techjournal.org/github-copilot-token-billing-backlash" rel="noopener noreferrer"&gt;"Goodbye, Copilot"&lt;/a&gt; post has been shared thousands of times, and &lt;a href="https://techcrunch.com/2026/05/30/what-a-joke-github-copilots-new-token-based-billing-spurs-consternation-among-devs/" rel="noopener noreferrer"&gt;TechCrunch called it the end of Copilot's golden age&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The r/github thread that's been climbing all week reads the same way, and the sharpest complaints aren't about price at all. One developer described being forced into &lt;a href="https://www.reddit.com/r/github/comments/1ttcpw0/github_copilots_new_creditbased_pricing_is/" rel="noopener noreferrer"&gt;"token anxiety,"&lt;/a&gt; micromanaging every click to survive the month. Another nailed the unit mismatch: you bought a seat, and now every agentic run feels like &lt;a href="https://www.reddit.com/r/github/comments/1ttcpw0/github_copilots_new_creditbased_pricing_is/" rel="noopener noreferrer"&gt;"leaving a taxi meter running in another room."&lt;/a&gt; And this one should land for anyone who signs off on a budget — a developer whose org hadn't even finished configuring its credit pools wrote that, at his burn rate, &lt;a href="https://www.reddit.com/r/github/comments/1ttcpw0/github_copilots_new_creditbased_pricing_is/" rel="noopener noreferrer"&gt;"finance will be getting a hefty bill because management isn't up to date on plan changes."&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To be precise: this isn't a hidden markup. Copilot charges standard per-model API rates — one commenter noted the models cost &lt;a href="https://www.reddit.com/r/github/comments/1ttcpw0/github_copilots_new_creditbased_pricing_is/" rel="noopener noreferrer"&gt;"exactly the same price as direct from OpenAI and Anthropic."&lt;/a&gt; The price was never the subsidy — the flat subscription was. Now that it's gone, you're just seeing what agentic coding actually costs.&lt;/p&gt;

&lt;p&gt;Here's the kicker for anyone responsible for a budget: &lt;strong&gt;you couldn't even trust the preview.&lt;/strong&gt; GitHub's Billing Preview tool was meant to estimate costs before the switch — but it runs on discounted credits, so the number it showed enterprises is &lt;em&gt;lower&lt;/em&gt; than what they'll actually pay. GitHub also warns that older IDE and extension versions can display inaccurate pricing. And some heavy users found the projected spend wildly &lt;em&gt;higher&lt;/em&gt; than their lived experience.&lt;/p&gt;

&lt;p&gt;Either way, many couldn't get a number they trusted.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is already happening at enterprise scale
&lt;/h2&gt;

&lt;p&gt;The individual horror stories are the visible edge of something bigger: at the org level, agentic coding is outrunning the budgets attached to it.&lt;/p&gt;

&lt;p&gt;Uber is the clearest example. It burned through its &lt;em&gt;entire&lt;/em&gt; 2026 AI coding tools budget in four months — by April — and has since capped employee spending at $1,500 a month. Its CTO for Mobility and Delivery, Praveen Neppalli Naga, confirmed the blowout to &lt;em&gt;The Information&lt;/em&gt;. And it wasn't even on Copilot — it was Claude Code and Cursor. The dynamic isn't vendor-specific: agentic workflows burn tokens faster than flat per-seat budgets were built to absorb. GitHub's change just forces every other org to confront the same math.&lt;/p&gt;

&lt;p&gt;And the optimistic case is a trap: even as per-token prices fall, enterprise bills won't drop in step, because agentic workflows burn far more tokens per task and providers won't pass all the savings through. If your plan assumes prices will just come down, retire that assumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  What leaders can do right now
&lt;/h2&gt;

&lt;p&gt;The basics everyone here is trading notes on are the right place to start — so let's start there, then go one step further than the band-aids.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Analyze your actual usage — against the real rates.&lt;/strong&gt; Pull your usage report now and model your team's real token consumption against the standard metered rates, not any discounted preview.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Put spend governance in place before overages start.&lt;/strong&gt; GitHub has rolled out hard spending caps and user-level budgets with a "stop at limit" option. Set the ceiling now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prioritize workloads — match the model to the task.&lt;/strong&gt; The unpredictability is worst when every action defaults to the most expensive frontier model. Reserve the heavy models for the work that needs them, and stop spending premium credits on trivial completions and boilerplate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't let one vendor own your meter.&lt;/strong&gt; The first three steps are damage control — they make a bad position survivable. The real exposure is having bet your roadmap on one provider's pricing and model availability, and this week showed how that feels when it shifts overnight.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Developers are already voting with their feet, running hybrid stacks: burn the Copilot allocation, then route the rest elsewhere. It's a smart stopgap — but still a workaround for a problem you shouldn't have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model freedom is the durable answer
&lt;/h2&gt;

&lt;p&gt;This is the world Kilo was built for. We have always focused on open source, transparent pricing, bring-your-own-key (BYOK) support, and genuine model choice. When the prevailing wisdom said everyone would consolidate onto one or two providers, we bet on flexibility.&lt;/p&gt;

&lt;p&gt;The principle is simple: you shouldn't have to care which vendor controls the model, or what their next pricing change does to your workflow. Bring your own keys, run any model across any provider, and see exactly what you'll pay.&lt;/p&gt;

&lt;h3&gt;
  
  
  500+ Models, One Place
&lt;/h3&gt;

&lt;p&gt;Pick the best model for every task — coding, planning, debugging, agentic work — ranked by real-world usage across 500+ hosted options, and switch the moment the economics change.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/gateway" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhooxs38ocljb5crykfw.jpeg" alt="Kilo Gateway showing 500+ models ranked by real-world usage across coding, planning, and agentic tasks" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Different work wants different models: the task that justifies a frontier model for orchestration is wasteful for a quick refactor. We show you which models lead in Code, Plan, Debug, Ask, Review, and Orchestrator, based on real usage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/gateway" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fny2jhkczreiumkobzxik.jpeg" alt="Kilo model recommendations by task type — Code, Plan, Debug, Ask, Review, and Orchestrator" width="800" height="492"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  We Know Which Model Fits the Job
&lt;/h3&gt;

&lt;p&gt;Matching the model to the task is the single most effective way to keep agentic costs sane — and it's hard to do by hand, prompt by prompt. Kilo Bench measures cost versus performance across the most capable coding models on Terminal Bench 2.0, so the trade-off between completion rate and cost per attempt is a number you can see, not a guess.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/bench" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ul1hm8i8pg5a53ij1ef.jpeg" alt="Kilo Bench chart showing cost vs completion rate across frontier coding models on Terminal Bench 2.0" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And you don't have to make that call on every request. With &lt;a href="https://kilo.ai/features/auto-model" rel="noopener noreferrer"&gt;Auto Model&lt;/a&gt;, smart routing automatically selects the optimal model for each task, across tiers that balance cost and capability — no manual switching required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Granular Usage Analytics
&lt;/h3&gt;

&lt;p&gt;The hardest thing for a leader this week is simply &lt;em&gt;seeing&lt;/em&gt; where the spend goes — one developer complained GitHub had &lt;a href="https://www.reddit.com/r/github/comments/1ttcpw0/github_copilots_new_creditbased_pricing_is/" rel="noopener noreferrer"&gt;"made tracking your spending as difficult as possible."&lt;/a&gt; That's exactly what Kilo gives you: complete visibility into how your teams use AI, with spending tracked down to the individual developer.&lt;/p&gt;

&lt;p&gt;Slice usage however the question demands — by date, feature, model, mode, provider, or project — at individual or organization scope.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/features/analytics" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljq4qa7hvzzzu5ir0ews.jpeg" alt="Kilo analytics dashboard showing usage sliced by date, model, mode, provider, and project" width="628" height="1228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And see exactly where the money goes: cost broken out by model turns "why is the bill so high" into a chart you can act on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/features/analytics" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feptlq5vmyuf21gp0h1i0.jpeg" alt="Kilo cost breakdown chart showing spend by model" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Governance: You Decide What Your Org Can Use
&lt;/h3&gt;

&lt;p&gt;Model freedom doesn't mean a free-for-all. Kilo gives administrators 62 providers and 681 models to draw on — and full control over which your organization can actually use, with a default model set centrally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/enterprise" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtrdjc6xvzqpgv6v3mpw.jpeg" alt="Kilo admin panel showing model governance — 62 providers and 681 models with org-level controls" width="722" height="964"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can disable specific models for organization members, so spend governance isn't just a cap on a bill — it's control over what runs in the first place. That's exactly the kind of structural control the leaders we're talking to are scrambling to put in place this week.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/enterprise" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygszofl0dbmhfo7bn4h7.jpeg" alt="Kilo model disable controls showing per-member model restrictions for org admins" width="800" height="583"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Model choice shouldn't be a premium feature, and open source is the foundation that stays stable when closed systems reprice overnight. It's good to see the broader market arriving at the same place.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;PS. Looking for model freedom? Try out &lt;a href="https://kilo.ai/pricing/kilo-pass" rel="noopener noreferrer"&gt;Kilo Pass&lt;/a&gt; — instant access to 500+ models, transparent pricing, and never any surcharge.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>vibecoding</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Inside Kilo Speed: The Engineer Who Teaches Teams How to Think in Agents</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Wed, 13 May 2026 12:17:35 +0000</pubDate>
      <link>https://dev.to/kilocode/inside-kilo-speed-the-engineer-who-teaches-teams-how-to-think-in-agents-cbf</link>
      <guid>https://dev.to/kilocode/inside-kilo-speed-the-engineer-who-teaches-teams-how-to-think-in-agents-cbf</guid>
      <description>&lt;p&gt;&lt;em&gt;How to manage your agent team, from someone who coaches Kilo customers in agentic engineering.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rebecca Dodd&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;May 12, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When you're learning a new discipline—especially on the job—learning the theory behind it can feel like an abstract nice-to-have, while practice is the thing that's actually useful. Learning by doing is absolutely a valid way to upskill, but in &lt;a href="https://www.linkedin.com/in/marius-wichtner/" rel="noopener noreferrer"&gt;Marius Wichtner&lt;/a&gt;'s experience, grasping the conceptual foundation of agentic engineering helps to make the practical steps make sense.&lt;/p&gt;

&lt;p&gt;Before joining Kilo Code, Marius was already training engineering teams on working with generative AI. At Kilo, he does the same for enterprise clients in Kilo Speedruns: one-hour sessions designed to give teams a fast, practical orientation on agentic software development. He's run them for companies across industries, and now he's sharing the foundations of those lessons (and his specific practices for each) here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to delegate effectively&lt;/li&gt;
&lt;li&gt;How to scale across concurrent workstreams&lt;/li&gt;
&lt;li&gt;How to maintain judgment and recover when things go wrong&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  1. How to Delegate: The Team Lead Model and the Plan
&lt;/h2&gt;

&lt;p&gt;The mental model Marius uses to explain agentic engineering—both in client speedruns and in how he structures his own work—is the team lead.&lt;/p&gt;

&lt;p&gt;Team leads don't spend all day writing code, and the same was true even before agentic tools existed. They were in pairing sessions, answering questions, reviewing output, and deciding what to merge. "Those were always the people that were only in meetings and they got called by all the juniors," Marius says. "They were just solving the last 20% of the problem."&lt;/p&gt;

&lt;p&gt;In this model, the agent takes care of execution work, while the engineer operates as the team lead. The 80% that agents handle well—code generation, boilerplate, well-scoped subtasks—is work that the team lead delegates. The 20% that still requires the engineer is the judgment work: architectural decisions, what to merge, and recognizing when the agent has drifted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmm8h2pbihosw5oblm5d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmm8h2pbihosw5oblm5d.png" alt="Parallel development with the engineer acting as team lead" width="800" height="300"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Parallel development with the engineer acting as team lead&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The engineers who transition most naturally into agentic workflows are often the ones who were already operating this way: team leads and architects who had developed the habit of switching contexts and reviewing output rather than writing it. Everyone else has to learn that mode of working, which starts with understanding the difference between a specification and a plan.&lt;/p&gt;

&lt;p&gt;A specification captures what the user wants. It doesn't change based on the current state of the codebase. It's set from the user demand, and it stays set. A plan is &lt;em&gt;how&lt;/em&gt; you intend to build the thing given where the code actually is right now. "A plan is dependent on your state of the code," says Marius. "Plans usually get thrown away very quickly."&lt;/p&gt;

&lt;p&gt;When Marius works with an agent on complex tasks (especially those with important architecture decisions), he asks it to write its plan to a markdown file before it starts executing so he can review it. Asking the agent to write its plan first forces a shared understanding of what's actually being built. You review it, ask questions, and surface problems before execution begins. It's the refinement stage of traditional software engineering, but the difference now is that the feedback loop is much faster.&lt;/p&gt;

&lt;p&gt;Plans, done right, function as constraints. Marius thinks of this as keeping an agent in the acceptable solution space: the set of outputs you will actually accept. The further an agent drifts from a confirmed plan, the more likely it ends up somewhere that requires starting over. Forcing the plan upfront dramatically increases the probability of staying on track.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiah6bpjjibnnl0scu8fu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiah6bpjjibnnl0scu8fu.png" alt="Plans help to keep your agent within the acceptable solution space" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Plans help to keep your agent within the acceptable solution space&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The plan also acts as a contract: it documents the approach the agent intends to take, so when it does something unexpected later, there's a reference point. "You can always reiterate to the agent, 'We decided to implement this plan. Why have you decided otherwise?'"&lt;/p&gt;

&lt;h2&gt;
  
  
  2. How to Scale: Parallelism and the Context Rot Problem
&lt;/h2&gt;

&lt;p&gt;Even with a solid plan in place, there's a natural limit to how far a single agent session can take you: context rot. As a session grows, accumulating conversation history, prior decisions, and intermediate code states, the agent starts losing coherence. Tasks that were reasonable at the start become unpredictable midway through. Early decisions can come back to bite you. At some point, recovery means starting over.&lt;/p&gt;

&lt;p&gt;Most engineers treat this as a nuisance and work around it by brute force: shorter sessions, more restarts. Marius treats it as a signal that the work hasn't been decomposed correctly. "If you have a huge feature and you develop on it for the whole week, you will keep having context rot," Marius says. "It makes much more sense to plan out what you want to implement ahead of time and then develop each of the sub-problems individually in small context windows."&lt;/p&gt;

&lt;p&gt;This is where parallelism comes in: you run multiple agents simultaneously, each working on a specific sub-problem. But parallel agents writing to the same file system will conflict (the same reason Git was invented). You need each agent working in its own isolated environment.&lt;/p&gt;

&lt;p&gt;To address this, Marius built a solution into his own custom IDE, before building Kilo's &lt;a href="https://blog.kilo.ai/i/192608130/the-agent-manager" rel="noopener noreferrer"&gt;Agent Manager&lt;/a&gt;: a tool for running multiple agent sessions simultaneously, each in its own isolated workspace, with its own file system. Instead of supervising agents one at a time, an engineer can delegate across several concurrent workstreams and review the results as they come in. Things that look good get merged; things that don't get discarded without the cost of untangling a week of compounded decisions.&lt;/p&gt;

&lt;p&gt;Not every task demands the multi-agent treatment. Marius works across three categories depending on complexity:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznu2ohrx3tdl032rthw5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznu2ohrx3tdl032rthw5.png" alt="How Marius routes tasks based on their complexity" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;How Marius routes tasks based on their complexity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Easy tasks:&lt;/strong&gt; Things like adding documentation, writing a unit test, or well-scoped bug fixes go to a fully autonomous cloud workflow. The developer writes the spec, the agent executes, the developer reviews the diff. No supervision is required mid-session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard tasks:&lt;/strong&gt; Implementing a complex feature spanning UI and backend, or anything with meaningful architectural decisions, gets handled locally with Agent Manager. The developer supervises multiple agents working in parallel on decomposed subtasks, stays close to the work, and makes the judgment calls as diffs come in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unclear tasks:&lt;/strong&gt; When the outcome isn't well-defined, it's hard to write a spec precise enough to constrain the agent toward a single solution. For these, Marius runs multiple agents in parallel against the same spec and compares the results. Instead of splitting work, the parallelism here is about generating variants and selecting the best one. The engineer's job is choosing the right route.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. How to Stay on Track: Context Engineering and Judgment
&lt;/h2&gt;

&lt;p&gt;Context engineering, as Marius defines it, is how you structure and optimize the context of the agent. The goal is to limit an agent to doing exactly what you want, over time, in your codebase. It's the ongoing work of keeping agents oriented, and knowing how to reorient them when they've drifted.&lt;/p&gt;

&lt;p&gt;For upfront orientation, Marius uses Handy, a speech-to-text tool, to interact with agents verbally before locking in a plan. A lot of the context that matters for a task lives in the engineer's head and never gets written down, because it's too tedious to type everything out. Speaking it aloud removes that barrier, and an LLM can distil the rough transcript into a precise problem statement. The rough transcript also becomes the raw material for the plan the agent writes before executing.&lt;/p&gt;

&lt;p&gt;When an agent session ends—whether it hit a context limit or simply reached a natural stopping point—continuing the work is usually straightforward. The original prompts, the Git diff (Agent Manager measures the delta from when the session started), and the current state of the codebase give a new agent enough to pick up where the previous one left off. Tools like &lt;a href="https://repomix.com/" rel="noopener noreferrer"&gt;Repomix&lt;/a&gt; can help with collecting specific file trees for this purpose. All of this can happen locally or in GitHub, where an issue describes the task, the PR contains the changes, and the history provides the thread. Most agents can continue from that context without much intervention.&lt;/p&gt;

&lt;p&gt;What this process makes visible is what's actually irreplaceable: the context that isn't captured anywhere. Code and prompts are always an approximation—there are causal relationships in software that are hard to capture in prompts or code alone. Some of them, like another team's architectural decision creating a dependency you didn't know about, can be surfaced and handed off. Others only become visible when you run the code or at scale. An agent can't know what hasn't surfaced yet—that's still the engineer's job.&lt;/p&gt;

&lt;p&gt;This is the difference between just coding and software &lt;em&gt;engineering&lt;/em&gt;. The easy mistake with agentic work is treating it as a handoff: you describe what you want, the agent builds it, you ship it. In that approach, the critical last 20% can get lost: things like evaluating architectural choices and catching when an agent has veered off course. These require engineering judgment, and they're often much harder than the first 80%.&lt;/p&gt;

&lt;p&gt;The mental shift Marius describes is about learning to apply engineering judgment at the right moments, across multiple concurrent threads, rather than sequentially inside a single one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Read the other posts in our Kilo Speed series:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://blog.kilo.ai/p/inside-kilo-speed-how-one-engineer-52c" rel="noopener noreferrer"&gt;Inside Kilo Speed: How One Engineer is Replatforming Our VS Code Extension in a Month&lt;/a&gt; (Mar 11)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.kilo.ai/p/inside-kilo-speed-how-our-head-of" rel="noopener noreferrer"&gt;Inside Kilo Speed: How Our Head of Data Shipped an Identity Resolution System Before His First Full Day&lt;/a&gt; (Feb 20)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.kilo.ai/p/inside-kilo-speed-how-one-engineer-dcb" rel="noopener noreferrer"&gt;Inside Kilo Speed: How One Engineer Built Cloud Agents in a Week&lt;/a&gt; (Feb 4)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.kilo.ai/p/inside-kilo-speed-how-one-engineer-971" rel="noopener noreferrer"&gt;Inside Kilo Speed: How One Engineer Shipped an MVP in His First Week&lt;/a&gt; (Jan 28)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.kilo.ai/p/inside-kilo-speed-how-one-engineer" rel="noopener noreferrer"&gt;Inside Kilo Speed: How One Engineer Shipped an AI Adoption Dashboard in Two Days&lt;/a&gt; (Jan 21)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>learning</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Cowboy Coder Is Back. This Time, They Scale</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Wed, 13 May 2026 12:00:53 +0000</pubDate>
      <link>https://dev.to/kilocode/cowboy-coder-is-back-this-time-they-scale-266n</link>
      <guid>https://dev.to/kilocode/cowboy-coder-is-back-this-time-they-scale-266n</guid>
      <description>&lt;h1&gt;
  
  
  Cowboy Coder Is Back. This Time, They Scale
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Andrew Storms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;May 11, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I should start by admitting I'm part of the problem.&lt;/p&gt;

&lt;p&gt;I can still draw the architecture of code I wrote three years ago from memory. The data flow, the edge cases, the reasoning behind every choice that looks strange at first glance. Ask me to do the same for a feature I shipped last month with help from an agent, and I can tell you what it does and why we built it. The deeper model, the one that lives at the level of individual functions, isn't there.&lt;/p&gt;

&lt;p&gt;That's not laziness, and it's not a lapse in review. I read every diff. An agent does a closer pass alongside me. I can speak to the intent and shape of what I'm approving. But the deep mental model, the one you actually need at 2am when something breaks and the agent isn't helping you debug, isn't forming the way it used to.&lt;/p&gt;

&lt;p&gt;I'm a CISO who still writes code, and this worries me. It should worry anyone managing engineers right now, because it isn't just me. Across the industry, AI coding agents are quietly reviving the single worst antipattern in software engineering history. We just don't recognize it yet, because it's wearing different clothes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25dekrce36z9g3pav21r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25dekrce36z9g3pav21r.png" alt="Cowboy Coder Is Back" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Remember the cowboy?
&lt;/h2&gt;

&lt;p&gt;If you've managed engineers long enough, you know the cowboy. The one who disappears for a weekend and comes back Monday with a full rewrite nobody asked for. The one who, somehow, is the only person who understands the gnarly billing module, the auth flow, the deployment pipeline. The one whose decisions land in production faster than the team can review them.&lt;/p&gt;

&lt;p&gt;Cowboys aren't heroes, by the way. The hero is the engineer who pulls the 2am save when production breaks. The cowboy is the one who created the conditions that made the 2am save necessary in the first place. Heroes clean up. Cowboys cause.&lt;/p&gt;

&lt;p&gt;For twenty years, our industry has been quietly learning how to build engineering organizations that don't depend on this person. Code review. Pair programming. Design docs and RFCs. Collective code ownership. Postmortems that look at process, not blame. The whole inheritance from XP, agile, and DevOps was, in large part, a response to the lesson that cowboy culture feels productive and is actually corrosive.&lt;/p&gt;

&lt;p&gt;It worked. Not perfectly, but the average engineering team today is far more resilient than the average team in 2005.&lt;/p&gt;

&lt;p&gt;Then the agents arrived.&lt;/p&gt;

&lt;p&gt;Watch what happens on teams that have adopted Claude, Cursor, Copilot, Codex, and the rest without changing how they work. An engineer prompts an agent. The agent emits eight hundred lines of code. The engineer skims it, sees the tests pass, and merges. Repeat, ten times a day, across the team.&lt;/p&gt;

&lt;p&gt;The output is enormous. The velocity charts look incredible. And underneath, something is going wrong that nobody is naming yet.&lt;/p&gt;

&lt;p&gt;Nobody on the team has reasoned through that code. The "author" couldn't walk you through it under questioning. They didn't write it, they prompted it. The reviewer couldn't either; they had thirty other PRs in the queue, and half the time the reviewer is another agent. Six months from now, when something breaks at 2am, the engineer who gets paged will be debugging code that has, in any meaningful sense, no human author at all.&lt;/p&gt;

&lt;p&gt;This is the cowboy pattern. The weekend rewrite, the opaque module, the knowledge silo, the tech debt nobody quite remembers creating. Same antipattern, new substrate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's actually worse
&lt;/h2&gt;

&lt;p&gt;The cowboy archetype, for all its damage, had one redeeming feature: somewhere, in one human brain, the model of the system existed. Bus factor of one.&lt;/p&gt;

&lt;p&gt;Development driven by agents, without comprehension, produces bus factor zero. The code enters the repository with nobody understanding it. There is no expert to consult, because the "expert" was a probability distribution that has since moved on to the next prompt.&lt;/p&gt;

&lt;p&gt;The social brakes that used to slow cowboys down are also gone. Cowboys had egos, reputations, and peers who could push back in code review. Agents have none of these. They don't sulk when overruled, don't take credit, don't feel shame when prod breaks. The friction that used to make cowboy culture limit itself in healthy teams, the simple fact that other humans were watching, is absent.&lt;/p&gt;

&lt;p&gt;And there's a new accountability sink. When the cowboy shipped a bad rewrite, you knew whose desk to visit. When an agent ships a bad rewrite, the conversation goes "well, the AI wrote it" and everyone shrugs. The blame diffuses into the tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  What managers should do now
&lt;/h2&gt;

&lt;p&gt;The good news: the playbook for fixing this already exists. We wrote it the last time. It needs updating, not reinventing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Require comprehension, not just approval.&lt;/strong&gt; Before any meaningful PR written with an agent gets merged, the author should be able to walk through it without asking the agent again. If they can't explain why a function exists, the PR isn't ready. This is the most impactful change you can make, and the one I'd benefit from most personally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cap PR size, hard.&lt;/strong&gt; Code review evolved assuming limited human throughput on both sides. Agents broke that assumption. A PR of 50 lines can be meaningfully reviewed; a PR of 800 lines gets approved without thought. Set a limit, enforce it in tooling, and force large changes to be decomposed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag agent involvement and track it.&lt;/strong&gt; Make AI authorship a first class piece of metadata on commits and PRs. Watch incident rates, time to debug, and refactor cost on modules where agents wrote most of the code, and compare against the rest. You can't manage what you can't see, and right now most engineering orgs are flying blind on this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Protect the loop of deliberate practice.&lt;/strong&gt; Junior engineers who never struggle through a hard bug don't become senior engineers who can debug under pressure. Build in rotations without agents, pair on hard problems, and make "can debug from scratch" part of your leveling criteria. The seniors riding herd on agents today learned their craft the hard way. The next cohort needs a path to the same skill, or you'll wake up in five years with a team that can prompt fluently and reason about nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reframe tech debt as unread code.&lt;/strong&gt; The most dangerous code in your repository is no longer the bad code. It's the unread code, modules that work today and that nobody on the team has actually internalized. Schedule comprehension audits. Assign engineers to read and document modules written by agents that they didn't author themselves. Treat unread code as a liability on the books.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is not an argument against AI
&lt;/h2&gt;

&lt;p&gt;The agents are useful. The productivity gains are real. I use them every day, and I'm not giving them up.&lt;/p&gt;

&lt;p&gt;The point is that the technical productivity of these tools is arriving faster than the organizational practices needed to absorb them. The teams that already had healthy engineering culture, the kind with code review that actually reviews, sustainable pace, and collective ownership, will adapt and thrive. The teams that quietly tolerated cowboys are about to have a much worse problem, at much greater scale, with no single person to point at.&lt;/p&gt;

&lt;p&gt;And the rest of us, the ones who can still picture the flow of code we wrote three years ago but no longer build that same depth of model with the new stuff, need to be honest that the muscle is atrophying. Mine is. Yours probably is too.&lt;/p&gt;

&lt;p&gt;The cowboy didn't go away. The cowboy scaled, with a million tokens of context. The work of engineering management is to recognize the pattern in its new form and apply the lessons we already learned the last time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
    </item>
    <item>
      <title>7 Unexpected Ways AI Makes Your Team Faster</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Mon, 11 May 2026 11:58:46 +0000</pubDate>
      <link>https://dev.to/kilocode/7-unexpected-ways-ai-makes-your-team-faster-4hp1</link>
      <guid>https://dev.to/kilocode/7-unexpected-ways-ai-makes-your-team-faster-4hp1</guid>
      <description>&lt;p&gt;Most enterprise teams adopt AI coding tools expecting one thing: faster code output. And sure, that happens. But the teams getting the most out of AI are finding speed in places they didn't anticipate. The decisions, the handoffs, the context switches, the organizational friction that quietly eats weeks off every quarter. That's where the real time goes, and that's where AI has the most room to compress it.&lt;/p&gt;

&lt;p&gt;Here are seven of those less-obvious wins, based on what we're seeing across engineering orgs using Kilo at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Decisions that don't stall in Slack threads&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6dfvvfhb94fhkai0mq95.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6dfvvfhb94fhkai0mq95.png" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of good engineering decisions happen in Slack threads. Two or three people hash out an approach, agree on a direction, maybe sketch out some pseudocode in a message. Then someone has to take all of that context, switch to their IDE, reconstruct the conversation in their head, and actually implement it. That translation step is where momentum dies. The idea was clear in the thread, but by the time someone sits down to build it, they're re-reading messages and second-guessing what the team actually agreed on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/slack" rel="noopener noreferrer"&gt;Kilo for Slack&lt;/a&gt;&amp;nbsp;can read the full thread context, understand what the team discussed, and start implementing directly from the conversation. Instead of someone manually distilling a Slack thread into a ticket and then into code, Kilo picks up the intent from the discussion itself, with all the nuance that multiple contributors added along the way. The gap between "we agreed on an approach" and "someone started building it" shrinks from hours or days to minutes.&lt;/p&gt;

&lt;p&gt;For engineering teams, this changes the rhythm of how work gets kicked off. Conversations become the starting point for implementation, not a precursor to yet another handoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Code contributions from people who aren't engineers&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexjfjcsj1elq8qirmid1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexjfjcsj1elq8qirmid1.png" width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Product managers, designers, data analysts, and other non-engineering team members are able to use AI agents to write and submit code. They can describe what they need, have an agent generate a PR, and push it up for an&amp;nbsp;&lt;a href="https://kilo.ai/docs/automate/code-reviews/overview" rel="noopener noreferrer"&gt;AI-powered review&lt;/a&gt;. Some years ago that PR would have been dead on arrival. The code might work, but it might not follow the team's conventions, handle edge cases, or meet the bar for production.&lt;/p&gt;

&lt;p&gt;Kilo's&amp;nbsp;&lt;a href="https://kilo.ai/code-reviewer" rel="noopener noreferrer"&gt;Code Reviewer&lt;/a&gt;&amp;nbsp;changes that equation. When a non-engineer submits a PR, the reviewer analyzes it against performance, security, style, and test coverage, then gives structured feedback the contributor can actually act on. The contributor iterates with their agent based on that feedback, resubmits, and the cycle repeats until the code reaches an acceptable level. Each round takes minutes, not days waiting for a human reviewer to find time.&lt;/p&gt;

&lt;p&gt;The impact for enterprise teams is significant: work that used to require an engineer's time from start to finish can now arrive as a reviewable PR from someone outside the engineering org. Engineers still own the final approval, but they're reviewing and approving instead of building from scratch. That frees up engineering bandwidth in a way that no amount of "write code faster" tooling can match.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Onboarding that doesn't require a sherpa&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffygccbn7rlmgda3v9vmg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffygccbn7rlmgda3v9vmg.png" width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;New engineers joining a large codebase used to spend their first few weeks in a fog. They read docs that are three sprints out of date, ping senior devs with questions that feel stupid, and take twice as long on their first PRs because they don't understand the conventions yet.&lt;/p&gt;

&lt;p&gt;AI changes the dynamic. When a new hire can point an agent at the repo and ask "how does authentication work in this service?" or "what's the pattern for adding a new API endpoint here?", they get answers grounded in the actual code, not someone's best recollection of how things worked six months ago. Kilo's Ask mode works well here, providing read-only answers powered by&amp;nbsp;&lt;a href="https://kilo.ai/docs/customize/context/codebase-indexing" rel="noopener noreferrer"&gt;codebase indexing&lt;/a&gt;. New devs ramp in days instead of weeks, and senior devs get fewer interruptions.&lt;/p&gt;

&lt;p&gt;The compounding effect matters: every engineer who onboards faster is productive sooner, and every senior engineer who isn't answering onboarding questions is shipping their own work.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Documentation that actually updates&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnu1ly2xyxc4i6jrmum50.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnu1ly2xyxc4i6jrmum50.png" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every engineering team says they value documentation. Almost none of them have enough of it, because writing docs is tedious and the codebase moves faster than anyone can document manually.&lt;/p&gt;

&lt;p&gt;AI flips the economics. Generating docs from code is exactly the kind of structured, pattern-heavy task where AI agents perform well. A developer can point a&amp;nbsp;&lt;a href="https://kilo.ai/docs/code-with-ai/platforms/cloud-agent" rel="noopener noreferrer"&gt;webhook-triggered Cloud Agent&lt;/a&gt;&amp;nbsp;at a new PR and get a first draft of internal docs, API references, or architecture decision records in minutes. That draft still needs a human to review and refine, but the difference between "edit a draft" and "write from scratch" is the difference between documentation existing and not existing.&lt;/p&gt;

&lt;p&gt;For enterprise teams, this pays off across the org. Knowledge stops being locked in individual developers' heads. Teams that depend on each other's services can actually find out how those services work. The "bus factor" for any given system gets a lot less scary.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Maintenance work that stops being a black hole&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3888xz7oqc4xqionp5hb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3888xz7oqc4xqionp5hb.png" width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every codebase has a backlog of maintenance tasks that never rise to the top of the sprint: dependency upgrades, test coverage gaps, deprecated API migrations, lint rule enforcement. Each one is individually small, but collectively they represent weeks of accumulated drag on the team.&lt;/p&gt;

&lt;p&gt;AI agents can handle a lot of this at volume. Kilo's orchestration capabilities let you break down a large maintenance initiative (say, migrating from one logging library to another across 200 files) into subtasks and distribute them to agents running in parallel. What used to be a quarter-long slog becomes a focused effort measured in hours.&lt;/p&gt;

&lt;p&gt;The net effect is that the maintenance backlog actually shrinks instead of growing indefinitely. Teams spend less time working around known issues and more time building features that move the product forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Cross-team requests that don't take a sprint&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4v86xa15pzskurafjtdt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4v86xa15pzskurafjtdt.png" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In larger orgs, teams constantly need small things from each other. A backend team needs a new field exposed in an API. A frontend team needs a config change. A platform team needs a migration script. Each request is maybe a day of work for the team that owns the code, but it sits in their backlog for two weeks because it's nobody's priority.&lt;/p&gt;

&lt;p&gt;When the requesting team can use AI to draft the change themselves (using agents that understand the target repo's patterns and conventions), the dynamic shifts. Instead of filing a ticket and waiting, they can open a PR with a well-formed change and ask the owning team to review it. The owning team spends minutes reviewing instead of a day implementing, and the requesting team isn't blocked for two weeks.&lt;/p&gt;

&lt;p&gt;This might be the single most impactful change AI enables in enterprise settings, and it almost never shows up in productivity benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. Consistency that doesn't depend on tribal knowledge&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3kx2d3brwdgaekcug0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3kx2d3brwdgaekcug0d.png" width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most large codebases have a "right way" to do things that isn't fully captured in any linter config or style guide. It lives in the heads of engineers who've been around a while, and it gets enforced inconsistently through code review when those engineers happen to be reviewers.&lt;/p&gt;

&lt;p&gt;AI can formalize this. Kilo's custom modes and rules system lets teams encode their conventions, patterns, and preferences so that every developer (and every agent) follows the same playbook. New patterns get adopted uniformly instead of unevenly, and deprecated patterns stop spreading through the codebase via copy-paste.&lt;/p&gt;

&lt;p&gt;For enterprise teams managing large, long-lived codebases, this is arguably the most valuable thing AI can do. Consistency across a large codebase reduces cognitive load for everyone who touches it, which makes everything else on this list work better.&lt;/p&gt;




&lt;p&gt;None of these seven things are what most people think of when they hear "AI makes developers faster." They're not about generating code in fewer keystrokes. They're about removing the organizational friction, the coordination overhead, and the knowledge gaps that slow engineering teams down far more than typing speed ever did.&lt;/p&gt;

&lt;p&gt;If your team is evaluating AI tooling and only measuring lines of code generated or time to first commit, you're probably missing the real value. The teams getting the biggest returns are the ones that recognized AI as a way to make the whole system move faster, not just individual contributors.&lt;/p&gt;

&lt;p&gt;To see how Kilo fits into your engineering org, check out our&amp;nbsp;&lt;a href="https://kilo.ai/enterprise" rel="noopener noreferrer"&gt;enterprise plans&lt;/a&gt;&amp;nbsp;or&amp;nbsp;&lt;a href="https://kilo.ai/contact-sales" rel="noopener noreferrer"&gt;talk to our team&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Hermes vs. OpenClaw - When to Reach for Which Agent</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Fri, 08 May 2026 10:53:38 +0000</pubDate>
      <link>https://dev.to/kilocode/hermes-vs-openclaw-when-to-reach-for-which-agent-58bp</link>
      <guid>https://dev.to/kilocode/hermes-vs-openclaw-when-to-reach-for-which-agent-58bp</guid>
      <description>&lt;h1&gt;
  
  
  Hermes vs. OpenClaw — When to Reach for Which Agent
&lt;/h1&gt;

&lt;p&gt;Last week, someone in the &lt;a href="https://kilo.ai/discord" rel="noopener noreferrer"&gt;Kilo Discord&lt;/a&gt; asked: "Should I switch from OpenClaw to Hermes?" I've seen this question pop up a dozen times since Hermes launched in February. It's the right question to ask — both are open source, both connect to your chat apps, both run tools and remember things. On paper, they look almost identical.&lt;/p&gt;

&lt;p&gt;But after running both for the past two months, I think the feature checklists are a distraction — the design philosophies are where they actually diverge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnohgzdxzcwtuy4dwh6rg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnohgzdxzcwtuy4dwh6rg.jpeg" alt="Hermes vs. OpenClaw" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The One-Sentence Difference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hermes&lt;/strong&gt; packages a gateway around a learning agent.&lt;br&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt; packages an agent around a messaging gateway.&lt;/p&gt;

&lt;p&gt;That distinction sounds abstract, but it has practical consequences for how you configure and interact with each tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Hermes Gets Right
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://hermes-agent.nousresearch.com/" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; comes from Nous Research and launched in February 2026. It's hit about 135,000 GitHub stars as of this writing. The headline feature is what they call a "learning loop" — the agent creates and evolves its own skills based on what it does.&lt;/p&gt;

&lt;p&gt;From their &lt;a href="https://hermes-agent.nousresearch.com/docs/user-guide/features/overview" rel="noopener noreferrer"&gt;feature docs&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-improving skills:&lt;/strong&gt; The agent generates procedural knowledge from experience. Run the same task type a hundred times, and Hermes actually gets better at it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Five sandbox backends:&lt;/strong&gt; Local execution, Docker, SSH, Singularity, and Modal. You pick how isolated you want command execution to be.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subagent delegation:&lt;/strong&gt; Spawn child agents with isolated contexts and terminals. Parallel workstreams without context pollution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broader browser/voice stack:&lt;/strong&gt; Browserbase, Browser Use, Firecrawl, local Chrome, plus native voice in Discord channels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Hermes &lt;a href="https://blakecrosley.com/guides/hermes" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; is worth reading even if you don't use it — the provider matrix alone covers 19+ providers with detailed auth flows.&lt;/p&gt;

&lt;p&gt;What impressed me most was the checkpoint system. Before Hermes touches files, it snapshots your working directory. &lt;code&gt;/rollback&lt;/code&gt; if something goes wrong. I've used this more times than I'd like to admit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What OpenClaw Gets Right
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://openclaw.ai/" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; has been around longer and has the larger community — roughly 369,000 GitHub stars and 13,700+ community-built skills. It started as a personal assistant project by &lt;a href="https://twitter.com/steipete" rel="noopener noreferrer"&gt;Peter Steinberger&lt;/a&gt; and grew into something much bigger.&lt;/p&gt;

&lt;p&gt;OpenClaw is fundamentally a &lt;strong&gt;gateway&lt;/strong&gt;. The &lt;a href="https://docs.openclaw.ai" rel="noopener noreferrer"&gt;docs&lt;/a&gt; are explicit: "The Gateway is the single source of truth for sessions, routing, and channel connections."&lt;/p&gt;

&lt;p&gt;What that means in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Channel breadth:&lt;/strong&gt; Discord, Google Chat, iMessage, Matrix, Microsoft Teams, Signal, Slack, Telegram, WhatsApp, Zalo, WebChat. One Gateway process handles all of them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent routing:&lt;/strong&gt; Isolated sessions per agent, workspace, or sender. You can run different agents for different purposes through the same gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile nodes:&lt;/strong&gt; iOS and Android apps that pair with the gateway for camera, canvas, and device actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Massive skill ecosystem:&lt;/strong&gt; 13,700+ community skills covering everything from email to calendar to flight check-ins.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture assumes you want one always-on process that routes messages to agents. That's different from Hermes's model of "here's an agent runtime that can talk to various platforms."&lt;/p&gt;

&lt;h2&gt;
  
  
  Known Pitfalls
&lt;/h2&gt;

&lt;p&gt;Both tools have well-documented failure modes that the communities are vocal about. Worth knowing before you commit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hermes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-evaluation always passes.&lt;/strong&gt; Hermes evaluates its own work to decide if a task succeeded. The problem: it almost always thinks it did well, even when it didn't. This means the skills it auto-generates from "successful" tasks can encode errors. You need external validation for anything important.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-learning overwrites manual edits.&lt;/strong&gt; The same system that auto-generates skills also overwrites your customizations. If you've spent time tuning a skill for a specific workflow, the agent may "self-improve" it back into something generic. Power users find this maddening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maturity gap.&lt;/strong&gt; With only 11 releases compared to OpenClaw's 137, Hermes simply hasn't been tested at the same scale. Fewer updates means fewer chances to break things — but that's not the same as proven stability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Updates break things.&lt;/strong&gt; This is the most consistent complaint in the community. Users report roughly a 25% chance that any given update will break response delivery, cron jobs, or webhooks. The development process lacks the staging/testing discipline you'd expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory is unreliable.&lt;/strong&gt; Agents forget instructions, cross-contaminate data between projects, and repeat mistakes. Memory retention issues are the #1 driver of user churn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosting is the real barrier.&lt;/strong&gt; Docker setup, SSH configuration, YAML files, security hardening, 24/7 uptime — users consistently report spending more time on infrastructure than on their actual agent workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Trade-offs
&lt;/h2&gt;

&lt;p&gt;A &lt;a href="https://screenshotone.com/blog/hermes-agent-versus-openclaw/" rel="noopener noreferrer"&gt;comparison on ScreenshotOne&lt;/a&gt; put it well: Hermes is "agent-first" while OpenClaw is "gateway-first."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hermes&lt;/strong&gt; optimizes for the agent becoming more capable over time. It's built for people who want autonomous agents that learn from experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw&lt;/strong&gt; optimizes for a persistent assistant you can message from anywhere. It's built for people who want infrastructure they can talk to.&lt;/p&gt;

&lt;p&gt;Neither approach is wrong. But they lead to different outcomes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Hermes&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native skill evolution&lt;/td&gt;
&lt;td&gt;Skills are static (community-maintained)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sandbox options&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 backends (local, Docker, SSH, Singularity, Modal)&lt;/td&gt;
&lt;td&gt;Docker, SSH, local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Channel breadth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7 messaging platforms&lt;/td&gt;
&lt;td&gt;24+ platforms and plugins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Community size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~135k stars, growing fast&lt;/td&gt;
&lt;td&gt;~369k stars, larger skill library&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browser providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6+ options including cloud services&lt;/td&gt;
&lt;td&gt;Local Chrome + managed profiles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IDE integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ACP support (VS Code, Zed, JetBrains)&lt;/td&gt;
&lt;td&gt;CLI + browser control UI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Security Considerations
&lt;/h2&gt;

&lt;p&gt;This matters more than people think. A &lt;a href="https://www.reddit.com/r/selfhosted/comments/1r9yrw1/if_youre_selfhosting_openclaw_heres_every/" rel="noopener noreferrer"&gt;Reddit thread&lt;/a&gt; documented OpenClaw's 2026 security incidents: 6 CVEs, 341+ malicious skills identified in the community repository, 135,000+ exposed instances found by Shodan.&lt;/p&gt;

&lt;p&gt;OpenClaw grew fast. Some security assumptions that made sense for a personal tool on a laptop became dangerous when people started running it on public VPSes with open ports.&lt;/p&gt;

&lt;p&gt;Hermes, being newer, has &lt;a href="https://medium.com/@sathishkraju/i-switched-from-openclaw-to-hermes-agent-heres-what-nobody-told-me-5f33a746b6ca" rel="noopener noreferrer"&gt;zero reported agent-specific CVEs&lt;/a&gt; as of April 2026. That's not because it's inherently more secure — it just hasn't had the same scale of exposure. Give it time.&lt;/p&gt;

&lt;p&gt;Both projects now have sandboxing options and approval flows. But if you're deploying either on a server, audit the defaults. Neither assumes you're running on a hardened production box.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Pick Hermes
&lt;/h2&gt;

&lt;p&gt;Hermes is the better choice if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want an agent that improves at tasks over time&lt;/li&gt;
&lt;li&gt;You need multiple sandbox backends (especially Modal for cloud execution)&lt;/li&gt;
&lt;li&gt;You're doing research-style workflows with subagent delegation&lt;/li&gt;
&lt;li&gt;You want tight IDE integration via ACP&lt;/li&gt;
&lt;li&gt;You're willing to trade ecosystem size for a more capable core agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The learning loop is what justifies choosing Hermes over OpenClaw. If you're running the same types of tasks repeatedly — data analysis, code review, research synthesis — Hermes will genuinely get better at them.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Pick OpenClaw
&lt;/h2&gt;

&lt;p&gt;OpenClaw is the better choice if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to message your assistant from everywhere (24+ platforms)&lt;/li&gt;
&lt;li&gt;You need the existing skill ecosystem (13,700+ skills)&lt;/li&gt;
&lt;li&gt;You want mobile nodes for phone camera/canvas integration&lt;/li&gt;
&lt;li&gt;You're building team infrastructure, not just a personal agent&lt;/li&gt;
&lt;li&gt;You value stability over cutting-edge features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your primary use case is "I want to message my AI from WhatsApp and have it do things on my computer," OpenClaw has that nailed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Problem
&lt;/h2&gt;

&lt;p&gt;This doesn't get discussed enough. Running either agent autonomously is expensive if you're not careful. Every message sends the full conversation history to the API, so costs compound within a session.&lt;/p&gt;

&lt;p&gt;Users in the community report anywhere from $1-3/day on budget models to $130+/day on Claude Opus for heavy agentic use. The fix is aggressive session resets and picking appropriate models per task tier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality-sensitive work:&lt;/strong&gt; Claude Opus 4.6 (expensive, best agentic performance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily driver:&lt;/strong&gt; GPT 5.4 (thinking mode on medium+) or MiniMax M2.7&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget automation:&lt;/strong&gt; Qwen 3.5/3.6 (free on OpenRouter), GLM-5.1, Kimi K2.5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Flat-rate subscriptions (MiniMax at $10-20/month, Ollama Pro Cloud at $20/month) are rapidly replacing per-token billing as the community default.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Use
&lt;/h2&gt;

&lt;p&gt;I run both — and the community data confirms this is a growing pattern. The specific architecture that works: &lt;strong&gt;OpenClaw as orchestrator&lt;/strong&gt; (planning, decomposition, multi-step coordination, scheduling) and &lt;strong&gt;Hermes as execution specialist&lt;/strong&gt; (fast, repeatable task loops). They communicate via the ACP protocol.&lt;/p&gt;

&lt;p&gt;OpenClaw handles my day-to-day messaging — it's the interface I talk to from Telegram. I've been using it for months and the skill ecosystem covers most of what I need.&lt;/p&gt;

&lt;p&gt;Hermes runs on research tasks where I want the learning loop. When I'm doing a series of similar analyses, Hermes's skill evolution actually matters.&lt;/p&gt;

&lt;p&gt;I could probably consolidate — Hermes's docs actually note that it's the "successor to OpenClaw" and they have a migration command (&lt;code&gt;hermes claw migrate&lt;/code&gt;) — but I haven't felt the urgency. They solve different problems well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Both projects are actively developed. Both have real communities. Both work.&lt;/p&gt;

&lt;p&gt;Hermes is younger, more ambitious architecturally, and smaller in ecosystem. OpenClaw is more mature, broader in integrations, and has had more security scrutiny (for better and worse).&lt;/p&gt;

&lt;p&gt;The 30% of developers who &lt;a href="https://www.kucoin.com/blog/hermes-agent-vs-openclaw-which-open-source-ai-agent-wins-in-2026" rel="noopener noreferrer"&gt;switched from OpenClaw to Hermes&lt;/a&gt; cite "maintenance fatigue" from debugging community skills and wanting the learning loop. The 35% who stayed on OpenClaw cite integrations and ecosystem breadth.&lt;/p&gt;

&lt;p&gt;Pick based on what you actually need. If you want a persistent assistant you can message, OpenClaw. If you want an agent that improves itself, Hermes.&lt;/p&gt;

&lt;p&gt;Or run both — they're free, and the resource overhead of a second process is negligible.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hermes-agent.nousresearch.com/" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; — official site&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/user-guide/features/overview" rel="noopener noreferrer"&gt;Hermes docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openclaw.ai/" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; — official site&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.openclaw.ai" rel="noopener noreferrer"&gt;OpenClaw docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://screenshotone.com/blog/hermes-agent-versus-openclaw/" rel="noopener noreferrer"&gt;Detailed comparison on ScreenshotOne&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>hermes</category>
      <category>openclaw</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>Mistral Medium 3.5 is Live in Kilo</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Fri, 08 May 2026 10:44:59 +0000</pubDate>
      <link>https://dev.to/kilocode/mistral-medium-35-is-live-in-kilo-code-43b4</link>
      <guid>https://dev.to/kilocode/mistral-medium-35-is-live-in-kilo-code-43b4</guid>
      <description>&lt;h1&gt;
  
  
  Mistral Medium 3.5 is Live in Kilo
&lt;/h1&gt;

&lt;p&gt;We're thrilled to announce that the public preview version of &lt;a href="https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5" rel="noopener noreferrer"&gt;Mistral Medium 3.5&lt;/a&gt; is now live in Kilo. This is Mistral's first blended model (it merges instruction-following, reasoning, and coding into a single 128B dense model) and it puts the lab instantly back on the OSS map.&lt;/p&gt;

&lt;p&gt;If it's seemed quiet on the Mistral front for a while, that's because they've been heads-down building. This new model is a major leap for the lab, and the focus on agentic work — coding and agentic engineering — benefits all of us.&lt;/p&gt;

&lt;p&gt;Mistral's &lt;a href="https://kilo.ai/models/mistral-medium-3-5" rel="noopener noreferrer"&gt;new flagship&lt;/a&gt; is a dense 128B model with a 256k context window, built from the ground up for long-horizon agentic work. It merges instruction-following, reasoning, and coding into a single set of weights, with configurable reasoning effort so you can dial it up for a gnarly refactor or keep it light for a quick edit. It scores 77.6% on SWE-Bench Verified, putting it ahead of Devstral 2 and models like Qwen3.5 397B A17B. The vision encoder was trained from scratch to handle variable image sizes, and the whole thing can run self-hosted on as few as four GPUs.&lt;/p&gt;

&lt;p&gt;And Mistral is sticking to their OSS principles: the new model shipped with &lt;a href="https://huggingface.co/mistralai/Mistral-Medium-3.5-128B" rel="noopener noreferrer"&gt;open weights&lt;/a&gt; under a modified MIT license.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tb2vkgpukmt4y5chd6u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tb2vkgpukmt4y5chd6u.png" alt="Mistral Medium 3.5" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a serious new model for serious engineering tasks, and Mistral users will find that it's now the default for the Mistral Vibe CLI and Le Chat. And with Kilo, anybody can use the model among hundreds of other top models and always find the right tools for the job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Mistral Medium 3.5 Everywhere You Use Kilo
&lt;/h2&gt;

&lt;p&gt;The new model is available in the Kilo Gateway, so you can use it everywhere with a single login.&lt;/p&gt;

&lt;h3&gt;
  
  
  VS Code Extension
&lt;/h3&gt;

&lt;p&gt;The upgraded Kilo Code VS Code extension now surfaces Mistral Medium 3.5 in the model switcher. Pick it for any task where you want a model that can hold a lot of context, reason through complexity, and produce structured output your codebase can actually consume.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kilo Code CLI
&lt;/h3&gt;

&lt;p&gt;Running Kilo from the terminal? Mistral Medium 3.5 is available there too. It's a strong choice for longer CLI sessions — dependency upgrades, test generation, CI investigations — where you want the model working steadily without losing the thread.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Agents
&lt;/h3&gt;

&lt;p&gt;Kilo Code's cloud agent infrastructure is where Mistral Medium 3.5 really opens up. Kick off sessions powered by this model, walk away, and come back to finished branches or draft PRs. The model was built specifically for async, multi-tool work — running long stretches reliably, calling tools in sequence, producing structured handoffs. That makes it a natural fit for the tasks you want to delegate completely: module refactors, issue triage, test coverage gaps, incident investigations.&lt;/p&gt;

&lt;h3&gt;
  
  
  KiloClaw
&lt;/h3&gt;

&lt;p&gt;Mistral Medium 3.5 is available as a model option across KiloClaw recipes. Whether you're running a personal claw or a work claw, you can now back those workflows with a model that handles complex, multi-step reasoning without breaking a sweat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It in Kilo Today
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/models/mistral-medium-3-5" rel="noopener noreferrer"&gt;Mistral Medium 3.5&lt;/a&gt; is priced at $1.50 per million input tokens and $7.50 per million output tokens through the API. For a frontier-class 128B model at this capability level, that's competitive — especially for agentic runs that justify the context and reasoning headroom.&lt;/p&gt;

&lt;p&gt;At a &lt;a href="https://artificialanalysis.ai/models/mistral-medium-3-5/providers?blend=3-1" rel="noopener noreferrer"&gt;blended price&lt;/a&gt; of $3 per million tokens for general chat, and just $1.56 per million tokens for long-context summarization, it's more affordable than it might look at first glance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dn0fdtyvma846yvywg3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dn0fdtyvma846yvywg3.png" alt="Mistral Medium 3.5 blended pricing" width="710" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Plus, if you grab a &lt;a href="https://kilo.ai/features/kilo-pass" rel="noopener noreferrer"&gt;Kilo Pass&lt;/a&gt; you can embrace a healthy discount :)&lt;/p&gt;

&lt;p&gt;Open the model switcher in the &lt;a href="https://www.producthunt.com/products/kilocode/launches/kilo-code-v7-for-vs-code" rel="noopener noreferrer"&gt;latest version of our VS Code extension&lt;/a&gt;, select it in your CLI agent config, or choose it as the backing model for your next KiloClaw recipe. It's available now in public preview — we'd love to hear what you build with it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://blog.kilo.ai/p/mistral-medium-35-is-live-in-kilo" rel="noopener noreferrer"&gt;the Kilo Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kilocode</category>
    </item>
    <item>
      <title>KiloClaw in VS Code, Kilo CLI in KiloClaw</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Fri, 08 May 2026 10:42:14 +0000</pubDate>
      <link>https://dev.to/kilocode/kiloclaw-in-vs-code-kilo-cli-in-kiloclaw-53k0</link>
      <guid>https://dev.to/kilocode/kiloclaw-in-vs-code-kilo-cli-in-kiloclaw-53k0</guid>
      <description>&lt;h1&gt;
  
  
  KiloClaw in VS Code, Kilo CLI in KiloClaw
&lt;/h1&gt;

&lt;h3&gt;
  
  
  When your AI agent lives inside your AI coding assistant (and vice versa)
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;By Brendan O'Leary · May 04, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Last week in the Kilo Discord, someone asked if they could SSH into their KiloClaw instance from VS Code. Not to use Kilo Code — just to edit their agent's AGENTS.md file directly. A few messages later, another person asked how to get KiloClaw chat working inside their editor.&lt;/p&gt;

&lt;p&gt;Same underlying need from two directions: how do I talk to my always-on agent while I'm in the middle of writing code?&lt;/p&gt;

&lt;p&gt;Kilo Code shipped answers to both in April. KiloClaw now has a &lt;a href="https://github.com/Kilo-Org/kilocode/pull/7960" rel="noopener noreferrer"&gt;native chat panel&lt;/a&gt; inside the VS Code extension. And the Kilo CLI — which ships built into every KiloClaw instance — got &lt;a href="https://github.com/Kilo-Org/kilocode/pull/8218" rel="noopener noreferrer"&gt;org-aware &lt;code&gt;/kiloclaw&lt;/code&gt; support&lt;/a&gt; so you can manage your cloud agent from the terminal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhust836hl8cegzu150cm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhust836hl8cegzu150cm.jpeg" alt="KiloClaw and Kilo Code side by side in VS Code" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;KiloClaw in VS Code&lt;/strong&gt; means you open the KiloClaw chat panel alongside your Kilo Code sidebar. You're editing code with Kilo Code's agent on one side, and on the other you have your KiloClaw agent that's running on a server somewhere — doing background work, monitoring things, managing tasks. Interactive coding in one panel, autonomous agent in the other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kilo CLI in KiloClaw&lt;/strong&gt; means your cloud-hosted KiloClaw instance has the full Kilo CLI available. Your agent can use &lt;code&gt;kilo run&lt;/code&gt; to spin up coding sessions on its own projects, use &lt;code&gt;kilo pr&lt;/code&gt; to check out and review pull requests, or invoke any of the 500+ models through the same interface you use locally.&lt;/p&gt;

&lt;p&gt;Josh from the Kilo team &lt;a href="https://discord.com/channels/1349288496988160052/vscode" rel="noopener noreferrer"&gt;said it plainly in Discord&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;KiloClaw ships with Kilo CLI built-in. We are also working to integrate KiloClaw inside of the extension. Being able to start a session, pick it up with KiloClaw, set KiloClaw to do work autonomously, etc. is pretty powerful.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Setting up KiloClaw in VS Code
&lt;/h2&gt;

&lt;p&gt;The panel shipped in &lt;a href="https://github.com/Kilo-Org/kilocode/pull/7960" rel="noopener noreferrer"&gt;v7.2.20&lt;/a&gt; and is available now if you have a KiloClaw instance.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Update your VS Code extension to the latest version&lt;/li&gt;
&lt;li&gt;In the sidebar, click the KiloClaw icon (chat bubble) or open the Command Palette and run &lt;strong&gt;KiloClaw&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;If you already have a KiloClaw instance configured through Kilo Gateway, you'll see the chat panel with your agent's conversation history&lt;/li&gt;
&lt;li&gt;If you don't have one yet, you'll get a setup view that walks you through provisioning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The panel uses the same Stream Chat WebSocket as the web UI, so messages appear in real time. Your agent's responses stream in, and the panel restores automatically when you reopen VS Code.&lt;/p&gt;

&lt;p&gt;One detail I noticed: it uses the same &lt;code&gt;kilo-ui&lt;/code&gt; component library as the rest of the extension. Markdown rendering, buttons, toast notifications all match. Doesn't feel bolted on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using Kilo CLI inside KiloClaw
&lt;/h2&gt;

&lt;p&gt;If you're running KiloClaw (either self-hosted via OpenClaw or on Kilo's managed hosting), the Kilo CLI is already there. Your agent can invoke it directly.&lt;/p&gt;

&lt;p&gt;A few patterns I've been using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your KiloClaw agent watches a repo for new PRs and uses &lt;code&gt;kilo pr &amp;lt;number&amp;gt;&lt;/code&gt; to check them out and run a review session. Results come back to you over Telegram, Discord, or wherever you get KiloClaw messages.&lt;/li&gt;
&lt;li&gt;You tell your agent "refactor the authentication module" and it uses &lt;code&gt;kilo run&lt;/code&gt; with the right model and mode to do the work, commits the result, and opens a PR for you to review.&lt;/li&gt;
&lt;li&gt;Your agent has access to multiple repos and can run separate Kilo sessions in each one, coordinating changes that span services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;/kiloclaw&lt;/code&gt; command in the CLI now supports organization contexts too. If you've selected a team via &lt;code&gt;/teams&lt;/code&gt;, running &lt;code&gt;/kiloclaw&lt;/code&gt; resolves to that org's KiloClaw instance rather than your personal one. Useful if your company has a shared agent for CI tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why both
&lt;/h2&gt;

&lt;p&gt;Kilo Code in VS Code is interactive. You're pair-programming with it. It sees your editor state, your file tree, your terminal output. It works in your context.&lt;/p&gt;

&lt;p&gt;KiloClaw is persistent and autonomous. It runs when you're asleep, handles background tasks, monitors systems, processes incoming requests. It works in its own context, on its own machine.&lt;/p&gt;

&lt;p&gt;Having both accessible from the same editor means you can tell your KiloClaw agent to start a background task while you keep coding, check in on what it found overnight, or hand off a tedious refactor while you work on the interesting parts. When it finishes, the results show up right there in your editor.&lt;/p&gt;

&lt;p&gt;I've been doing this for the last week. Writing code with Kilo Code, glancing over at KiloClaw to see what my agent turned up from the research I asked it to do that morning. No tab switching, no opening a separate app. It's there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rough edges
&lt;/h2&gt;

&lt;p&gt;This is new. A few things to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The KiloClaw panel requires Kilo Gateway authentication. If you're using Kilo Code with just a bare API key and no Kilo account, you won't see the panel.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;/kiloclaw&lt;/code&gt; command in the CLI only works when connected to Kilo Gateway. Same prerequisite.&lt;/li&gt;
&lt;li&gt;Error handling &lt;a href="https://github.com/Kilo-Org/kilocode/pull/9643" rel="noopener noreferrer"&gt;got improved this week&lt;/a&gt; — there was an issue where failures in the WebSocket connection could leave the panel in a bad state. That's fixed in the latest release.&lt;/li&gt;
&lt;li&gt;Documentation is still catching up. There's an &lt;a href="https://github.com/Kilo-Org/kilocode/pull/9134" rel="noopener noreferrer"&gt;open PR to add a "Setting Up Other Tools" page&lt;/a&gt; for KiloClaw that should cover this in more detail once it lands.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Your coding assistant and your autonomous agent used to be separate tools with separate UIs. Now they share the same extension, the same underlying engine, and the same model ecosystem. I expect the boundary between "interactive coding agent" and "background autonomous agent" to keep blurring.&lt;/p&gt;

&lt;p&gt;I use both daily. KiloClaw runs my email checks, monitors Discord, handles blog research (it's writing this post right now, actually). Kilo Code handles the interactive stuff — writing features, debugging, reviewing diffs. Having them in the same window means I stop context-switching between tools to check on what my agent is doing.&lt;/p&gt;

&lt;p&gt;If you're running KiloClaw already, update your VS Code extension and try the panel. If you're just using Kilo Code, the &lt;code&gt;/kiloclaw&lt;/code&gt; command in the CLI is how you'd set up your first instance.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://blog.kilo.ai/p/kiloclaw-in-vs-code-kilo-cli-in-kiloclaw" rel="noopener noreferrer"&gt;the Kilo Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openclaw</category>
    </item>
    <item>
      <title>The Arrival of GPT-5.5: OpenAI’s New Deep-Thinking Powerhouse</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Mon, 27 Apr 2026 09:19:50 +0000</pubDate>
      <link>https://dev.to/kilocode/the-arrival-of-gpt-55-openais-new-deep-thinking-powerhouse-53fk</link>
      <guid>https://dev.to/kilocode/the-arrival-of-gpt-55-openais-new-deep-thinking-powerhouse-53fk</guid>
      <description>&lt;p&gt;OpenAI recently &lt;a href="https://openai.com/index/introducing-gpt-5-5/" rel="noopener noreferrer"&gt;rolled out GPT-5.5&lt;/a&gt; and its heavy-duty sibling, GPT-5.5 Pro, and everybody wants to put them to the test.&lt;/p&gt;

&lt;p&gt;If you feel like the model landscape is moving faster and faster, you're right. OpenAI's chief data scientist &lt;a href="https://techcrunch.com/2026/04/23/openai-chatgpt-gpt-5-5-ai-model-superapp/" rel="noopener noreferrer"&gt;told TechCrunch&lt;/a&gt; this week that "the last two years have been surprisingly slow," but what he meant is that now we're really moving — &lt;em&gt;now we're cooking with gas&lt;/em&gt;. And that's a good thing for consumers.&lt;/p&gt;

&lt;p&gt;These SOTA models aren't just becoming smarter and more comprehensive, they're also becoming more token-efficient for larger tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's new?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-5.5&lt;/strong&gt; is OpenAI's latest release for complex professional workloads, building on GPT-5.4 with stronger reasoning, higher reliability, and improved token efficiency on hard tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; is OpenAI's high-capability model optimized for deep reasoning and accuracy on complex, high-stakes workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both new models are now available in the &lt;a href="https://kilo.ai/gateway" rel="noopener noreferrer"&gt;Kilo Gateway&lt;/a&gt; and GPT-5.5 is one of our top recommended models out of the gate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/gateway" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj58omox92dhf8sqjohad.png" alt="GPT-5.5 shown as a top recommended model in the Kilo Gateway model selector" width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A New Standard for Complex Work
&lt;/h2&gt;

&lt;p&gt;GPT-5.5 is particularly impressive when it comes to coding and reasoning, and the kind of computer-use and browser skills needed by always-on agents like KiloClaw:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Terminal-Bench 2.0&lt;/strong&gt; (Command-line workflows &amp;amp; tool coordination): 82.7% &lt;em&gt;(vs. GPT-5.4: 75.1% | Claude Opus 4.7: 69.4%)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expert-SWE&lt;/strong&gt; (Internal long-horizon coding tasks ~20 hours): 73.1% &lt;em&gt;(vs. GPT-5.4: 68.5%)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GDPval&lt;/strong&gt; (Knowledge work across 44 occupations): 84.9% &lt;em&gt;(vs. GPT-5.4: 83.0% | Claude Opus 4.7: 80.3%)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OSWorld-Verified&lt;/strong&gt; (Operating real computer environments): 78.7% &lt;em&gt;(vs. GPT-5.4: 75.0% | Claude Opus 4.7: 78.0%)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BrowseComp&lt;/strong&gt;: 84.4% &lt;em&gt;(GPT-5.5 Pro scores 90.1%)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But benchmarks are only half the story. &lt;strong&gt;We had the privilege of pre-testing the alpha release of GPT-5.5&lt;/strong&gt;, and we're ready to share what this means for builders, agents, and the broader AI ecosystem. First of all, it's exciting to see OpenAI continuing to bridge the gap between execution and high-level strategy. Coming just two days after the release of GPT-5.4 Image 2, a stunning new image generation model for multimodal workflows, GPT-5.5 covers a lot of bases for professional workloads. &lt;strong&gt;This new model can transform how engineering teams scale their most complex autonomous workflows.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In our testing, GPT-5.5 has proven to be tremendously capable at long-context tasks and agentic coding. Where previous generation models would occasionally lose the plot during massive refactoring jobs or deep-reasoning requirements for large codebases, GPT-5.5 stays locked in.&lt;/p&gt;

&lt;p&gt;More importantly for our ecosystem, it has become a formidable daily driver for KiloClaw as well as an excellent fit for getting a new claw up and running and exploring new use cases. We've been using it to run always-on agents handling highly complex, multi-step professional work, and the reliability jump is palpable.&lt;/p&gt;

&lt;p&gt;As we noted in our recent &lt;a href="https://blog.kilo.ai/p/we-gave-claude-opus-47-and-kimi-k26" rel="noopener noreferrer"&gt;deep dive comparing Claude Opus 4.7 and Moonshot's Kimi K2.6&lt;/a&gt;, the frontier of AI is fiercely competitive right now. While Opus 4.7 and Kimi K2.6 brought massive leaps in their own rights, GPT-5.5 introduces a new class of autonomous capability that specifically targets professional, high-stakes workflows where fewer retries and higher reliability directly translate to better outcomes.&lt;/p&gt;

&lt;p&gt;GPT-5.5 is definitely crushing a wide range of benchmarks, which fits with our experience testing the model in Kilo Code and KiloClaw. Significantly, it topped the Artificial Analysis Intelligence Index by 3 points, &lt;a href="https://artificialanalysis.ai/articles/openai-gpt5-5-is-the-new-leading-AI-model" rel="noopener noreferrer"&gt;breaking a three-way tie&lt;/a&gt; with Anthropic and Google.&lt;/p&gt;

&lt;p&gt;In our testing, GPT-5.5 did have some issues with UI-related design tasks, but we found that more specific instructions helped resolve some of those problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  So which one should you use?
&lt;/h2&gt;

&lt;p&gt;GPT-5.5 is priced higher than GPT-5.4, reflecting its heavy-duty reasoning capabilities. And with this new model OpenAI did push up pricing again.&lt;/p&gt;

&lt;p&gt;In fact, GPT-5.5 &lt;strong&gt;($5 / Mtok input, $30 / Mtok output, $0.50 / Mtok cache)&lt;/strong&gt; is more approachable than it might look from the outside. The 5.5 series is more token efficient than 5.4. For hard tasks, this efficiency often results in a &lt;em&gt;lower&lt;/em&gt; actual cost per completed task because the model gets it right on the first try, without needing endless prompt engineering or loop retries.&lt;/p&gt;

&lt;p&gt;GPT-5.5 often reaches higher-quality outputs with fewer retries, so it can be more token-efficient on real workflows even when reasoning is higher. And good news for Kilo Coders: &lt;strong&gt;it's the most token efficient at coding workflows.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We would also like to echo OpenAI's own advice here: "Higher reasoning can use more tokens, so customers should match reasoning effort to the task."&lt;/p&gt;

&lt;p&gt;In-memory prompt caching is &lt;strong&gt;not supported&lt;/strong&gt; for GPT-5.5. Caching for this model relies exclusively on extended prompt caching. During inference, the model caches tokens from previous requests directly on GPU-local storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does it Claw?
&lt;/h2&gt;

&lt;p&gt;We're excited to see what Kilo users around the world do with it. Like the new Opus, it's super smart. But is it &lt;em&gt;too smart&lt;/em&gt; for daily tasks? Or will it become your daily driver?&lt;/p&gt;

&lt;p&gt;My prediction is that GPT-5.5 will compete more directly with the latest Opus release for coding, but be more of a top-agent driver in Hermes and OpenClaw workflows like &lt;a href="https://kilo.ai/kiloclaw" rel="noopener noreferrer"&gt;KiloClaw&lt;/a&gt;: sub-agents will likely need to use smaller models or OSS models to remain cost-efficient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/gateway" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvprep86439j667bsv7f.png" alt="GPT-5.5 and GPT-5.5 Pro shown in the Kilo model selector" width="680" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That said, the only way to really&lt;/p&gt;

</description>
      <category>openai</category>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Shell Security Plugin</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Mon, 27 Apr 2026 09:16:14 +0000</pubDate>
      <link>https://dev.to/kilocode/shell-security-plugin-4p16</link>
      <guid>https://dev.to/kilocode/shell-security-plugin-4p16</guid>
      <description>&lt;p&gt;I ran &lt;code&gt;openclaw security audit&lt;/code&gt; on my instance the other day and got back a wall of text. Six findings — one critical, three warnings, two informational. I stared at it for a minute, scrolled through the nested objects, and thought: "Okay, but what should I actually &lt;em&gt;do&lt;/em&gt; about this?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!E7DS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4eebf49-f3ea-4980-a9bc-40fae53963f1_1080x298.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F50a7c4amv6n4pmbb73n4.png" alt="Screenshot of openclaw security audit JSON output showing six findings" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's the gap the new &lt;a href="https://github.com/Kilo-Org/shell-security" rel="noopener noreferrer"&gt;Shell Security&lt;/a&gt; plugin fills. It takes that same audit output, sends the findings (not your secrets, not your config) to the &lt;a href="https://kilo.ai" rel="noopener noreferrer"&gt;KiloCode&lt;/a&gt; Security Advisor API, and gives you back a prioritized report with specific remediation steps. The whole thing happens in your chat — Telegram, Slack, the Control UI, wherever you talk to your agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;The plugin is a thin bridge between two things that already exist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;openclaw security audit&lt;/code&gt;&lt;/strong&gt; — the built-in CLI command that checks your local config for common security foot-guns (weak models without sandboxing, exposed runtime tools, missing trusted proxies, multi-user setups without isolation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KiloCode's Security Advisor API&lt;/strong&gt; — an endpoint that takes those findings and returns expert analysis with context-specific remediation guidance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The plugin runs the audit locally, packages the JSON output, and sends it off. What comes back is a markdown report that covers what was found, why it matters, and what to do about it — organized by priority.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing it
&lt;/h2&gt;

&lt;p&gt;It's currently dev-only but will be released soon!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw plugins &lt;span class="nb"&gt;install&lt;/span&gt; @kilocode/shell-security
openclaw plugins &lt;span class="nb"&gt;enable &lt;/span&gt;shell-security
openclaw gateway restart
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gateway restart is a one-time thing after install. If you're talking to your agent through Slack or Telegram, you'll see a brief connection blip and then it's back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two ways to run it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Slash command (recommended):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This runs the plugin directly and renders the full report. It bypasses the LLM's summarization layer entirely, so you get the complete output regardless of which model you're running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural language:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can also just say "run a security checkup" or "audit my OpenClaw config" and the agent will call the tool. One thing to know: if you're running a smaller model (Haiku, GPT-x-nano), it might paraphrase or truncate the report. Capable models like Sonnet or GPT's latest handle it fine. When in doubt, use the slash command.&lt;/p&gt;

&lt;h2&gt;
  
  
  First-run authentication
&lt;/h2&gt;

&lt;p&gt;The first time you run it, the plugin prompts you to connect your KiloCode account through a device auth flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open a URL in your browser&lt;/li&gt;
&lt;li&gt;Enter a code&lt;/li&gt;
&lt;li&gt;Sign in or create a free account&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;/security-checkup&lt;/code&gt; again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After that, the token is saved and you never see the auth flow again. There's a gateway reload on first auth (the plugin writes the token to your config), but subsequent runs are instant.&lt;/p&gt;

&lt;p&gt;If you're running OpenClaw in CI or a container, you can skip the interactive flow entirely by setting &lt;code&gt;KILOCODE_API_KEY&lt;/code&gt; as an environment variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What gets sent (and what doesn't)
&lt;/h2&gt;

&lt;p&gt;This matters. Your OpenClaw instance has access to your filesystem, your API keys, your chat history. The plugin doesn't send any of that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The JSON output of &lt;code&gt;openclaw security audit&lt;/code&gt; — finding IDs and summaries, no secrets&lt;/li&gt;
&lt;li&gt;Your OpenClaw version and plugin version&lt;/li&gt;
&lt;li&gt;Your instance's public IP (for optional remote probes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not sent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Config file contents&lt;/li&gt;
&lt;li&gt;API keys, secrets, or tokens&lt;/li&gt;
&lt;li&gt;Chat history&lt;/li&gt;
&lt;li&gt;Workspace files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything goes over HTTPS, authenticated with your KiloCode account token.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the report looks like
&lt;/h2&gt;

&lt;p&gt;On my instance, the report came back with findings grouped by severity — the critical one about small models running without sandboxing at the top, followed by the warnings about trusted proxies and multi-user heuristics, and then the informational items. Each finding includes context about why it's a risk and concrete steps to fix it.&lt;/p&gt;

&lt;p&gt;It's... a lot of text right now. The formatting still needs work — the dev release is functional but not polished. There's also a bug where the KiloClaw call-to-action shows up even if you're already a KiloClaw user. These are known rough edges that'll get smoothed out before the stable release.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is useful
&lt;/h2&gt;

&lt;p&gt;Running &lt;code&gt;openclaw security audit&lt;/code&gt; is already good practice. But JSON output requires you to interpret each finding yourself, look up what the check IDs mean, and figure out the right remediation. The Security Advisor layer turns those findings into specific guidance you can act on immediately.&lt;/p&gt;

&lt;p&gt;For anyone running OpenClaw as a personal assistant (which is most of us), the security surface is real. Your agent has shell access, filesystem access, web browsing. A misconfigured model fallback or an unintended multi-user exposure means your agent could be manipulated by untrusted input. Having something that checks this and explains the results in plain language saves you from reading JSON and guessing at severity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current status
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.npmjs.com/package/@kilocode/shell-security" rel="noopener noreferrer"&gt;npm package&lt;/a&gt; is live and the &lt;a href="https://github.com/Kilo-Org/shell-security" rel="noopener noreferrer"&gt;source is on GitHub&lt;/a&gt; under MIT license. A stable release is coming — the main work remaining is formatting improvements and fixing the conditional CTA logic.&lt;/p&gt;

&lt;p&gt;Install it, run &lt;code&gt;/shell-security&lt;/code&gt;, see what it finds. It takes about thirty seconds.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>New VS Code Extension - Week Three: Memory, Stability, and Moving at Kilo Speed Into the Future</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Fri, 24 Apr 2026 08:16:10 +0000</pubDate>
      <link>https://dev.to/kilocode/new-vs-code-extension-week-three-memory-stability-and-moving-at-kilo-speed-into-the-future-21cd</link>
      <guid>https://dev.to/kilocode/new-vs-code-extension-week-three-memory-stability-and-moving-at-kilo-speed-into-the-future-21cd</guid>
      <description>&lt;p&gt;Three weeks ago we GA'd the &lt;a href="https://blog.kilo.ai/p/new-kilo-for-vs-code-is-live" rel="noopener noreferrer"&gt;completely rebuilt Kilo Code extension&lt;/a&gt; for VS Code. &lt;a href="https://blog.kilo.ai/p/new-vs-code-extension-week-one-what/" rel="noopener noreferrer"&gt;Week one&lt;/a&gt; was about what we were hearing and what we were shipping. &lt;a href="https://blog.kilo.ai/p/new-vs-code-week-two/" rel="noopener noreferrer"&gt;Week two&lt;/a&gt; was about addressing the most urgent feedback and bumps.&lt;/p&gt;

&lt;p&gt;This week is about the two other areas of frequent feedback and challenges: &lt;strong&gt;memory usage on Windows&lt;/strong&gt; and &lt;strong&gt;session stability under sustained use&lt;/strong&gt;. Both are materially better now than they were a week ago. Neither is 100% fixed and "done", we can see from open GitHub issues that some of you still hit rough edges, but the experience is significantly improved especially on Windows when using Agent Manager.&lt;/p&gt;

&lt;p&gt;Across the week we shipped &lt;strong&gt;80+ Kilo PRs&lt;/strong&gt; and merged &lt;strong&gt;three more upstream OpenCode releases&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!NfA1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100f1bf8-af1b-4b8d-b345-313a8792c17e_1536x1024.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmll6qxzvcm3w3h0saqv.png" alt="Screenshot showing the Kilo VS Code extension Agent Manager interface with multiple active sessions" width="800" height="534"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Windows Memory: A Big Step Forward
&lt;/h2&gt;

&lt;p&gt;This is the one we know has caused the most pain. Users on Windows reported the Kilo core process climbing into multiple GB of RAM within minutes of opening Agent Manager and staying there. A handful of you sent us heap snapshots — thank you — which helped track down root cause on some harder to reproduce issues.&lt;/p&gt;

&lt;p&gt;The high-level story: Agent Manager was polling git status and diffs through the Kilo core subprocess, and on Windows the combination of IPC round-trips, diff payload sizes, and allocator behavior meant freed memory wasn't being returned to the OS cleanly. In v7.2.20 we've restructured that path (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9046" rel="noopener noreferrer"&gt;#9046&lt;/a&gt;) and made the extension much more careful about what it holds in memory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent Manager's git work now runs directly in the extension host, not through the core process.&lt;/li&gt;
&lt;li&gt;We cap how much of any single diff we'll read into memory, so opening a very large file no longer causes a spike the allocator can't recover from.&lt;/li&gt;
&lt;li&gt;We also tuned the allocator on the core process itself to release memory back to the OS more promptly on Windows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you were running on a downgraded 5.x build because of memory issues, this is the release to come back on. If you're still seeing unbounded growth, please keep the issues coming — the heap-snapshot command we added this cycle (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9034" rel="noopener noreferrer"&gt;#9034&lt;/a&gt;) makes those reports much easier to act on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Session Stability: Fewer Interruptions
&lt;/h2&gt;

&lt;p&gt;The second theme was sessions getting interrupted mid-flow — usually recoverable by sending another message or re-opening the session/extension. Most of the reports we got traced back to a handful of specific state-machine edges, and those are now meaningfully better.&lt;/p&gt;

&lt;p&gt;The one we heard about most often was sessions ending up stuck — most visibly when VS Code was closed while a suggestion prompt was still showing, which left the session permanently marked busy and any follow-up message queued forever. Sessions now go idle correctly while waiting on a suggestion response (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9199" rel="noopener noreferrer"&gt;#9199&lt;/a&gt;). A related set of stuck states around the end-of-plan flow — where "Start new session" and "Continue here" didn't reliably transition you into the handover session — also got fixed, so those buttons now move you into a new session that stays visibly busy until the handover summary lands (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9245" rel="noopener noreferrer"&gt;#9245&lt;/a&gt;, &lt;a href="https://github.com/Kilo-Org/kilocode/pull/9300" rel="noopener noreferrer"&gt;#9300&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Everyday chat behavior got a lot smoother too. The most common irritation was the chat view snapping back to the bottom while you were trying to read earlier context during a streaming response; that no longer happens, and scrolling back through long sessions now correctly reloads earlier history from the virtualized list (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9236" rel="noopener noreferrer"&gt;#9236&lt;/a&gt;, &lt;a href="https://github.com/Kilo-Org/kilocode/pull/9194" rel="noopener noreferrer"&gt;#9194&lt;/a&gt;). Switching between long sessions in Agent Manager — which used to briefly freeze the UI — is now near-instant, with the chat view self-healing if messages arrived while it was in the background (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/8911" rel="noopener noreferrer"&gt;#8911&lt;/a&gt;). Smaller queue and layout fixes also landed around follow-up prompts and tool output interleaving.&lt;/p&gt;

&lt;p&gt;Finally, a nice performance-and-stability win from the community: &lt;a href="https://github.com/IamCoder18" rel="noopener noreferrer"&gt;@IamCoder18&lt;/a&gt; landed visibility-aware git polling plus resolution caching in Agent Manager's git stats poller (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/8703" rel="noopener noreferrer"&gt;#8703&lt;/a&gt;), meaningfully reducing the number of git subprocesses the extension spawns on repos with many worktrees.&lt;/p&gt;

&lt;h2&gt;
  
  
  New Capabilities This Cycle
&lt;/h2&gt;

&lt;p&gt;Stability was the priority, but we still shipped meaningful new capability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fork sessions from any user message&lt;/strong&gt; — both in Agent Manager (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9207" rel="noopener noreferrer"&gt;#9207&lt;/a&gt;) and in the sidebar (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9244" rel="noopener noreferrer"&gt;#9244&lt;/a&gt;). Branch at any point without losing the original.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KiloClaw chat panel in VS Code&lt;/strong&gt; — the KiloClaw group chat experience now lives directly inside the editor (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/7960" rel="noopener noreferrer"&gt;#7960&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Folder @-mentions&lt;/strong&gt; — reference a folder with @ and include its top-level file contents as context (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9023" rel="noopener noreferrer"&gt;#9023&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autocomplete backend prewarm&lt;/strong&gt; — inline completions are ready on the first keystroke without having to open the Kilo sidebar first, and autocomplete state refreshes when workspace folders change (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9305" rel="noopener noreferrer"&gt;#9305&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heap snapshots from the Command Palette&lt;/strong&gt; — capture a snapshot of the bundled Kilo core directly from VS Code (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9034" rel="noopener noreferrer"&gt;#9034&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Contribute on GitHub" CTA in Marketplace&lt;/strong&gt; — a subtle footer link inviting contributions of new skills, modes, and MCP servers (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/9099" rel="noopener noreferrer"&gt;#9099&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Upstream OpenCode
&lt;/h2&gt;

&lt;p&gt;Three more OpenCode upstream releases merged this cycle — v1.4.4, v1.4.5, and v1.4.6 — bringing continued improvements to session sync, provider compatibility, Windows terminal handling, and the underlying AI SDK layer. Building on a shared open-source foundation continues to pay off: work from the broader OpenCode community lands in Kilo automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Codebase Indexing Progress
&lt;/h2&gt;

&lt;p&gt;Community contributor &lt;a href="https://github.com/shssoichiro" rel="noopener noreferrer"&gt;@shssoichiro&lt;/a&gt;'s codebase indexing work (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/6966" rel="noopener noreferrer"&gt;#6966&lt;/a&gt;) remains active. The branch is being kept current against main, review iterations are ongoing, and we're closing in on a form we can land. This is a substantial feature and we want to get it right — thank you for the sustained effort here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community Update
&lt;/h2&gt;

&lt;p&gt;Some numbers and names from this cycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;80+ PRs merged&lt;/strong&gt; on top of the upstream OpenCode work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 upstream OpenCode releases merged&lt;/strong&gt; — v1.4.4, v1.4.5, and v1.4.6.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple stable releases&lt;/strong&gt; promoted to the marketplace through the period, with v7.2.20 as the current stable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thank you to community contributors whose work landed or continued this cycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/shssoichiro" rel="noopener noreferrer"&gt;@shssoichiro&lt;/a&gt; — continued work on codebase indexing (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/6966" rel="noopener noreferrer"&gt;#6966&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/IamCoder18" rel="noopener noreferrer"&gt;@IamCoder18&lt;/a&gt; — visibility-aware git polling in GitStatsPoller (&lt;a href="https://github.com/Kilo-Org/kilocode/pull/8703" rel="noopener noreferrer"&gt;#8703&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And broad thanks to every community member who filed heap snapshots, reproduction steps, Discord reports, and sustained the long-running Windows performance thread (&lt;a href="https://github.com/Kilo-Org/kilocode/issues/8030" rel="noopener noreferrer"&gt;#8030&lt;/a&gt;). That conversation is the reason we had the signal we needed to tackle the memory work head-on this week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving at Kilo Speed Into the Future
&lt;/h2&gt;

&lt;p&gt;This is the last of the regular weekly updates in this series. The core issues that we highlighted in Week 1 — rate limiting, Plan/Ask strictness, human-in-the-loop controls, config resilience, and Windows memory — are either resolved or meaningfully better. We will continue to focus on smoothing out the rough edges in the near future.&lt;/p&gt;

&lt;p&gt;We will also be driving Kilo further towards the vision of where agentic coding is going, enabling engineering teams to ship at Kilo Speed safely and confidently, faster than ever before. We are excited about this future and believe that the new V7 is on a strong foundation to build on. Agent Manager continues to improve for those who like to run multiple agent sessions in parallel, and will only become more useful as models continue to improve and become more capable and need less oversight. And when a particular change or workstyle requires closer agent supervision and pair programming, you can do that too. The AI landscape is evolving quickly and models keep advancing, and the tools we use need to keep pace.&lt;/p&gt;

&lt;p&gt;To everyone who showed up over these three weeks — the issue filers, the PR authors, the Discord commenters, the prerelease testers, the heap-snapshot senders, and the folks who point to the future with feature requests — &lt;strong&gt;thank you&lt;/strong&gt;. Your feedback, issues, and pull requests are genuinely what makes this community great. We value every piece of it, and we'll keep making the extension better because of it.&lt;/p&gt;

&lt;p&gt;See you in the release notes.&lt;/p&gt;

&lt;p&gt;— Josh and Mark&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Move at Kilo Speed.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vscode</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The future of Product Managers</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Thu, 23 Apr 2026 12:54:39 +0000</pubDate>
      <link>https://dev.to/kilocode/the-future-of-product-managers-4k28</link>
      <guid>https://dev.to/kilocode/the-future-of-product-managers-4k28</guid>
      <description>&lt;p&gt;A product leader we know has 15 years of experience shipping developer tools. He spent a decade at a household name. He is, genuinely, one of the best product minds we've encountered in this industry.&lt;/p&gt;

&lt;p&gt;He can't get a conversation for a group PM role.&lt;/p&gt;

&lt;p&gt;That is a signal, not a market blip.&lt;/p&gt;

&lt;p&gt;We've spent a lot of time talking about what AI is doing to engineers – how one developer with the right tools now ships what used to require a team of five. But we had an adjacent question: what happens to product managers?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fep6jgi2nbd309il72zk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fep6jgi2nbd309il72zk2.png" alt="Illustration representing the collapse of the traditional PM-to-engineer shipping funnel" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Shipping isn't a funnel anymore
&lt;/h2&gt;

&lt;p&gt;For years, software development worked like a funnel. PMs turned customer insights into specs. Engineers turned specs into code. The funnel created a natural place for the PM to sit – upstream, owning the translation layer.&lt;/p&gt;

&lt;p&gt;Shipping was expensive. So you needed someone to decide what was worth shipping.&lt;/p&gt;

&lt;p&gt;That's no longer true. Shipping is close to free now. So what is a PM's role now that the funnel has collapsed and PMs aren't filtering a very costly resource (engineering time)? Is there still a place for PMs in this new world?&lt;/p&gt;

&lt;p&gt;As former PMs ourselves, we're watching this shift from two very different vantage points. At Kilo, there are about 40 people and one PM. We operate with a &lt;a href="https://blog.kilo.ai/p/our-engineers-own-a-number" rel="noopener noreferrer"&gt;WAUzer (Weekly Active User) model&lt;/a&gt; – every engineer owns a single product area and is accountable for the weekly active users in that area. Every Monday, Evgeny would stand up for two minutes: here's what I did on cloud agents, here are the numbers, here's my target for next week. He was fast. He was accountable. And across those product areas, we saw roughly 10% week-over-week growth.&lt;/p&gt;

&lt;p&gt;The product hat shifted to engineers. And it worked.&lt;/p&gt;

&lt;p&gt;But, it didn't work everywhere – the VS Code extension had too much surface area for one engineer to own clearly. So we brought in Josh. He runs a pod. He decides what gets built. Traditional PM model.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://asksolo.ai" rel="noopener noreferrer"&gt;Solo&lt;/a&gt; (Asher's company), it's just two people – one developer – moving at a pace that would have required a team of 10 three years ago. No PM at all. No coordination layer. The product question and the building question sit with the same person.&lt;/p&gt;

&lt;p&gt;Two different experiments. Same conclusion forming.&lt;/p&gt;

&lt;h2&gt;
  
  
  It's always been vibe coding
&lt;/h2&gt;

&lt;p&gt;"PMs were the original vibe coders. We wrote the spec, and the engineers were our LLMs."&lt;/p&gt;

&lt;p&gt;That framing came out of a conversation between us. Because if the spec-to-code handoff is getting absorbed by AI tooling – if engineers can hold the product context and build without a translation layer – then the PM role has to move. The question is where.&lt;/p&gt;

&lt;p&gt;We see two paths forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path one: shift left toward go-to-market.&lt;/strong&gt; The thing that's genuinely hard, even in an AI-native company, is knowing what to build. Not technically – but commercially. What will people pay for? What problem are we actually solving? Who is the buyer, and do we have them before we build?&lt;/p&gt;

&lt;p&gt;That's where PMs might land. Not writing specs, but sitting closer to sales, customer research, and market discovery to orchestrate the product strategy and business rationale for building a feature. A big portion of the PM's role will be saying no to features to prevent bloat and identify customers who are willing to pay for features before building it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path two: the long thin layer – engineers who wear the product hat.&lt;/strong&gt; Each engineer owns their area completely. Customer conversations, support, metrics, roadmap decisions – all of it. No handoff, no telephone game.&lt;/p&gt;

&lt;p&gt;The upside is accountability. The downside is that it requires people who can go wide – technically sharp AND commercially minded AND customer-facing. That's a rare profile. And at some point, a customer doesn't want your one thin area. They want the whole package.&lt;/p&gt;

&lt;p&gt;Both paths are real. You'll see companies betting on each.&lt;/p&gt;

&lt;p&gt;The traditional shipping funnel is gone. It's dead in startups now and will die in F100s over the next 5 years. The people who figure out the new shape of product ownership – whether that's engineers, PMs who've shifted left, or something we don't have a name for yet – are the ones who'll be standing in three years.&lt;/p&gt;

&lt;p&gt;The senior product leader we mentioned will land somewhere. His experience is real. But the role he's looking for may not look like what it used to. The best thing any PM can do right now is stop waiting for the old model to come back and start experimenting with new models.&lt;/p&gt;

&lt;p&gt;Developers are working in the future. PMs need to join them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>We Gave Claude Opus 4.7 and Kimi K2.6 the Same Workflow Orchestration Spec</title>
      <dc:creator>Darko from Kilo</dc:creator>
      <pubDate>Thu, 23 Apr 2026 12:47:55 +0000</pubDate>
      <link>https://dev.to/kilocode/we-gave-claude-opus-47-and-kimi-k26-the-same-workflow-orchestration-spec-1b9m</link>
      <guid>https://dev.to/kilocode/we-gave-claude-opus-47-and-kimi-k26-the-same-workflow-orchestration-spec-1b9m</guid>
      <description>&lt;p&gt;&lt;a href="https://www.kimi.com/blog/kimi-k2-6" rel="noopener noreferrer"&gt;Kimi K2.6&lt;/a&gt; launched on April 20, 2026, four days after Anthropic released &lt;a href="https://www.anthropic.com/news/claude-opus-4-7" rel="noopener noreferrer"&gt;Claude Opus 4.7&lt;/a&gt;. We gave both models the same spec for FlowGraph, a persistent workflow orchestration API with DAG validation, atomic worker claims, lease expiry recovery, pause/resume/cancel, and SSE event streaming. Then we reviewed the code and reproduced the edge cases the models' own tests did not cover.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!3LI_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a9dd82-4071-49c4-9392-942df218e832_1944x1220.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftedrcmifweg0hliyzz7w.png" alt="Scorecard table comparing Claude Opus 4.7 and Kimi K2.6 across benchmark categories" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Claude Opus 4.7 scored &lt;strong&gt;91/100&lt;/strong&gt; and Kimi K2.6 scored &lt;strong&gt;68/100&lt;/strong&gt; on the same build. Kimi K2.6 reached 75% of Claude Opus's score at &lt;strong&gt;19% of the cost&lt;/strong&gt;, but the 25-point gap sits in lease handling, scheduling, and live streaming (the parts its own tests never exercised).&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!CFgp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdf3acc-d975-4947-b5ae-d38e4ca523e3_1642x298.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flluikub5rkrh308py0xw.png" alt="Pricing comparison table showing Claude Opus 4.7 at roughly 5x input and 6x output cost vs Kimi K2.6" width="800" height="145"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude Opus 4.7 runs at roughly 5x the input cost and 6x the output cost of Kimi K2.6. That is the gap we wanted to pressure-test.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Workflow Orchestration Spec
&lt;/h2&gt;

&lt;p&gt;A workflow engine runs jobs like a nightly settlement: fetch captured payments, charge customers, send receipts, publish analytics. Four steps with dependencies between them, retries when a step fails, and recovery when a worker crashes mid-step. Temporal, Airflow, and AWS Step Functions all solve the same problem at different scales.&lt;/p&gt;

&lt;p&gt;Most of our API comparisons test a wide range of skills (architecture, auth, filtering, error handling). For this test we wanted a single deep build where correctness was the main axis. A workflow engine with DAG validation, atomic step claims, lease expiry recovery, retry scheduling, and pause/resume/cancel semantics has objectively right and wrong answers. Either two workers can win the same step or they can't. Either an expired lease is recovered or it isn't. Either a step becomes runnable when its dependencies succeed or it doesn't.&lt;/p&gt;

&lt;p&gt;The spec also calls out at-least-once execution, deterministic scheduling across all eligible steps, and SQLite as the source of truth. The full spec is 1,042 lines and covers 20 endpoints across workflow definitions, runs, workers, events, health, and metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt
&lt;/h2&gt;

&lt;p&gt;We ran both tests in &lt;a href="https://kilocode.ai/" rel="noopener noreferrer"&gt;Kilo CLI&lt;/a&gt; and gave both models the same prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Read &lt;a class="mentioned-user" href="https://dev.to/spec"&gt;@spec&lt;/a&gt;.md and build the project in the current directory. Treat &lt;a class="mentioned-user" href="https://dev.to/spec"&gt;@spec&lt;/a&gt;.md as the source of truth. Do not simplify this into a mock, toy app, or basic CRUD scaffold. Create all code, configuration, Prisma schema, tests, and README needed for a runnable project. Work autonomously and continue until the implementation is complete. Before you finish, install dependencies, run the test suite, fix any failures you can reproduce, and make sure the project is runnable."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude Opus 4.7 ran on high thinking mode. Kimi K2.6 ran on thinking mode. Each model worked in its own empty directory with no shared state.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Each Model Produced
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!1-Kt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F235d6a11-70ea-4fdc-9e2c-cfa06d1e8392_2716x302.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttgxo0b114jmma0n9rge.png" alt="Side-by-side comparison of project output from Claude Opus 4.7 and Kimi K2.6" width="800" height="89"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude Opus 4.7 finished in about 20 minutes. Kimi K2.6 took longer on the clock, but we are not scoring elapsed time here. Kimi K2.6 was released the day of this test and provider availability is still limited. Wall-clock comparisons against a model as well-supported as Claude Opus 4.7 would distort the picture. Expect that gap to close as more providers host Kimi K2.6.&lt;/p&gt;

&lt;p&gt;Both models delivered the project shape we asked for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prisma with SQLite as the source of truth&lt;/li&gt;
&lt;li&gt;Hono routes for workflow definitions, runs, worker actions, events, health, and metrics&lt;/li&gt;
&lt;li&gt;Conditional &lt;code&gt;updateMany&lt;/code&gt; for step claiming&lt;/li&gt;
&lt;li&gt;Retry and lease-expiry scheduling&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;RunEvent&lt;/code&gt; table for audit logs&lt;/li&gt;
&lt;li&gt;Readmes with setup instructions and at-least-once execution notes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Both Models Said Their Tests Passed
&lt;/h2&gt;

&lt;p&gt;Claude Opus 4.7 ran 31 tests across 6 files. Every test passed. Kimi K2.6 ran 20 tests inside a single file. Every test passed.&lt;/p&gt;

&lt;p&gt;If we had stopped there, the two implementations would look close. They weren't. A direct code review plus targeted reproductions against isolated SQLite databases surfaced &lt;strong&gt;one real bug in Claude Opus 4.7 and six in Kimi K2.6&lt;/strong&gt;. We will show each one with the line that causes it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Opus 4.7: One Real Bug
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multi-expired lease recovery leaves retryable siblings on a failed run
&lt;/h3&gt;

&lt;p&gt;The spec says that when a step exhausts retries, the parent run fails and every other non-terminal step becomes &lt;code&gt;blocked&lt;/code&gt;. Claude Opus 4.7's recovery path handles this correctly for a single expired lease. With two expired leases in the same recovery pass, it can undo its own block.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;src/services/workers.ts&lt;/code&gt;, &lt;code&gt;runRecovery()&lt;/code&gt; loads every expired &lt;code&gt;running&lt;/code&gt; step into memory and iterates:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!u_NU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c1d7f06-15bd-4804-84cf-29e47f8aebd5_1994x652.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0s1mzjru9xpiywcrqdms.png" alt="Code snippet showing runRecovery() iterating over expired steps in Claude Opus 4.7" width="800" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the first iteration exhausts retries for one step, &lt;code&gt;failRunDueToDeadStep()&lt;/code&gt; fires, the run becomes &lt;code&gt;failed&lt;/code&gt;, and every other non-succeeded step is set to &lt;code&gt;blocked&lt;/code&gt;. That is correct.&lt;/p&gt;

&lt;p&gt;The problem is the second iteration. &lt;code&gt;handleLeaseExpiry()&lt;/code&gt; updates by &lt;code&gt;id&lt;/code&gt; only:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!pecT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91043bb9-9896-4557-8cf3-7d0f7eaba7a9_1994x492.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0jl4lv6eu9gmv9cyatb.png" alt="Code snippet showing handleLeaseExpiry() updating by id without a status guard" width="800" height="197"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is no guard on &lt;code&gt;status&lt;/code&gt;, so a step that was just marked &lt;code&gt;blocked&lt;/code&gt; by the prior failure cascade gets updated back to &lt;code&gt;waiting_retry&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We reproduced it with a run containing two expired running steps: &lt;code&gt;a&lt;/code&gt; with &lt;code&gt;maxAttempts = 1&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; with &lt;code&gt;maxAttempts = 2&lt;/code&gt;. After recovery:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!kx6e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52105575-cea2-4065-8e2d-c2ae18964a2a_1994x276.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc8vyx9ib2lz1xo3f9acz.png" alt="Reproduction output showing step b incorrectly set to waiting_retry after the run had already failed" width="800" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Step &lt;code&gt;b&lt;/code&gt; should have been &lt;code&gt;blocked&lt;/code&gt; because the run had already failed. Instead it is eligible to be claimed again on the next &lt;code&gt;/workers/claim&lt;/code&gt; call.&lt;/p&gt;

&lt;p&gt;Claude Opus 4.7's test suite does not cover this case. It tests single-step lease expiry in isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Smaller contract risks
&lt;/h3&gt;

&lt;p&gt;Two smaller issues turned up in review but did not need a full reproduction.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The claim path reads &lt;code&gt;maxClaims * 10&lt;/code&gt; candidates. That is fine most of the time, but a queue with many skipped candidates at the front can hide valid work farther down the ordered list.&lt;/li&gt;
&lt;li&gt;The SSE stream subscribes after replay finishes and treats an unknown &lt;code&gt;afterEventId&lt;/code&gt; as "replay everything." The spec does not define unknown-cursor behavior explicitly, so this is more a looseness than a bug.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Kimi K2.6: Six Confirmed Issues
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Claim ordering is not global across runs
&lt;/h3&gt;

&lt;p&gt;The spec requires that when multiple steps are eligible, claim order is &lt;code&gt;priority&lt;/code&gt; descending, then &lt;code&gt;availableAt&lt;/code&gt; ascending, then &lt;code&gt;createdAt&lt;/code&gt; ascending, &lt;strong&gt;across all eligible steps&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Kimi K2.6's claim loop orders steps inside each run, then iterates runs in whatever order the database returns them:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!qSqZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969de824-79bf-47e6-9b12-64e756b69c02_1994x1082.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkb0eumgvqvye0902qgqc.png" alt="Code snippet showing Kimi K2.6's claim loop ordering steps per run instead of globally" width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We reproduced this with two active runs on the same queue. One had a step at &lt;code&gt;priority = 10&lt;/code&gt;. The other had a step at &lt;code&gt;priority = 100&lt;/code&gt;. The call to &lt;code&gt;POST /workers/claim&lt;/code&gt; returned the priority 10 step first.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. SSE is replay-only, not live
&lt;/h3&gt;

&lt;p&gt;The spec requires that &lt;code&gt;GET /runs/:id/events/stream&lt;/code&gt; replays stored events and then switches to live streaming.&lt;/p&gt;

&lt;p&gt;Kimi K2.6's stream reads every persisted event, writes them to the stream, and then starts a keepalive timer. Nothing subscribes to new events. The file &lt;code&gt;src/lib/events.ts&lt;/code&gt; even defines an &lt;code&gt;emitAndBroadcast&lt;/code&gt; function and a subscriber map, but the route never wires to them:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!G7L1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0c7d956-6690-4f15-b98a-7c586f81efe5_1994x813.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx28ly90tld8rpkoj38qg.png" alt="Code snippet showing Kimi K2.6's SSE route with unused emitAndBroadcast function and no live subscription" width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Clients receive replayed history once, then silence. The README still claims live streaming.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Expired leases can still be completed
&lt;/h3&gt;

&lt;p&gt;The heartbeat endpoint rejects expired leases. The &lt;code&gt;complete&lt;/code&gt; and &lt;code&gt;fail&lt;/code&gt; endpoints do not. We reproduced this by claiming a step, forcing &lt;code&gt;leaseExpiresAt&lt;/code&gt; into the past, and calling &lt;code&gt;POST /step-runs/:id/complete&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!Fr-U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761cb687-e134-49dc-95c3-10ad448b6690_1994x115.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fae9gs8h3zbftn128rg8x.png" alt="Reproduction output showing a step marked succeeded despite an expired lease" width="800" height="46"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The step was marked &lt;code&gt;succeeded&lt;/code&gt; on an expired lease. The spec treats lease expiry as a failed attempt. A worker can crash, its lease can expire, recovery can schedule a retry for the next worker, and the original worker can still phone in a "success" afterwards.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. "No active version" returns 404 instead of 409
&lt;/h3&gt;

&lt;p&gt;The spec: if there is no active version and no explicit &lt;code&gt;version&lt;/code&gt;, return &lt;code&gt;409&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Kimi K2.6 raises &lt;code&gt;NOT_FOUND&lt;/code&gt; (404):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!sIdy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddd5455-d0e0-41f3-bc21-c466d9dbac36_1994x223.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frel8pz5asyg0qpornpo4.png" alt="Code snippet showing Kimi K2.6 returning 404 instead of the spec-required 409" width="800" height="90"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Validation is narrower than the spec
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;CreateRunSchema&lt;/code&gt; and &lt;code&gt;CompleteSchema&lt;/code&gt; use &lt;code&gt;z.record(z.any())&lt;/code&gt; for &lt;code&gt;input&lt;/code&gt;, &lt;code&gt;metadata&lt;/code&gt;, and &lt;code&gt;output&lt;/code&gt;. The spec allows arbitrary JSON payloads. A string, array, or number payload is rejected even though the spec accepts it.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. The clean build path fails
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;npm test&lt;/code&gt; passes. &lt;code&gt;npm run build&lt;/code&gt; does not:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!zw-b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcee959c-d406-4978-84b3-ea1df77cf582_1994x223.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fny1bkicxeizlrtcr1n4u.png" alt="Terminal output showing npm run build failing on a clean checkout" width="800" height="90"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;package.json&lt;/code&gt; expects &lt;code&gt;npm start&lt;/code&gt; to run &lt;code&gt;node dist/index.js&lt;/code&gt;, so the documented build-and-start flow is broken on a clean checkout.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Each Model Said About Itself
&lt;/h2&gt;

&lt;p&gt;Both models produced end-of-run summaries claiming their implementations were complete and all tests passed. Both were technically true. Neither flagged the issues above.&lt;/p&gt;

&lt;p&gt;Claude Opus 4.7's summary was mostly accurate. It described its recovery path, atomic claim pattern, and event persistence correctly. The one thing it missed was the multi-expired lease interaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!XtV5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61573fdc-7a07-4b69-a032-2d13a83af77d_1928x2664.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4u3uoo4gf95fbegylel.png" alt="Claude Opus 4.7's end-of-run summary describing its implementation" width="800" height="1105"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6's summary claimed deterministic global scheduling and live SSE streaming. Both of those claims are in the README too. The code does not deliver either.&lt;/p&gt;

&lt;p&gt;"My tests pass" is not the same thing as "my implementation is correct." Both models understood the spec well enough to build most of it. Neither model wrote tests that would have caught its own worst behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!yJJ0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a961ff-359e-447e-9896-0fd38db22966_1928x2672.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftw8fy3agli5qficyaeg7.png" alt="Kimi K2.6's end-of-run summary incorrectly claiming deterministic scheduling and live SSE streaming" width="800" height="1109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scoring
&lt;/h2&gt;

&lt;p&gt;We scored each model on the spec, weighted by how much each category mattered for a correctness-first workflow engine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!QItW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8401fa44-af87-47e1-a9b9-c68f9e769c57_1504x854.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5pg0q2iy3l61hx0qxoc.png" alt="Scoring breakdown table weighted by category importance for a correctness-first workflow engine" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude Opus 4.7 lost points on the reproduced recovery bug, the bounded claim scan, and the SSE cursor fallback.&lt;/p&gt;

&lt;p&gt;Kimi K2.6 lost points on the six confirmed issues above. The biggest hits are in recovery, scheduling, and streaming, which is exactly where the spec's hardest requirements live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost vs Quality
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://substackcdn.com/image/fetch/$s_!P7SD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89be5073-989b-47ed-9398-62efea3ebc75_1038x300.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw8xfb6ej6wehsh3bj7o.png" alt="Cost vs quality table showing Kimi K2.6 at roughly 4x cheaper per point than Claude Opus 4.7" width="800" height="231"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6 is about 4x cheaper per point. The missing 23 points are in step-leasing, scheduling, and event streaming, which is where the hardest spec requirements live. Those are the parts that separate "the endpoints exist" from "the system behaves correctly under load."&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Open-Weight Models Stand Right Now
&lt;/h2&gt;

&lt;p&gt;This test sits inside a pattern we've been tracking for a while. MiniMax M2.7 matched Claude Opus 4.6's detection rate on &lt;a href="https://blog.kilocode.ai/" rel="noopener noreferrer"&gt;our last three-part benchmark&lt;/a&gt;. GLM-5.1 scored five points behind Claude Opus 4.6 on &lt;a href="https://blog.kilo.ai/p/we-tested-minimax-m27-against-claude" rel="noopener noreferrer"&gt;our job queue spec&lt;/a&gt;. Kimi K2.6 landed 23 points behind Claude Opus 4.7 here on a harder spec, but still produced the right shape of the system on the first pass.&lt;/p&gt;

&lt;p&gt;The gap on surface coverage &lt;strong&gt;has narrowed meaningfully over the last year.&lt;/strong&gt; The gap on correctness inside hard code paths (lease recovery, cross-run scheduling, streaming semantics) is still there. For work where the bugs only show up under contention or mid-crash, frontier proprietary models are the safer choice today. For work where you need the scaffold, the tables, the endpoint surface, and a starting test suite, open-weight models like Kimi K2.6 are close enough that the price delta matters.&lt;/p&gt;

&lt;p&gt;Kimi K2.6's current pricing ($0.95 / $4 per million tokens) is a starting point, not a floor. Moonshot AI releases open weights, which means Kimi K2.6 will end up hosted on multiple providers, &lt;strong&gt;with pricing and latency converging on whoever runs it most efficiently.&lt;/strong&gt; That is already playing out with MiniMax M2.5, which became the #1 most-used model across every mode in Kilo Code in the months after release. Price competition tends to pull these numbers down further as more hosts come online.&lt;/p&gt;

&lt;p&gt;Being open-weight also means you can &lt;strong&gt;self-host&lt;/strong&gt; or &lt;strong&gt;fine-tune&lt;/strong&gt; Kimi K2.6 if you have data residency requirements, custom workflows, or a cost profile that makes API-only models impractical at scale. That is not a capability Claude Opus 4.7 offers at any price.&lt;/p&gt;

&lt;p&gt;None of that changes the correctness findings above. It does reframe them. At $0.67 with a careful review pass, Kimi K2.6 is a real option now. At $3.56 with fewer corrections needed, Claude Opus 4.7 is the safer call. Which trade-off wins depends on the work. A year ago, that choice did not really exist at this level of complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For building the scaffold of a complex backend:&lt;/strong&gt; Kimi K2.6 did well. It produced the right project shape, the right tables, the right endpoint surface, and a test suite that passed. For prototyping, exploring a design, or generating a starting point you plan to review carefully, the $0.67 run is a good deal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For systems where state-machine correctness matters:&lt;/strong&gt; Claude Opus 4.7 pulled clearly ahead. The two implementations look similar in shape but diverge in the code paths that are hard to test casually (lease expiry, cross-run ordering, SSE, expired-lease rejection). If the project needs to behave correctly when leases expire, when multiple runs compete for workers, or when events need to flow live to clients, Claude Opus 4.7's output is closer to something you could ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On trusting model self-reports:&lt;/strong&gt; Both models said they were done. One was mostly right. The other had six spec-level issues in shipped code. "Tests pass" is a necessary signal. It is not a sufficient one for work this correctness-sensitive. A review pass plus a few targeted reproductions closed the gap between what the models said and what they actually built.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on Kimi K2.6 Speed
&lt;/h2&gt;

&lt;p&gt;Kimi K2.6 was released the day of this test. Provider availability is limited right now, so the current wall-clock timings understate the model's real speed. We saw similar adoption curves on previous open-weight releases from MiniMax and Z.ai as more providers came online. We expect Kimi K2.6's elapsed time (and its effective cost) to keep dropping as that happens.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Testing performed using &lt;a href="https://kilocode.ai/" rel="noopener noreferrer"&gt;Kilo Code&lt;/a&gt;, a free open-source AI coding assistant for &lt;a href="https://marketplace.visualstudio.com/items?itemName=kilocode.Kilo-Code" rel="noopener noreferrer"&gt;VS Code&lt;/a&gt; and &lt;a href="https://plugins.jetbrains.com/plugin/28350-kilo-code" rel="noopener noreferrer"&gt;JetBrains&lt;/a&gt; with 2,300,000+ Kilo Coders.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
