DEV Community: Darko from Kilo

Inside Kilo Speed: The Engineer Who Teaches Teams How to Think in Agents

Darko from Kilo — Wed, 13 May 2026 12:17:35 +0000

How to manage your agent team, from someone who coaches Kilo customers in agentic engineering.

Rebecca Dodd
May 12, 2026

When you're learning a new discipline—especially on the job—learning the theory behind it can feel like an abstract nice-to-have, while practice is the thing that's actually useful. Learning by doing is absolutely a valid way to upskill, but in Marius Wichtner's experience, grasping the conceptual foundation of agentic engineering helps to make the practical steps make sense.

Before joining Kilo Code, Marius was already training engineering teams on working with generative AI. At Kilo, he does the same for enterprise clients in Kilo Speedruns: one-hour sessions designed to give teams a fast, practical orientation on agentic software development. He's run them for companies across industries, and now he's sharing the foundations of those lessons (and his specific practices for each) here:

How to delegate effectively
How to scale across concurrent workstreams
How to maintain judgment and recover when things go wrong

1. How to Delegate: The Team Lead Model and the Plan

The mental model Marius uses to explain agentic engineering—both in client speedruns and in how he structures his own work—is the team lead.

Team leads don't spend all day writing code, and the same was true even before agentic tools existed. They were in pairing sessions, answering questions, reviewing output, and deciding what to merge. "Those were always the people that were only in meetings and they got called by all the juniors," Marius says. "They were just solving the last 20% of the problem."

In this model, the agent takes care of execution work, while the engineer operates as the team lead. The 80% that agents handle well—code generation, boilerplate, well-scoped subtasks—is work that the team lead delegates. The 20% that still requires the engineer is the judgment work: architectural decisions, what to merge, and recognizing when the agent has drifted.

Parallel development with the engineer acting as team lead

The engineers who transition most naturally into agentic workflows are often the ones who were already operating this way: team leads and architects who had developed the habit of switching contexts and reviewing output rather than writing it. Everyone else has to learn that mode of working, which starts with understanding the difference between a specification and a plan.

A specification captures what the user wants. It doesn't change based on the current state of the codebase. It's set from the user demand, and it stays set. A plan is how you intend to build the thing given where the code actually is right now. "A plan is dependent on your state of the code," says Marius. "Plans usually get thrown away very quickly."

When Marius works with an agent on complex tasks (especially those with important architecture decisions), he asks it to write its plan to a markdown file before it starts executing so he can review it. Asking the agent to write its plan first forces a shared understanding of what's actually being built. You review it, ask questions, and surface problems before execution begins. It's the refinement stage of traditional software engineering, but the difference now is that the feedback loop is much faster.

Plans, done right, function as constraints. Marius thinks of this as keeping an agent in the acceptable solution space: the set of outputs you will actually accept. The further an agent drifts from a confirmed plan, the more likely it ends up somewhere that requires starting over. Forcing the plan upfront dramatically increases the probability of staying on track.

Plans help to keep your agent within the acceptable solution space

The plan also acts as a contract: it documents the approach the agent intends to take, so when it does something unexpected later, there's a reference point. "You can always reiterate to the agent, 'We decided to implement this plan. Why have you decided otherwise?'"

2. How to Scale: Parallelism and the Context Rot Problem

Even with a solid plan in place, there's a natural limit to how far a single agent session can take you: context rot. As a session grows, accumulating conversation history, prior decisions, and intermediate code states, the agent starts losing coherence. Tasks that were reasonable at the start become unpredictable midway through. Early decisions can come back to bite you. At some point, recovery means starting over.

Most engineers treat this as a nuisance and work around it by brute force: shorter sessions, more restarts. Marius treats it as a signal that the work hasn't been decomposed correctly. "If you have a huge feature and you develop on it for the whole week, you will keep having context rot," Marius says. "It makes much more sense to plan out what you want to implement ahead of time and then develop each of the sub-problems individually in small context windows."

This is where parallelism comes in: you run multiple agents simultaneously, each working on a specific sub-problem. But parallel agents writing to the same file system will conflict (the same reason Git was invented). You need each agent working in its own isolated environment.

To address this, Marius built a solution into his own custom IDE, before building Kilo's Agent Manager: a tool for running multiple agent sessions simultaneously, each in its own isolated workspace, with its own file system. Instead of supervising agents one at a time, an engineer can delegate across several concurrent workstreams and review the results as they come in. Things that look good get merged; things that don't get discarded without the cost of untangling a week of compounded decisions.

Not every task demands the multi-agent treatment. Marius works across three categories depending on complexity:

How Marius routes tasks based on their complexity

Easy tasks: Things like adding documentation, writing a unit test, or well-scoped bug fixes go to a fully autonomous cloud workflow. The developer writes the spec, the agent executes, the developer reviews the diff. No supervision is required mid-session.

Hard tasks: Implementing a complex feature spanning UI and backend, or anything with meaningful architectural decisions, gets handled locally with Agent Manager. The developer supervises multiple agents working in parallel on decomposed subtasks, stays close to the work, and makes the judgment calls as diffs come in.

Unclear tasks: When the outcome isn't well-defined, it's hard to write a spec precise enough to constrain the agent toward a single solution. For these, Marius runs multiple agents in parallel against the same spec and compares the results. Instead of splitting work, the parallelism here is about generating variants and selecting the best one. The engineer's job is choosing the right route.

3. How to Stay on Track: Context Engineering and Judgment

Context engineering, as Marius defines it, is how you structure and optimize the context of the agent. The goal is to limit an agent to doing exactly what you want, over time, in your codebase. It's the ongoing work of keeping agents oriented, and knowing how to reorient them when they've drifted.

For upfront orientation, Marius uses Handy, a speech-to-text tool, to interact with agents verbally before locking in a plan. A lot of the context that matters for a task lives in the engineer's head and never gets written down, because it's too tedious to type everything out. Speaking it aloud removes that barrier, and an LLM can distil the rough transcript into a precise problem statement. The rough transcript also becomes the raw material for the plan the agent writes before executing.

When an agent session ends—whether it hit a context limit or simply reached a natural stopping point—continuing the work is usually straightforward. The original prompts, the Git diff (Agent Manager measures the delta from when the session started), and the current state of the codebase give a new agent enough to pick up where the previous one left off. Tools like Repomix can help with collecting specific file trees for this purpose. All of this can happen locally or in GitHub, where an issue describes the task, the PR contains the changes, and the history provides the thread. Most agents can continue from that context without much intervention.

What this process makes visible is what's actually irreplaceable: the context that isn't captured anywhere. Code and prompts are always an approximation—there are causal relationships in software that are hard to capture in prompts or code alone. Some of them, like another team's architectural decision creating a dependency you didn't know about, can be surfaced and handed off. Others only become visible when you run the code or at scale. An agent can't know what hasn't surfaced yet—that's still the engineer's job.

This is the difference between just coding and software engineering. The easy mistake with agentic work is treating it as a handoff: you describe what you want, the agent builds it, you ship it. In that approach, the critical last 20% can get lost: things like evaluating architectural choices and catching when an agent has veered off course. These require engineering judgment, and they're often much harder than the first 80%.

The mental shift Marius describes is about learning to apply engineering judgment at the right moments, across multiple concurrent threads, rather than sequentially inside a single one.

Read the other posts in our Kilo Speed series:

Cowboy Coder Is Back. This Time, They Scale

Darko from Kilo — Wed, 13 May 2026 12:00:53 +0000

Cowboy Coder Is Back. This Time, They Scale

Andrew Storms
May 11, 2026

I should start by admitting I'm part of the problem.

I can still draw the architecture of code I wrote three years ago from memory. The data flow, the edge cases, the reasoning behind every choice that looks strange at first glance. Ask me to do the same for a feature I shipped last month with help from an agent, and I can tell you what it does and why we built it. The deeper model, the one that lives at the level of individual functions, isn't there.

That's not laziness, and it's not a lapse in review. I read every diff. An agent does a closer pass alongside me. I can speak to the intent and shape of what I'm approving. But the deep mental model, the one you actually need at 2am when something breaks and the agent isn't helping you debug, isn't forming the way it used to.

I'm a CISO who still writes code, and this worries me. It should worry anyone managing engineers right now, because it isn't just me. Across the industry, AI coding agents are quietly reviving the single worst antipattern in software engineering history. We just don't recognize it yet, because it's wearing different clothes.

Remember the cowboy?

If you've managed engineers long enough, you know the cowboy. The one who disappears for a weekend and comes back Monday with a full rewrite nobody asked for. The one who, somehow, is the only person who understands the gnarly billing module, the auth flow, the deployment pipeline. The one whose decisions land in production faster than the team can review them.

Cowboys aren't heroes, by the way. The hero is the engineer who pulls the 2am save when production breaks. The cowboy is the one who created the conditions that made the 2am save necessary in the first place. Heroes clean up. Cowboys cause.

For twenty years, our industry has been quietly learning how to build engineering organizations that don't depend on this person. Code review. Pair programming. Design docs and RFCs. Collective code ownership. Postmortems that look at process, not blame. The whole inheritance from XP, agile, and DevOps was, in large part, a response to the lesson that cowboy culture feels productive and is actually corrosive.

It worked. Not perfectly, but the average engineering team today is far more resilient than the average team in 2005.

Then the agents arrived.

Watch what happens on teams that have adopted Claude, Cursor, Copilot, Codex, and the rest without changing how they work. An engineer prompts an agent. The agent emits eight hundred lines of code. The engineer skims it, sees the tests pass, and merges. Repeat, ten times a day, across the team.

The output is enormous. The velocity charts look incredible. And underneath, something is going wrong that nobody is naming yet.

Nobody on the team has reasoned through that code. The "author" couldn't walk you through it under questioning. They didn't write it, they prompted it. The reviewer couldn't either; they had thirty other PRs in the queue, and half the time the reviewer is another agent. Six months from now, when something breaks at 2am, the engineer who gets paged will be debugging code that has, in any meaningful sense, no human author at all.

This is the cowboy pattern. The weekend rewrite, the opaque module, the knowledge silo, the tech debt nobody quite remembers creating. Same antipattern, new substrate.

Why it's actually worse

The cowboy archetype, for all its damage, had one redeeming feature: somewhere, in one human brain, the model of the system existed. Bus factor of one.

Development driven by agents, without comprehension, produces bus factor zero. The code enters the repository with nobody understanding it. There is no expert to consult, because the "expert" was a probability distribution that has since moved on to the next prompt.

The social brakes that used to slow cowboys down are also gone. Cowboys had egos, reputations, and peers who could push back in code review. Agents have none of these. They don't sulk when overruled, don't take credit, don't feel shame when prod breaks. The friction that used to make cowboy culture limit itself in healthy teams, the simple fact that other humans were watching, is absent.

And there's a new accountability sink. When the cowboy shipped a bad rewrite, you knew whose desk to visit. When an agent ships a bad rewrite, the conversation goes "well, the AI wrote it" and everyone shrugs. The blame diffuses into the tooling.

What managers should do now

The good news: the playbook for fixing this already exists. We wrote it the last time. It needs updating, not reinventing.

Require comprehension, not just approval. Before any meaningful PR written with an agent gets merged, the author should be able to walk through it without asking the agent again. If they can't explain why a function exists, the PR isn't ready. This is the most impactful change you can make, and the one I'd benefit from most personally.

Cap PR size, hard. Code review evolved assuming limited human throughput on both sides. Agents broke that assumption. A PR of 50 lines can be meaningfully reviewed; a PR of 800 lines gets approved without thought. Set a limit, enforce it in tooling, and force large changes to be decomposed.

Tag agent involvement and track it. Make AI authorship a first class piece of metadata on commits and PRs. Watch incident rates, time to debug, and refactor cost on modules where agents wrote most of the code, and compare against the rest. You can't manage what you can't see, and right now most engineering orgs are flying blind on this.

Protect the loop of deliberate practice. Junior engineers who never struggle through a hard bug don't become senior engineers who can debug under pressure. Build in rotations without agents, pair on hard problems, and make "can debug from scratch" part of your leveling criteria. The seniors riding herd on agents today learned their craft the hard way. The next cohort needs a path to the same skill, or you'll wake up in five years with a team that can prompt fluently and reason about nothing.

Reframe tech debt as unread code. The most dangerous code in your repository is no longer the bad code. It's the unread code, modules that work today and that nobody on the team has actually internalized. Schedule comprehension audits. Assign engineers to read and document modules written by agents that they didn't author themselves. Treat unread code as a liability on the books.

This is not an argument against AI

The agents are useful. The productivity gains are real. I use them every day, and I'm not giving them up.

The point is that the technical productivity of these tools is arriving faster than the organizational practices needed to absorb them. The teams that already had healthy engineering culture, the kind with code review that actually reviews, sustainable pace, and collective ownership, will adapt and thrive. The teams that quietly tolerated cowboys are about to have a much worse problem, at much greater scale, with no single person to point at.

And the rest of us, the ones who can still picture the flow of code we wrote three years ago but no longer build that same depth of model with the new stuff, need to be honest that the muscle is atrophying. Mine is. Yours probably is too.

The cowboy didn't go away. The cowboy scaled, with a million tokens of context. The work of engineering management is to recognize the pattern in its new form and apply the lessons we already learned the last time.

7 Unexpected Ways AI Makes Your Team Faster

Darko from Kilo — Mon, 11 May 2026 11:58:46 +0000

Most enterprise teams adopt AI coding tools expecting one thing: faster code output. And sure, that happens. But the teams getting the most out of AI are finding speed in places they didn't anticipate. The decisions, the handoffs, the context switches, the organizational friction that quietly eats weeks off every quarter. That's where the real time goes, and that's where AI has the most room to compress it.

Here are seven of those less-obvious wins, based on what we're seeing across engineering orgs using Kilo at scale.

1. Decisions that don't stall in Slack threads

A lot of good engineering decisions happen in Slack threads. Two or three people hash out an approach, agree on a direction, maybe sketch out some pseudocode in a message. Then someone has to take all of that context, switch to their IDE, reconstruct the conversation in their head, and actually implement it. That translation step is where momentum dies. The idea was clear in the thread, but by the time someone sits down to build it, they're re-reading messages and second-guessing what the team actually agreed on.

Kilo for Slack can read the full thread context, understand what the team discussed, and start implementing directly from the conversation. Instead of someone manually distilling a Slack thread into a ticket and then into code, Kilo picks up the intent from the discussion itself, with all the nuance that multiple contributors added along the way. The gap between "we agreed on an approach" and "someone started building it" shrinks from hours or days to minutes.

For engineering teams, this changes the rhythm of how work gets kicked off. Conversations become the starting point for implementation, not a precursor to yet another handoff.

2. Code contributions from people who aren't engineers

Product managers, designers, data analysts, and other non-engineering team members are able to use AI agents to write and submit code. They can describe what they need, have an agent generate a PR, and push it up for an AI-powered review. Some years ago that PR would have been dead on arrival. The code might work, but it might not follow the team's conventions, handle edge cases, or meet the bar for production.

Kilo's Code Reviewer changes that equation. When a non-engineer submits a PR, the reviewer analyzes it against performance, security, style, and test coverage, then gives structured feedback the contributor can actually act on. The contributor iterates with their agent based on that feedback, resubmits, and the cycle repeats until the code reaches an acceptable level. Each round takes minutes, not days waiting for a human reviewer to find time.

The impact for enterprise teams is significant: work that used to require an engineer's time from start to finish can now arrive as a reviewable PR from someone outside the engineering org. Engineers still own the final approval, but they're reviewing and approving instead of building from scratch. That frees up engineering bandwidth in a way that no amount of "write code faster" tooling can match.

3. Onboarding that doesn't require a sherpa

New engineers joining a large codebase used to spend their first few weeks in a fog. They read docs that are three sprints out of date, ping senior devs with questions that feel stupid, and take twice as long on their first PRs because they don't understand the conventions yet.

AI changes the dynamic. When a new hire can point an agent at the repo and ask "how does authentication work in this service?" or "what's the pattern for adding a new API endpoint here?", they get answers grounded in the actual code, not someone's best recollection of how things worked six months ago. Kilo's Ask mode works well here, providing read-only answers powered by codebase indexing. New devs ramp in days instead of weeks, and senior devs get fewer interruptions.

The compounding effect matters: every engineer who onboards faster is productive sooner, and every senior engineer who isn't answering onboarding questions is shipping their own work.

4. Documentation that actually updates

Every engineering team says they value documentation. Almost none of them have enough of it, because writing docs is tedious and the codebase moves faster than anyone can document manually.

AI flips the economics. Generating docs from code is exactly the kind of structured, pattern-heavy task where AI agents perform well. A developer can point a webhook-triggered Cloud Agent at a new PR and get a first draft of internal docs, API references, or architecture decision records in minutes. That draft still needs a human to review and refine, but the difference between "edit a draft" and "write from scratch" is the difference between documentation existing and not existing.

For enterprise teams, this pays off across the org. Knowledge stops being locked in individual developers' heads. Teams that depend on each other's services can actually find out how those services work. The "bus factor" for any given system gets a lot less scary.

5. Maintenance work that stops being a black hole

Every codebase has a backlog of maintenance tasks that never rise to the top of the sprint: dependency upgrades, test coverage gaps, deprecated API migrations, lint rule enforcement. Each one is individually small, but collectively they represent weeks of accumulated drag on the team.

AI agents can handle a lot of this at volume. Kilo's orchestration capabilities let you break down a large maintenance initiative (say, migrating from one logging library to another across 200 files) into subtasks and distribute them to agents running in parallel. What used to be a quarter-long slog becomes a focused effort measured in hours.

The net effect is that the maintenance backlog actually shrinks instead of growing indefinitely. Teams spend less time working around known issues and more time building features that move the product forward.

6. Cross-team requests that don't take a sprint

In larger orgs, teams constantly need small things from each other. A backend team needs a new field exposed in an API. A frontend team needs a config change. A platform team needs a migration script. Each request is maybe a day of work for the team that owns the code, but it sits in their backlog for two weeks because it's nobody's priority.

When the requesting team can use AI to draft the change themselves (using agents that understand the target repo's patterns and conventions), the dynamic shifts. Instead of filing a ticket and waiting, they can open a PR with a well-formed change and ask the owning team to review it. The owning team spends minutes reviewing instead of a day implementing, and the requesting team isn't blocked for two weeks.

This might be the single most impactful change AI enables in enterprise settings, and it almost never shows up in productivity benchmarks.

7. Consistency that doesn't depend on tribal knowledge

Most large codebases have a "right way" to do things that isn't fully captured in any linter config or style guide. It lives in the heads of engineers who've been around a while, and it gets enforced inconsistently through code review when those engineers happen to be reviewers.

AI can formalize this. Kilo's custom modes and rules system lets teams encode their conventions, patterns, and preferences so that every developer (and every agent) follows the same playbook. New patterns get adopted uniformly instead of unevenly, and deprecated patterns stop spreading through the codebase via copy-paste.

For enterprise teams managing large, long-lived codebases, this is arguably the most valuable thing AI can do. Consistency across a large codebase reduces cognitive load for everyone who touches it, which makes everything else on this list work better.

None of these seven things are what most people think of when they hear "AI makes developers faster." They're not about generating code in fewer keystrokes. They're about removing the organizational friction, the coordination overhead, and the knowledge gaps that slow engineering teams down far more than typing speed ever did.

If your team is evaluating AI tooling and only measuring lines of code generated or time to first commit, you're probably missing the real value. The teams getting the biggest returns are the ones that recognized AI as a way to make the whole system move faster, not just individual contributors.

To see how Kilo fits into your engineering org, check out our enterprise plans or talk to our team.

Hermes vs. OpenClaw - When to Reach for Which Agent

Darko from Kilo — Fri, 08 May 2026 10:53:38 +0000

Hermes vs. OpenClaw — When to Reach for Which Agent

Last week, someone in the Kilo Discord asked: "Should I switch from OpenClaw to Hermes?" I've seen this question pop up a dozen times since Hermes launched in February. It's the right question to ask — both are open source, both connect to your chat apps, both run tools and remember things. On paper, they look almost identical.

But after running both for the past two months, I think the feature checklists are a distraction — the design philosophies are where they actually diverge.

The One-Sentence Difference

Hermes packages a gateway around a learning agent.
OpenClaw packages an agent around a messaging gateway.

That distinction sounds abstract, but it has practical consequences for how you configure and interact with each tool.

What Hermes Gets Right

Hermes Agent comes from Nous Research and launched in February 2026. It's hit about 135,000 GitHub stars as of this writing. The headline feature is what they call a "learning loop" — the agent creates and evolves its own skills based on what it does.

From their feature docs:

Self-improving skills: The agent generates procedural knowledge from experience. Run the same task type a hundred times, and Hermes actually gets better at it.
Five sandbox backends: Local execution, Docker, SSH, Singularity, and Modal. You pick how isolated you want command execution to be.
Subagent delegation: Spawn child agents with isolated contexts and terminals. Parallel workstreams without context pollution.
Broader browser/voice stack: Browserbase, Browser Use, Firecrawl, local Chrome, plus native voice in Discord channels.

The Hermes documentation is worth reading even if you don't use it — the provider matrix alone covers 19+ providers with detailed auth flows.

What impressed me most was the checkpoint system. Before Hermes touches files, it snapshots your working directory. /rollback if something goes wrong. I've used this more times than I'd like to admit.

What OpenClaw Gets Right

OpenClaw has been around longer and has the larger community — roughly 369,000 GitHub stars and 13,700+ community-built skills. It started as a personal assistant project by Peter Steinberger and grew into something much bigger.

OpenClaw is fundamentally a gateway. The docs are explicit: "The Gateway is the single source of truth for sessions, routing, and channel connections."

What that means in practice:

Channel breadth: Discord, Google Chat, iMessage, Matrix, Microsoft Teams, Signal, Slack, Telegram, WhatsApp, Zalo, WebChat. One Gateway process handles all of them.
Multi-agent routing: Isolated sessions per agent, workspace, or sender. You can run different agents for different purposes through the same gateway.
Mobile nodes: iOS and Android apps that pair with the gateway for camera, canvas, and device actions.
Massive skill ecosystem: 13,700+ community skills covering everything from email to calendar to flight check-ins.

The architecture assumes you want one always-on process that routes messages to agents. That's different from Hermes's model of "here's an agent runtime that can talk to various platforms."

Known Pitfalls

Both tools have well-documented failure modes that the communities are vocal about. Worth knowing before you commit.

Hermes:

Self-evaluation always passes. Hermes evaluates its own work to decide if a task succeeded. The problem: it almost always thinks it did well, even when it didn't. This means the skills it auto-generates from "successful" tasks can encode errors. You need external validation for anything important.
Self-learning overwrites manual edits. The same system that auto-generates skills also overwrites your customizations. If you've spent time tuning a skill for a specific workflow, the agent may "self-improve" it back into something generic. Power users find this maddening.
Maturity gap. With only 11 releases compared to OpenClaw's 137, Hermes simply hasn't been tested at the same scale. Fewer updates means fewer chances to break things — but that's not the same as proven stability.

OpenClaw:

Updates break things. This is the most consistent complaint in the community. Users report roughly a 25% chance that any given update will break response delivery, cron jobs, or webhooks. The development process lacks the staging/testing discipline you'd expect.
Memory is unreliable. Agents forget instructions, cross-contaminate data between projects, and repeat mistakes. Memory retention issues are the #1 driver of user churn.
Self-hosting is the real barrier. Docker setup, SSH configuration, YAML files, security hardening, 24/7 uptime — users consistently report spending more time on infrastructure than on their actual agent workflows.

Trade-offs

A comparison on ScreenshotOne put it well: Hermes is "agent-first" while OpenClaw is "gateway-first."

Hermes optimizes for the agent becoming more capable over time. It's built for people who want autonomous agents that learn from experience.

OpenClaw optimizes for a persistent assistant you can message from anywhere. It's built for people who want infrastructure they can talk to.

Neither approach is wrong. But they lead to different outcomes:

Dimension	Hermes	OpenClaw
Learning	Native skill evolution	Skills are static (community-maintained)
Sandbox options	5 backends (local, Docker, SSH, Singularity, Modal)	Docker, SSH, local
Channel breadth	7 messaging platforms	24+ platforms and plugins
Community size	~135k stars, growing fast	~369k stars, larger skill library
Browser providers	6+ options including cloud services	Local Chrome + managed profiles
IDE integration	ACP support (VS Code, Zed, JetBrains)	CLI + browser control UI

Security Considerations

This matters more than people think. A Reddit thread documented OpenClaw's 2026 security incidents: 6 CVEs, 341+ malicious skills identified in the community repository, 135,000+ exposed instances found by Shodan.

OpenClaw grew fast. Some security assumptions that made sense for a personal tool on a laptop became dangerous when people started running it on public VPSes with open ports.

Hermes, being newer, has zero reported agent-specific CVEs as of April 2026. That's not because it's inherently more secure — it just hasn't had the same scale of exposure. Give it time.

Both projects now have sandboxing options and approval flows. But if you're deploying either on a server, audit the defaults. Neither assumes you're running on a hardened production box.

When to Pick Hermes

Hermes is the better choice if:

You want an agent that improves at tasks over time
You need multiple sandbox backends (especially Modal for cloud execution)
You're doing research-style workflows with subagent delegation
You want tight IDE integration via ACP
You're willing to trade ecosystem size for a more capable core agent

The learning loop is what justifies choosing Hermes over OpenClaw. If you're running the same types of tasks repeatedly — data analysis, code review, research synthesis — Hermes will genuinely get better at them.

When to Pick OpenClaw

OpenClaw is the better choice if:

You want to message your assistant from everywhere (24+ platforms)
You need the existing skill ecosystem (13,700+ skills)
You want mobile nodes for phone camera/canvas integration
You're building team infrastructure, not just a personal agent
You value stability over cutting-edge features

If your primary use case is "I want to message my AI from WhatsApp and have it do things on my computer," OpenClaw has that nailed.

The Cost Problem

This doesn't get discussed enough. Running either agent autonomously is expensive if you're not careful. Every message sends the full conversation history to the API, so costs compound within a session.

Users in the community report anywhere from $1-3/day on budget models to $130+/day on Claude Opus for heavy agentic use. The fix is aggressive session resets and picking appropriate models per task tier:

Quality-sensitive work: Claude Opus 4.6 (expensive, best agentic performance)
Daily driver: GPT 5.4 (thinking mode on medium+) or MiniMax M2.7
Budget automation: Qwen 3.5/3.6 (free on OpenRouter), GLM-5.1, Kimi K2.5

Flat-rate subscriptions (MiniMax at $10-20/month, Ollama Pro Cloud at $20/month) are rapidly replacing per-token billing as the community default.

What I Actually Use

I run both — and the community data confirms this is a growing pattern. The specific architecture that works: OpenClaw as orchestrator (planning, decomposition, multi-step coordination, scheduling) and Hermes as execution specialist (fast, repeatable task loops). They communicate via the ACP protocol.

OpenClaw handles my day-to-day messaging — it's the interface I talk to from Telegram. I've been using it for months and the skill ecosystem covers most of what I need.

Hermes runs on research tasks where I want the learning loop. When I'm doing a series of similar analyses, Hermes's skill evolution actually matters.

I could probably consolidate — Hermes's docs actually note that it's the "successor to OpenClaw" and they have a migration command (hermes claw migrate) — but I haven't felt the urgency. They solve different problems well.

Summary

Both projects are actively developed. Both have real communities. Both work.

Hermes is younger, more ambitious architecturally, and smaller in ecosystem. OpenClaw is more mature, broader in integrations, and has had more security scrutiny (for better and worse).

The 30% of developers who switched from OpenClaw to Hermes cite "maintenance fatigue" from debugging community skills and wanting the learning loop. The 35% who stayed on OpenClaw cite integrations and ecosystem breadth.

Pick based on what you actually need. If you want a persistent assistant you can message, OpenClaw. If you want an agent that improves itself, Hermes.

Or run both — they're free, and the resource overhead of a second process is negligible.

Links:

Mistral Medium 3.5 is Live in Kilo

Darko from Kilo — Fri, 08 May 2026 10:44:59 +0000

Mistral Medium 3.5 is Live in Kilo

We're thrilled to announce that the public preview version of Mistral Medium 3.5 is now live in Kilo. This is Mistral's first blended model (it merges instruction-following, reasoning, and coding into a single 128B dense model) and it puts the lab instantly back on the OSS map.

If it's seemed quiet on the Mistral front for a while, that's because they've been heads-down building. This new model is a major leap for the lab, and the focus on agentic work — coding and agentic engineering — benefits all of us.

Mistral's new flagship is a dense 128B model with a 256k context window, built from the ground up for long-horizon agentic work. It merges instruction-following, reasoning, and coding into a single set of weights, with configurable reasoning effort so you can dial it up for a gnarly refactor or keep it light for a quick edit. It scores 77.6% on SWE-Bench Verified, putting it ahead of Devstral 2 and models like Qwen3.5 397B A17B. The vision encoder was trained from scratch to handle variable image sizes, and the whole thing can run self-hosted on as few as four GPUs.

And Mistral is sticking to their OSS principles: the new model shipped with open weights under a modified MIT license.

This is a serious new model for serious engineering tasks, and Mistral users will find that it's now the default for the Mistral Vibe CLI and Le Chat. And with Kilo, anybody can use the model among hundreds of other top models and always find the right tools for the job.

Use Mistral Medium 3.5 Everywhere You Use Kilo

The new model is available in the Kilo Gateway, so you can use it everywhere with a single login.

VS Code Extension

The upgraded Kilo Code VS Code extension now surfaces Mistral Medium 3.5 in the model switcher. Pick it for any task where you want a model that can hold a lot of context, reason through complexity, and produce structured output your codebase can actually consume.

Kilo Code CLI

Running Kilo from the terminal? Mistral Medium 3.5 is available there too. It's a strong choice for longer CLI sessions — dependency upgrades, test generation, CI investigations — where you want the model working steadily without losing the thread.

Cloud Agents

Kilo Code's cloud agent infrastructure is where Mistral Medium 3.5 really opens up. Kick off sessions powered by this model, walk away, and come back to finished branches or draft PRs. The model was built specifically for async, multi-tool work — running long stretches reliably, calling tools in sequence, producing structured handoffs. That makes it a natural fit for the tasks you want to delegate completely: module refactors, issue triage, test coverage gaps, incident investigations.

KiloClaw

Mistral Medium 3.5 is available as a model option across KiloClaw recipes. Whether you're running a personal claw or a work claw, you can now back those workflows with a model that handles complex, multi-step reasoning without breaking a sweat.

Try It in Kilo Today

Mistral Medium 3.5 is priced at $1.50 per million input tokens and $7.50 per million output tokens through the API. For a frontier-class 128B model at this capability level, that's competitive — especially for agentic runs that justify the context and reasoning headroom.

At a blended price of $3 per million tokens for general chat, and just $1.56 per million tokens for long-context summarization, it's more affordable than it might look at first glance.

Plus, if you grab a Kilo Pass you can embrace a healthy discount :)

Open the model switcher in the latest version of our VS Code extension, select it in your CLI agent config, or choose it as the backing model for your next KiloClaw recipe. It's available now in public preview — we'd love to hear what you build with it.

Originally published on the Kilo Blog.

KiloClaw in VS Code, Kilo CLI in KiloClaw

Darko from Kilo — Fri, 08 May 2026 10:42:14 +0000

KiloClaw in VS Code, Kilo CLI in KiloClaw

When your AI agent lives inside your AI coding assistant (and vice versa)

By Brendan O'Leary · May 04, 2026

Last week in the Kilo Discord, someone asked if they could SSH into their KiloClaw instance from VS Code. Not to use Kilo Code — just to edit their agent's AGENTS.md file directly. A few messages later, another person asked how to get KiloClaw chat working inside their editor.

Same underlying need from two directions: how do I talk to my always-on agent while I'm in the middle of writing code?

Kilo Code shipped answers to both in April. KiloClaw now has a native chat panel inside the VS Code extension. And the Kilo CLI — which ships built into every KiloClaw instance — got org-aware /kiloclaw support so you can manage your cloud agent from the terminal.

What this looks like in practice

KiloClaw in VS Code means you open the KiloClaw chat panel alongside your Kilo Code sidebar. You're editing code with Kilo Code's agent on one side, and on the other you have your KiloClaw agent that's running on a server somewhere — doing background work, monitoring things, managing tasks. Interactive coding in one panel, autonomous agent in the other.

Kilo CLI in KiloClaw means your cloud-hosted KiloClaw instance has the full Kilo CLI available. Your agent can use kilo run to spin up coding sessions on its own projects, use kilo pr to check out and review pull requests, or invoke any of the 500+ models through the same interface you use locally.

Josh from the Kilo team said it plainly in Discord:

KiloClaw ships with Kilo CLI built-in. We are also working to integrate KiloClaw inside of the extension. Being able to start a session, pick it up with KiloClaw, set KiloClaw to do work autonomously, etc. is pretty powerful.

Setting up KiloClaw in VS Code

The panel shipped in v7.2.20 and is available now if you have a KiloClaw instance.

Update your VS Code extension to the latest version
In the sidebar, click the KiloClaw icon (chat bubble) or open the Command Palette and run KiloClaw
If you already have a KiloClaw instance configured through Kilo Gateway, you'll see the chat panel with your agent's conversation history
If you don't have one yet, you'll get a setup view that walks you through provisioning

The panel uses the same Stream Chat WebSocket as the web UI, so messages appear in real time. Your agent's responses stream in, and the panel restores automatically when you reopen VS Code.

One detail I noticed: it uses the same kilo-ui component library as the rest of the extension. Markdown rendering, buttons, toast notifications all match. Doesn't feel bolted on.

Using Kilo CLI inside KiloClaw

If you're running KiloClaw (either self-hosted via OpenClaw or on Kilo's managed hosting), the Kilo CLI is already there. Your agent can invoke it directly.

A few patterns I've been using:

Your KiloClaw agent watches a repo for new PRs and uses kilo pr <number> to check them out and run a review session. Results come back to you over Telegram, Discord, or wherever you get KiloClaw messages.
You tell your agent "refactor the authentication module" and it uses kilo run with the right model and mode to do the work, commits the result, and opens a PR for you to review.
Your agent has access to multiple repos and can run separate Kilo sessions in each one, coordinating changes that span services.

The /kiloclaw command in the CLI now supports organization contexts too. If you've selected a team via /teams, running /kiloclaw resolves to that org's KiloClaw instance rather than your personal one. Useful if your company has a shared agent for CI tasks.

Why both

Kilo Code in VS Code is interactive. You're pair-programming with it. It sees your editor state, your file tree, your terminal output. It works in your context.

KiloClaw is persistent and autonomous. It runs when you're asleep, handles background tasks, monitors systems, processes incoming requests. It works in its own context, on its own machine.

Having both accessible from the same editor means you can tell your KiloClaw agent to start a background task while you keep coding, check in on what it found overnight, or hand off a tedious refactor while you work on the interesting parts. When it finishes, the results show up right there in your editor.

I've been doing this for the last week. Writing code with Kilo Code, glancing over at KiloClaw to see what my agent turned up from the research I asked it to do that morning. No tab switching, no opening a separate app. It's there.

Rough edges

This is new. A few things to know:

The KiloClaw panel requires Kilo Gateway authentication. If you're using Kilo Code with just a bare API key and no Kilo account, you won't see the panel.
The /kiloclaw command in the CLI only works when connected to Kilo Gateway. Same prerequisite.
Error handling got improved this week — there was an issue where failures in the WebSocket connection could leave the panel in a bad state. That's fixed in the latest release.
Documentation is still catching up. There's an open PR to add a "Setting Up Other Tools" page for KiloClaw that should cover this in more detail once it lands.

What's next

Your coding assistant and your autonomous agent used to be separate tools with separate UIs. Now they share the same extension, the same underlying engine, and the same model ecosystem. I expect the boundary between "interactive coding agent" and "background autonomous agent" to keep blurring.

I use both daily. KiloClaw runs my email checks, monitors Discord, handles blog research (it's writing this post right now, actually). Kilo Code handles the interactive stuff — writing features, debugging, reviewing diffs. Having them in the same window means I stop context-switching between tools to check on what my agent is doing.

If you're running KiloClaw already, update your VS Code extension and try the panel. If you're just using Kilo Code, the /kiloclaw command in the CLI is how you'd set up your first instance.

Originally published on the Kilo Blog.

The Arrival of GPT-5.5: OpenAI’s New Deep-Thinking Powerhouse

Darko from Kilo — Mon, 27 Apr 2026 09:19:50 +0000

OpenAI recently rolled out GPT-5.5 and its heavy-duty sibling, GPT-5.5 Pro, and everybody wants to put them to the test.

If you feel like the model landscape is moving faster and faster, you're right. OpenAI's chief data scientist told TechCrunch this week that "the last two years have been surprisingly slow," but what he meant is that now we're really moving — now we're cooking with gas. And that's a good thing for consumers.

These SOTA models aren't just becoming smarter and more comprehensive, they're also becoming more token-efficient for larger tasks.

What's new?

GPT-5.5 is OpenAI's latest release for complex professional workloads, building on GPT-5.4 with stronger reasoning, higher reliability, and improved token efficiency on hard tasks.
GPT-5.5 Pro is OpenAI's high-capability model optimized for deep reasoning and accuracy on complex, high-stakes workloads.

Both new models are now available in the Kilo Gateway and GPT-5.5 is one of our top recommended models out of the gate.

A New Standard for Complex Work

GPT-5.5 is particularly impressive when it comes to coding and reasoning, and the kind of computer-use and browser skills needed by always-on agents like KiloClaw:

Terminal-Bench 2.0 (Command-line workflows & tool coordination): 82.7% (vs. GPT-5.4: 75.1% | Claude Opus 4.7: 69.4%)
Expert-SWE (Internal long-horizon coding tasks ~20 hours): 73.1% (vs. GPT-5.4: 68.5%)
GDPval (Knowledge work across 44 occupations): 84.9% (vs. GPT-5.4: 83.0% | Claude Opus 4.7: 80.3%)
OSWorld-Verified (Operating real computer environments): 78.7% (vs. GPT-5.4: 75.0% | Claude Opus 4.7: 78.0%)
BrowseComp: 84.4% (GPT-5.5 Pro scores 90.1%)

But benchmarks are only half the story. We had the privilege of pre-testing the alpha release of GPT-5.5, and we're ready to share what this means for builders, agents, and the broader AI ecosystem. First of all, it's exciting to see OpenAI continuing to bridge the gap between execution and high-level strategy. Coming just two days after the release of GPT-5.4 Image 2, a stunning new image generation model for multimodal workflows, GPT-5.5 covers a lot of bases for professional workloads. This new model can transform how engineering teams scale their most complex autonomous workflows.

In our testing, GPT-5.5 has proven to be tremendously capable at long-context tasks and agentic coding. Where previous generation models would occasionally lose the plot during massive refactoring jobs or deep-reasoning requirements for large codebases, GPT-5.5 stays locked in.

More importantly for our ecosystem, it has become a formidable daily driver for KiloClaw as well as an excellent fit for getting a new claw up and running and exploring new use cases. We've been using it to run always-on agents handling highly complex, multi-step professional work, and the reliability jump is palpable.

As we noted in our recent deep dive comparing Claude Opus 4.7 and Moonshot's Kimi K2.6, the frontier of AI is fiercely competitive right now. While Opus 4.7 and Kimi K2.6 brought massive leaps in their own rights, GPT-5.5 introduces a new class of autonomous capability that specifically targets professional, high-stakes workflows where fewer retries and higher reliability directly translate to better outcomes.

GPT-5.5 is definitely crushing a wide range of benchmarks, which fits with our experience testing the model in Kilo Code and KiloClaw. Significantly, it topped the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google.

In our testing, GPT-5.5 did have some issues with UI-related design tasks, but we found that more specific instructions helped resolve some of those problems.

So which one should you use?

GPT-5.5 is priced higher than GPT-5.4, reflecting its heavy-duty reasoning capabilities. And with this new model OpenAI did push up pricing again.

In fact, GPT-5.5 ($5 / Mtok input, $30 / Mtok output, $0.50 / Mtok cache) is more approachable than it might look from the outside. The 5.5 series is more token efficient than 5.4. For hard tasks, this efficiency often results in a lower actual cost per completed task because the model gets it right on the first try, without needing endless prompt engineering or loop retries.

GPT-5.5 often reaches higher-quality outputs with fewer retries, so it can be more token-efficient on real workflows even when reasoning is higher. And good news for Kilo Coders: it's the most token efficient at coding workflows.

We would also like to echo OpenAI's own advice here: "Higher reasoning can use more tokens, so customers should match reasoning effort to the task."

In-memory prompt caching is not supported for GPT-5.5. Caching for this model relies exclusively on extended prompt caching. During inference, the model caches tokens from previous requests directly on GPU-local storage.

Does it Claw?

We're excited to see what Kilo users around the world do with it. Like the new Opus, it's super smart. But is it too smart for daily tasks? Or will it become your daily driver?

My prediction is that GPT-5.5 will compete more directly with the latest Opus release for coding, but be more of a top-agent driver in Hermes and OpenClaw workflows like KiloClaw: sub-agents will likely need to use smaller models or OSS models to remain cost-efficient.

That said, the only way to really

Shell Security Plugin

Darko from Kilo — Mon, 27 Apr 2026 09:16:14 +0000

I ran openclaw security audit on my instance the other day and got back a wall of text. Six findings — one critical, three warnings, two informational. I stared at it for a minute, scrolled through the nested objects, and thought: "Okay, but what should I actually do about this?"

That's the gap the new Shell Security plugin fills. It takes that same audit output, sends the findings (not your secrets, not your config) to the KiloCode Security Advisor API, and gives you back a prioritized report with specific remediation steps. The whole thing happens in your chat — Telegram, Slack, the Control UI, wherever you talk to your agent.

What it does

The plugin is a thin bridge between two things that already exist:

openclaw security audit — the built-in CLI command that checks your local config for common security foot-guns (weak models without sandboxing, exposed runtime tools, missing trusted proxies, multi-user setups without isolation)
KiloCode's Security Advisor API — an endpoint that takes those findings and returns expert analysis with context-specific remediation guidance

The plugin runs the audit locally, packages the JSON output, and sends it off. What comes back is a markdown report that covers what was found, why it matters, and what to do about it — organized by priority.

Installing it

It's currently dev-only but will be released soon!

openclaw plugins install @kilocode/shell-security
openclaw plugins enable shell-security
openclaw gateway restart

The gateway restart is a one-time thing after install. If you're talking to your agent through Slack or Telegram, you'll see a brief connection blip and then it's back.

Two ways to run it

Slash command (recommended):

This runs the plugin directly and renders the full report. It bypasses the LLM's summarization layer entirely, so you get the complete output regardless of which model you're running.

Natural language:

You can also just say "run a security checkup" or "audit my OpenClaw config" and the agent will call the tool. One thing to know: if you're running a smaller model (Haiku, GPT-x-nano), it might paraphrase or truncate the report. Capable models like Sonnet or GPT's latest handle it fine. When in doubt, use the slash command.

First-run authentication

The first time you run it, the plugin prompts you to connect your KiloCode account through a device auth flow:

Open a URL in your browser
Enter a code
Sign in or create a free account
Run /security-checkup again

After that, the token is saved and you never see the auth flow again. There's a gateway reload on first auth (the plugin writes the token to your config), but subsequent runs are instant.

If you're running OpenClaw in CI or a container, you can skip the interactive flow entirely by setting KILOCODE_API_KEY as an environment variable.

What gets sent (and what doesn't)

This matters. Your OpenClaw instance has access to your filesystem, your API keys, your chat history. The plugin doesn't send any of that.

Sent:

The JSON output of openclaw security audit — finding IDs and summaries, no secrets
Your OpenClaw version and plugin version
Your instance's public IP (for optional remote probes)

Not sent:

Config file contents
API keys, secrets, or tokens
Chat history
Workspace files

Everything goes over HTTPS, authenticated with your KiloCode account token.

What the report looks like

On my instance, the report came back with findings grouped by severity — the critical one about small models running without sandboxing at the top, followed by the warnings about trusted proxies and multi-user heuristics, and then the informational items. Each finding includes context about why it's a risk and concrete steps to fix it.

It's... a lot of text right now. The formatting still needs work — the dev release is functional but not polished. There's also a bug where the KiloClaw call-to-action shows up even if you're already a KiloClaw user. These are known rough edges that'll get smoothed out before the stable release.

Why this is useful

Running openclaw security audit is already good practice. But JSON output requires you to interpret each finding yourself, look up what the check IDs mean, and figure out the right remediation. The Security Advisor layer turns those findings into specific guidance you can act on immediately.

For anyone running OpenClaw as a personal assistant (which is most of us), the security surface is real. Your agent has shell access, filesystem access, web browsing. A misconfigured model fallback or an unintended multi-user exposure means your agent could be manipulated by untrusted input. Having something that checks this and explains the results in plain language saves you from reading JSON and guessing at severity.

Current status

The npm package is live and the source is on GitHub under MIT license. A stable release is coming — the main work remaining is formatting improvements and fixing the conditional CTA logic.

Install it, run /shell-security, see what it finds. It takes about thirty seconds.

New VS Code Extension - Week Three: Memory, Stability, and Moving at Kilo Speed Into the Future

Darko from Kilo — Fri, 24 Apr 2026 08:16:10 +0000

Three weeks ago we GA'd the completely rebuilt Kilo Code extension for VS Code. Week one was about what we were hearing and what we were shipping. Week two was about addressing the most urgent feedback and bumps.

This week is about the two other areas of frequent feedback and challenges: memory usage on Windows and session stability under sustained use. Both are materially better now than they were a week ago. Neither is 100% fixed and "done", we can see from open GitHub issues that some of you still hit rough edges, but the experience is significantly improved especially on Windows when using Agent Manager.

Across the week we shipped 80+ Kilo PRs and merged three more upstream OpenCode releases.

Windows Memory: A Big Step Forward

This is the one we know has caused the most pain. Users on Windows reported the Kilo core process climbing into multiple GB of RAM within minutes of opening Agent Manager and staying there. A handful of you sent us heap snapshots — thank you — which helped track down root cause on some harder to reproduce issues.

The high-level story: Agent Manager was polling git status and diffs through the Kilo core subprocess, and on Windows the combination of IPC round-trips, diff payload sizes, and allocator behavior meant freed memory wasn't being returned to the OS cleanly. In v7.2.20 we've restructured that path (#9046) and made the extension much more careful about what it holds in memory:

Agent Manager's git work now runs directly in the extension host, not through the core process.
We cap how much of any single diff we'll read into memory, so opening a very large file no longer causes a spike the allocator can't recover from.
We also tuned the allocator on the core process itself to release memory back to the OS more promptly on Windows.

If you were running on a downgraded 5.x build because of memory issues, this is the release to come back on. If you're still seeing unbounded growth, please keep the issues coming — the heap-snapshot command we added this cycle (#9034) makes those reports much easier to act on.

Session Stability: Fewer Interruptions

The second theme was sessions getting interrupted mid-flow — usually recoverable by sending another message or re-opening the session/extension. Most of the reports we got traced back to a handful of specific state-machine edges, and those are now meaningfully better.

The one we heard about most often was sessions ending up stuck — most visibly when VS Code was closed while a suggestion prompt was still showing, which left the session permanently marked busy and any follow-up message queued forever. Sessions now go idle correctly while waiting on a suggestion response (#9199). A related set of stuck states around the end-of-plan flow — where "Start new session" and "Continue here" didn't reliably transition you into the handover session — also got fixed, so those buttons now move you into a new session that stays visibly busy until the handover summary lands (#9245, #9300).

Everyday chat behavior got a lot smoother too. The most common irritation was the chat view snapping back to the bottom while you were trying to read earlier context during a streaming response; that no longer happens, and scrolling back through long sessions now correctly reloads earlier history from the virtualized list (#9236, #9194). Switching between long sessions in Agent Manager — which used to briefly freeze the UI — is now near-instant, with the chat view self-healing if messages arrived while it was in the background (#8911). Smaller queue and layout fixes also landed around follow-up prompts and tool output interleaving.

Finally, a nice performance-and-stability win from the community: @IamCoder18 landed visibility-aware git polling plus resolution caching in Agent Manager's git stats poller (#8703), meaningfully reducing the number of git subprocesses the extension spawns on repos with many worktrees.

New Capabilities This Cycle

Stability was the priority, but we still shipped meaningful new capability:

Fork sessions from any user message — both in Agent Manager (#9207) and in the sidebar (#9244). Branch at any point without losing the original.
KiloClaw chat panel in VS Code — the KiloClaw group chat experience now lives directly inside the editor (#7960).
Folder @-mentions — reference a folder with @ and include its top-level file contents as context (#9023).
Autocomplete backend prewarm — inline completions are ready on the first keystroke without having to open the Kilo sidebar first, and autocomplete state refreshes when workspace folders change (#9305).
Heap snapshots from the Command Palette — capture a snapshot of the bundled Kilo core directly from VS Code (#9034).
"Contribute on GitHub" CTA in Marketplace — a subtle footer link inviting contributions of new skills, modes, and MCP servers (#9099).

Upstream OpenCode

Three more OpenCode upstream releases merged this cycle — v1.4.4, v1.4.5, and v1.4.6 — bringing continued improvements to session sync, provider compatibility, Windows terminal handling, and the underlying AI SDK layer. Building on a shared open-source foundation continues to pay off: work from the broader OpenCode community lands in Kilo automatically.

Codebase Indexing Progress

Community contributor @shssoichiro's codebase indexing work (#6966) remains active. The branch is being kept current against main, review iterations are ongoing, and we're closing in on a form we can land. This is a substantial feature and we want to get it right — thank you for the sustained effort here.

Community Update

Some numbers and names from this cycle:

80+ PRs merged on top of the upstream OpenCode work.
3 upstream OpenCode releases merged — v1.4.4, v1.4.5, and v1.4.6.
Multiple stable releases promoted to the marketplace through the period, with v7.2.20 as the current stable.

Thank you to community contributors whose work landed or continued this cycle:

@shssoichiro — continued work on codebase indexing (#6966).
@IamCoder18 — visibility-aware git polling in GitStatsPoller (#8703).

And broad thanks to every community member who filed heap snapshots, reproduction steps, Discord reports, and sustained the long-running Windows performance thread (#8030). That conversation is the reason we had the signal we needed to tackle the memory work head-on this week.

Moving at Kilo Speed Into the Future

This is the last of the regular weekly updates in this series. The core issues that we highlighted in Week 1 — rate limiting, Plan/Ask strictness, human-in-the-loop controls, config resilience, and Windows memory — are either resolved or meaningfully better. We will continue to focus on smoothing out the rough edges in the near future.

We will also be driving Kilo further towards the vision of where agentic coding is going, enabling engineering teams to ship at Kilo Speed safely and confidently, faster than ever before. We are excited about this future and believe that the new V7 is on a strong foundation to build on. Agent Manager continues to improve for those who like to run multiple agent sessions in parallel, and will only become more useful as models continue to improve and become more capable and need less oversight. And when a particular change or workstyle requires closer agent supervision and pair programming, you can do that too. The AI landscape is evolving quickly and models keep advancing, and the tools we use need to keep pace.

To everyone who showed up over these three weeks — the issue filers, the PR authors, the Discord commenters, the prerelease testers, the heap-snapshot senders, and the folks who point to the future with feature requests — thank you. Your feedback, issues, and pull requests are genuinely what makes this community great. We value every piece of it, and we'll keep making the extension better because of it.

See you in the release notes.

— Josh and Mark

Move at Kilo Speed.

The future of Product Managers

Darko from Kilo — Thu, 23 Apr 2026 12:54:39 +0000

A product leader we know has 15 years of experience shipping developer tools. He spent a decade at a household name. He is, genuinely, one of the best product minds we've encountered in this industry.

He can't get a conversation for a group PM role.

That is a signal, not a market blip.

We've spent a lot of time talking about what AI is doing to engineers – how one developer with the right tools now ships what used to require a team of five. But we had an adjacent question: what happens to product managers?

Shipping isn't a funnel anymore

For years, software development worked like a funnel. PMs turned customer insights into specs. Engineers turned specs into code. The funnel created a natural place for the PM to sit – upstream, owning the translation layer.

Shipping was expensive. So you needed someone to decide what was worth shipping.

That's no longer true. Shipping is close to free now. So what is a PM's role now that the funnel has collapsed and PMs aren't filtering a very costly resource (engineering time)? Is there still a place for PMs in this new world?

As former PMs ourselves, we're watching this shift from two very different vantage points. At Kilo, there are about 40 people and one PM. We operate with a WAUzer (Weekly Active User) model – every engineer owns a single product area and is accountable for the weekly active users in that area. Every Monday, Evgeny would stand up for two minutes: here's what I did on cloud agents, here are the numbers, here's my target for next week. He was fast. He was accountable. And across those product areas, we saw roughly 10% week-over-week growth.

The product hat shifted to engineers. And it worked.

But, it didn't work everywhere – the VS Code extension had too much surface area for one engineer to own clearly. So we brought in Josh. He runs a pod. He decides what gets built. Traditional PM model.

At Solo (Asher's company), it's just two people – one developer – moving at a pace that would have required a team of 10 three years ago. No PM at all. No coordination layer. The product question and the building question sit with the same person.

Two different experiments. Same conclusion forming.

It's always been vibe coding

"PMs were the original vibe coders. We wrote the spec, and the engineers were our LLMs."

That framing came out of a conversation between us. Because if the spec-to-code handoff is getting absorbed by AI tooling – if engineers can hold the product context and build without a translation layer – then the PM role has to move. The question is where.

We see two paths forward.

Path one: shift left toward go-to-market. The thing that's genuinely hard, even in an AI-native company, is knowing what to build. Not technically – but commercially. What will people pay for? What problem are we actually solving? Who is the buyer, and do we have them before we build?

That's where PMs might land. Not writing specs, but sitting closer to sales, customer research, and market discovery to orchestrate the product strategy and business rationale for building a feature. A big portion of the PM's role will be saying no to features to prevent bloat and identify customers who are willing to pay for features before building it.

Path two: the long thin layer – engineers who wear the product hat. Each engineer owns their area completely. Customer conversations, support, metrics, roadmap decisions – all of it. No handoff, no telephone game.

The upside is accountability. The downside is that it requires people who can go wide – technically sharp AND commercially minded AND customer-facing. That's a rare profile. And at some point, a customer doesn't want your one thin area. They want the whole package.

Both paths are real. You'll see companies betting on each.

The traditional shipping funnel is gone. It's dead in startups now and will die in F100s over the next 5 years. The people who figure out the new shape of product ownership – whether that's engineers, PMs who've shifted left, or something we don't have a name for yet – are the ones who'll be standing in three years.

The senior product leader we mentioned will land somewhere. His experience is real. But the role he's looking for may not look like what it used to. The best thing any PM can do right now is stop waiting for the old model to come back and start experimenting with new models.

Developers are working in the future. PMs need to join them.

We Gave Claude Opus 4.7 and Kimi K2.6 the Same Workflow Orchestration Spec

Darko from Kilo — Thu, 23 Apr 2026 12:47:55 +0000

Kimi K2.6 launched on April 20, 2026, four days after Anthropic released Claude Opus 4.7. We gave both models the same spec for FlowGraph, a persistent workflow orchestration API with DAG validation, atomic worker claims, lease expiry recovery, pause/resume/cancel, and SSE event streaming. Then we reviewed the code and reproduced the edge cases the models' own tests did not cover.

TL;DR: Claude Opus 4.7 scored 91/100 and Kimi K2.6 scored 68/100 on the same build. Kimi K2.6 reached 75% of Claude Opus's score at 19% of the cost, but the 25-point gap sits in lease handling, scheduling, and live streaming (the parts its own tests never exercised).

Pricing

Claude Opus 4.7 runs at roughly 5x the input cost and 6x the output cost of Kimi K2.6. That is the gap we wanted to pressure-test.

Why a Workflow Orchestration Spec

A workflow engine runs jobs like a nightly settlement: fetch captured payments, charge customers, send receipts, publish analytics. Four steps with dependencies between them, retries when a step fails, and recovery when a worker crashes mid-step. Temporal, Airflow, and AWS Step Functions all solve the same problem at different scales.

Most of our API comparisons test a wide range of skills (architecture, auth, filtering, error handling). For this test we wanted a single deep build where correctness was the main axis. A workflow engine with DAG validation, atomic step claims, lease expiry recovery, retry scheduling, and pause/resume/cancel semantics has objectively right and wrong answers. Either two workers can win the same step or they can't. Either an expired lease is recovered or it isn't. Either a step becomes runnable when its dependencies succeed or it doesn't.

The spec also calls out at-least-once execution, deterministic scheduling across all eligible steps, and SQLite as the source of truth. The full spec is 1,042 lines and covers 20 endpoints across workflow definitions, runs, workers, events, health, and metrics.

The Prompt

We ran both tests in Kilo CLI and gave both models the same prompt:

"Read @spec.md and build the project in the current directory. Treat @spec.md as the source of truth. Do not simplify this into a mock, toy app, or basic CRUD scaffold. Create all code, configuration, Prisma schema, tests, and README needed for a runnable project. Work autonomously and continue until the implementation is complete. Before you finish, install dependencies, run the test suite, fix any failures you can reproduce, and make sure the project is runnable."

Claude Opus 4.7 ran on high thinking mode. Kimi K2.6 ran on thinking mode. Each model worked in its own empty directory with no shared state.

What Each Model Produced

Claude Opus 4.7 finished in about 20 minutes. Kimi K2.6 took longer on the clock, but we are not scoring elapsed time here. Kimi K2.6 was released the day of this test and provider availability is still limited. Wall-clock comparisons against a model as well-supported as Claude Opus 4.7 would distort the picture. Expect that gap to close as more providers host Kimi K2.6.

Both models delivered the project shape we asked for:

Prisma with SQLite as the source of truth
Hono routes for workflow definitions, runs, worker actions, events, health, and metrics
Conditional updateMany for step claiming
Retry and lease-expiry scheduling
A RunEvent table for audit logs
Readmes with setup instructions and at-least-once execution notes

Both Models Said Their Tests Passed

Claude Opus 4.7 ran 31 tests across 6 files. Every test passed. Kimi K2.6 ran 20 tests inside a single file. Every test passed.

If we had stopped there, the two implementations would look close. They weren't. A direct code review plus targeted reproductions against isolated SQLite databases surfaced one real bug in Claude Opus 4.7 and six in Kimi K2.6. We will show each one with the line that causes it.

Claude Opus 4.7: One Real Bug

Multi-expired lease recovery leaves retryable siblings on a failed run

The spec says that when a step exhausts retries, the parent run fails and every other non-terminal step becomes blocked. Claude Opus 4.7's recovery path handles this correctly for a single expired lease. With two expired leases in the same recovery pass, it can undo its own block.

In src/services/workers.ts, runRecovery() loads every expired running step into memory and iterates:

If the first iteration exhausts retries for one step, failRunDueToDeadStep() fires, the run becomes failed, and every other non-succeeded step is set to blocked. That is correct.

The problem is the second iteration. handleLeaseExpiry() updates by id only:

There is no guard on status, so a step that was just marked blocked by the prior failure cascade gets updated back to waiting_retry.

We reproduced it with a run containing two expired running steps: a with maxAttempts = 1 and b with maxAttempts = 2. After recovery:

Step b should have been blocked because the run had already failed. Instead it is eligible to be claimed again on the next /workers/claim call.

Claude Opus 4.7's test suite does not cover this case. It tests single-step lease expiry in isolation.

Smaller contract risks

Two smaller issues turned up in review but did not need a full reproduction.

The claim path reads maxClaims * 10 candidates. That is fine most of the time, but a queue with many skipped candidates at the front can hide valid work farther down the ordered list.
The SSE stream subscribes after replay finishes and treats an unknown afterEventId as "replay everything." The spec does not define unknown-cursor behavior explicitly, so this is more a looseness than a bug.

Kimi K2.6: Six Confirmed Issues

1. Claim ordering is not global across runs

The spec requires that when multiple steps are eligible, claim order is priority descending, then availableAt ascending, then createdAt ascending, across all eligible steps.

Kimi K2.6's claim loop orders steps inside each run, then iterates runs in whatever order the database returns them:

We reproduced this with two active runs on the same queue. One had a step at priority = 10. The other had a step at priority = 100. The call to POST /workers/claim returned the priority 10 step first.

2. SSE is replay-only, not live

The spec requires that GET /runs/:id/events/stream replays stored events and then switches to live streaming.

Kimi K2.6's stream reads every persisted event, writes them to the stream, and then starts a keepalive timer. Nothing subscribes to new events. The file src/lib/events.ts even defines an emitAndBroadcast function and a subscriber map, but the route never wires to them:

Clients receive replayed history once, then silence. The README still claims live streaming.

3. Expired leases can still be completed

The heartbeat endpoint rejects expired leases. The complete and fail endpoints do not. We reproduced this by claiming a step, forcing leaseExpiresAt into the past, and calling POST /step-runs/:id/complete:

The step was marked succeeded on an expired lease. The spec treats lease expiry as a failed attempt. A worker can crash, its lease can expire, recovery can schedule a retry for the next worker, and the original worker can still phone in a "success" afterwards.

4. "No active version" returns 404 instead of 409

The spec: if there is no active version and no explicit version, return 409.

Kimi K2.6 raises NOT_FOUND (404):

5. Validation is narrower than the spec

CreateRunSchema and CompleteSchema use z.record(z.any()) for input, metadata, and output. The spec allows arbitrary JSON payloads. A string, array, or number payload is rejected even though the spec accepts it.

6. The clean build path fails

npm test passes. npm run build does not:

package.json expects npm start to run node dist/index.js, so the documented build-and-start flow is broken on a clean checkout.

What Each Model Said About Itself

Both models produced end-of-run summaries claiming their implementations were complete and all tests passed. Both were technically true. Neither flagged the issues above.

Claude Opus 4.7's summary was mostly accurate. It described its recovery path, atomic claim pattern, and event persistence correctly. The one thing it missed was the multi-expired lease interaction.

Kimi K2.6's summary claimed deterministic global scheduling and live SSE streaming. Both of those claims are in the README too. The code does not deliver either.

"My tests pass" is not the same thing as "my implementation is correct." Both models understood the spec well enough to build most of it. Neither model wrote tests that would have caught its own worst behavior.

Scoring

We scored each model on the spec, weighted by how much each category mattered for a correctness-first workflow engine.

Claude Opus 4.7 lost points on the reproduced recovery bug, the bounded claim scan, and the SSE cursor fallback.

Kimi K2.6 lost points on the six confirmed issues above. The biggest hits are in recovery, scheduling, and streaming, which is exactly where the spec's hardest requirements live.

Cost vs Quality

Kimi K2.6 is about 4x cheaper per point. The missing 23 points are in step-leasing, scheduling, and event streaming, which is where the hardest spec requirements live. Those are the parts that separate "the endpoints exist" from "the system behaves correctly under load."

Where Open-Weight Models Stand Right Now

This test sits inside a pattern we've been tracking for a while. MiniMax M2.7 matched Claude Opus 4.6's detection rate on our last three-part benchmark. GLM-5.1 scored five points behind Claude Opus 4.6 on our job queue spec. Kimi K2.6 landed 23 points behind Claude Opus 4.7 here on a harder spec, but still produced the right shape of the system on the first pass.

The gap on surface coverage has narrowed meaningfully over the last year. The gap on correctness inside hard code paths (lease recovery, cross-run scheduling, streaming semantics) is still there. For work where the bugs only show up under contention or mid-crash, frontier proprietary models are the safer choice today. For work where you need the scaffold, the tables, the endpoint surface, and a starting test suite, open-weight models like Kimi K2.6 are close enough that the price delta matters.

Kimi K2.6's current pricing ($0.95 / $4 per million tokens) is a starting point, not a floor. Moonshot AI releases open weights, which means Kimi K2.6 will end up hosted on multiple providers, with pricing and latency converging on whoever runs it most efficiently. That is already playing out with MiniMax M2.5, which became the #1 most-used model across every mode in Kilo Code in the months after release. Price competition tends to pull these numbers down further as more hosts come online.

Being open-weight also means you can self-host or fine-tune Kimi K2.6 if you have data residency requirements, custom workflows, or a cost profile that makes API-only models impractical at scale. That is not a capability Claude Opus 4.7 offers at any price.

None of that changes the correctness findings above. It does reframe them. At $0.67 with a careful review pass, Kimi K2.6 is a real option now. At $3.56 with fewer corrections needed, Claude Opus 4.7 is the safer call. Which trade-off wins depends on the work. A year ago, that choice did not really exist at this level of complexity.

Takeaways

For building the scaffold of a complex backend: Kimi K2.6 did well. It produced the right project shape, the right tables, the right endpoint surface, and a test suite that passed. For prototyping, exploring a design, or generating a starting point you plan to review carefully, the $0.67 run is a good deal.

For systems where state-machine correctness matters: Claude Opus 4.7 pulled clearly ahead. The two implementations look similar in shape but diverge in the code paths that are hard to test casually (lease expiry, cross-run ordering, SSE, expired-lease rejection). If the project needs to behave correctly when leases expire, when multiple runs compete for workers, or when events need to flow live to clients, Claude Opus 4.7's output is closer to something you could ship.

On trusting model self-reports: Both models said they were done. One was mostly right. The other had six spec-level issues in shipped code. "Tests pass" is a necessary signal. It is not a sufficient one for work this correctness-sensitive. A review pass plus a few targeted reproductions closed the gap between what the models said and what they actually built.

A Note on Kimi K2.6 Speed

Kimi K2.6 was released the day of this test. Provider availability is limited right now, so the current wall-clock timings understate the model's real speed. We saw similar adoption curves on previous open-weight releases from MiniMax and Z.ai as more providers came online. We expect Kimi K2.6's elapsed time (and its effective cost) to keep dropping as that happens.

Testing performed using Kilo Code, a free open-source AI coding assistant for VS Code and JetBrains with 2,300,000+ Kilo Coders.

Enterprise AI Has a Trust Problem. We’re Hearing It Firsthand.

Darko from Kilo — Thu, 23 Apr 2026 12:41:29 +0000

The last few weeks have been chaotic for anyone paying attention to the AI tooling market. Cursor is set to sell to SpaceX. Anthropic pulled the rug on subscription pricing for businesses. And in the middle of all that noise, our conversations with enterprise teams have been converging on the same frustrations.

The specifics differ by industry. The underlying problem is consistent: walled gardens and pricing uncertainty.

Their Ceiling Is Your Ceiling

Take infrastructure trust. A top-three auto manufacturer came to us because their developers were hitting Cursor rate limits and couldn't build while they waited for them to reset. That same company had a second concern, quieter but more significant: they suspected the frontier lab powering their primary tool had oversold capacity and was running into compute headroom issues.

Whether that was probably true didn't matter. The perception had already taken root. If your workflow depends on one lab's availability, their ceiling is your ceiling.

Then there's cost visibility. A Director of DevEx at one of the world's largest banks came to us because his developers had existing model agreements with frontier labs, negotiated at the enterprise level, and he wanted them to actually use those models instead of routing everything through a middleman — which isn't possible on vendor-locked tools. On top of that, the other tools he'd evaluated gave him no visibility into token-level costs. When you can't see what you're paying for, you're trusting a vendor's math on your own spend.

A platform engineer at one of the UK's largest retailers had a similar frustration: his colleague was evaluating a tool with an opaque credit system and finding that developers burned through credits fast when they asked what he called "some juicy questions of the codebase." They wanted powerful models, but they also wanted to know what those models were costing.

Routing and Compliance Shouldn't Be Optional

For others, the issue is routing and compliance. A healthcare software CEO was simultaneously in contract negotiations with two different vendors when he reached out. He wanted to know if there was a more open alternative before he signed with either, and was already writing his own model routing layer internally (a CEO, doing infrastructure work) because "the world changes too much to bet on any one solution."

A separate healthcare data company came to us for a specific technical reason: they work with PHI and can't route that data through outside vendor infrastructure, but they still need frontier models for tasks that don't touch patient data. They needed one tool that could route differently based on what was actually in the request. That's not an unusual ask. It's compliance.

And then there's the on-prem and sovereignty tier. A defense contractor with CUI requirements told us that on-prem model routing wasn't optional, it was a contractual necessity. A cloud CTO asked for mixed inference on day one, with some calls going to self-hosted models, others to their existing AWS Bedrock commitments, and the rest through our gateway, because running models is literally his business and single-vendor inference lock-in was a risk he'd already mapped out. The platform engineer at the UK retailer liked the tool he'd been using personally for 18 months, but said plainly, "obviously I can't bring that to my work environment." He needed enterprise data controls with his company's own Bedrock models underneath.

The AI champion at a major fast food chain put it most directly: closed vendors are building something that looks a lot like OpenClaw but locked inside their own walled garden, and that's precisely why model-agnostic infrastructure matters to her. The capability isn't the moat. Who controls access to the models is.

The Data Backs This Up

We see this play out in our usage data too, and the numbers are striking. On an average day this month, Kilo users are actively running 348 different models. Yesterday, the top 10 by usage came from six different labs: MiniMax, StepFun, xAI, ByteDance, Anthropic, and NVIDIA. MiniMax was #1 by request volume. The three most popular models combined only covered half of all usage, and a full third of Kilo traffic goes to labs that most people wouldn't have recognized 18 months ago.

Nearly half of Kilo users run models from more than one lab in a given month, and that share grew from 29% to 46% over the last six weeks. Among organizational customers specifically, 42% used models from two or more labs in a single week, generating 1.1 million requests routed to 19 different labs. The number of labs with 1,000+ weekly active users on Kilo grew from 8 in January to 12 in April.

People also aren't just switching between projects. Yesterday, 15% of users routed to two or more models within a single hour. Power users average five labs a month. The average Kilo employee, who has every model available and no spend cap, draws from 5.7 labs per month. Even internally, with unlimited access, nobody settles on one lab. Multi-model isn't a power-user quirk anymore. It's becoming the default way developers work.

Cursor & SpaceX: The Cost of Structural Dependency

The Cursor/SpaceX deal is worth understanding through this lens. Cursor built a genuinely good product and still ended up in a position where the models at the core of their product were controlled by companies now competing directly against them. The $60 billion acquisition option and access to a million H100s is the cost of buying out of that structural dependency — training their own models so they're not reliant on infrastructure providers who also ship competing tools. That's not a Cursor problem. That's just what it costs to not be dependent on your competitors.

The auto manufacturer waiting on rate limits, the bank that can't see its token costs, the healthcare company that can't route PHI externally, the defense contractor with on-prem requirements, the retailer who loved a tool he couldn't bring to work. These are all expressions of the same structural problem. When you don't own the model layer, the decisions of whoever does become your constraints.

And as frontier labs move further into tooling, the likelihood of those constraints tightening only goes up. One enterprise customer said it plainly: "I do not like vendor lock-in. All the features that these big companies are making to try and lure you in and get vendor lock-in on their flagship models is not something I'm interested in."

He's not alone. The market is moving toward infrastructure that stays out of the way, routing intelligently to whatever model fits the task, showing you exactly what it costs, and not requiring you to trust a vendor's judgment about which models you should have access to. The walled garden is a bet that lock-in wins. Increasingly, the developers and enterprise teams we talk to are betting the other way.

Kilo is the all-in-one agentic engineering platform, open-source and model-agnostic. Install the VS Code extension or get started at app.kilo.ai.