<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tessl</title>
    <description>The latest articles on DEV Community by Tessl (@tessl-io).</description>
    <link>https://dev.to/tessl-io</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3865880%2Fae4ef80f-404f-4ed5-849f-f94683a6e7b0.png</url>
      <title>DEV Community: Tessl</title>
      <link>https://dev.to/tessl-io</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tessl-io"/>
    <language>en</language>
    <item>
      <title>Warp goes open source, betting agents and community can outpace closed rivals</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Sun, 05 Jul 2026 07:53:34 +0000</pubDate>
      <link>https://dev.to/tessl-io/warp-goes-open-source-betting-agents-and-community-can-outpace-closed-rivals-1hal</link>
      <guid>https://dev.to/tessl-io/warp-goes-open-source-betting-agents-and-community-can-outpace-closed-rivals-1hal</guid>
      <description>&lt;p&gt;&lt;a href="https://www.warp.dev/" rel="noopener noreferrer"&gt;Warp&lt;/a&gt;, the developer tooling startup behind the modern terminal of the same name, is open-sourcing its client — and tying that move to a broader shift in how it believes software will be built.&lt;/p&gt;

&lt;p&gt;With AI agents now capable of handling much of the implementation work, Warp argues the real constraint now lies in defining what to build, coordinating tasks, and verifying outputs. Opening up the codebase, it says, allows a wider pool of contributors to take on that role — effectively supervising a growing fleet of agents.&lt;/p&gt;

&lt;p&gt;“The biggest bottleneck to development is no longer writing code – it’s all the human-in-the-loop activities around the code: speccing the product and verifying behavior, and frankly, we are limited in what our internal team can do and the pace we want to move at,” Warp founder and CEO &lt;a href="https://www.linkedin.com/in/zachlloyd/" rel="noopener noreferrer"&gt;Zach Lloyd&lt;/a&gt; wrote in a &lt;a href="https://www.warp.dev/blog/warp-is-now-open-source" rel="noopener noreferrer"&gt;blog post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;OpenAI is a founding sponsor of the new &lt;a href="https://github.com/warpdotdev/warp" rel="noopener noreferrer"&gt;Warp repo on GitHub&lt;/a&gt;, with Warp’s agent workflows powered by GPT models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gboxug5blrj9v9lyt7i.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gboxug5blrj9v9lyt7i.gif" alt="Warp in action" width="720" height="405"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Warp in action&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A different kind of open source
&lt;/h2&gt;

&lt;p&gt;Warp isn’t just publishing its code and inviting pull requests. It’s proposing a more structured setup, where agents handle coding, planning, and testing, while human contributors focus on direction and validation.&lt;/p&gt;

&lt;p&gt;This means ideas flow in through public GitHub issues, are picked up by agents &lt;a href="https://tessl.io/blog/as-coding-agents-become-collaborative-co-workers-orchestration-takes-center-stage/" rel="noopener noreferrer"&gt;orchestrated&lt;/a&gt; via Warp’s internal platform, &lt;a href="https://www.warp.dev/oz" rel="noopener noreferrer"&gt;Oz&lt;/a&gt;, and are then reviewed by both the community and the core team.&lt;/p&gt;

&lt;p&gt;The aim is to increase throughput without expanding headcount.&lt;/p&gt;

&lt;p&gt;“Open-sourcing with an agent-powered repo is our vision of how software will be built in the future,” Lloyd said. “Humans managing agents at scale to build production-grade software is the model, and implementing this model in the open will allow software to improve most quickly.”&lt;/p&gt;

&lt;p&gt;The company says it already has confidence in code generated this way, pointing to internal use of agents for implementation-heavy tasks. Opening that process up, it argues, should accelerate development further — and surface ideas it might not arrive at on its own.&lt;/p&gt;

&lt;p&gt;“We’ve found that agents can handle the implementation heavy lifting really well,” Lloyd continued. “That frees contributors to focus on the higher-leverage work: shaping what gets built and making sure it’s right.”&lt;/p&gt;

&lt;p&gt;For sure, this approach follows a familiar trend of late, where the software development process transitions from writing code to managing the systems that produce it. And this is what &lt;a href="https://docs.tessl.io/" rel="noopener noreferrer"&gt;Tessl is explicitly building around&lt;/a&gt;, serving as an agent enablement platform for managing the &lt;a href="https://docs.tessl.io/introduction-to-tessl/context-lifecycle" rel="noopener noreferrer"&gt;context&lt;/a&gt; that coding agents rely on — treating agent &lt;a href="https://docs.tessl.io/create/creating-skills" rel="noopener noreferrer"&gt;skills&lt;/a&gt; and context as software that needs to be built, evaluated, and continuously updated as systems evolve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Warp uses openness as a lever
&lt;/h2&gt;

&lt;p&gt;Warp is explicit about the competitive backdrop behind its decision. It points to “highly funded, closed-source competitors” and acknowledges it can’t match them on pricing or resourcing.&lt;/p&gt;

&lt;p&gt;Instead, it’s positioning openness as a lever — not just to attract contributors, but to move faster by distributing parts of the development process. The addition of wider support for open models, including systems such as &lt;a href="https://tessl.io/blog/kimi-k26-agent-skills-evaluation/" rel="noopener noreferrer"&gt;Kimi&lt;/a&gt;, &lt;a href="https://tessl.io/blog/minimax-m2-1-marries-scale-efficiency-and-multi-language-software-development/" rel="noopener noreferrer"&gt;MiniMax&lt;/a&gt;, and &lt;a href="https://qwen.ai/" rel="noopener noreferrer"&gt;Qwen&lt;/a&gt;, along with a new “auto (open)” routing mode that selects the most suitable model for a given task, reinforces that stance.&lt;/p&gt;

&lt;p&gt;Alongside the open-source release, Warp is also expanding how much users can customize the environment, letting them choose between a plain terminal and a more fully featured setup with built-in agents, diff views, and file navigation tools.&lt;/p&gt;

&lt;p&gt;Ultimately, it’s a bid to differentiate on flexibility and pace in what is a super-competitive space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agents change the equation
&lt;/h2&gt;

&lt;p&gt;Underlying all of this is a view about what AI agents actually change.&lt;/p&gt;

&lt;p&gt;Warp argues the biggest gains are no longer in code generation itself, but in offloading the surrounding work — planning, coordination, verification — that has traditionally slowed development cycles. If agents can handle the bulk of execution, then expanding the pool of people who can guide and review that work becomes the next step.&lt;/p&gt;

&lt;p&gt;That’s where open-source comes in. Rather than scaling an internal team, Warp is betting that a community — working alongside agents — can iterate faster and push the product in directions a smaller group might miss.&lt;/p&gt;

&lt;p&gt;It’s a notable contrast to other recent moves in the market, where &lt;a href="https://thenewstack.io/cal-com-codebase-security-ai/" rel="noopener noreferrer"&gt;some companies have pulled back from open development&lt;/a&gt; over security concerns tied to AI. Warp is taking the opposite view: that agents make openness more valuable, not less.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Tessl Academy is live (in preview) — and there are two ways in</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Sat, 04 Jul 2026 07:13:53 +0000</pubDate>
      <link>https://dev.to/tessl/tessl-academy-is-live-in-preview-and-there-are-two-ways-in-2a1h</link>
      <guid>https://dev.to/tessl/tessl-academy-is-live-in-preview-and-there-are-two-ways-in-2a1h</guid>
      <description>&lt;h2&gt;
  
  
  Tessl Academy is live (in preview) — and there are two ways in
&lt;/h2&gt;

&lt;p&gt;We just shipped the first version of &lt;a href="https://tessl.co/kuh" rel="noopener noreferrer"&gt;Tessl Academy&lt;/a&gt;, a hands-on curriculum for building, evaluating, and running skills for coding agents. It's early. Two courses are up — &lt;strong&gt;Skill Foundations&lt;/strong&gt; and &lt;strong&gt;Tuning Your Agent&lt;/strong&gt; — with more on the way. We'd rather get it in front of you now and shape it with your feedback than polish it in private for another month.&lt;/p&gt;

&lt;p&gt;Here's the idea. Most of us are already using coding agents, but the results swing between magic and mess. The Academy is about closing that gap: moving from one-off AI coding experiments to workflows you can repeat and trust. Skills are the thread running through every lesson — small, reusable instructions your agent loads on demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two ways to take it
&lt;/h3&gt;

&lt;p&gt;We built the Academy so you can learn whichever way suits you right now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Read it.&lt;/strong&gt; Every lesson works as a plain read on the site. No install, no setup — open a lesson and go. Good for a commute, a coffee, or deciding whether the hands-on version is worth your time.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Run it.&lt;/strong&gt; Install a course once, then ask your agent — Claude Code, Cursor, Codex, or Tessl Agent — to walk you through a lesson. It guides you one step at a time, waits while you work, and hands off to the next lesson when you're done. You learn skills by building one.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same content, two speeds. Start by reading and switch to hands-on whenever you like — the &lt;a href="https://tessl.co/kuh" rel="noopener noreferrer"&gt;Quickstart&lt;/a&gt; gets you running in about four steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  It's a preview, and your feedback shapes it
&lt;/h3&gt;

&lt;p&gt;This is genuinely a first cut. Some lessons will land, some won't, and the roadmap past these two courses is still open. That's where you come in: tell us what's confusing, what's missing, and what you'd want to learn next.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Join the conversation in our &lt;a href="https://discord.com/invite/jbb2vHnHZQ" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Or email me directly: &lt;a href="mailto:alan@tessl.io"&gt;&lt;strong&gt;alan@tessl.io&lt;/strong&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll be reading everything. Expect the Academy to move quickly over the coming weeks, and the fastest way to influence where it goes is to try it and tell me what you think.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tessl.co/kuh" rel="noopener noreferrer"&gt;&lt;strong&gt;Start with the Quickstart →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>The model's solved, now comes the hard part: Reviewability as the bottleneck</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Fri, 03 Jul 2026 07:33:36 +0000</pubDate>
      <link>https://dev.to/tessl-io/the-models-solved-now-comes-the-hard-part-reviewability-as-the-bottleneck-gp3</link>
      <guid>https://dev.to/tessl-io/the-models-solved-now-comes-the-hard-part-reviewability-as-the-bottleneck-gp3</guid>
      <description>&lt;p&gt;It's something you'll likely be hearing more and more: the model is no longer the big sticking point in AI engineering. The question keeping teams up at night is how to build reliable, governable systems around it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kilo.ai/" rel="noopener noreferrer"&gt;Kilo&lt;/a&gt;, the &lt;a href="https://tessl.io/blog/inside-kilo-code-an-open-source-ai-coding-agent-with-plans-to-reshape-software-development/" rel="noopener noreferrer"&gt;open source coding agent built on VS Code&lt;/a&gt;, recently crossed three million downloads and processed more than 40 trillion tokens. And the lessons that volume of real-world usage produced had little to do with model intelligence, and everything to do with reviewability, context, and operational control.&lt;/p&gt;

&lt;h2&gt;
  
  
  ‘Task size should be bounded by what a human can review in a single sitting’
&lt;/h2&gt;

&lt;p&gt;Forty trillion tokens sounds like some sort of success metric, but Kilo's own assessment is a little more measured. At that volume, small problems in the surrounding system become expensive fast. A missing context file becomes repeated tool calls; a poorly scoped task produces a diff too large for any engineer to sensibly review; and a vague permission setting becomes a blocker the moment a second team tries to adopt the tool.&lt;/p&gt;

&lt;p&gt;The conclusion Kilo drew from its own usage data was pointed: task size should be bounded by what a human can review in a single sitting. If the output can't be reviewed, it can't be trusted, and if it can't be trusted, it won't be merged.&lt;/p&gt;

&lt;p&gt;To illustrate the point, Kilo describes splitting a single feature into three parallel workstreams — a billing API endpoint, a test suite, and documentation update — each handled by a separate agent with a narrow, explicit instruction. One diff touches the endpoint, another touches the tests, while one touches the docs. If the tests fail, the failure is scoped. If the docs agent guesses, the mistake is visible.&lt;/p&gt;

&lt;p&gt;“The job changes from ‘write every line’ to ‘design the loop,’” &lt;a href="https://www.linkedin.com/in/olearycrew/" rel="noopener noreferrer"&gt;Brendan O'Leary&lt;/a&gt;, developer relations engineer at Kilo, &lt;a href="https://blog.kilo.ai/p/what-we-learned-from-3-million-downloads" rel="noopener noreferrer"&gt;writes in a blog post&lt;/a&gt;. “You decide the task boundary, the model, the permissions, the environment, and the verification step. The agent writes code. You decide whether that code should exist.”&lt;/p&gt;

&lt;p&gt;Kilo's findings fit into a broader pattern emerging elsewhere in the industry. Sourcegraph, the code intelligence platform, recently analysed 1,281 agent runs across more than 40 enterprise-scale open source repositories and &lt;a href="https://tessl.io/blog/coding-agent-failure-patterns-large-codebases/" rel="noopener noreferrer"&gt;found that the gap between success and failure&lt;/a&gt; had almost nothing to do with the underlying model.&lt;/p&gt;

&lt;p&gt;"The difference between complete failure and near-perfect completion wasn't intelligence — it was efficient access to context," Stephanie Jarmak, agent advocate at Sourcegraph, said.&lt;/p&gt;

&lt;p&gt;One benchmark task saw an agent make 96 tool calls over 84 minutes without proper retrieval tooling. The same task, with the right infrastructure in place, took five calls and under five minutes.&lt;/p&gt;

&lt;p&gt;The lesson from both Kilo and Sourcegraph is that the systems surrounding the model increasingly determine the outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  The infrastructure around the model is the engineering challenge
&lt;/h2&gt;

&lt;p&gt;Kilo's experience also surfaced a more granular picture of what production-grade agentic engineering actually requires. The full loop — plan, scope, run, verify, review, merge — needs dedicated infrastructure at every step. Planning needs modes and file-backed handoffs. Scoping needs explicit permissions and task boundaries. Running needs model choice, tool calls, and environment isolation. Verification needs tests, CI integration, and sometimes a second agent with fresh context. Review needs a diff a human can understand. When any one part is missing, the agent may still produce code, but the team just won't trust it enough to merge.&lt;/p&gt;

&lt;p&gt;But reviewability is only part of that picture. OpenAI's most &lt;a href="https://tessl.io/blog/how-enterprises-are-scaling-ai-5-patterns-from-openai/" rel="noopener noreferrer"&gt;recent enterprise guidance&lt;/a&gt;, drawing on deployments at companies including BBVA, Philips, and JetBrains, shows that organisations seeing the most traction are those focused on evaluation systems, context management, orchestration, and governance — not on which model sits underneath.&lt;/p&gt;

&lt;p&gt;"The organisations that win with AI won't be the ones that tried it first — they'll be the ones that operationalised it best," said Sanj Bhayro, OpenAI's managing director for EMEA.&lt;/p&gt;

&lt;p&gt;The emerging picture is of a new engineering layer forming around AI systems: evaluation tooling that runs against real codebases, shared context registries, permission controls, usage analytics, and observability infrastructure. Kilo's own roadmap reflects this directly — its next priorities centre on portable sessions that survive moving between VS Code, the terminal, Slack, and cloud environments, and on ensuring every agent workflow ends in an artifact a human can judge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance before autonomy: teams won't adopt what they can't explain
&lt;/h2&gt;

&lt;p&gt;One of the less obvious lessons from Kilo's experiences is the difference between individual and team adoption. Individual developers adopt tools when they save time. Teams adopt them when they can explain the risk — to finance, to security, to whoever owns the production environment.&lt;/p&gt;

&lt;p&gt;“That means agentic engineering needs controls that feel boring until you need them,” O’Leary notes.&lt;/p&gt;

&lt;p&gt;Kilo learned that the hard way. Its early free credits attracted tens of thousands of throwaway accounts, generating billing pressure, infrastructure strain, and weeks of engineering time spent in merge conflicts rather than shipping product. The experience sharpened Kilo's thinking on what enterprise-grade agentic tooling actually needs: model allowlists, usage visibility before a billing surprise arrives, permission prompts that can block tool calls, isolated cloud environments for sensitive work, and source visibility for security review.&lt;/p&gt;

&lt;p&gt;Those requirements map directly onto the questions Kilo found developers asking about any open source AI tool: can I inspect what runs against my code? Can I bring my own model key? Can I control which models my team is allowed to use? Can I see usage before a bill arrives? Can I keep sensitive work local? And crucially — can I leave if the product stops fitting how my team works?&lt;/p&gt;

&lt;h2&gt;
  
  
  The infrastructure layer is still being built
&lt;/h2&gt;

&lt;p&gt;Sourcegraph's retrieval findings, OpenAI's governance lessons, and Kilo's focus on reviewability all point toward the same challenge: reliable AI systems depend on reliable infrastructure around the model.&lt;/p&gt;

&lt;p&gt;Kilo's own roadmap frames the next phase in three parts: portable, meaning sessions that survive moving between VS Code, the terminal, Slack, and cloud environments; governed, meaning teams can set model policies, inspect usage, and control permissions; and review-first, meaning every agent workflow ends in an artifact a human can judge — a diff, a test result, a PR comment, a deployment preview.&lt;/p&gt;

&lt;p&gt;Forty trillion tokens and three million downloads later, Kilo's conclusion is that generating code is only part of the problem. Teams still need ways to review it, verify it, govern it, and trust it. The model may be good enough, but the systems around it are still being built.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>How context travels in a multi-agent world</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Thu, 02 Jul 2026 06:29:33 +0000</pubDate>
      <link>https://dev.to/tessl-io/how-context-travels-in-a-multi-agent-world-1j87</link>
      <guid>https://dev.to/tessl-io/how-context-travels-in-a-multi-agent-world-1j87</guid>
      <description>&lt;p&gt;Engineering teams building with AI agents have largely solved the single-agent problem. The harder challenge arrives when capabilities get split across multiple independently deployed agents — each owned by a different team, each running on its own release cadence. Keeping a coherent conversation alive across those boundaries turns out to be one of the messier architectural questions in production agent systems today, and one that Tessl's own work on &lt;a href="https://tessl.io/blog/the-hidden-cost-of-agentic-software-development-why-context-engineering-matters/" rel="noopener noreferrer"&gt;context engineering&lt;/a&gt; and &lt;a href="https://tessl.io/blog/how-i-scan-my-agent-context-across-github-with-skill-inventory/" rel="noopener noreferrer"&gt;skill sprawl&lt;/a&gt; has been circling from a different angle.&lt;/p&gt;

&lt;p&gt;Microsoft's Industry Solutions Engineering (&lt;a href="https://microsoft.github.io/code-with-engineering-playbook/ISE/" rel="noopener noreferrer"&gt;ISE&lt;/a&gt;) team, which embeds with clients on complex technical engagements, has &lt;a href="https://devblogs.microsoft.com/ise/a2a-context-passing-multi-agent-systems/" rel="noopener noreferrer"&gt;published&lt;/a&gt; a detailed account of how they tackled that context problem in a recent engagement.&lt;/p&gt;

&lt;p&gt;Working with Agent2Agent &lt;a href="https://a2a-protocol.org/latest/" rel="noopener noreferrer"&gt;(A2A&lt;/a&gt;) — an open agent communication protocol originally &lt;a href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/" rel="noopener noreferrer"&gt;developed by Google&lt;/a&gt; and now maintained by a cross-vendor technical steering committee &lt;a href="https://www.linuxfoundation.org/press/linux-foundation-launches-the-agent2agent-protocol-project-to-enable-secure-intelligent-communication-between-ai-agents" rel="noopener noreferrer"&gt;at the Linux Foundation&lt;/a&gt; — they needed coordinator agents to hand off conversational history to domain agents that held no shared infrastructure and no persistent memory. Where the Model Context Protocol (&lt;a href="https://tessl.io/blog/what-s-new-in-mcp/" rel="noopener noreferrer"&gt;MCP&lt;/a&gt;) standardises how agents connect to tools and data, &lt;a href="https://tessl.io/blog/mcp-vs-a2a/" rel="noopener noreferrer"&gt;A2A operates at a different level&lt;/a&gt;: it defines how agents communicate with each other as peers, passing tasks and messages across service boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared storage creates dependencies agents shouldn't have
&lt;/h2&gt;

&lt;p&gt;Microsoft says it evaluated three core approaches before settling on the one that worked best.&lt;/p&gt;

&lt;p&gt;The first option entailed domain agents reading from a shared storage layer, using a common identifier to retrieve conversation history. The appeal with this is minimal message size and a single source of truth, but it requires every domain agent to have credentials and connectivity to storage owned by another team — a dependency that becomes unwieldy fast when agents cross organisational lines. A second option makes each domain agent stateful, maintaining its own record of the conversation. However, the operational overhead of running per-agent storage, handling state migration during deployments, and avoiding divergence between agents ruled this out early.&lt;/p&gt;

&lt;p&gt;The third approach, and the one Microsoft ultimately adopted, sends summarised conversation history directly inside each message payload. Domain agents receive everything they need to respond coherently without touching any external store. After every ten conversational turns, the history gets summarised to keep payload sizes manageable — a deliberate trade-off between fidelity and performance that the team acknowledges.&lt;/p&gt;

&lt;p&gt;"Summarisation is not without risk," they write. “Every summarisation step can lose important details or introduce inaccuracies, potentially degrading the quality of downstream responses.”&lt;/p&gt;

&lt;p&gt;The counter, though, is that passing full conversation history unchecked creates its own problems — models fed too much context can lose coherence, prioritising volume over relevance. The 10-turn threshold is the team's answer to that, and one they say can be tuned against observed output quality. They also note that summarisation creates a natural security boundary, giving the coordinator fine-grained control over what each agent sees — something that storage-level access controls, applied uniformly, cannot easily replicate.&lt;/p&gt;

&lt;p&gt;“The coordinator controls exactly what context each domain agent receives,” the company writes. “Sensitive information from one part of the conversation can be excluded from the context sent to a specific agent."&lt;/p&gt;

&lt;h2&gt;
  
  
  Context architecture is a governance decision with engineering consequences
&lt;/h2&gt;

&lt;p&gt;Microsoft makes the case that keeping domain agents stateless pays dividends well beyond the immediate context problem. Agents that carry no state of their own are easier to reason about — their behaviour is determined entirely by what arrives in the message, making failures straightforward to diagnose.&lt;/p&gt;

&lt;p&gt;"Debugging reduces to inspecting the input message rather than reconstructing an agent's internal state history," Microsoft writes. "This also simplifies deployments — any agent instance can be replaced or rolled back without state migration."&lt;/p&gt;

&lt;p&gt;Those properties matter most when agents operate across team or organisational boundaries, where auditability and predictability carry real weight.&lt;/p&gt;

&lt;p&gt;For engineering teams in that position, the choice of context pattern has consequences well beyond message size. It determines who owns the conversational record, who can audit what any given agent was told, and how safely the system can grow when new agents are added or team boundaries change.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Your agents keep making the same mistakes. Nobody has time to fix it.</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Wed, 01 Jul 2026 07:18:31 +0000</pubDate>
      <link>https://dev.to/tessl/your-agents-keep-making-the-same-mistakes-nobody-has-time-to-fix-it-5b0p</link>
      <guid>https://dev.to/tessl/your-agents-keep-making-the-same-mistakes-nobody-has-time-to-fix-it-5b0p</guid>
      <description>&lt;p&gt;Your agents keep making the same mistakes. Nobody has time to fix it.&lt;/p&gt;

&lt;p&gt;AI coding agents are getting better at the tasks you give them direct feedback on. Everything else stays broken.&lt;/p&gt;

&lt;p&gt;You leave the same comment in code review three sprints in a row. There's a recurring task that could run as an automation but it's on the backlog because no one has time to stop and systematize it. The context your agents need to do better work — updated conventions, patterns from past PRs, recurring fixes — exists in your commit history and session logs. Nobody has time to extract it and package it up.&lt;/p&gt;

&lt;p&gt;Agent enablement is real work. It just never gets done.&lt;/p&gt;

&lt;h2&gt;
  
  
  What teams usually do
&lt;/h2&gt;

&lt;p&gt;Most teams handle this one of three ways: rely on PR review to catch the same errors week after week, schedule occasional cleanup sprints to update skills and conventions (that never actually get scheduled), or accept that their agents plateau.&lt;/p&gt;

&lt;p&gt;All three require engineers to stop building to maintain the thing that's supposed to help them build faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing Tessl Agent — open beta
&lt;/h2&gt;

&lt;p&gt;Today we're launching Tessl Agent.&lt;/p&gt;

&lt;p&gt;Point it at a repo. It scans your PRs, coding agent session logs, and tickets continuously. When it spots a recurring error pattern, it creates a skill to address it and opens a PR. When it finds a task your team runs manually every week, it turns it into a GitHub Actions workflow. Then it asks if you want it to keep doing that automatically; daily, weekly, on a schedule you set.&lt;/p&gt;

&lt;p&gt;Tessl Agent is built to get you to stop using it interactively. You work with it, and at the end of each session it says: &lt;em&gt;I could set some of these up as recurring actions. I could create a CI/CD check for this.&lt;/em&gt; The goal is that most of the recurring work — finding optimizations, catching agent mistakes, updating context — runs on a trigger and files issues without you having to ask.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/1QaAPfsEqYQ"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like in practice
&lt;/h2&gt;

&lt;p&gt;The use case we use most at Tessl: setting up an agentic code review harness.&lt;/p&gt;

&lt;p&gt;You type something like &lt;em&gt;set up agentic code review&lt;/em&gt; or &lt;em&gt;I want to spend less time reviewing code&lt;/em&gt;. Tessl Agent scans your PRs, your issue tracker, and your coding agent session logs. It surfaces what's there: your style guide, common agent failure patterns, comments your team leaves repeatedly in review. Then it walks you through building on that.&lt;/p&gt;

&lt;p&gt;First, it creates a code review skill that maps to your team's best practices. Unlike a one-click tool you forget about, this is a skill you own; you can update it, augment it, share it across workflows. From that point, every PR gets agentic review automatically. Then it sets up a recurring loop that optimises that review over time, so the quality of automated review improves as your codebase evolves.&lt;/p&gt;

&lt;p&gt;You spend time reviewing code and shipping features, knowing the routine work is handled.&lt;/p&gt;

&lt;h2&gt;
  
  
  It works alongside your coding agent, not instead of it
&lt;/h2&gt;

&lt;p&gt;Tessl Agent is not a replacement for Claude Code, Codex, or whatever you're using. It runs in the background. You don't context-switch to it mid-session.&lt;/p&gt;

&lt;p&gt;It's also provider-agnostic — it works with CodeRabbit, GitHub Actions, and your existing stack. It's not tied to any one coding agent, which matters when you want something that works across your whole development workflow, not just within one tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  The compounding effect
&lt;/h2&gt;

&lt;p&gt;This is what loop engineering looks like in practice. Each automated improvement creates the conditions for the next one.&lt;/p&gt;

&lt;p&gt;A skill that encodes a common pattern means your agent makes that class of error less often. An automated workflow that runs weekly means recurring tasks get systematised instead of repeated. At some point you look up and 40, 50% of your PRs don't have a human looking at them. You never had to run a big initiative to make that happen. You got started, kept building, and over time delegated more to the agent.&lt;/p&gt;

&lt;p&gt;That's the path toward a software factory. Not a big-bang platform migration, but incremental agent enablement that compounds week over week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Tessl Agent is in open beta and free to try. Download the Tessl CLI, run &lt;code&gt;tessl&lt;/code&gt;, and open a session. A good starting point: pull up the last month of your team's coding agent sessions and ask what's broken, what's taking a lot of your time. The findings tend to be immediately useful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tessl.co/4td" rel="noopener noreferrer"&gt;Try Tessl Agent&lt;/a&gt; for free or &lt;a href="https://tessl.co/ayj" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>The hidden cost of agentic software development: why context engineering matters</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Tue, 30 Jun 2026 06:22:53 +0000</pubDate>
      <link>https://dev.to/tessl-io/the-hidden-cost-of-agentic-software-development-why-context-engineering-matters-2oo3</link>
      <guid>https://dev.to/tessl-io/the-hidden-cost-of-agentic-software-development-why-context-engineering-matters-2oo3</guid>
      <description>&lt;p&gt;AI token bills are becoming one of the fastest-growing line items in engineering budgets, and most teams have little visibility into where the money is actually going. &lt;a href="https://thenewstack.io/github-copilot-token-billing/" rel="noopener noreferrer"&gt;GitHub recently abandoned&lt;/a&gt; flat-rate pricing for its Copilot coding agent in favour of token-based billing — a move that sent some subscribers' projected costs up tenfold overnight. Anthropic, too, is increasingly moving toward consumption-based API token pricing — a direction that has developers bracing for a potential cost surge, with many VPs of engineering already exploring whether open-weight models can absorb more of their workload.&lt;/p&gt;

&lt;p&gt;However you slice and dice it, the message from the market is clear: token costs are now a governance problem. When consumption is opaque and billing is variable, engineering leaders lose the ability to forecast spend, set budgets, or hold teams accountable — the same control problems that plagued cloud costs a decade ago, before &lt;a href="https://learn.microsoft.com/en-us/cloud-computing/finops/overview" rel="noopener noreferrer"&gt;FinOps&lt;/a&gt; became a discipline in its own right.&lt;/p&gt;

&lt;p&gt;Tessl has already run this experiment. When it &lt;a href="https://tessl.io/blog/why-were-changing-our-default-eval-model/" rel="noopener noreferrer"&gt;switched its default eval solver&lt;/a&gt; from Claude Sonnet 4.6 to the open-weight GLM 5.1 — a lower-cost model it uses to measure whether agent skills are working — it found that skills-equipped agents agreed on the right outcome in 88.5% of tasks, at an overall eval cost roughly 28% lower.&lt;/p&gt;

&lt;p&gt;Recent &lt;a href="https://arxiv.org/abs/2601.14470" rel="noopener noreferrer"&gt;research&lt;/a&gt; from Concordia University puts some empirical weight behind the over-arching concern — and its findings may surprise engineering leaders who assume they know where their agent spending is concentrated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context engineering is the cost lever
&lt;/h2&gt;

&lt;p&gt;The paper, titled &lt;a href="https://arxiv.org/pdf/2601.14470" rel="noopener noreferrer"&gt;&lt;em&gt;Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering&lt;/em&gt;,&lt;/a&gt; analysed execution traces from 30 software development tasks run through &lt;a href="https://github.com/openbmb/ChatDev" rel="noopener noreferrer"&gt;ChatDev&lt;/a&gt;, an open source multi-agent framework that simulates a software development team — complete with agents assigned roles such as programmer, tester, and code reviewer. The researchers, led by &lt;a href="https://www.linkedin.com/in/emad-shihab-8099523/" rel="noopener noreferrer"&gt;Emad Shihab&lt;/a&gt; at Concordia's Data-driven Analysis of Software lab, mapped token consumption across six development stages: design, coding, code completion, code review, testing, and documentation.&lt;/p&gt;

&lt;p&gt;The headline finding is that code review accounts for an average of 59.4% of all token consumption — by far the largest single cost centre. Initial code generation, by contrast, comes in at just 8.6%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk0uy6av18qjg2rti30jq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk0uy6av18qjg2rti30jq.png" alt="ChatDev with GPT-5 reasoning (Credit: Concordia University)" width="800" height="682"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;ChatDev with GPT-5 reasoning (Credit: Concordia University)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The reason proffered in the report is structural: in a conversational multi-agent system, agents engaged in code review repeatedly pass the full codebase back and forth on every turn, accumulating what the researchers call a "communication tax." Across all tasks, input tokens — context being fed into the model — made up 53.9% of total consumption, compared to 24.4% for output tokens.&lt;/p&gt;

&lt;p&gt;In other words, the agents are spending more tokens communicating context to each other than they are generating new work.&lt;/p&gt;

&lt;p&gt;The coding stage is the one notable outlier: it runs output-heavy, with 58% output tokens versus just 6.9% input, which makes intuitive sense — a single instruction can yield hundreds of lines of code. Every other stage, including testing and documentation, is dominated by input tokens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftfysl3epnqr8a3n55qrq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftfysl3epnqr8a3n55qrq.png" alt="Phase-by-phase token ratio breakdown" width="800" height="302"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Phase-by-phase token ratio breakdown (Credit: Concordia University)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Know your cost map before the bill arrives
&lt;/h2&gt;

&lt;p&gt;For teams running agents in production, the research offers a way to think about cost prediction based on the nature of the work. A greenfield project with substantial initial coding will look very different from a refactoring or review-heavy effort, which will be dominated by the expensive, input-heavy code review cycle. The researchers suggest that inserting a human checkpoint before the iterative code review loop begins could prevent a significant amount of unnecessary token burn, pointing to where the real inefficiency lies.&lt;/p&gt;

&lt;p&gt;“This suggests that the primary cost of agentic software engineering lies not in initial code generation but in the iterative, conversational process of refinement and verification,” the report notes.&lt;/p&gt;

&lt;p&gt;There are important caveats. The study used a single framework and a single model — GPT-5 — across 30 tasks. ChatDev is primarily a research framework rather than a production tool, so the specific percentages may not map directly onto commercial agents. The authors are candid about these limitations. However, the underlying dynamic — that verification and refinement loops, where agents repeatedly ingest large amounts of existing code, are structurally more expensive than generation — is likely to hold across conversational multi-agent architectures more broadly.&lt;/p&gt;

&lt;p&gt;The research also connects to a growing body of practitioner thinking on context engineering: keeping token costs down is less about the model and more about how carefully you manage what gets passed into it. A &lt;a href="https://tessl.io/registry/skills/github/muratcankoylan/Agent-Skills-for-Context-Engineering/context-fundamentals" rel="noopener noreferrer"&gt;community-contributed skill&lt;/a&gt; already in the &lt;a href="https://tessl.io/registry" rel="noopener noreferrer"&gt;Tessl registry&lt;/a&gt; cites this line of research directly, framing context engineering — loading only what's needed, compressing history, applying strict retrieval thresholds — as the practical discipline for keeping agent costs under control.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tessl.io/blog/improving-your-skills-with-tessl-evals/" rel="noopener noreferrer"&gt;Tessl's evals layer&lt;/a&gt; adds another dimension: by running paired evaluations across models and measuring turn count, cost per task, and skill performance side by side, engineering teams can make data-driven decisions about which model delivers the best results for their specific workloads, rather than relying on headline accuracy scores that can mask significant cost differences underneath.&lt;/p&gt;

&lt;p&gt;As token-based billing becomes the norm, understanding where tokens actually go is a prerequisite for running agents responsibly.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Why Warp is betting engineering leaders are done picking a favourite coding agent</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Mon, 29 Jun 2026 06:48:20 +0000</pubDate>
      <link>https://dev.to/tessl/why-warp-is-betting-engineering-leaders-are-done-picking-a-favourite-coding-agent-4c70</link>
      <guid>https://dev.to/tessl/why-warp-is-betting-engineering-leaders-are-done-picking-a-favourite-coding-agent-4c70</guid>
      <description>&lt;p&gt;Engineering leaders have spent the past year trying to get their teams to adopt AI coding tools as quickly as possible. Now, a new set of questions has taken over: how do you measure whether any of it is worth the money, and how do you stop agents from running unchecked on production systems?&lt;/p&gt;

&lt;p&gt;Developer tooling company &lt;a href="https://www.warp.dev/" rel="noopener noreferrer"&gt;Warp&lt;/a&gt;, an open agentic development environment built from the terminal up, thinks the answer isn't picking a single agent and standardising on it — it's giving teams a way to run several at once, compare them, and govern all of them from a single control plane.&lt;/p&gt;

&lt;p&gt;As Tessl wrote back in February, orchestration &lt;a href="https://tessl.io/blog/as-coding-agents-become-collaborative-co-workers-orchestration-takes-center-stage/" rel="noopener noreferrer"&gt;has emerged&lt;/a&gt; as a discipline in its own right — a dedicated layer of tooling for coordinating, supervising and directing multiple agents running in parallel. Back in February, Warp &lt;a href="https://www.warp.dev/blog/oz-orchestration-platform-cloud-agents" rel="noopener noreferrer"&gt;launched Oz&lt;/a&gt; as a cloud platform for running and managing coding agents at scale.&lt;/p&gt;

&lt;p&gt;Now, Warp is taking things a step further. In May, &lt;a href="https://www.warp.dev/blog/multi-harness-cloud-agent-orchestration" rel="noopener noreferrer"&gt;the company expanded Oz&lt;/a&gt; into what it's calling the first multi-harness control plane — meaning teams can now run Claude Code, Codex and Warp Agent simultaneously through a single interface, rather than committing to any one of them.&lt;/p&gt;

&lt;p&gt;Tessl caught up with Warp CEO &lt;a href="https://www.linkedin.com/in/zachlloyd/" rel="noopener noreferrer"&gt;Zach Lloyd&lt;/a&gt; to discuss how engineering leaders are thinking about agent fleets, what the harness layer actually changes, and where the lines between autonomy and human oversight are really being drawn.&lt;/p&gt;

&lt;h2&gt;
  
  
  "The wild west": how the agent gold rush became a budget problem
&lt;/h2&gt;

&lt;p&gt;Zach spent several years at Google, leading engineering on Docs and Sheets before co-founding &lt;a href="https://techcrunch.com/2017/10/17/selfmade-helps-businesses-post-better-photos-online/" rel="noopener noreferrer"&gt;photo-editing startup SelfMade&lt;/a&gt;. He later served as interim CTO at Time, before founding Warp in 2020, raising north of $70 million in funding from the likes of Sequoia, Google Ventures, Figma co-founder Dylan Field, and Salesforce’s co-founder Marc Benioff.&lt;/p&gt;

&lt;p&gt;That background — building collaborative tools at Google scale, then navigating the startup world — gives Zach a particular vantage point on how quickly the engineering tooling landscape has moved. A year and a half ago, he says, most companies were still trying to get developers to use AI autocomplete tools. Then, about a year ago, the conversation moved to interactive agents — Claude Code, Codex, Warp — where engineers were directing tools to build features and fix issues end to end.&lt;/p&gt;

&lt;p&gt;Now, he says, that phase too has largely passed — and the CFO's arrival in the conversation is perhaps the clearest sign of it.&lt;/p&gt;

&lt;p&gt;"Companies right now have moved from a '&lt;em&gt;can we get people to adopt&lt;/em&gt;' mindset to a '&lt;em&gt;how do you measure ROI&lt;/em&gt;' mindset," Zach explained. "They're paying a lot of money for these tools, and the CFO has gotten involved. All these costs are showing up, and so they are thinking through how to go from the wild west, where every engineer is just spending as much as they can on different agents, to a world where they're still creating as much productivity as possible. But they want to measure it, they want to put quotas and budgets in place, and they also want to use different agents for different types of tasks."&lt;/p&gt;

&lt;p&gt;That last point is central to Warp's multi-harness bet. Rather than standardising on a single agent, Zach argues that engineering teams want the ability to route different tasks to different agents depending on what each does best — while keeping the governance layer consistent across all of them.&lt;/p&gt;

&lt;p&gt;"The biggest trend that we see is: can you use open-weight models for some tasks when you have to be at the frontier?” Zach said. "The way that we're positioning Oz is that you can basically not lock into one source of intelligence. You can use Claude Code, you can use Codex, you can use open-weight models — but you can still confidently invest in a layer of infrastructure for governance that is not tightly coupled to any one particular agent."&lt;/p&gt;

&lt;p&gt;The economics driving that are already visible. Open-weight models — DeepSeek, Kimi, Qwen — have gone from lagging well behind the frontier to matching it on many tasks, and at a fraction of the inference cost. Tessl also recently &lt;a href="https://tessl.io/blog/why-were-changing-our-default-eval-model/" rel="noopener noreferrer"&gt;switched its default eval model&lt;/a&gt; from Claude Sonnet 4.6 to GLM 5.1 for exactly this reason — finding that for skill evaluation work, a cheaper open-weight model produced near-identical signal at meaningfully lower cost.&lt;/p&gt;

&lt;p&gt;Elsewhere, AI agent startup Lindy &lt;a href="https://thenewstack.io/lindy-deepseek-anthropic-switch/" rel="noopener noreferrer"&gt;recently moved 100% of its traffic&lt;/a&gt; from Anthropic to DeepSeek v4, with CEO Flo Crivello &lt;a href="https://x.com/Altimor/status/2062389885437366342" rel="noopener noreferrer"&gt;claiming the company&lt;/a&gt; would be saving millions in the process.&lt;/p&gt;

&lt;p&gt;It's worth noting that Warp has been &lt;a href="https://tessl.io/blog/warp-goes-open-source-betting-agents-and-community-can-outpace-closed-rivals/" rel="noopener noreferrer"&gt;doubling down on openness more broadly&lt;/a&gt;, open-sourcing its client &lt;a href="https://tessl.io/blog/warp-goes-open-source-betting-agents-and-community-can-outpace-closed-rivals/" rel="noopener noreferrer"&gt;earlier this year&lt;/a&gt; and using Oz itself to manage the repo — agents handle the implementation, community contributors handle direction and verification.&lt;/p&gt;

&lt;p&gt;“We now have a lot of confidence in code that is generated by Oz with our rules, context and verification, so anyone contributing should have a high chance of success coding a feature correctly,” Zach &lt;a href="https://www.warp.dev/blog/warp-is-now-open-source" rel="noopener noreferrer"&gt;said at the time&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The move also serves as a live test of Warp's own thesis — if the orchestration layer is good enough to run a public repo at scale, it's good enough for enterprise teams to trust with their own.&lt;/p&gt;

&lt;p&gt;“Leaning on agents creates pressure for us to nail orchestration, memory, handoff, and all of the other parts of agentic engineering that are core to our business,” Zach continued. “There’s a virtuous loop here.”&lt;/p&gt;

&lt;p&gt;That loop extends to customers too. The things that matter most — &lt;a href="https://tessl.io/blog/the-hidden-cost-of-agentic-software-development-why-context-engineering-matters/" rel="noopener noreferrer"&gt;context management&lt;/a&gt;, memory, audit logs — can all be separated from the agent itself, Zach argues. That's the point of Oz: a container layer for all of it, so that when the best model or harness changes — and Zach is clear that it will, every few months — teams aren't starting from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model isn't enough: why the harness and context matter just as much
&lt;/h2&gt;

&lt;p&gt;The natural question is whether multi-harness is a solution in search of a problem. If Claude Code and Warp Agent can both run on Anthropic models, what is the harness actually changing?&lt;/p&gt;

&lt;p&gt;Zach's answer is that performance is a function of three things working together: the model, the harness, and the context.&lt;/p&gt;

&lt;p&gt;"The harness is what feeds the context in," Zach said. "You want a harness that is good at managing the context window — when do you take different sources of external context and put them in? If you put too much context in, the model has to summarise and it loses information on the current task. How you manage that context window is really important. Different harnesses excel at different things — Claude Code is a great harness, Codex is a really good harness, Warp's agent harness is [also] really good."&lt;/p&gt;

&lt;p&gt;The model and the harness are table stakes. The third element — organisational context — is where Warp is investing most heavily right now, through what it calls cross-harness memory. The idea is that as agents complete tasks, the system captures what worked and surfaces it automatically in future runs, across whichever harness is being used.&lt;/p&gt;

&lt;p&gt;"Every time one of these agents runs, it does some task, and maybe in the course of figuring out some problem, with the guidance of a human, they arrive at some solution," Zach said. "What you don't want to do is throw that away and start from scratch next time. If you have a memory system, think of it as a layer that is observing what all of your agents are doing and being like: this seems like an important thing to remember."&lt;/p&gt;

&lt;p&gt;Cross-harness agent memory is currently in research preview with a small number of pilot customers.&lt;/p&gt;

&lt;h2&gt;
  
  
  More autonomy, more controls: Warp's answer to an uncomfortable balancing act
&lt;/h2&gt;

&lt;p&gt;The tension at the heart of Oz's pitch is one that Zach doesn't try to resolve so much as manage. On the one hand, the platform promises agents that can handle complex, long-running tasks — migrations, production deployments — with less human oversight. On the other, the same release adds approval gates, per-user authentication, and least-privilege permissions.&lt;/p&gt;

&lt;p&gt;Those two things pull in opposite directions.&lt;/p&gt;

&lt;p&gt;"I think there's a fundamental tension, but I think it's necessary," Zach said. "From talking to our customers, I don’t think companies are ready to be fully hands off. The ideal system at this moment looks like a factory floor, where you want to put stuff that can be automated through an automation process, but then you want a human to step in and say: ‘&lt;em&gt;was this done right&lt;/em&gt;’?"&lt;/p&gt;

&lt;p&gt;The logic Zach applies is essentially risk-tiering. The parts of the stack where errors are cheapest get automated first; the parts where they are most costly stay human-supervised longest.&lt;/p&gt;

&lt;p&gt;"The parts that can be most automated are the parts where the risks are lowest — this is common sense," Zach said. "Making changes to our website is way lower risk than making changes to our data. So you'll see more and more of the guardrails go away on the low risk things before they go on the high risk things."&lt;/p&gt;

&lt;p&gt;As for who inside an enterprise actually draws those lines, Zach says it's rarely one team. Platform teams or dedicated AI developer productivity functions tend to lead, with security always involved and finance increasingly so.&lt;/p&gt;

&lt;p&gt;"The security team is always involved — probably the team that's most scared," Zach said. "Increasingly there is a cost management component. What's the budget for this? What's the token budget per engineer? What's the way that you see ROI? It's starting to become a significant line item for all of these customers."&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals: measuring the factory floor
&lt;/h2&gt;

&lt;p&gt;Which brings the conversation to &lt;a href="https://tessl.io/blog/improving-your-skills-with-tessl-evals/" rel="noopener noreferrer"&gt;evals&lt;/a&gt; — how teams actually know whether any of this is working. Zach's framing here draws again on the factory floor analogy: what you want, ultimately, is a bird's eye view of how work flows from idea to shipped product.&lt;/p&gt;

&lt;p&gt;Warp has built a live version of this for its own open-source repository at &lt;a href="https://build.warp.dev/" rel="noopener noreferrer"&gt;build.warp.dev&lt;/a&gt;, where anyone can pull up a view of how issues move through the agent pipeline. Zach uses it as a reference point for what enterprise teams should be aiming for.&lt;/p&gt;

&lt;p&gt;"The things you can measure are throughput of code as one basic measurement," Zach said. "Ideally, in a more sophisticated world, you would go all the way from measuring throughput of code to throughput of user or customer impact — be able to tie back: ‘&lt;em&gt;a ticket came in asking for this feature, an agent was able to build it, it cost this number of dollars or tokens, and in production it was used by XYZ customers&lt;/em&gt;’. That's the dream loop. The code part is not that hard — that's where we can just deliver."&lt;/p&gt;

&lt;p&gt;Token efficiency per PR is the baseline metric Warp currently offers. The harder problem — tying agent output to business outcomes — remains what Zach calls the “holy grail.”&lt;/p&gt;

&lt;h2&gt;
  
  
  The agent builder: a new role that doesn't require an engineering background
&lt;/h2&gt;

&lt;p&gt;One of the more striking parts of the conversation is what Zach describes happening to engineering teams themselves as agent fleets become the norm — at Warp and at the companies it works with.&lt;/p&gt;

&lt;p&gt;The background profile of engineers Warp hires hasn't changed much, he says. What has changed is what they do.&lt;/p&gt;

&lt;p&gt;"The day to day of a software engineer now is not about writing code," Zach said. "It's about: can you accurately specify a user requirement to an agent? Can you make sure that the technical plan an agent comes up with makes sense? Is it building in the right part of the codebase? Is it repeating a bunch of code? Is it using the same quality of abstraction that a human would use?"&lt;/p&gt;

&lt;p&gt;Beyond that shift in existing roles, Warp has also introduced a new function it calls the agent builder — a full-time role focused on building internal automations using agents. Notably, the people filling it don't come from engineering backgrounds.&lt;/p&gt;

&lt;p&gt;"The people who are in this role are people with product and design backgrounds," Zach said. "They are not engineers by training, and I don't think you need that. For internal tooling use cases you can hire people who are more generic builders. One of the cool things that's come out of all this new technology is a democratisation of who gets to build stuff."&lt;/p&gt;

&lt;p&gt;The caveat is that this only holds where the stakes are low — customer-facing product, he implies, is a different matter. "As long as it's not customer-facing, I think it's pretty much fine for that to work that way," Zach said.&lt;/p&gt;

&lt;p&gt;Among the companies Warp works with, Zach sees two distinct camps emerging. Larger organisations with dedicated developer productivity teams are building their own internal software factories from scratch — the complexity is manageable if you have the headcount. Smaller ones are buying, because the build cost simply doesn't justify the investment. What they share, he says, is the destination: a centralised system where agents handle the routine work and humans focus on the exceptions.&lt;/p&gt;

&lt;p&gt;What that means in practice for engineering leaders is less about which agent to pick and more about building the layer around it — the governance, the memory, the measurement — that makes any agent trustworthy enough to run at scale.&lt;/p&gt;

&lt;p&gt;For all the variation in how companies are approaching this — different tools, different team structures, different risk tolerances — Zach sees them all heading toward the same place.&lt;/p&gt;

&lt;p&gt;"The goal of most companies right now is to get to what I would call an internal software factory — a centralised system where agents are taking in issues, judging, building, verifying, pushing," Zach said. "They don't want to do that for 100% of the issues, and they don't want to take humans out of the loop. But they're all trying to stand up this same kind of machine. And different companies are further along on this journey than others.”&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>See You at AI Engineering World's Fair 2026</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Sun, 28 Jun 2026 07:51:55 +0000</pubDate>
      <link>https://dev.to/tessl/see-you-at-ai-engineering-worlds-fair-2026-1ede</link>
      <guid>https://dev.to/tessl/see-you-at-ai-engineering-worlds-fair-2026-1ede</guid>
      <description>&lt;p&gt;Next week, the Tessl team is heading to &lt;strong&gt;AI Engineering World's Fair 2026&lt;/strong&gt;, and we couldn't be more excited to spend a few days with the community talking about the future of AI engineering.&lt;/p&gt;

&lt;p&gt;If you're attending, come and find us at &lt;strong&gt;Booth L-G48&lt;/strong&gt;. We'll be demoing our latest product, sharing what we've been building, and talking all things agentic development with engineering teams from around the world.&lt;/p&gt;

&lt;h2&gt;
  
  
  Come and meet the team
&lt;/h2&gt;

&lt;p&gt;At Tessl, we believe &lt;strong&gt;skills are the new code. Treat them that way.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tessl enables development teams to continuously build, test, distribute and optimize agent skills with the security and governance of enterprise software.&lt;/p&gt;

&lt;p&gt;Throughout the event, our technical team will be running live demos at the booth and chatting with attendees about everything from coding agents and agent workflows to evaluation, context management and harness engineering. Whether you're just getting started or already deploying agents in production, we'd love to hear what you're building.&lt;/p&gt;

&lt;p&gt;We're also running a competition throughout the conference, with prizes including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  🎁 Ray-Ban Meta Smart Glasses&lt;/li&gt;
&lt;li&gt;  🎟️ A ticket to &lt;strong&gt;AI DevCon&lt;/strong&gt; in New York this November&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fv65s0ia1n7m330sbhccp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fv65s0ia1n7m330sbhccp.png" alt="prizes" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Unveiling Tessl Agent
&lt;/h2&gt;

&lt;p&gt;AI agents shouldn't just write software—they should continuously improve how software gets built.&lt;/p&gt;

&lt;p&gt;At AI Engineering World's Fair, we'll be unveiling &lt;strong&gt;Tessl Agent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build your software factory, one workflow at a time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tessl Agent makes your agents more autonomous over time. It continuously scans your pull requests, session logs and tickets for recurring mistakes and opportunities, automatically opens improvement PRs, turns repeated patterns into automated workflows, and ships them through GitHub Actions—creating a software factory that compounds week after week without slowing feature delivery.&lt;/p&gt;

&lt;p&gt;If you'd like to see it in action, stop by the booth for a live demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  The conversation we're most excited about: Harness Engineering
&lt;/h2&gt;

&lt;p&gt;Every conference has a theme. This year, we think it'll be &lt;strong&gt;Harness Engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;AI models are getting smarter every month. The challenge is everything around them.&lt;/p&gt;

&lt;p&gt;Agents need context. They need evaluation, testing, guardrails, observability and workflows that help them operate reliably in production. In short, they need a harness.&lt;/p&gt;

&lt;p&gt;We believe Harness Engineering is becoming one of the defining disciplines of modern AI engineering, and we're looking forward to hearing how the community is tackling these challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catch our talks
&lt;/h2&gt;

&lt;p&gt;We're delighted to have two Tessl speakers presenting on &lt;strong&gt;Thursday, July 2&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coding Agents Don't Scale Themselves. Neither Do Your Teams: The Rise of Agent Enablement
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;🕜 1:30–1:50 PM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Patrick Debois, AI Product Engineer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Coding agents are transforming software development, but the context that drives them is still managed with ad hoc prompts, copied rule files and undocumented practices.&lt;/p&gt;

&lt;p&gt;Patrick introduces the &lt;strong&gt;Context Development Lifecycle&lt;/strong&gt;—a framework for treating context with the same engineering discipline we've spent decades applying to code—and explores how teams can build a feedback loop that continuously improves agent performance over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Harness Engineering: The New Core Skill for Agentic Developers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;🕝 2:50–3:10 PM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dru Knox, Head of Product &amp;amp; Design&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As coding agents become more capable, success depends less on writing code and more on upgrading your codebase so agents can reliably succeed.&lt;/p&gt;

&lt;p&gt;Dru introduces the core loop of Harness Engineering, the common improvements teams are making today, and how Tessl's Harness Engineering Agent helps developers scale those improvements across their software factory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Join our community event
&lt;/h2&gt;

&lt;p&gt;We're also hosting an evening fireside discussion:&lt;/p&gt;

&lt;h3&gt;
  
  
  Harness Engineering: Building Reliable AI Systems
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;📅 Wednesday, July 1 | 6:00 PM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Featuring &lt;strong&gt;Steve Yegge&lt;/strong&gt; and &lt;strong&gt;Dru Knox&lt;/strong&gt;, this conversation explores the emerging discipline of Harness Engineering and what it takes to move AI systems beyond experimentation into reliable production software.&lt;/p&gt;

&lt;p&gt;Together they'll discuss the systems surrounding AI models—from context and evaluation to testing, observability and guardrails—followed by audience Q&amp;amp;A and networking with the AI engineering community.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Reserve your place:&lt;/strong&gt; &lt;a href="https://luma.com/7f31tcht" rel="noopener noreferrer"&gt;https://luma.com/7f31tcht&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Leadership dinner
&lt;/h2&gt;

&lt;p&gt;Alongside the conference, we're also hosting an invite-only leadership dinner, bringing together engineering leaders and AI practitioners for an evening of conversation about the future of agentic development.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fug0itruj0qk2a44sjuj3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fug0itruj0qk2a44sjuj3.png" alt="pvt dinner" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We're looking forward to sharing ideas with some of the people helping define where this industry goes next.&lt;/p&gt;

&lt;h2&gt;
  
  
  See you next week
&lt;/h2&gt;

&lt;p&gt;AI Engineering World's Fair has become one of the best places to connect with the people shaping the future of software engineering, and we can't wait to be part of it.&lt;/p&gt;

&lt;p&gt;Whether you want to see &lt;strong&gt;Tessl Agent&lt;/strong&gt; in action, chat about Harness Engineering, attend one of our talks, or simply swap ideas about building reliable AI systems, we'd love to meet you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Come and see us at Booth L-G48.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Or, if you'd like to guarantee some time with the team, &lt;strong&gt;book a meeting with us through the AI Engineering World's Fair app.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>aie</category>
    </item>
    <item>
      <title>How Small Can an Agent Model Get? The Nemotron Floor</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Sat, 27 Jun 2026 06:21:19 +0000</pubDate>
      <link>https://dev.to/tessl-io/how-small-can-an-agent-model-get-the-nemotron-floor-5gne</link>
      <guid>https://dev.to/tessl-io/how-small-can-an-agent-model-get-the-nemotron-floor-5gne</guid>
      <description>&lt;p&gt;Most model comparisons ask which model is best. This one starts with a model that never even produced a single result.&lt;/p&gt;

&lt;p&gt;We tested NVIDIA's open-weight Nemotron family, from the 30B Nano to the 120B Super, on a benchmark of real-world coding tasks: the kind of models an indie developer on a tight budget, or an enterprise cutting inference cost and keeping data in-house, would run.&lt;/p&gt;

&lt;p&gt;The main finding is that model size is not a dial you turn for a little more quality, it is a threshold. Below a certain capability floor a model cannot drive an agent loop at all, which is why the smallest variant we tried, Nano 12B, produced nothing to score.&lt;/p&gt;

&lt;p&gt;Above the floor, the question stops being which model is cheapest and becomes which one clears the bar your work actually needs: Nano 30B is an extremely cheap workhorse for narrow, well-scoped jobs, while Super 120B is the size that holds up on demanding multi-step agent work.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;agent size floor&lt;/strong&gt; is the minimum model capacity below which a model cannot reliably complete the act-observe-decide loop an agent depends on. Below it you don't get a slower or sloppier agent, you get a non-agent: a model that reads the task, takes a few steps, and never converges. For anyone choosing a model, this changes the question from "which is cheaper" to "which clears the floor for my work", and that is the question to answer first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the numbers come from
&lt;/h2&gt;

&lt;p&gt;Every scenario in the evaluation is a real-world agent task tied to a published skill, scored on two axes: instruction-following (does the agent do what it was told, in the way it was told) and task-completion (does it reach the goal). The overall score weights instruction-following at 4 and task-completion at 3, then divides by 7. Each task runs with and without the skill, so the lift from the skill is visible directly. The tasks and skills are public, in the &lt;a href="https://huggingface.co/datasets/tesslio/task-evals-for-skills" rel="noopener noreferrer"&gt;task-evals-for-skills dataset&lt;/a&gt;, so you can inspect any scenario yourself.&lt;/p&gt;

&lt;p&gt;This design is deliberate. The tasks are derived from published skills, so they mirror the work teams write skills for, not contrived benchmark puzzles. That changes what a low score means. For a model that can do the work, the gap that remains is instruction-following: doing the job the way it was asked. For a model that cannot reach the goal even on ordinary work, the problem is more fundamental than guidance.&lt;/p&gt;

&lt;p&gt;Both models were served the same way, OpenHands on Bedrock, and graded by the same judges, which leaves close to a thousand paired scenarios for each model. Every comparison below is apples-to-apples within NVIDIA, with no cross-harness confound and no provider pricing to reconcile. Cost is solve-only dollars per task, taken from each run's measured token usage. Neither model triggered a single rubric-gaming flag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two sizes, two different walls
&lt;/h2&gt;

&lt;p&gt;Here are the headline results, baseline → with-skill.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Class&lt;/th&gt;
&lt;th&gt;Goal completion&lt;/th&gt;
&lt;th&gt;Instruction following&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;th&gt;$/task&lt;/th&gt;
&lt;th&gt;Near-zero solves (overall &amp;lt; 25)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Super 120B&lt;/td&gt;
&lt;td&gt;68.4 → 69.3&lt;/td&gt;
&lt;td&gt;31.3 → 49.2&lt;/td&gt;
&lt;td&gt;47.2 → 57.8&lt;/td&gt;
&lt;td&gt;0.083&lt;/td&gt;
&lt;td&gt;19% → 22%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nano 30B&lt;/td&gt;
&lt;td&gt;46.6 → 51.3&lt;/td&gt;
&lt;td&gt;19.0 → 26.0&lt;/td&gt;
&lt;td&gt;30.8 → 36.8&lt;/td&gt;
&lt;td&gt;0.040&lt;/td&gt;
&lt;td&gt;43% → 38%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two sizes hit their limits for different reasons. Super 120B can mostly finish. Its goal completion sits near 69, and the skill barely moves it, adding only 0.9 points. What it struggles with is doing the task the prescribed way: the skill adds 17.9 points of instruction-following. Super has the capability, and is helped by the guidance the skill provides.&lt;/p&gt;

&lt;p&gt;Nano 30B, the smaller model, has the opposite problem. Reliable completion is where it wavers. Goal completion is 46.6, and 43% of its baseline attempts come back near-zero. It is close enough to the floor that the loop itself is the bottleneck, not the formatting of the answer.&lt;/p&gt;

&lt;p&gt;There is a pattern hiding in those averages, and it matters just as much as the averages do. With these agents you rarely get a mediocre run. You mostly get a near-finished result or a near-total miss. With the skill, Super scores 75 or above on 40% of tasks and misses badly on 22%. Nano flips that shape: it tops out on only 11% of tasks and misses badly on 38%. Scale does not make the agent gently better. It changes which of the two outcomes you get most of the time. This is why the average is a rough guide to any single run: the average of "mostly great" and "mostly broken" is a number that rarely actually happens on any given run.&lt;/p&gt;

&lt;p&gt;It also means Nano is not uniformly weak. On well-scoped tasks, like calling a documented API or following a focused doc-retrieval skill, it clears the usable bar often enough to be worth a look. Its trouble is the longer, multi-step work, where it may struggle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where scale helps, and where skills help
&lt;/h2&gt;

&lt;p&gt;Scale and skills are not competing answers to the same question. They do different jobs, and the eval shows where each one pays off. The familiar story, one we have told ourselves, is that a relevant skill can let a cheaper model catch a pricier one. That holds, with one condition: the model has to be capable enough to act on the skill in the first place.&lt;/p&gt;

&lt;p&gt;Start with the job scale does. Going from 30B to 120B, a 4x jump in parameters, buys 16.4 overall points at baseline. That is scale carrying a model over the floor, to where it can complete the task at all. Adding a skill to Nano 30B buys 6.0 points, but it still sits below Super with no skill at all (47.2). Below the floor, there was not yet enough capability for a skill to build on.&lt;/p&gt;

&lt;p&gt;A skill is a multiplier, and on a model above the floor the multiplier can be large. The same skill lifts Super by 17.9 points on instruction following, while barely touching its goal completion (up 0.9). This reflects where Super had room to grow. It could already finish most tasks, so the skill's gain showed up in instruction-following, not completion. A skill can help a model finish too; Super simply had little completion headroom left. The two are a sequence, not a contest. Get a model over the floor, then a skill delivers outsized returns.&lt;/p&gt;

&lt;p&gt;The effect is sharpest skill by skill, and it shows how much a skill can do for a capable model. A Brave Search location skill adds 76 points of instruction-following for Super. A Neon auth skill adds 68. On Nano those same skills add 1 point and nothing, because there is no capability yet for the guidance to land on. Match a skill to a model that can act on it and the payoff is substantial.&lt;/p&gt;

&lt;p&gt;Single tasks tell the same story. On the stripe_ai_upgrade-stripe scenario, the skill takes Super from a complete miss to a perfect 100, while the same skill on the same task leaves Nano at 0. The skill is doing the work in the first case and has nothing to build on in the second. Across the set there are 163 tasks where Super clears the usable bar and Nano comes back near-zero, the kind of gap a skill alone will not close.&lt;/p&gt;

&lt;p&gt;The same pattern emerges in effort. Nano 30B takes more turns than the larger model (29.9 with skill, against Super's 24.5) for roughly half the score. Its turns split into two habits: when it fails outright it gives up fast, in around ten turns, and when it engages it grinds for thirty or more to reach a middling result. Below the floor the extra guidance adds turns and cost (24.7 to 29.9, cost up 25%) without a matching gain, because the model cannot act on it efficiently yet. Above the floor, a model puts a skill's instructions to work; below it, capability has to catch up first.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the cheaper model is not necessarily the better value
&lt;/h2&gt;

&lt;p&gt;Here is where the intuition most teams carry into a self-hosting decision breaks. Nano costs half what Super does per task, $0.040 against $0.083, so the natural conclusion is that Nano is the better value and Super is the option you reach for only when you must.&lt;/p&gt;

&lt;p&gt;The per-task price leaves out one thing: failures. With the skill, Nano comes back with a near-zero result on 38% of tasks, against Super's 22%. Every one of those is a retry, and retries cost money the per-task price never shows. Count them and the model that looked cheaper per task can end up costing more for each result you can actually use.&lt;/p&gt;

&lt;p&gt;Points-per-dollar makes Nano look like the bargain, 928 against Super's 694. But that number only rewards cheapness, not quality: a model that regularly does the wrong thing but is extremely cheap will still score well on it. So decide the quality you need first, then compare pricing.&lt;/p&gt;

&lt;p&gt;Cost is also only half the decision. The other half is fit. Nano earns its low price on the well-scoped tasks it does reliably, while on the longer, multi-step work, Super is worth paying for. The value is in matching each model to the work it can do, not in naming one model the cheapest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which size fits your work?
&lt;/h2&gt;

&lt;p&gt;The findings turn into a simple rule of thumb. Reach for Nano 30B when the task is narrow and well-scoped: a documented API call, a focused doc-retrieval job, or a single-file change, run at high volume where a passable result or a cheap retry is acceptable. It costs half as much per task and is small enough to self-host on consumer hardware, which makes it a genuine workhorse.&lt;/p&gt;

&lt;p&gt;Reach for Super 120B when the work is multi-step or longer-horizon, when the result has to be usable on the first try, or when you cannot predict the shape of the tasks coming in. It is the first open-weight size that reliably clears the floor for real agent work, and the place to start for anything headed to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding your floor
&lt;/h2&gt;

&lt;p&gt;This study only exists because NVIDIA ships an open-weight size ladder you can self-host. This lets you match the model to the job, and step up only when the smaller one cannot clear your quality floor. The framing to carry away is the smallest usable agent, not fastest or cheapest on paper.&lt;/p&gt;

&lt;p&gt;So when it comes to choosing a model, don't start with price or parameter count. The model that looks like a bargain on the invoice could be the one quietly costing you. Take the work you actually need done, set the quality bar it has to clear, and measure which models successfully clear it. That is the comparison that will predict what actually works for you, and it is worth running before you commit to a model. Once the decision is made, the &lt;a href="https://tessl.io/registry" rel="noopener noreferrer"&gt;Tessl Registry&lt;/a&gt; is where you find the skills that take it the rest of the way.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>AI Agent Governance: 10 Takeaways from Engineering Leaders on Agentic Development</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Fri, 26 Jun 2026 08:31:25 +0000</pubDate>
      <link>https://dev.to/tessl-io/ai-agent-governance-10-takeaways-from-engineering-leaders-on-agentic-development-4ph0</link>
      <guid>https://dev.to/tessl-io/ai-agent-governance-10-takeaways-from-engineering-leaders-on-agentic-development-4ph0</guid>
      <description>&lt;p&gt;Agentic development starts as a productivity story, but at scale it quickly becomes a governance problem.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;AI Native DevCon&lt;/a&gt; London, we hosted a set of Chatham House roundtables with senior engineering leaders from a range of organizations. I won’t attribute comments to individuals or companies, but the patterns were strikingly consistent: agentic development is moving from an individual tooling conversation into an enterprise operating model question.&lt;/p&gt;

&lt;p&gt;The first wave was familiar enough: devs tried GitHub Copilot, Cursor, Claude Code, Codex, Devin and similar tools, and many found obvious value. They wrote code faster, produced tests faster, explored ideas faster, and in some cases revived work that had been sitting in the backlog because it was too costly to attempt.&lt;/p&gt;

&lt;p&gt;The interesting question is what happens once agents stop being a personal accelerator and start touching the way an engineering organization works. At that point, the problem shifts from “does the tool help?” to “can we make this safe, repeatable, measurable, and economically sane?”&lt;/p&gt;

&lt;p&gt;That shift is why I think the most useful frame is &lt;strong&gt;AI agent governance&lt;/strong&gt;. It means the systems that let teams move faster without losing control, including identity, permissions, context, evals, model routing, cost visibility, policy, ownership, and feedback loops.&lt;/p&gt;

&lt;p&gt;On a side note, you can hear my talk “skills are the new code”, where I share my personal framework towards agent governance and a proposed solution towards enterprise agent enablement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=KpfnldjO3Iw" rel="noopener noreferrer"&gt;Watch on YouTube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s now look at the 10 main takeaways from our roundtable.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Agent adoption starts with enthusiasm, but scaling it requires deliberate rollout
&lt;/h2&gt;

&lt;p&gt;Most organizations seem to start the same way: give developers access to AI coding tools and let the motivated teams run.&lt;/p&gt;

&lt;p&gt;This is the right instinct at the start, because the space is moving too quickly for a purely top-down programme to discover all the useful patterns. Bottom-up energy creates learning quickly. It also surfaces where agents are genuinely useful, rather than where a transformation deck hoped they might be.&lt;/p&gt;

&lt;p&gt;But it also creates fragmentation.&lt;/p&gt;

&lt;p&gt;Different teams adopt different tools, build different prompts, store skills in different repos, and develop different assumptions about what is safe enough to automate. One group may use agents for test generation, another for code review, another for product specs, another for deployment automation. Before long, the organization can have dozens of useful experiments that don’t yet add up to a system.&lt;/p&gt;

&lt;p&gt;The trick is not to kill the experimentation but to create a path from local learning to shared practice.&lt;/p&gt;

&lt;p&gt;The first wave of adoption was mostly about individual productivity. The next wave has to be about repeatable, governed team workflows. That means rollout phases, clear ownership, a view of which tools are approved for which classes of work, and a way to convert the best local experiments into standards others can reuse.&lt;/p&gt;

&lt;p&gt;This is a familiar pattern from cloud and DevOps: the early adopters prove what is possible, then the platform forms around them. The difference this time is that the cycle is much faster, and the unit being governed is not just infrastructure or code, but the agentic workflow itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The strongest ROI case is not productivity. It is increased &lt;em&gt;ambition&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;A lot of the public conversation around AI in software development is still framed around productivity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Can engineers do the same work faster?&lt;/li&gt;
&lt;li&gt;  Can teams ship more with the same number of people?&lt;/li&gt;
&lt;li&gt;  Can the business do the same with less?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many business leaders will look for savings, and it would be naive to pretend otherwise. It is also worth acknowledging that some of this is hard to say openly in a group setting, however intimate. In practice, some leaders will seek to capitalize on productivity by doing the same work with fewer people, reducing costs, or slowing future hiring.&lt;/p&gt;

&lt;p&gt;But the roundtables reinforced a concern I have had for a while: if we hype AI productivity too aggressively, we may slow adoption by making people fear what adoption means.&lt;/p&gt;

&lt;p&gt;If the internal narrative is mostly about headcount reduction, people will defend themselves. They may hide the real gains, avoid showing how much faster a workflow became, or keep their best agent patterns private because sharing them feels like making the case for fewer people.&lt;/p&gt;

&lt;p&gt;That is not a cultural foundation for transformation. A better frame is &lt;em&gt;ambition.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Agents make prototypes cheaper. They let senior engineers explore ideas that have been trapped behind calendar time. They change the build-versus-buy equation, because a capability that once required an RFP and a vendor project may now be plausible for a small internal team to try.&lt;/p&gt;

&lt;p&gt;This is the version of the story that leaders should emphasize publicly and internally. The question should be “what can we now attempt that we previously would not have attempted?”&lt;/p&gt;

&lt;p&gt;That framing does not deny the economics but it does point them in a healthier direction. The long-term narrative should not be about lowering the floor, but about raising the ceiling. If AI is understood as a way to increase ambition rather than quietly reduce capacity, more people will lean in, and the organization is more likely to discover the compounding benefits.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Why context engineering is becoming a first-class engineering asset
&lt;/h2&gt;

&lt;p&gt;Agents are only as useful as the context they can apply.&lt;/p&gt;

&lt;p&gt;That context includes specs, tests, policies, architecture guidance, product requirements, runbooks, coding conventions, incident patterns, security rules, and domain language. Most organizations already have some of this knowledge, but it is rarely as clean or discoverable as the agentic era requires. Some of it lives in docs, some in Slack, some in tickets, some in code comments, and a great deal of it lives in people’s heads.&lt;/p&gt;

&lt;p&gt;In the pre-agent world, weak documentation was annoying but survivable. A dev could ask the person who knew the system, or learn the convention through review comments. In the agentic world, missing context becomes a direct limit on what the agent can do.&lt;/p&gt;

&lt;p&gt;This is why skills matter.&lt;/p&gt;

&lt;p&gt;Skills turn tacit engineering knowledge into reusable context that agents can apply. They are not just prompts with nicer packaging; they are a way to encode how an organization wants work done, from API usage to security checks to writing style to deployment workflow.&lt;/p&gt;

&lt;p&gt;This is also where Tessl’s view of agentic development comes in. If agents are going to participate across the SDLC, organizations need a way to collaboratively develop, discover, evaluate, and improve the context those agents rely on. Skills and evals are two sides of that problem: skills package the knowledge agents need, while evals show whether that knowledge actually improved the outcome.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fl09sikdgfvlxvyhajr3h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fl09sikdgfvlxvyhajr3h.png" alt="sdlc" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you see context this way, and move the mental framework from SDLC → CDLC (Context Development Lifecyle illustrated above), documentation stops being a hygiene task and becomes infrastructure. The teams that write down how they work, keep that knowledge current, and make it available to agents will have a structural advantage over teams that treat context as tribal knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Cost matters, but the wrong framing leads to the wrong decisions
&lt;/h2&gt;

&lt;p&gt;Model costs are becoming real.&lt;/p&gt;

&lt;p&gt;In the earliest adoption phase, many teams did not feel the cost directly. Usage was limited, pilots were small, and in some cases vendor pricing or subsidies made the economics look less material than they would eventually become. But that phase is ending…&lt;/p&gt;

&lt;p&gt;As agents become part of daily development, cost shows up in more places: large context windows, repeated attempts, long-running tasks, model upgrades, autonomous workflows, and agents that call other tools in loops.&lt;/p&gt;

&lt;p&gt;A prompt that is cheap as a one-off experiment can become expensive when it runs across hundreds of devs every day, each with a large repo context, multiple retries, and a frontier model selected by default.&lt;/p&gt;

&lt;p&gt;This is why &lt;em&gt;AI FinOps&lt;/em&gt; needs to become a real discipline!&lt;/p&gt;

&lt;p&gt;The cloud analogy is useful (but only up to a point). In cloud, cost followed infrastructure usage. In AI, cost follows cognition-like work: reasoning, context, retries, tool calls, evals, and orchestration. That makes it harder to map spend to value, because the bill may be attached to a workflow that saved a week of engineering time, avoided a security incident, accelerated a customer feature, or simply produced three bad attempts before a human rewrote it.&lt;/p&gt;

&lt;p&gt;Even in the few weeks since these roundtables took place, awareness of AI costs has increased substantially. That will continue as agent adoption broadens. Leaders will need visibility into where spend goes, which models are used for which tasks, where context is being wasted, and which workflows justify their cost because they improve delivery, quality, risk, or ambition.&lt;/p&gt;

&lt;p&gt;The wrong answer is to suppress usage blindly. The better answer is to manage it deliberately: model routing, caching, context discipline, budgets, observability, and evals that help teams know whether cheaper options are good enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Model routing will be part of AI agent governance
&lt;/h2&gt;

&lt;p&gt;There was broad agreement that not every task should use the largest or most expensive frontier model. A good example is how we’ve recently switched Tessl’s default eval model from Sonnet 4.6 to GLM 5.1. The principle is easy to accept, but the operational question is harder: how does an organization know which model is good enough for which job?&lt;/p&gt;

&lt;p&gt;The answer will not be one model - it will be &lt;em&gt;routing&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Frontier models will remain valuable for ambiguous reasoning, complex planning, and tasks where the cost of a poor answer is high. Smaller models may be better for bounded, repeatable work where the task is well specified and the output can be validated. Open models have become capable enough that, for many narrow tasks, they may be more than sufficient and much cheaper. Local or private deployments may make sense when data sensitivity, latency, or control matters more than raw capability.&lt;/p&gt;

&lt;p&gt;The risk is that every team solves this independently. One team standardises on Claude Code, another on Cursor, another on Codex, another experiments with open models, and the organization ends up with duplicated eval work and no shared view of quality, cost, or risk.&lt;/p&gt;

&lt;p&gt;This is why model routing belongs inside AI agent governance. The decision should depend on the task, the data, the quality bar, the blast radius, the cost, and the validation available. The real capability is not choosing a favorite model; it is building the measurement and routing layer that lets teams use the right model for the right task.&lt;/p&gt;

&lt;p&gt;The important test is not whether a smaller model works once. It is whether it meets the quality bar repeatedly under realistic inputs, with the context and constraints the workflow will actually have in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Why AI agent governance is becoming the enterprise security bottleneck
&lt;/h2&gt;

&lt;p&gt;Cost is rising, but security is still the concern most likely to limit enterprise adoption.&lt;/p&gt;

&lt;p&gt;The risks are easy to understand once you stop thinking about agents as chatbots and start thinking about them as actors inside the development environment. A coding agent running with a developer’s credentials may be able to access internal repositories, package registries, logs, deployment systems, tickets, customer data, and production-adjacent systems. If that agent can browse the web, install packages, execute scripts, or move data between systems, the blast radius changes materially.&lt;/p&gt;

&lt;p&gt;This does not mean the right answer is to block agents. It means the trust model has to mature.&lt;/p&gt;

&lt;p&gt;One useful mental model from the roundtables was to treat agents like new employees or interns. You would not give an intern every credential and full production access on day one. You would start with a defined scope, observe their work, review their decisions, and expand trust over time. Agents need a version of the same path.&lt;/p&gt;

&lt;p&gt;That path includes identity, entitlements, sandboxing, audit trails, tool restrictions, policy enforcement, and incident response. It also includes a decision about whether the agent acts as the human, as a separate identity, or as a constrained delegated identity. Without that, security teams are left with a choice between approving risky autonomy or blocking usage entirely.&lt;/p&gt;

&lt;p&gt;There is also an important cost dynamic here. In many enterprises, security constraints currently limit usage, which means they also shield the organization from the full cost curve. If only a small number of teams can use agents in limited ways, the token bill remains constrained. Once identity, permissions, sandboxing, and audit controls mature, adoption will expand, and costs that were previously hidden by limited rollout will become much more visible.&lt;/p&gt;

&lt;p&gt;So security may be the immediate bottleneck, but cost is waiting behind it.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. As coding gets cheaper, alignment becomes the bottleneck
&lt;/h2&gt;

&lt;p&gt;Agents reduce the cost of implementation, but that does not mean the organization automatically moves faster. It means the bottleneck moves.&lt;/p&gt;

&lt;p&gt;If code becomes cheaper to produce, the relative cost of everything around code increases: product clarity, architecture decisions, security approvals, change management, compliance, release coordination, and cross-team alignment. Several leaders described a version of the same pattern, where teams can now build faster than the organization can decide, approve, or absorb.&lt;/p&gt;

&lt;p&gt;This changes the economics of software delivery.&lt;/p&gt;

&lt;p&gt;For years, engineering organizations optimised heavily against duplication. Build the shared capability once, coordinate across teams, extract commonality, and reuse the platform. That instinct still matters, but the trade-off changes when implementation becomes cheaper and coordination remains expensive. In some cases, duplicating a capability inside a clear domain boundary may be more effective than forcing multiple teams through a shared dependency.&lt;/p&gt;

&lt;p&gt;This is not an argument against architecture. It is an argument for architecture that recognises where the bottleneck has moved.&lt;/p&gt;

&lt;p&gt;Agentic development works best when work has clear ownership, limited dependencies, strong tests, and a constrained blast radius. It struggles when success depends on many teams agreeing before anything can move. The practical leadership question is therefore not just “how do we make developers faster?” but “what will become the constraint once they are?”&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Enterprise AI agent governance needs explicit, automated controls
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Foi6y9ef9sd6zfx4ubbf5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Foi6y9ef9sd6zfx4ubbf5.png" alt="meme" width="799" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most organizations already have controls for software delivery: code review, change management, access approval, security review, compliance checks, deployment gates, incident response, and audit logging.&lt;/p&gt;

&lt;p&gt;The problem is that many of those controls were designed for humans.&lt;/p&gt;

&lt;p&gt;They rely on judgement, institutional memory, informal interpretation, or manual process. People know what the policy really means. Reviewers know when something feels risky. Security teams know which exceptions matter. Auditors accept a workflow because they recognise the human pattern behind it.&lt;/p&gt;

&lt;p&gt;Agents force these assumptions into the open.&lt;/p&gt;

&lt;p&gt;If a policy is ambiguous, an agent cannot reliably follow it. If a control depends on a human noticing something subtle, it may not scale. If a process is only documented in training material, it is not agent-ready. If an approval exists mainly so another team can find out what is happening, it may need to be redesigned.&lt;/p&gt;

&lt;p&gt;This is governance debt, and agentic development exposes it.&lt;/p&gt;

&lt;p&gt;The answer is not to invent an entirely new governance model from scratch. It is to make existing controls explicit, automated, and measurable. That means clearer policies, better identity systems, structured workflows, automated checks, traceability across agent actions, and evals that test whether the agent is actually following the standards it was given.&lt;/p&gt;

&lt;p&gt;You cannot govern what you cannot see, and you cannot improve what you cannot evaluate. That is why skills, observability, and evals belong in the same conversation as security.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Standardization matters, but premature standardization can kill learning
&lt;/h2&gt;

&lt;p&gt;Every organization adopting agents faces the same tension: how much freedom should teams have?&lt;/p&gt;

&lt;p&gt;Too little standardization creates chaos. Too much standardization too early kills discovery.&lt;/p&gt;

&lt;p&gt;The roundtables surfaced many examples of parallel experimentation: multiple teams creating skills, multiple repositories collecting prompts, different approaches to code review, different rules for test generation, different ideas about how much autonomy is acceptable. Some duplication happened because teams wanted control. Some happened because they did not know someone else had already solved the problem.&lt;/p&gt;

&lt;p&gt;Early duplication is not always bad. It can be how teams learn. It can reveal which patterns work across different environments, and it can create local champions who are credible because they solved a real problem rather than followed a mandate.&lt;/p&gt;

&lt;p&gt;But local learning only becomes organizational advantage if it becomes visible.&lt;/p&gt;

&lt;p&gt;The healthiest pattern is to let teams experiment, make the work discoverable, then converge deliberately. That requires communities of practice, internal demos, shared repos, skill registries, lightweight review processes, and a platform team that sees its job as amplifying the good patterns rather than suppressing all variation.&lt;/p&gt;

&lt;p&gt;The question is not whether to standardise. The question is &lt;em&gt;when&lt;/em&gt;. Experimentation should be broad while the organization is learning. Production patterns should become intentional once that learning starts to repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. The talent model is shifting from writing code to directing, verifying, and integrating work
&lt;/h2&gt;

&lt;p&gt;Agentic development changes what great engineering looks like.&lt;/p&gt;

&lt;p&gt;It does not remove the need for engineering skill. If anything, judgement becomes more important. But the work shifts from producing every line of code to defining the task, supplying the context, delegating to agents, verifying the output, integrating the result, and knowing when something is subtly wrong.&lt;/p&gt;

&lt;p&gt;Some engineers will thrive in that environment. They are comfortable with ambiguity, orchestration, and context switching. They can hold the goal in their head while inspecting partial outputs. They know how to specify, review, and correct without needing to manually produce every detail.&lt;/p&gt;

&lt;p&gt;Others may struggle, especially if their identity is tied primarily to deep, single-threaded implementation or writing every line by hand. That style of work will not disappear, but it will become part of a larger system in which humans increasingly design and supervise the machinery of software creation.&lt;/p&gt;

&lt;p&gt;One analogy that came up in the discussions was the shift from building the furniture to building or operating the factory that builds the furniture. Another is management: working with agents can feel like defining work, delegating it, reviewing the output, and intervening when needed.&lt;/p&gt;

&lt;p&gt;That does not mean every engineer becomes a people manager. It means more engineers will need management-like skills for systems of agents: specification, delegation, verification, feedback, and accountability.&lt;/p&gt;

&lt;p&gt;The emerging role is less “the person who writes all the code” and more “the person who ensures the right system gets built.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thoughts: What are the main blockers for enterprise agent adoption?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Blocker&lt;/th&gt;
&lt;th&gt;What leaders are seeing&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Agents inherit human permissions, touch sensitive systems, browse the web, or act without enough containment.&lt;/td&gt;
&lt;td&gt;It limits rollout today, but also defines the trust model for everything that follows.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Usage grows through larger context windows, repeated runs, frontier models, and always-on workflows.&lt;/td&gt;
&lt;td&gt;AI FinOps becomes a durable discipline, not a one-off optimisation project.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model deployment&lt;/td&gt;
&lt;td&gt;Frontier models are powerful, but many enterprise tasks may be better served by smaller, open, or specialised models.&lt;/td&gt;
&lt;td&gt;The capability to route work across models becomes more strategic than picking a single model.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context&lt;/td&gt;
&lt;td&gt;Agents need specs, policies, tests, docs, runbooks, examples, and domain language to do useful work reliably.&lt;/td&gt;
&lt;td&gt;Context becomes infrastructure, and weak documentation becomes an adoption blocker.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alignment&lt;/td&gt;
&lt;td&gt;Implementation gets cheaper, while decisions, approvals, architecture, and cross-team coordination still move at human speed.&lt;/td&gt;
&lt;td&gt;The bottleneck moves from writing code to agreeing what should be built and how it should fit.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most of the roundtable discussion reinforced what enterprise leaders already feel: agentic development is useful, the tools are improving quickly, and adoption is uneven.&lt;/p&gt;

&lt;p&gt;From my perspective, three novel points stood out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Hyping AI productivity can hinder adoption&lt;/strong&gt;. If the story inside a company is mostly about doing the same work with fewer people, employees will quite reasonably hear a threat. A better transformation narrative is ambition: agents let teams attempt more, build more, explore more, and pursue work that previously looked out of reach. This shift turns the questions around and focuses on nurturing an enterprise culture directed at empowering devs (not scaring them!).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;We need AI FinOps!&lt;/strong&gt; Managing AI costs is not a short-lived problem that disappears once models get cheaper. As agents become embedded in development workflows, usage expands, model choice diversifies, and context-heavy workflows become normal. Cost needs to be observed, managed, and tied to value.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;In the enterprise, the security bottleneck currently shields organizations from the full cost curve&lt;/strong&gt;. Many companies are not yet seeing the true cost of broad agent adoption because security constraints are limiting usage. Once the controls mature, adoption will expand, and the cost question will become much sharper.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The next generation of engineering teams won’t be defined by how many agents they use, but by how well they govern them.&lt;/p&gt;

&lt;p&gt;At Tessl, this is the approach we’re building towards: agent governance rooted in context, evaluations, and security. A practical place to start is to point your coding agent at the Tessl CLI and ask it to evaluate your context. It is a simple way to see assess the quality of your context, understand where the gaps are, and think what governance will need to cover next.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Cursor's new leaderboard shows teams the most popular plugins, skills and MCPs</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Thu, 25 Jun 2026 06:28:02 +0000</pubDate>
      <link>https://dev.to/tessl-io/cursors-new-leaderboard-shows-teams-the-most-popular-plugins-skills-and-mcps-3d1n</link>
      <guid>https://dev.to/tessl-io/cursors-new-leaderboard-shows-teams-the-most-popular-plugins-skills-and-mcps-3d1n</guid>
      <description>&lt;p&gt;As engineering teams adopt more agent tooling, keeping track of what's actually running across an organisation has become its own problem. Plugins, skills, and MCP servers get configured differently by different developers, with no shared view of what teammates are using, what's proven out, or what's worth standardising on. The result is a sprawl of JSON config files and scattered settings that nobody has full visibility into.&lt;/p&gt;

&lt;p&gt;Cursor's &lt;a href="https://x.com/cursor_ai/status/2069512593887092811" rel="noopener noreferrer"&gt;latest update&lt;/a&gt; takes aim at that. Version 3.9, &lt;a href="https://cursor.com/changelog/customize" rel="noopener noreferrer"&gt;released June 22&lt;/a&gt;, introduces what the company calls a "Customize" page — a single interface for managing plugins, skills, MCP servers, subagents, rules, commands, and hooks across an organisation, controllable at user, team, or workspace level.&lt;/p&gt;

&lt;h2&gt;
  
  
  A leaderboard that shows what teammates actually use
&lt;/h2&gt;

&lt;p&gt;The headline feature is a leaderboard showing which plugins, skills, and MCPs are most used both within a team and across the broader Cursor community. For skills, the leaderboard surfaces how many times each has been used by the team in the past 30 days, and what proportion of those invocations were agent-initiated versus human-initiated — useful signal for understanding which skills are genuinely being put to work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx1nmtjr1itfpozza2tya.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx1nmtjr1itfpozza2tya.gif" alt="Leaderboard (Skills)" width="799" height="471"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Leaderboard (Skills)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For plugins, teams can see how many teammates have already added a given plugin, and click through to add it to their own setup in one step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fiaolt90w5b7rowyssqe0.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fiaolt90w5b7rowyssqe0.gif" alt="Leaderboard (plugins)" width="720" height="423"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Leaderboard (plugins)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Previously, there was no way to see what teammates had configured — adoption was an individual, manual process with no shared signal. The leaderboard turns it into a discovery surface driven by real usage data, drawing on both internal team behaviour and community-wide trends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Canvases, shared dashboards, and broader marketplace support
&lt;/h2&gt;

&lt;p&gt;The update also introduces prebuilt plugin canvases — shared, interactive dashboards that render live data from partner tools directly inside Cursor. The Atlassian canvas, for instance, pulls a real-time view of Jira issues, sprint progress, and project documents into the editor, giving teams a live window into their project state without switching context. Teams get a ready-made starting point they can open and reuse, rather than building the wiring themselves.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fagf4e38i1fqso35sa84s.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fagf4e38i1fqso35sa84s.gif" alt="Plugin canvas" width="599" height="406"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Plugin canvas&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Team marketplaces, which allow organisations to distribute private plugins internally, &lt;a href="https://cursor.com/changelog/customize#new-team-marketplaces" rel="noopener noreferrer"&gt;now also support&lt;/a&gt; GitLab, Bitbucket, and Azure DevOps repositories — previously the feature was limited to GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cursor's bigger picture: SpaceX, a GitHub challenger, and a quiet acqui-hire
&lt;/h2&gt;

&lt;p&gt;The update lands at a moment when Cursor is the most closely watched company in developer tools. &lt;a href="https://www.cnbc.com/2026/06/16/spacex-spcx-cursor-acquisition-ipo.html" rel="noopener noreferrer"&gt;SpaceX recently confirmed&lt;/a&gt; a $60 billion all-stock deal to acquire Cursor's parent company Anysphere — the largest acquisition of a venture-backed startup on record. Around the same time, &lt;a href="https://thenewstack.io/cursor-origin-github-disruption/" rel="noopener noreferrer"&gt;Cursor unveiled Origin&lt;/a&gt;: an agent-native code hosting platform designed as a challenger to GitHub, which has been logging hundreds of incidents over the past year as it struggles to keep pace with the volume of code AI agents are generating.&lt;/p&gt;

&lt;p&gt;Elsewhere, &lt;a href="https://thenewstack.io/cursor-acquires-continue-coding/" rel="noopener noreferrer"&gt;Cursor also quietly absorbed&lt;/a&gt; open-source coding assistant &lt;a href="https://www.continue.dev/" rel="noopener noreferrer"&gt;Continue&lt;/a&gt;, in an acqui-hire that shut down the product and handed its codebase to the community under its existing Apache 2.0 licence.&lt;/p&gt;

&lt;p&gt;For engineering leaders already managing Cursor deployments at scale, the governance question is only going to grow as agent tooling becomes more embedded in how teams work. A unified control plane and a usage leaderboard won't resolve every challenge, but they give platform teams something they didn't have before: a clear view of what's actually running.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>The new Tessl review: now you decide what "good" looks like:</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Wed, 24 Jun 2026 06:41:25 +0000</pubDate>
      <link>https://dev.to/tessl/the-new-tessl-review-now-you-decide-what-good-looks-like-581n</link>
      <guid>https://dev.to/tessl/the-new-tessl-review-now-you-decide-what-good-looks-like-581n</guid>
      <description>&lt;h2&gt;
  
  
  The new Tessl review: now you decide what "good" looks like:
&lt;/h2&gt;

&lt;p&gt;For a while now Tessl has been able to review the quality of your skills straight out of the box. By simply running &lt;code&gt;tessl skill review&lt;/code&gt; you get a score against Anthropic's best practices with no setup required. That is a sensible default and it has served most people well, but a default is still somebody else's opinion that you or your organisation might look at and disagree with.&lt;/p&gt;

&lt;p&gt;Today we are launching a new version of Tessl’s review functionality. It does three new things: reviews your skills &lt;strong&gt;agentically&lt;/strong&gt; with greater accuracy, and lets you define what &lt;strong&gt;good&lt;/strong&gt; actually means for your skills, and keeps a sharable &lt;strong&gt;history&lt;/strong&gt; of your skill review runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with one definition of good
&lt;/h2&gt;

&lt;p&gt;On one of my skills, the current review provides a quality score of 82%. The description review scores a perfect 100%, but the content section drops to 55%, with conciseness at 1 out of 3 and progressive disclosure at 1 out of 3.&lt;/p&gt;

&lt;p&gt;In some people’s view, nothing is wrong with the skill, but the judge is marking it down for keeping one tight, self-contained skill rather than spreading it across five files. That is a reasonable position and it is Anthropic's position. But what if your org prefers larger, consolidated skills, in which case an 82 is punishing me for doing exactly what we want. Perhaps we even have further constraints which are being missed in my skill but completely being overlooked by the review and giving me a false sense of quality.&lt;/p&gt;

&lt;p&gt;Here’s a video of the new Tessl review in action:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=2O2cQ2x_nbo" rel="noopener noreferrer"&gt;Watch on YouTube&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Offering a more accurate review
&lt;/h2&gt;

&lt;p&gt;The new Tessl review is invoked using &lt;code&gt;tessl review run&lt;/code&gt; from the CLI or via the agent (but make sure it’s calling the new version!) and you need to pass a workspace name where your review results will be stored.&lt;/p&gt;

&lt;p&gt;One of the bigger changes is under the hood. Whereas the previous review used an LLM as a judge in a single pass, the new version uses an agent. It takes more turns, gathers more information about the skill and associated files and reaches a better more grounded verdict. You will still see some variation between runs, since an LLM judge is non-deterministic by it’s very nature, but the results are more accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining what good skills look like for your organization
&lt;/h2&gt;

&lt;p&gt;This is the exciting part that changes how reviews determine what’s right, as the new review allows you to pass your own rubric, as a plugin, and review against it.&lt;/p&gt;

&lt;p&gt;We’ve made a plugin called &lt;code&gt;review-plugin-creator&lt;/code&gt; that walks you through building a custom review plugin. This allows you to fork the Anthropic best practices if you only wish to change a few things, so everything sensible stays in place by default and you only change what you disagree with. In my case I flipped a single rule, the one that punishes consolidated skills.&lt;/p&gt;

&lt;p&gt;The creator produces a plugin holding your guidelines and rubric. To reference it on a &lt;code&gt;tessl review run&lt;/code&gt;, you can reference it locally in the file system, or link to a private or public plugin on the Tessl Registry.&lt;/p&gt;

&lt;p&gt;Running the same skill again, this time with your rules, and you’ll see updated scores. In my case, the consolidated skill now scores full marks on conciseness and progressive disclosure, and the content section reflects what my org actually values rather than what a generic default assumes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seeing your reviews
&lt;/h2&gt;

&lt;p&gt;Everything you see at the CLI is also on the Tessl Registry. Head to your workspace and you will find your review plugin alongside a full history of review runs. Each run shows the same breakdown you get in the terminal, plus the plugin that produced it, so you always know which definition of good a score was measured against.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fo5wy2gtfkjs2j9wmibx0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fo5wy2gtfkjs2j9wmibx0.png" alt="image1" width="800" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In your workspace settings you can set a default review plugin. From then on every review run from that workspace uses it automatically. You can still override it per run with the &lt;code&gt;--review-plugin&lt;/code&gt; flag whenever you need to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rest of the toolkit
&lt;/h2&gt;

&lt;p&gt;A few more commands worth knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;tessl review list --workspace &amp;lt;workspace-name&amp;gt;&lt;/code&gt; lists every review run against a workspace&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;tessl review view &amp;lt;review-id&amp;gt;&lt;/code&gt; opens a single run and shows its full output.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;tessl review fix&lt;/code&gt; is the new home for the &lt;code&gt;--optimize&lt;/code&gt; behaviour you already know from our previous review. It agentically applies fixes to the skill based on a review outcome and can update your &lt;code&gt;SKILL.md&lt;/code&gt; directly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What does this mean for the old command?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;tessl skill review&lt;/code&gt; is not going anywhere yet. We have deliberately left it in place so nothing breaks for anyone relying on it today, although you may see a deprecation message. That said, &lt;code&gt;tessl review run&lt;/code&gt; is where all the work is going from here, so please move across and start using it, so you’re not caught out when we do turn off the older review feature. We’ll also be releasing updates to our GitHub actions soon to make use of the new tessl review functionality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it now
&lt;/h2&gt;

&lt;p&gt;The new Tessl review is live and you can use it today, do note that you’ll need a free account in order to use the Tessl review command (you can check the full documentation &lt;a href="https://docs.tessl.io/improving-your-skills/tessl-review?utm_source=website&amp;amp;utm_medium=website&amp;amp;utm_content=header-banner" rel="noopener noreferrer"&gt;here&lt;/a&gt;. There is plenty more to come and we will keep you posted as it lands. For now, run it against your own skills, write a rubric that matches how your team actually thinks about quality, then tell us how it performs in your environment. Your feedback shapes what we build next.&lt;/p&gt;

&lt;p&gt;Customise Tessl review: &lt;a href="https://tessl.io/registry/tessl/review-plugin-creator?utm_source=website&amp;amp;utm_medium=website&amp;amp;utm_content=header-banner" rel="noopener noreferrer"&gt;https://tessl.io/registry/tessl/review-plugin-creator&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Learn more about Tessl: &lt;a href="https://tessl.io" rel="noopener noreferrer"&gt;https://tessl.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
  </channel>
</rss>
