<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tessl</title>
    <description>The latest articles on DEV Community by Tessl (@tessl).</description>
    <link>https://dev.to/tessl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12956%2Fa0174916-e61b-4172-b5d6-29c9445932f5.png</url>
      <title>DEV Community: Tessl</title>
      <link>https://dev.to/tessl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tessl"/>
    <language>en</language>
    <item>
      <title>Claude Fable 5 vs Opus 4.8: The Mythos Hype Meets Reality</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Sun, 14 Jun 2026 06:39:18 +0000</pubDate>
      <link>https://dev.to/tessl/claude-fable-5-vs-opus-48-the-mythos-hype-meets-reality-od3</link>
      <guid>https://dev.to/tessl/claude-fable-5-vs-opus-48-the-mythos-hype-meets-reality-od3</guid>
      <description>&lt;p&gt;For months, the most interesting model at Anthropic was one we could not use. Mythos was the internal system the company said was too capable to release, the one that found software vulnerabilities at a level that tripped its own safety thresholds. On June 9, 2026, that tier went public for the first time, as Claude Fable 5. Opus 4.8, the model anchoring production coding agents, suddenly had a successor that's a full capability class above it.&lt;/p&gt;

&lt;p&gt;This raises two questions for anyone running coding agents. The practical one is whether you should move your fleet from Opus 4.8 to Fable 5. The bigger one is whether a Mythos-class model, the tier Anthropic held back as too capable to ship, lives up to what the name promised. This article answers both, and the numbers tell a more interesting story than the announcement did.&lt;/p&gt;

&lt;p&gt;We ran both models through the same evaluation, close to 1000 shared scenarios scored twice each, once with no skill supplied and once with the relevant skill in context. The short answer, as of mid-2026, is that Opus 4.8 is still the better value for most agent fleets, and the gap between the Mythos hype and the measured reality is the real story in the data.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Mythos-class model is a tier of Claude that sits above the Opus class in capability&lt;/strong&gt;. It reaches a threshold Anthropic considers high-risk, particularly at discovering and exploiting software vulnerabilities. Fable 5 and Mythos 5 are the same underlying model with the same capabilities. What separates them is the safeguards: Fable 5 is the public version that ships with safety classifiers, while Mythos 5, restricted to approved partners, runs without them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the industry expected from a Mythos-class model
&lt;/h2&gt;

&lt;p&gt;Before launch, the speculation was not subtle. Across Reddit, X, and a run of explainer posts, Mythos was framed as the model that would change how agents work, not just how well they answer. The recurring predictions clustered around four capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Restructuring a large codebase in one coherent pass.&lt;/li&gt;
&lt;li&gt;  Spotting security flaws that experienced engineers miss.&lt;/li&gt;
&lt;li&gt;  Working unsupervised for hours on a single hard problem.&lt;/li&gt;
&lt;li&gt;  Acting like a collaborator, not an assistant you steer turn by turn.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of the four, the cybersecurity claim was the one with hard evidence behind it. Through Project Glasswing, roughly 50 early partners with Mythos Preview access reported finding more than 10,000 high or critical severity vulnerabilities, and the program has since expanded past 150 organizations. Anthropic's CPO Mike Krieger called it "the most capable class of systems we've built." That is the dream the name sold: a model so powerful it stayed in the lab.&lt;/p&gt;

&lt;p&gt;What reached the public is narrower, and deliberately so. The model you can actually use is Fable 5, the Mythos-class system wrapped in safety classifiers. Whether it delivers comes down to the gap between that promise and what was released.&lt;/p&gt;

&lt;h2&gt;
  
  
  The headline numbers: Claude Fable 5 vs Opus 4.8
&lt;/h2&gt;

&lt;p&gt;Every scenario in the evaluation is a real agent task tied to a published skill, scored on two axes: instruction-following (does the agent do what it was told, in the way it was told) and task-completion (does it reach the goal). The overall score weights instruction-following at 4 and task-completion at 3, then divides by 7. Each task runs with and without the skill, so the lift from the skill is visible directly. The tasks and skills are public, in the &lt;a href="https://huggingface.co/datasets/tesslio/task-evals-for-skills" rel="noopener noreferrer"&gt;task-evals-for-skills dataset&lt;/a&gt;, so you can inspect any scenario yourself.&lt;/p&gt;

&lt;p&gt;This design is deliberate. The tasks come from published skills, so they mirror the real work teams write skills for, not frontier puzzles meant to find a model's ceiling. That is why task-completion runs high for both models and why the signal that separates them is instruction-following: doing the work the specific way the skill asks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension (with skill)&lt;/th&gt;
&lt;th&gt;Fable 5&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall score&lt;/td&gt;
&lt;td&gt;92.9&lt;/td&gt;
&lt;td&gt;92.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall score (no skill, baseline)&lt;/td&gt;
&lt;td&gt;75.7&lt;/td&gt;
&lt;td&gt;74.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall lift from the skill&lt;/td&gt;
&lt;td&gt;+17.2&lt;/td&gt;
&lt;td&gt;+17.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instruction-following&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;88.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task-completion&lt;/td&gt;
&lt;td&gt;97.8&lt;/td&gt;
&lt;td&gt;97.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turns to complete&lt;/td&gt;
&lt;td&gt;16.9&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens per task&lt;/td&gt;
&lt;td&gt;9,025&lt;/td&gt;
&lt;td&gt;10,687&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;List price (input / output, per MTok)&lt;/td&gt;
&lt;td&gt;$10 / $50&lt;/td&gt;
&lt;td&gt;$5 / $25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per task (average)&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$0.74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Points per dollar&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;125&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On the 917 scenarios both models ran, Fable 5 leads on overall score by 0.9 points (92.9 to 92.0). Scenario by scenario, the two tie on 61% of tasks, Fable wins 24%, and Opus wins 16%, at a two-point threshold. A capability class above Opus, and on everyday agent skill tasks the quality difference is inside the noise.&lt;/p&gt;

&lt;p&gt;One caveat sits underneath that number. The 917 are the tasks both models completed and scored. Fable 5 refused 26 that Opus 4.8 finished, and we excluded them, so the near-tie is measured only on the tasks Fable agreed to do. That exclusion turns out to be the most revealing part of the comparison, and we return to it below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why agent skill evaluation matters more than the model upgrade
&lt;/h2&gt;

&lt;p&gt;Here is the number that reframes the comparison. The skill adds about 17 overall points to both models: +17.2 for Fable 5 and +17.5 for Opus 4.8. The model upgrade from Opus 4.8 to Fable 5 adds less than 1 point on shared tasks. The context you supply moves the agent far more than the frontier tier you pick.&lt;/p&gt;

&lt;p&gt;The lift concentrates in instruction-following, where both models gain more than 27 points from the skill, while task-completion gains under 5. Both models can usually reach the goal on their own. What they cannot do reliably without a skill is follow the specific conventions, constraints, and steps a real task demands. That is what a good skill encodes.&lt;/p&gt;

&lt;p&gt;Skill receptivity is how much an agent's output improves when you supply a relevant skill. It shows up mostly as better instruction-following. It matters because it can outweigh the model choice, which is the practical case for investing in &lt;a href="https://tessl.io/registry" rel="noopener noreferrer"&gt;agent skills&lt;/a&gt; before chasing the newest tier. Running the same task with and without the skill, then measuring the difference, is a task eval. It is also the only way to know whether a model upgrade earns its price on your workload, which is what &lt;a href="https://tessl.io/blog/introducing-task-evals-measure-whether-your-skills-actually-work/" rel="noopener noreferrer"&gt;agent skill evaluation&lt;/a&gt; is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The price gap is the deciding factor for most teams
&lt;/h2&gt;

&lt;p&gt;On the agent skill tasks we measured, the trade comes down to paying a steep premium for a marginal gain. Fable 5 lists at $10 per million input tokens and $50 per million output tokens against Opus 4.8's $5 and $25, exactly twice across every token category, including cache reads and writes. For that, across our 917 shared scenarios, you get an overall score of 92.9 versus 92.0, a 0.9-point edge that sits well inside the range where the two are interchangeable. This is the everyday-agent-work picture, not a verdict on the marquee Mythos capabilities our eval does not test.&lt;/p&gt;

&lt;p&gt;Token behavior softens the unit price but does not close it. Across the 917 shared scenarios Fable 5 generated about 16% fewer output tokens per task (9,025 versus 10,687), so the real cost per task lands at $1.25 against $0.74, a 73% premium rather than a clean 2x. The value gap is the number to remember: Opus 4.8 returns 125 points per dollar to Fable 5's 74, about 69% more quality for every dollar spent.&lt;/p&gt;

&lt;p&gt;For a single session the difference is cents. For a fleet running thousands of agent tasks a day, it is the line item your finance team will ask about, and twice the price for under a point of quality on the tasks most teams actually run is not an easy answer to give them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fable refuses work Opus completes without issues
&lt;/h2&gt;

&lt;p&gt;The most consequential difference between Fable 5 and Opus 4.8 is not on the scoreboard. It is the safety layer that defines the Mythos class.&lt;/p&gt;

&lt;p&gt;Fable 5 ships with safeguards covering four domains: cybersecurity, biology and chemistry, distillation, and frontier LLM development. For the first three, a triggered request comes back as a refusal. Anthropic's design hands it to Opus 4.8 and informs the user, but that fallback is opt-in rather than a default, so in a stock harness like ours the blocked requests simply refused.&lt;/p&gt;

&lt;p&gt;The fourth domain worked differently during this run. By Anthropic's own documentation, requests touching frontier AI development were not refused or even flagged. The model quietly steered or fine-tuned its answer instead, with no notice to the user. That silent manipulation drew the sharpest backlash, and on June 11, the day after this run, Anthropic switched it to a visible classifier like the other three while conceding the restrictions had been "overly conservative." Because it never produced a refusal, that domain leaves no mark in our numbers; any effect would surface only as quietly weaker answers.&lt;/p&gt;

&lt;p&gt;A Mythos-class model routes some requests to a weaker model by design, so your harness needs to detect the fallback rather than trust that every response came from Fable. And the affected domains are exactly the ones you most want to check yourself, which is the practical edge of &lt;a href="https://tessl.io/blog/the-tessl-registry-now-has-security-scores-powered-by-snyk/" rel="noopener noreferrer"&gt;context governance and security&lt;/a&gt;: catch the regression in an eval, not in production.&lt;/p&gt;

&lt;p&gt;Our run shows how that plays out, and it is not flattering. Fable 5 refused 26 of the roughly 940 tasks it attempted, returning a usage-policy block with a refusal stop reason instead of doing the work, while Opus 4.8 completed and scored every one of them. What Fable refused is the revealing part. Four were defensive security reviews, including "review this Flask application for security vulnerabilities before deploying it," blocked as "violative cyber content." Five were routine bioinformatics tasks, such as running quality control on a single-cell RNA-seq file. One was a literature review on the landscape of AI-assisted drug discovery. A model from the class Anthropic markets for finding vulnerabilities in critical software declined to audit a Flask app for the developer who owns it. Anthropic's own "overly conservative" admission lands hardest here.&lt;/p&gt;

&lt;p&gt;On the security tasks Fable did complete, it was competitive. Across 51 authentication and security skill scenarios, from Auth0, Better Auth, and Bitwarden, Fable 5 averaged 95.0 with the skill against Opus 4.8's 96.6, a near-tie. The lesson is not that one model is safe and the other is not. It is that a Mythos-class model will sometimes refuse the defensive work you most need done, and only an eval on your own tasks will tell you where.&lt;/p&gt;

&lt;h2&gt;
  
  
  Did Fable deliver on the Mythos promise?
&lt;/h2&gt;

&lt;p&gt;Our evaluation answers the question that matters for a deployment decision: how both models handle hundreds of real, skill-driven agent tasks across dozens of tool ecosystems, which is the work most teams actually run coding agents on. The marquee Mythos feats sit outside this eval, but the day-to-day behavior it captures is exactly what you are buying when you point a fleet at a model.&lt;/p&gt;

&lt;p&gt;What the data does show is where Fable's extra capability surfaces in normal use. Grouped by the organization that owns the skill, Fable 5 pulls ahead on web-research and scraping workloads: Apify (+7.8 overall), Google Gemini (+4.6), Tavily (+3.4), and Firecrawl (+2.7). If your agents fetch, map, and extract from the open web, Fable 5 is the stronger pick. Opus 4.8 holds its ground where Fable regresses: Mastra (-7.3), Auth0 (-4.5), and Axiom (-2.5).&lt;/p&gt;

&lt;p&gt;So the Mythos dream of an autonomous collaborator is not what most teams will buy on day one. What they will buy is a model that is marginally better at instruction-following, meaningfully better at web research, twice the price, and gated by classifiers that occasionally hand the job to Opus 4.8 anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use each
&lt;/h2&gt;

&lt;p&gt;Choose Opus 4.8 if you run a coding-agent fleet at scale and care about cost per task. The quality difference is inside the noise for most workloads, Opus returns far more points per dollar, and it has no fallback layer to design around.&lt;/p&gt;

&lt;p&gt;Choose Fable 5 if your agents do heavy web research and scraping, if you need its reasoning depth on long-horizon tasks, or if you have a workload that genuinely benefits from the capability class above Opus. Budget for the roughly 73% per-task premium, and build fallback detection into your harness from day one. If your work touches the classifier domains, confirm the model is not silently routing to Opus 4.8 before you depend on it.&lt;/p&gt;

&lt;p&gt;Fable's edge shows up when you build around it, not when you swap it into an Opus 4.8 pipeline unchanged. Fable is the more autonomous model, but that edge only pays off in flows built for it: longer unsupervised runs, larger units of work, less step-by-step steering.&lt;/p&gt;

&lt;p&gt;For almost everyone, the larger lever is neither model. The skill adds about 17 points; the model upgrade adds less than 1. Standardize the model in your tessl.json, prove the switch with an eval before you roll it to the fleet, and watch for the tasks a Mythos-class model quietly declines to do.&lt;/p&gt;

&lt;p&gt;Want to see how a skill changes your own agent's behavior, on your own tasks, across both models? Start with the &lt;a href="https://tessl.io/registry" rel="noopener noreferrer"&gt;Tessl Registry&lt;/a&gt; and run the eval before you switch.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Same quality, a quarter of the cost: Should DeepSeek Flash be your model of choice?</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Thu, 11 Jun 2026 06:59:02 +0000</pubDate>
      <link>https://dev.to/tessl/same-quality-a-quarter-of-the-cost-should-deepseek-flash-be-your-model-of-choice-1c85</link>
      <guid>https://dev.to/tessl/same-quality-a-quarter-of-the-cost-should-deepseek-flash-be-your-model-of-choice-1c85</guid>
      <description>&lt;p&gt;&lt;strong&gt;$0.0236&lt;/strong&gt; is how much DeepSeek V4 Flash costs to run a complete agentic task, skill included, on the Fireworks price sheet. Claude Haiku 4.5 costs $0.10 for the same task. Sonnet 4.6 costs $0.30.&lt;/p&gt;

&lt;p&gt;In terms of how good they are, in our evals Flash scores 82.3, and Haiku scores 82.9. So the evals points to them being comparable, with skills applied, but one is four times the cost.&lt;/p&gt;

&lt;p&gt;In our eval we ran 19 model configurations through the same benchmark harness. The tasks we asked of them were real agentic tasks, and we measured the total token counts, and looked at the charged provider pricing. To be honest, the value story we expected to find was "cheap models are a trap." What we found instead was more interesting, and particularly useful if you're running agents at any kind of scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, the Pro comparison
&lt;/h2&gt;

&lt;p&gt;DeepSeek V4 ships two tiers: Pro and Flash. In our eval runs, Pro costs &lt;strong&gt;$0.183/task&lt;/strong&gt; and Flash costs &lt;strong&gt;$0.0236/task&lt;/strong&gt;. That's a &lt;strong&gt;7.7× price gap within the same model family&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you look at what you get for the extra spend, it’s only three points. On the eval results, Pro scores 85.3, Flash scores 82.3. When we scale that, 10,000 tasks/month costs you an extra &lt;strong&gt;$19,000/year&lt;/strong&gt; and 100,000 tasks/month costs an extra &lt;strong&gt;$190,000/year&lt;/strong&gt;. For three points that may not be too visible from a quality point of view.&lt;/p&gt;

&lt;h2&gt;
  
  
  Points-per-dollar
&lt;/h2&gt;

&lt;p&gt;When we look at cost per point of eval score, this gives us a ratio between quality and cost, which can be useful, so long as the overall quality of the model satisfies your needs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score (w/ skill)&lt;/th&gt;
&lt;th&gt;$/task&lt;/th&gt;
&lt;th&gt;pts/$&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;82.3&lt;/td&gt;
&lt;td&gt;$0.024&lt;/td&gt;
&lt;td&gt;3,482&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;82.9&lt;/td&gt;
&lt;td&gt;$0.097&lt;/td&gt;
&lt;td&gt;829&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;85.3&lt;/td&gt;
&lt;td&gt;$0.183&lt;/td&gt;
&lt;td&gt;467&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM 5.1&lt;/td&gt;
&lt;td&gt;90.4&lt;/td&gt;
&lt;td&gt;$0.200&lt;/td&gt;
&lt;td&gt;451&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;90.8&lt;/td&gt;
&lt;td&gt;$0.296&lt;/td&gt;
&lt;td&gt;303&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The number your cost model is probably missing
&lt;/h2&gt;

&lt;p&gt;Cost-per-token is the number everyone tends to quote and often mistakenly use as the most important factor in making a decision. It's also the number that will quietly blow your budget if you're not watching turns per solve as well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr60224o4620wv6dkop8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr60224o4620wv6dkop8i.png" alt="tokens/turn" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Flash's mean average is around 20 turns per task which is pretty manageable. But the single worst-case runs in our dataset hit roughly 10× that. This isn’t unusual for models in this class, but in dollar terms, that's a single task costing as much as 10 average tasks. Multiply that across thousands of concurrent agent runs and you may start to have a budget problem that didn't show up in your per-token estimate.&lt;/p&gt;

&lt;p&gt;The reason most teams don't catch this is that agent frameworks surface token counts by default. Turn counts, which is the variable that actually drives fat-tail cost explosions, often need to be logged explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instrument your agents for turns, not just tokens.&lt;/strong&gt; Know your median and your 95th percentile. Set your timeout policies against the 95th, not the median, or you're either killing valid runs or absorbing surprise bills.&lt;/p&gt;

&lt;h2&gt;
  
  
  The skill is doing half the work
&lt;/h2&gt;

&lt;p&gt;One thing worth being very direct about here is that Flash's 82.3 score is a &lt;strong&gt;skill-augmented score&lt;/strong&gt;. Without a skill, Flash scores 64.1. The skill adds +18.2 points.&lt;/p&gt;

&lt;p&gt;That lift is real, but very conditional on the skill being precise, well-scoped, and actually relevant to the task. A vague skill will drag you back down closer to the 64.1 baseline, whereas a sharp one gets you 82.3.&lt;/p&gt;

&lt;p&gt;This matters more than most model evaluations acknowledge since the model you test in a playground doesn’t usually use a skill or relevant context, but just raw capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going further: find cheaper models and test them yourself
&lt;/h2&gt;

&lt;p&gt;The analysis above shows the cheapest hosted options we measured. But there are two obvious next steps if you want to push it further, and both are more accessible than you might think.&lt;/p&gt;

&lt;p&gt;Every model in this benchmark that isn't GPT, Anthropic, or Gemini has publicly available weights. DeepSeek V4 Flash, GLM 5.1, you can run all of them yourself. When you do, the marginal token cost drops to near zero. You're paying for compute (GPU rental or owned infra), not per-call pricing.&lt;/p&gt;

&lt;p&gt;The maths of self-hosting only make sense above a certain volume threshold, the ops overhead and GPU costs aren't free of course, but if you're running tens of thousands of agentic tasks per month, the crossover point is lower than you'd expect.&lt;/p&gt;

&lt;p&gt;The skill in this benchmark is doing +18.2 points of work. The question worth asking is: where did that skill come from, and how do you know it's any good?&lt;/p&gt;

&lt;p&gt;The Tessl registry is a good place to start and look at the quality, impact and security posture of your skill. Before you write a skill from scratch, check whether one already exists and has eval data behind it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluate your skills properly.&lt;/strong&gt; You can run two types of evaluation: reviews (automated quality assessment of whether your skill is well-structured) and task evals (end-to-end runs that measure whether the skill actually improves agent performance on real tasks). The task eval output is exactly the kind of "with skill / without skill" delta that the Flash benchmark is built on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use skill quality as a model selection input.&lt;/strong&gt; The 18-point lift Flash gets from a well-scoped skill isn't a fixed number, it depends on the skill and the tasks. A skill that has been evaluated by Tessl with a high task eval score gives you confidence that the lift is real and reproducible. A skill that's never been evaluated is a variable you can't account for in your cost modelling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your own workload, not someone else's benchmark.&lt;/strong&gt; The task eval system lets you define scenarios from your actual codebase and run them. That's the self-evaluation framework described above.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaways, flat out
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;DeepSeek V4 Flash at $0.0236/task is the value pick.&lt;/strong&gt; Haiku costs 4× more for 0.6 points. Pro costs 7.7× more for 3 points.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Set a quality floor before you rank by cost.&lt;/strong&gt; pts/$ flatters cheap-and-weak models. Above 80 points, it's a real signal.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Instrument for turns, not just tokens.&lt;/strong&gt; Your 95th percentile turn count is the budget variable nobody's logging.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The skill is doing half the work.&lt;/strong&gt; A bad skill collapses your score back to baseline. Evaluate your skills — with task evals, not vibes.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You can run this yourself.&lt;/strong&gt; 20-30 tasks, turn logging, a spreadsheet, and Tessl's eval system.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Self-hosting open source models is a real option.&lt;/strong&gt; The weights are public, the ops trade-off is real. You should run your own evals with your models to see if they can be substituted in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tier name told you Flash was cheap; the data says it's also good. Now you have the tools to find out whether that holds for what &lt;em&gt;you're&lt;/em&gt; building.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>productivity</category>
      <category>agents</category>
    </item>
    <item>
      <title>AI Coding Agent Accuracy: Opus 4.7 vs 4.8</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Tue, 09 Jun 2026 07:23:08 +0000</pubDate>
      <link>https://dev.to/tessl/ai-coding-agent-accuracy-opus-47-vs-48-3051</link>
      <guid>https://dev.to/tessl/ai-coding-agent-accuracy-opus-47-vs-48-3051</guid>
      <description>&lt;p&gt;You are deciding whether to roll your default agent model from Opus 4.7 to 4.8. The release notes promise improvements, the leaderboard moves a fraction of a point, so you shrug, schedule the upgrade for a quiet Friday, and move on.&lt;/p&gt;

&lt;p&gt;We ran both versions through the same skills evaluation, roughly 850 scenarios solved twice each, and on the headline metric they finished level. Underneath the tie, though, 4.8 reached the same answers in four fewer turns and for measurably less money, so the upgrade that looks like a non-event on the scoreboard turns out to be a real efficiency gain in the place that actually bills you: the agent loop.&lt;/p&gt;

&lt;p&gt;AI agent evaluation measures how an agent behaves on real tasks rather than only scoring its final answer, tracking cost, turns, and reliability across paired runs. The reason to bother is that two models can post the same score while spending very different amounts of work to reach it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two versions, one eval harness
&lt;/h2&gt;

&lt;p&gt;Both models ran the identical setup. Every scenario is solved twice, once with no help and once with the relevant skill installed, so we can isolate what the skill contributes from what the base model already knows. We score three things: instruction following (did the agent do what the skill tells it to do), task completion (did it reach the goal), and an overall blend weighted toward instruction following. We also flag integrity issues, like an agent peeking at the grading rubric instead of solving the task.&lt;/p&gt;

&lt;p&gt;Opus 4.7 is the incumbent. In our runs it is a strong agent that leans heavily on skills to reach its ceiling, and it explores a lot of paths to get there.&lt;/p&gt;

&lt;p&gt;Opus 4.8 is the point release. It posts the same ceiling with a skill installed, but it starts from a higher floor without one, and it gets to the answer with noticeably less wandering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI coding agent accuracy stops being the story
&lt;/h2&gt;

&lt;p&gt;Here is the head-to-head on the shared scenario set, all with the relevant skill installed unless noted.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall score&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;td&gt;92.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Baseline score, no skill&lt;/td&gt;
&lt;td&gt;71.4&lt;/td&gt;
&lt;td&gt;74.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task completion&lt;/td&gt;
&lt;td&gt;97.1&lt;/td&gt;
&lt;td&gt;97.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instruction following&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turns per task&lt;/td&gt;
&lt;td&gt;19.2&lt;/td&gt;
&lt;td&gt;15.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens per task&lt;/td&gt;
&lt;td&gt;7,820&lt;/td&gt;
&lt;td&gt;9,763&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per task, API pricing&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;about 5% lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integrity flags raised&lt;/td&gt;
&lt;td&gt;10.2%&lt;/td&gt;
&lt;td&gt;7.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The overall accuracy gap is 0.2 points. If you stopped reading the row labeled "overall score," you would conclude nothing changed. Three other rows complicate that picture.&lt;/p&gt;

&lt;p&gt;The first is the baseline. Without any skill, 4.8 scores 74.1 against 4.7's 71.4, a 2.6 point gain, and its no-skill instruction following climbed from the high 50s into the low 60s. The ceiling is shared because the skill pulls both versions up to roughly the same place. The floor is where 4.8 actually improved, and that has a practical consequence: 4.8 depends on the skill slightly less to do good work. This suggests some of the knowledge previously only present in skills has been trained into the model weights.&lt;/p&gt;

&lt;p&gt;The second is turns. 4.8 finishes the average task in 15.0 turns versus 19.2 for 4.7, a 21% reduction. In an agent loop, a turn is a full round trip of context, reasoning, and tool use. Cutting four turns off the average task lowers latency, reduces the chances for an agent to talk itself into a wrong path, and, as we will see, lowers cost.&lt;/p&gt;

&lt;p&gt;The third is integrity. The eval flags runs where the agent took a shortcut, like reading the grading rubric or reaching outside its workspace. Those flags dropped from 10.2% of shared runs to 7.9%. 4.8 is modestly more disciplined about how it reaches an answer. This matches Anthropic’s claims about 4.8 being more honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading the cost: turns, not tokens
&lt;/h2&gt;

&lt;p&gt;Look again at two rows that seem to contradict each other. 4.8 produces more output per task, 9,763 tokens against 7,820, yet it costs about 5% less.&lt;/p&gt;

&lt;p&gt;This is because output volume does not dominate agentic cost. The dominant term is the context replayed on every turn. Each turn re-sends the accumulated conversation and tool results, and in long agent runs that cached input swamps the fresh output the model writes. Fewer turns means fewer replays, so 4.8 can be more verbose inside each turn and still come out ahead, because it takes four fewer turns to converge.&lt;/p&gt;

&lt;p&gt;Model cards only show the per-token rate that sets the price of a unit of work, while turn count sets how many units the model decides to spend. A point release that holds accuracy flat while spending 21% fewer turns is working on that second term, which is the one that scales with your usage.&lt;/p&gt;

&lt;p&gt;The same dynamic shows up in how each version absorbs a skill. Adding the relevant skill is not free: it pulls in instructions and reference material the agent has to process, and the question is how efficiently the model turns that overhead into a result.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Effect of installing the skill&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall score gain&lt;/td&gt;
&lt;td&gt;+20.5&lt;/td&gt;
&lt;td&gt;+18.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost increase&lt;/td&gt;
&lt;td&gt;+38%&lt;/td&gt;
&lt;td&gt;+12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turn increase&lt;/td&gt;
&lt;td&gt;+41%&lt;/td&gt;
&lt;td&gt;+14%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On 4.7, switching on a skill added 41% more turns to cash in a 20 point accuracy gain. On 4.8, the same class of skill buys nearly the same gain for much less turn and cost overhead. 4.8 treats a skill more like a shortcut and less like an invitation to explore. If you run agent skills at scale, that lower skill tax compounds across every task you ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one place 4.8 regressed
&lt;/h2&gt;

&lt;p&gt;A fair comparison reports where the new version loses ground. Per scenario, the record is close to a wash: 4.8 scored higher on 23% of shared tasks, tied on 61%, and scored lower on 17%, using a two point threshold. The interesting part is that the losses cluster.&lt;/p&gt;

&lt;p&gt;4.8 regressed on web research and scraping skill families. Firecrawl tasks dropped 3.3 points on average across 72 scenarios. LangChain dropped 2.9 points across 48. Smaller families like Tavily and Apify fell further, 10.4 and 7.6 points, though on fewer tasks. Meanwhile 4.8 improved on infrastructure, auth, and code tooling: Cloudflare gained 4.5 points across 38 scenarios, Auth0 gained 4.3 across 18, and Mastra gained 10.1 across 10.&lt;/p&gt;

&lt;p&gt;The aggregate hid this completely, because the gains and losses nearly cancel. Only a per domain breakdown surfaces it. That is the whole argument for paired skill evals over a single leaderboard number: the headline can be a tie while two coherent shifts run in opposite directions underneath it.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to roll forward to 4.8
&lt;/h3&gt;

&lt;p&gt;Roll forward to 4.8 if your agents run long, multi turn tasks where turn count, latency, and cost matter, which is most production agent work. You get the same accuracy ceiling, a higher floor before skills, a 21% turn reduction, a cheaper skill tax, and fewer integrity flags. If your workloads lean on infrastructure, auth, or general code tooling, 4.8 is flat to clearly better.&lt;/p&gt;

&lt;p&gt;Test before you roll forward if your agents live in the scrape, crawl, and summarize world. The web research regression is small in absolute terms but consistent across the families we measured. Run your own A/B on your top scraping workflows first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway: measure behavior, not the changelog
&lt;/h2&gt;

&lt;p&gt;A skeptic has two reasonable objections. The first: a flat score is just no improvement, so why care? Two models can tie on accuracy while one spends 21% more turns and about 5% more budget to get there. The second: these are our eval harness costs. However, the relative differences in turns, tokens, and cost reflect model behavior which does generalize.&lt;/p&gt;

&lt;p&gt;Make sure you’re measuring each release on behavior, on your own tasks, with skills installed and stripped out, and look at the per domain breakdown before you trust the average.&lt;/p&gt;

&lt;p&gt;Want to see how your own stack behaves across a model upgrade? Browse the Tessl Registry to find the skills your agents depend on, then run the same paired evaluations we used here to measure what actually changed.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>agentskills</category>
      <category>agents</category>
    </item>
    <item>
      <title>AI Native DevCon Day 2: From Agent Demos to Operating Models</title>
      <dc:creator>Rohan Sharma</dc:creator>
      <pubDate>Wed, 03 Jun 2026 06:40:09 +0000</pubDate>
      <link>https://dev.to/tessl/ai-native-devcon-day-2-from-agent-demos-to-operating-models-51hf</link>
      <guid>https://dev.to/tessl/ai-native-devcon-day-2-from-agent-demos-to-operating-models-51hf</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Day 2 of AI Native DevCon shifted from agent capability to operating discipline. The strongest sessions focused on how teams can run AI-native delivery with clearer context pipelines, measurable agent behavior, safer execution boundaries, and better organizational ownership.&lt;/p&gt;

&lt;p&gt;The scale showed up in the numbers too. Across the two days, DevCon brought together 650+ in-person registrations, around 2,000 online registrations, and a packed mix of sessions, workshops, hallway conversations, and practical lessons.&lt;/p&gt;

&lt;p&gt;Day 2 leaned into workshops. That shift mattered because the second day was less about proving agents can do useful work and more about showing how teams can make that work repeatable.&lt;/p&gt;

&lt;p&gt;Hey there, welcome back. &lt;a href="https://www.linkedin.com/in/rohan-sharma-9386rs/" rel="noopener noreferrer"&gt;Rohan Sharma&lt;/a&gt; here again continuing the devcon series.&lt;/p&gt;

&lt;p&gt;Day 1 gave us the framing, including &lt;a href="https://www.linkedin.com/in/guypo/" rel="noopener noreferrer"&gt;Guy Podjarny&lt;/a&gt;’s core point that skills should be treated like real software assets. Day 2 picked up from there and moved into the operating details. Once agents are inside daily engineering work, platform and product teams need to decide what changes first, who owns those changes, and how the results are measured.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5t7ov7qd0wqaq0ohk0s.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5t7ov7qd0wqaq0ohk0s.jpg" alt="day1" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Talks that shaped Day 2
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Harness engineering beyond code
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/marcsloan/" rel="noopener noreferrer"&gt;Marc Sloan&lt;/a&gt; from Tessl focused on the next gap many teams are hitting. Code context is increasingly structured, but product and design context still lives in external systems such as Figma, Notion, and Linear. Pulling that context live can reduce staleness, but it introduces drift in evals, versioning, and reproducibility.&lt;/p&gt;

&lt;p&gt;The practical lesson was to stop treating external product and design context as random reference material. Teams need a defined layer between the repository and those external systems, with clear versioning so evaluations can be replayed against known context snapshots.&lt;/p&gt;

&lt;p&gt;Without that, agents can produce work that looks technically correct while missing the product constraint that actually mattered. That is a very expensive kind of almost-right.&lt;/p&gt;

&lt;h3&gt;
  
  
  From vibes to metrics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/simonobstbaum/" rel="noopener noreferrer"&gt;Simon Obstbaum&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/robertgwilloughby/" rel="noopener noreferrer"&gt;Rob Willoughby&lt;/a&gt; from Tessl delivered a session focused on a challenge many engineering leaders are currently facing. Their distinction between output evals and trajectory evals is operationally important. A good answer is not enough if the agent used risky tools, skipped required checks, or ignored policy steps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypw3qaky2rov21ea2q8j.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypw3qaky2rov21ea2q8j.jpg" alt="rob" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The useful measurement model came down to activation, trajectory, and outcome. Did the right skill trigger? Did the agent follow the right steps? Was the final result actually useful and correct?&lt;/p&gt;

&lt;p&gt;The good part was the emphasis on partial compliance. Pass or fail is too blunt for agent workflows. If a workflow degrades halfway through, teams need to know where it happened, not just that something felt off.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmarking beyond the model
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://uk.linkedin.com/in/amit-kushwaha28" rel="noopener noreferrer"&gt;Amit Kushwaha&lt;/a&gt; highlighted why many current benchmarks miss real agent behavior. Agent systems run long traces with tool calls, context accumulation, and latency bottlenecks that one-shot benchmark numbers do not capture.&lt;/p&gt;

&lt;p&gt;For teams choosing infrastructure, the warning was clear. Do not optimize only for model speed. Real agent workloads involve tools, memory, caches, retries, and long-running traces.&lt;/p&gt;

&lt;p&gt;The better benchmark is closer to production reality, with multi-turn tasks, tool latency, tail latency, and cache behavior over time. Otherwise teams risk picking systems that look great in a chart and struggle in the actual workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safe execution boundaries for agents
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/shelajev/" rel="noopener noreferrer"&gt;Oleg Šelajev&lt;/a&gt; from Docker covered a problem every platform team eventually sees. An unconstrained agent can make high-impact changes in the wrong environment. Sandboxing is not optional once agents are allowed to execute.&lt;/p&gt;

&lt;p&gt;The practical takeaway was to treat environment policy as part of the harness. Filesystem access, network access, secrets, and permissions all need clear boundaries before agents are given the ability to act.&lt;/p&gt;

&lt;p&gt;This is how teams lower blast radius. Not by hoping the agent behaves nicely, but by designing the room it is allowed to move around in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do not write prompts, write software
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/jbaruch" rel="noopener noreferrer"&gt;Baruch Sadogursky&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/maceybaker/" rel="noopener noreferrer"&gt;Macey Baker&lt;/a&gt; from Tessl reinforced an idea that keeps proving useful in production. Break behavior into modular skills instead of maintaining one giant prompt. This makes agent behavior easier to test, review, and reuse.&lt;/p&gt;

&lt;p&gt;The message was not “write a better mega prompt.” It was to turn repeatable behavior into composable skills that match real workflow stages. That gives teams something they can review, test, improve, and share across repos.&lt;/p&gt;

&lt;p&gt;If you try one thing from this workshop, use the materials and skill templates as a starting point. Prototype one small skill pipeline in your own environment before trying to scale the pattern across every repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kept coming up across the day
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Context quality is now a platform responsibility
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/marcsloan/" rel="noopener noreferrer"&gt;Marc Sloan&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/smithshaun/" rel="noopener noreferrer"&gt;Shaun Smith&lt;/a&gt;, and &lt;a href="https://www.linkedin.com/in/john-groetzinger/" rel="noopener noreferrer"&gt;John Groetzinger&lt;/a&gt; approached this from different angles, but the operational message was consistent. Context delivery is becoming an engineering system, not documentation hygiene. Teams need predictable context pipelines for both humans and agents.&lt;/p&gt;

&lt;p&gt;The next step is ownership. Teams need to know who maintains context sources, how often they refresh, and how changes are versioned. Context also needs observability so teams can trace which inputs shaped an agent decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Agent performance needs production-grade telemetry
&lt;/h3&gt;

&lt;p&gt;The sessions from &lt;a href="https://www.linkedin.com/in/simonobstbaum/" rel="noopener noreferrer"&gt;Simon Obstbaum&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/robertgwilloughby/" rel="noopener noreferrer"&gt;Rob Willoughby&lt;/a&gt; from Tessl, plus &lt;a href="https://uk.linkedin.com/in/amit-kushwaha28" rel="noopener noreferrer"&gt;Amit Kushwaha&lt;/a&gt; from NVIDIA and &lt;a href="https://www.linkedin.com/in/justincormack/" rel="noopener noreferrer"&gt;Justin Cormack&lt;/a&gt;, former CTO at Docker, made this very concrete. Teams need to measure how agents worked, not only what they returned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36vlk57fuoml49x7hh5c.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36vlk57fuoml49x7hh5c.jpg" alt="justin" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Trajectory metrics belong next to existing quality signals. If your dashboards already show test health, release health, or incident trends, agent workflow quality should sit in the same operational view.&lt;/p&gt;

&lt;p&gt;The benchmark scenarios should also look like real work. Multi-turn, tool-heavy, slightly messy, and full of the same constraints your teams face every day. Justin’s observability point connected neatly here too. Teams need runtime signals that can reveal agent-induced drift before it becomes a bigger production problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Adoption is an organizational design problem, not a tooling checkbox
&lt;/h3&gt;

&lt;p&gt;Talks from &lt;a href="https://www.linkedin.com/in/tammuzdubnov/" rel="noopener noreferrer"&gt;Tammuz Dubnov&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/birgittaboeckeler/" rel="noopener noreferrer"&gt;Birgitta Böckeler&lt;/a&gt; from Thoughtworks showed that adoption succeeds when review structures, ownership boundaries, and team rituals evolve with the tooling.&lt;/p&gt;

&lt;p&gt;That means setting explicit contribution boundaries for AI-assisted changes and updating review criteria. The diff still matters, but so does the path the agent took to produce it. Birgitta’s adoption data made this especially grounded by showing where hidden costs appear, including review load, technical debt, and maintainability when speed becomes the only metric.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Workshops made the ideas practical
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/jbaruch" rel="noopener noreferrer"&gt;Baruch Sadogursky&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/maceybaker/" rel="noopener noreferrer"&gt;Macey Baker&lt;/a&gt; from Tessl, along with &lt;a href="https://www.linkedin.com/in/alfonso-graziano/" rel="noopener noreferrer"&gt;Alfonso Graziano&lt;/a&gt; from Nearform, helped turn the bigger Day 2 ideas into something teams could actually try. The workshop-heavy format made the day feel less like theory and more like practice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/derekashmore/" rel="noopener noreferrer"&gt;Derek Ashmore&lt;/a&gt;’s packed workshop, &lt;strong&gt;“The AI Agent Testing Pyramid,”&lt;/strong&gt; focused on the different levels of testing agent systems need. For those following from home, you can attempt it on your own by following &lt;a href="https://github.com/AsperitasConsulting/research-summarizer-agent" rel="noopener noreferrer"&gt;this repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmxs4pmkc5rz7jh3wy667.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmxs4pmkc5rz7jh3wy667.jpg" alt="derek" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/lamis-mukta/" rel="noopener noreferrer"&gt;Aashrey Tiku&lt;/a&gt; from Anthropic worked through a hands-on session on shipping a managed agent. It was a useful bridge between agent concepts and the practical work of packaging, managing, and operating an agent with the right boundaries.&lt;/p&gt;

&lt;p&gt;That mattered because AI-native development is still new enough that people need patterns they can test, not just concepts they can nod along to. Alfonso’s spec-driven angle fit well here because prompts become far more useful when they are turned into testable, production-ready specifications.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Agent enablement needs real ownership
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/anatomic/" rel="noopener noreferrer"&gt;Ian Thomas&lt;/a&gt; from Meta and &lt;a href="https://www.linkedin.com/in/katie-roberts-3bbb2316/" rel="noopener noreferrer"&gt;Katie Roberts&lt;/a&gt; from Nearform made the enablement side feel practical. Rollouts work better when platform safeguards are paired with updated team rituals, clear ownership, and realistic guidance for brownfield systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffpu0182z5excdycdvlnl.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffpu0182z5excdycdvlnl.jpg" alt="ian" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Katie’s legacy advice was especially useful. AI should help teams modernize incrementally, not generate another fragile layer on top of systems that are already hard to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you missed Day 1, &lt;a href="https://www.youtube.com/watch?v=akZ85mG5HXY" rel="noopener noreferrer"&gt;start here&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Day 2 was workshop-heavy. If you missed the &lt;a href="https://www.youtube.com/watch?v=akZ85mG5HXY" rel="noopener noreferrer"&gt;Day 1 virtual stream&lt;/a&gt;, start with these talks before digging into the workshop themes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/guypo/" rel="noopener noreferrer"&gt;Guy Podjarny&lt;/a&gt;, Tessl&lt;/strong&gt; - Skills are the new Code&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/dglawson" rel="noopener noreferrer"&gt;Dana Lawson&lt;/a&gt;, Netlify&lt;/strong&gt; - Built for Humans. Now Agents Are Here.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/jimbomoss/" rel="noopener noreferrer"&gt;James Moss&lt;/a&gt;, Tessl&lt;/strong&gt; - Using skills to pay the bills&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/talliran/" rel="noopener noreferrer"&gt;Liran Tal&lt;/a&gt;, Snyk&lt;/strong&gt; - Your AI Agent Installed Malware Because a SKILL.md Told It To&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/ryanlopopolo/?_l=en_US" rel="noopener noreferrer"&gt;Ryan Lopopolo&lt;/a&gt;, OpenAI&lt;/strong&gt; - Harness Engineering&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://be.linkedin.com/in/patrickdebois" rel="noopener noreferrer"&gt;Patrick Debois&lt;/a&gt;, Tessl&lt;/strong&gt; - The Rise of Agent Enablement&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/shachar-azriel-215748127/" rel="noopener noreferrer"&gt;Shachar Azriel&lt;/a&gt;, Baz&lt;/strong&gt; - Executable Specs&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/may-walterr/" rel="noopener noreferrer"&gt;May Walter&lt;/a&gt;, Hud&lt;/strong&gt; - Runtime Intelligence for Continuous Agentic Performance Optimization&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.linkedin.com/in/dave-farley-a67927" rel="noopener noreferrer"&gt;&lt;strong&gt;Dave Farley&lt;/strong&gt;&lt;/a&gt; - Vibe Coding: Is this really the best we can do?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That set gives the right foundation for Day 2 across skills, context, verification, security, harnesses, runtime feedback, and team enablement.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Native DevCon is not over yet!
&lt;/h2&gt;

&lt;p&gt;We are already working on the next AI DevCon, and yes, we are very excited to say that AI DevCon NYC is officially on the way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rcrewhayjs1otwkikvh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rcrewhayjs1otwkikvh.jpg" alt="devcon nyc" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If Day 1 gave the frame and Day 2 showed the operating model, NYC is where the conversation gets even more practical. Expect more on skills, harnesses, agent safety, context systems, benchmarking, product workflows, and what it really takes to make AI-native delivery work inside teams.&lt;/p&gt;

&lt;p&gt;Super-early-bird seats are available now. If you want to be in the room for the next round of conversations, this is the time to grab a spot.&lt;/p&gt;

&lt;p&gt;In the meantime, &lt;a href="https://tessl.io/newsletter/" rel="noopener noreferrer"&gt;register for the AI DevCon newsletter&lt;/a&gt;. We will release the content shared over the conference, including selected highlights, session clips, notes, slide decks, and workshop materials as they are published.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>security</category>
      <category>architecture</category>
    </item>
    <item>
      <title>AI Native DevCon Day 1: Making AI Agents Ready for Enterprise</title>
      <dc:creator>Rohan Sharma</dc:creator>
      <pubDate>Tue, 02 Jun 2026 08:37:13 +0000</pubDate>
      <link>https://dev.to/tessl/ai-native-devcon-day-1-making-ai-agents-ready-for-enterprise-1e50</link>
      <guid>https://dev.to/tessl/ai-native-devcon-day-1-making-ai-agents-ready-for-enterprise-1e50</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Day 1 of &lt;strong&gt;AI Native DevCon&lt;/strong&gt; was a practical reality check for AI-native software development. IRL tickets were sold out, the room was packed with 650+ builders. Agents are moving beyond demos, and teams now need better skills, context, verification, security, and enablement to make them dependable.&lt;/p&gt;

&lt;p&gt;Hey there! Welcome back. &lt;a href="https://www.linkedin.com/in/rohan-sharma-9386rs/" rel="noopener noreferrer"&gt;Rohan Sharma&lt;/a&gt; here 👋&lt;/p&gt;

&lt;p&gt;The first day of DevCon felt less like a normal developer conference and more like the industry collectively agreeing on something important. Coding agents are powerful, but they do not become production-ready; production readiness is earned through reliability, testing, and governance, not just compelling demonstrations.&lt;/p&gt;

&lt;p&gt;The common thread was reliability. How do we make AI-native development work for teams, not just polished demos?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41v3c2mrdbgo0bi0li78.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41v3c2mrdbgo0bi0li78.jpg" alt="1780295131363.jpeg" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://uk.linkedin.com/in/simonmaple" rel="noopener noreferrer"&gt;Simon Maple&lt;/a&gt; opened DevCon by setting the frame. The question is no longer whether AI changes software development. It is how teams, platforms, and engineering cultures adapt now that agents are part of the workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sessions that shaped Day 1
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Skills are the new Code
&lt;/h3&gt;

&lt;p&gt;&lt;a href="//linkedin.com/in/guypo/"&gt;Guy Podjarny&lt;/a&gt;’s keynote, &lt;strong&gt;“Skills are the new Code”&lt;/strong&gt;, gave the day its strongest early thesis. The instructions, skills, and context we give agents are becoming a real unit of software.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3r64e92kpxj30dr84yt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3r64e92kpxj30dr84yt.png" alt="guypo" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If a skill shapes agent behaviour, it needs intent, review, testing, versioning, and maintenance. Your &lt;code&gt;SKILL.md&lt;/code&gt; file cannot be the chaotic group chat of your engineering process. It needs structure.&lt;/p&gt;

&lt;p&gt;Teams are already relying on agent instructions. The missing piece is treating those instructions like production assets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Platforms built for humans now need to work for agents
&lt;/h3&gt;

&lt;p&gt;&lt;a href="//linkedin.com/in/dglawson"&gt;Dana Lawson&lt;/a&gt; from Netlify focused on a practical platform challenge. Most dev tools still assume a human is reading logs, checking previews, and interpreting CLI output.&lt;/p&gt;

&lt;p&gt;Agents need something different. They need structured signals, machine-readable feedback, and clear next actions. Otherwise they guess, retry blindly, or break something with full confidence.&lt;/p&gt;

&lt;p&gt;Giving agents human-only logs is often insufficient. The data may be available, but agents need structured, machine-readable context to use it effectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  From solo skill hacks to organizational enablement
&lt;/h3&gt;

&lt;p&gt;&lt;a href="//linkedin.com/in/jimbomoss/"&gt;James Moss&lt;/a&gt; from Tessl took the skills conversation into team territory with &lt;strong&gt;“Using skills to pay the bills”&lt;/strong&gt;. Solo agents are easy to experiment with. Team agents are where things get messy.&lt;/p&gt;

&lt;p&gt;Every developer can end up with different instructions, different context, and slightly different agent behaviour. If that layer is not shared, reviewed, and versioned, the team does not have one AI workflow. It has ten confused ones wearing the same hoodie.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://be.linkedin.com/in/patrickdebois" rel="noopener noreferrer"&gt;Patrick Debois&lt;/a&gt; expanded that idea in &lt;strong&gt;“Coding Agents Don’t Scale Themselves. Neither Do Your Teams.”&lt;/strong&gt; Organizations cannot simply roll out agent tooling and expect consistent results. Adoption requires enablement, governance, platform thinking, shared practices, and ways to measure whether these systems are genuinely improving outcomes.&lt;/p&gt;

&lt;p&gt;Taken together, both talks pointed to the same conclusion: successful agent adoption is less about the model and more about how teams operationalize it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skills are also a supply-chain risk
&lt;/h3&gt;

&lt;p&gt;&lt;a href="//linkedin.com/in/talliran/"&gt;Liran Tal&lt;/a&gt;’s “Your AI Agent Installed Malware Because a &lt;a href="http://skill.md/" rel="noopener noreferrer"&gt;SKILL.md&lt;/a&gt; Told It To” focused on an often-overlooked security challenge. If a skill can influence agent behaviour, it becomes part of your supply chain.&lt;/p&gt;

&lt;p&gt;Teams need to audit skills, understand what they instruct agents to do, and avoid blindly installing context files because they look useful. Cute name, dangerous permissions. Classic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Harness engineering makes agent-first development serious
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/ryanlopopolo/?_l=en_US" rel="noopener noreferrer"&gt;Ryan Lopopolo&lt;/a&gt; from OpenAI discussed &lt;strong&gt;Harness Engineering&lt;/strong&gt;, a useful phrase for what agent-first development needs. Agents need the right context, sensible tool access, clear boundaries, verification loops, and feedback when something goes wrong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw87yflphkrz2b1kbkffv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw87yflphkrz2b1kbkffv.png" alt="ryan" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One practical takeaway was that &lt;strong&gt;"give the model the entire repository" is not a deployment strategy&lt;/strong&gt;. Effective agents need carefully scoped context, access to the right tools, and clear boundaries around what they can and cannot do. More context is not always better context.&lt;/p&gt;

&lt;p&gt;Ryan also emphasized the importance of &lt;strong&gt;verification and feedback loops&lt;/strong&gt;. Agents can generate code quickly, but production use requires mechanisms to evaluate outputs, catch mistakes, and continuously improve performance. The goal is not autonomous agents operating without oversight. It is systems where agents can work independently while remaining observable and accountable.&lt;/p&gt;

&lt;p&gt;The framing made agent-first engineering feel less vague. Agents can execute more, but humans still need to design the operating environment. Less typing every line, more steering the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kept coming up across the day
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Context is becoming infrastructure
&lt;/h3&gt;

&lt;p&gt;Across Guy Podjarny’s keynote, James Moss’ team workflow talk, &lt;a href="http://mozilla.ai/" rel="noopener noreferrer"&gt;Mozilla.ai&lt;/a&gt;’s cq, and &lt;a href="//linkedin.com/in/robertoverweg/"&gt;Robert Overweg&lt;/a&gt;’s shared-brain session, there was a clear thread running through the discussions.&lt;br&gt;&lt;br&gt;
Context is not background information anymore. It is infrastructure.&lt;/p&gt;

&lt;p&gt;The teams that get real value from agents will not be the ones with the longest prompts. They will be the ones with reusable, maintained, structured context that both humans and agents can trust. Your agent context should not look like your Downloads folder. We all know what that looks like. 😅&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Verification is the new bottleneck
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/shachar-azriel-215748127/" rel="noopener noreferrer"&gt;Shachar Azriel&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/simonmartinelli/" rel="noopener noreferrer"&gt;Simon Martinelli&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/may-walterr/" rel="noopener noreferrer"&gt;May Walter&lt;/a&gt;, and &lt;a href="https://www.linkedin.com/in/dave-farley-a67927" rel="noopener noreferrer"&gt;Dave Farley&lt;/a&gt; all circled the same problem from different angles. Generating code is getting easier. Knowing whether that code is correct, safe, aligned with intent, and maintainable is the hard part.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs8xdg3r46zwajnigxl72.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs8xdg3r46zwajnigxl72.png" alt="lieven" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If an AI workflow only optimizes for output speed, it becomes a very fast confusion machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. AI-native development is a team-design problem
&lt;/h3&gt;

&lt;p&gt;The organizational talks made the discussion feel more mature than the usual “everyone becomes 10x” stuff. AI changes review processes, team boundaries, product workflows, release safety, and how work moves from idea to production.&lt;/p&gt;

&lt;p&gt;The better advice was boring in the best way. Train people properly, revisit workflows often, keep humans focused on judgment and architecture, and measure outcomes instead of tool adoption.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Security cannot be bolted on later
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/jkcso/" rel="noopener noreferrer"&gt;Joseph Katsioloudes&lt;/a&gt; from GitHub and Liran Tal from Snyk made security feel immediate. AI can help scale security knowledge, but it also creates new failure modes such as unsafe generated code, malicious skills, supply-chain exposure, prompt injection, and leaky context.&lt;/p&gt;

&lt;p&gt;In other words, your agent may be smart, but please do not hand it the production keys and a Red Bull.&lt;/p&gt;

&lt;h2&gt;
  
  
  A few talks we'd &lt;a href="https://www.youtube.com/watch?v=akZ85mG5HXY" rel="noopener noreferrer"&gt;recommend watching&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;For teams trying to move from experimentation to real AI-native practice, these sessions are worth shortlisting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Guy Podjarny, Tessl&lt;/strong&gt; - Skills are the new Code&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dana Lawson, Netlify&lt;/strong&gt; - Built for Humans. Now Agents Are Here.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;James Moss, Tessl&lt;/strong&gt; - Using skills to pay the bills&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Liran Tal, Snyk&lt;/strong&gt; - Your AI Agent Installed Malware Because a SKILL.md Told It To&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Ryan Lopopolo, OpenAI&lt;/strong&gt; - Harness Engineering&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Patrick Debois, Tessl&lt;/strong&gt; - The Rise of Agent Enablement&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Shachar Azriel, Baz&lt;/strong&gt; - Executable Specs&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;May Walter, Hud&lt;/strong&gt; - Runtime Intelligence for Continuous Agentic Performance Optimization&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dave Farley&lt;/strong&gt; - Vibe Coding - Is this really the best we can do?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That mix gives a strong picture of the day: context, skills, harnesses, verification, runtime feedback, security, and team enablement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The party bit
&lt;/h2&gt;

&lt;p&gt;After a full day of agent talk, context talk, and slightly scary security talk, the evening party was a good reset. People got to continue the hallway conversations, meet speakers, and process the day.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld0ytji4c1d46fse8fkw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld0ytji4c1d46fse8fkw.jpg" alt="party" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Honestly, conferences need this part. Some of the best ideas do not happen in the session room. They happen when someone says, “wait, we had the same problem,” and a conversation turns into a new idea, solution, or connection. 😄&lt;/p&gt;

&lt;h2&gt;
  
  
  A small look at Day 2
&lt;/h2&gt;

&lt;p&gt;Day 2 continues the same themes, with more hands-on sessions and a few focused talks worth tracking through notes or recordings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Harness engineering beyond code - product &amp;amp; design constraints for agents&lt;/strong&gt; by &lt;a href="//linkedin.com/in/marcsloan/"&gt;Marc Sloan&lt;/a&gt;, Tessl&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Benchmarking the Agent Era: Measuring Performance Beyond the LLM&lt;/strong&gt; by &lt;a href="https://uk.linkedin.com/in/amit-kushwaha28" rel="noopener noreferrer"&gt;Amit Kushwaha&lt;/a&gt;, NVIDIA&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Connecting Context - Exploring Future Transports&lt;/strong&gt; by &lt;a href="//linkedin.com/in/smithshaun/"&gt;Shaun Smith&lt;/a&gt;, Hugging Face&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You’re absolutely right, it was your home directory!&lt;/strong&gt; by &lt;a href="//linkedin.com/in/shelajev/"&gt;Oleg Šelajev&lt;/a&gt;, Docker&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Don’t Write Prompts, Write Software&lt;/strong&gt; by &lt;a href="//linkedin.com/in/jbaruch"&gt;Baruch Sadogursky&lt;/a&gt; and &lt;a href="//linkedin.com/in/maceybaker/"&gt;Macey Baker&lt;/a&gt;, Tessl&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Day 1 gave the frame. Day 2 goes deeper into harnesses, skills, benchmarking, context, and agent safety.&lt;/p&gt;

&lt;p&gt;We'll be sharing more highlights, key takeaways, and session content from Day 2 over the coming weeks.&lt;/p&gt;

&lt;p&gt;If you'd like to follow along and get the latest updates as they're released, &lt;a href="https://tessl.io/newsletter/" rel="noopener noreferrer"&gt;sign up for the newsletter&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The main takeaway
&lt;/h2&gt;

&lt;p&gt;Day 1 made one thing clear. AI-native development is growing up.&lt;/p&gt;

&lt;p&gt;The strongest talks were not about replacing developers or chasing the latest model release. They were about the engineering work around agents: skills, context, harnesses, verification, security, and team enablement.&lt;/p&gt;

&lt;p&gt;And yes, your coding agent still has commitment issues. But after Day 1, at least the industry has a better couples therapy plan.&lt;/p&gt;

&lt;p&gt;Thank you for joining AI Native DevCon, whether you were in the room or following along virtually.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>security</category>
      <category>architecture</category>
    </item>
    <item>
      <title>AI Native DevCon’26: The London conference for developers building with AI</title>
      <dc:creator>Rohan Sharma</dc:creator>
      <pubDate>Thu, 21 May 2026 06:06:32 +0000</pubDate>
      <link>https://dev.to/tessl/ai-native-devcon26-the-london-conference-for-developers-building-with-ai-4nm9</link>
      <guid>https://dev.to/tessl/ai-native-devcon26-the-london-conference-for-developers-building-with-ai-4nm9</guid>
      <description>&lt;p&gt;The bottleneck moved from writing code to governing it.&lt;/p&gt;

&lt;p&gt;The promise was 2× throughput. The reality is 2× the review queue, 2× the security exposure, and a CI signal you can no longer trust. &lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;AI Native DevCon&lt;/a&gt; 2026 is for the engineering leaders who have to figure out how to ship anyway.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;AI Native DevCon&lt;/a&gt; 2026 lands at The Brewery in London on June 1 and 2, with a hybrid track for remote. This is the conference for VPs of engineering, CTOs, platform owners, security leads, and senior engineers running agents in production, or about to. 500+ builders. Four tracks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgp2a7mihm3ub05mz53mx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgp2a7mihm3ub05mz53mx.png" alt="sponsors" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/guypo/" rel="noopener noreferrer"&gt;Guy Podjarny&lt;/a&gt;, founder of &lt;a href="https://tessl.io" rel="noopener noreferrer"&gt;Tessl&lt;/a&gt;, organizer of &lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;AI Native DevCon&lt;/a&gt;, and previously of &lt;a href="https://snyk.io/" rel="noopener noreferrer"&gt;Snyk&lt;/a&gt;, frames the 2026 question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If 2025 was the year coding agents started showing real promise, 2026 is the year we figure out how they hold up in production. The challenge is no longer getting an agent to work, it is getting it to work consistently across teams, codebases, and environments without constant human correction.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The schedule is organized around four tracks: &lt;strong&gt;context engineering&lt;/strong&gt; (building with agents), &lt;strong&gt;agent orchestration&lt;/strong&gt; (verification when CI is no longer enough), &lt;strong&gt;organizational enablement&lt;/strong&gt; (coordination at agent throughput), and &lt;strong&gt;agent enablement&lt;/strong&gt; (security and governance). Each maps to a problem most teams are already hitting.&lt;/p&gt;

&lt;p&gt;The agenda is built around the problems they actually have right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. “We do not know how to build with agents yet.”
&lt;/h2&gt;

&lt;p&gt;How the engineer’s role is changing, and what products designed for humans need to do once agents start using them. By 2026, that question lands on every platform team. This is the context engineering track.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/ryanlopopolo/" rel="noopener noreferrer"&gt;&lt;strong&gt;Ryan Lopopolo&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(OpenAI)&lt;/strong&gt;, &lt;em&gt;Harness Engineering&lt;/em&gt;. Concrete patterns for systems where humans set direction and agents execute, including the review and approval surfaces that keep it safe at scale.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/dglawson" rel="noopener noreferrer"&gt;&lt;strong&gt;Dana Lawson&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(Netlify, CTO)&lt;/strong&gt;, &lt;em&gt;Built for Humans. Now Agents Are Here.&lt;/em&gt; What changes in a developer platform when half the users are non-human, and the API and UX decisions Netlify made in response.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/anatomic/" rel="noopener noreferrer"&gt;&lt;strong&gt;Ian Thomas&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(Meta)&lt;/strong&gt;, &lt;em&gt;AI Native Engineering&lt;/em&gt;. How a large engineering org is restructuring workflows around agent-assisted development.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://uk.linkedin.com/in/steve-ruiz-61a150239" rel="noopener noreferrer"&gt;&lt;strong&gt;Steve Ruiz&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(tldraw)&lt;/strong&gt;, &lt;em&gt;Agents on the canvas&lt;/em&gt;. Interaction patterns for visual agents, with shipping examples you can copy.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. “We can generate code. We cannot verify it.”
&lt;/h2&gt;

&lt;p&gt;CI is no longer evidence of correctness. Two years of agent-generated code has proved it. The agent orchestration track is about what to put in its place.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/justincormack/" rel="noopener noreferrer"&gt;&lt;strong&gt;Justin Cormack&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(ex-Docker CTO)&lt;/strong&gt;, &lt;em&gt;When Tests Lie&lt;/em&gt;. Runtime signals that flag agent-introduced drift before it reaches users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.linkedin.com/in/dave-farley-a67927/" rel="noopener noreferrer"&gt;Dave Farley&lt;/a&gt; (Founder &amp;amp; CEO of Continuous Delivery Ltd. - 250k on Youtube),&lt;/strong&gt; &lt;em&gt;Vibe Coding, really?&lt;/em&gt;  The ideas that may actually survive the AI programming revolution, beyond hype, demos, and generated boilerplate.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/fowlerchad/" rel="noopener noreferrer"&gt;&lt;strong&gt;Chad Fowler&lt;/strong&gt;&lt;/a&gt;, &lt;em&gt;Regenerative Software&lt;/em&gt;. An architectural model where components are regenerated rather than patched, and what verification looks like when code is short-lived by design.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. “AI writes code faster than teams can coordinate.”
&lt;/h2&gt;

&lt;p&gt;Two years into coding-agent adoption, throughput is up roughly 2×. Coordination cost scaled with it. The organizational enablement track covers review, ownership, and team structure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ng9y6sw60tvf7ann6jw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ng9y6sw60tvf7ann6jw.png" alt="guypo" width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/guypo/" rel="noopener noreferrer"&gt;&lt;strong&gt;Guy Podjarny&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(Tessl)&lt;/strong&gt;, &lt;em&gt;Skills are the new Code&lt;/em&gt; (keynote). The case for &lt;a href="https://tessl.io/registry" rel="noopener noreferrer"&gt;treating skills as proper software&lt;/a&gt;: versioned, tested, owned, reviewed. With the Tessl Registry now holding 2,000+ evaluated skills, the talk covers what that means for repo structure and review process.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/birgittaboeckeler/" rel="noopener noreferrer"&gt;&lt;strong&gt;Birgitta Böckeler&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(Thoughtworks)&lt;/strong&gt;, &lt;em&gt;State of Play: AI Coding Assistants&lt;/em&gt; (keynote). Two years of field data on which adoption patterns work and which create future technical debt.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/patrickdebois/" rel="noopener noreferrer"&gt;&lt;strong&gt;Patrick Debois&lt;/strong&gt;&lt;/a&gt;, &lt;em&gt;The Rise of Agent Enablement&lt;/em&gt;. &lt;a href="https://tessl.io/agent-enablement" rel="noopener noreferrer"&gt;Agent Enablement&lt;/a&gt; is the function that owns reliable agent adoption inside an engineering org. It defines standards for skills, evals, and workflows, and sits next to DevOps and Platform Engineering. Patrick’s session covers who owns it, what they do, and how teams formalize it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. “Your AI is a new attack surface.”
&lt;/h2&gt;

&lt;p&gt;Vulnerability classes that did not exist 18 months ago, and the controls most teams have not put in place yet. This is the agent enablement track from a security and governance angle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/talliran/" rel="noopener noreferrer"&gt;&lt;strong&gt;Liran Tal&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(Snyk)&lt;/strong&gt;, &lt;em&gt;Your AI Agent Installed Malware Because a SKILL.md Told It To&lt;/em&gt;. Live demo of prompt-injection via SKILL.md manifests, with the threat model and mitigations.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://linkedin.com/in/jkcso" rel="noopener noreferrer"&gt;&lt;strong&gt;Joseph Katsioloudes&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(GitHub)&lt;/strong&gt;, &lt;em&gt;Code Security Reinvented&lt;/em&gt;. How SAST, secret scanning, and review need to change for AI-generated code.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/jack-wotherspoon/" rel="noopener noreferrer"&gt;&lt;strong&gt;Jack Wotherspoon&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(Google)&lt;/strong&gt;, &lt;em&gt;Humans vs. Slop&lt;/em&gt;. New rules for open source maintainers when an unknown share of contributors are agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why engineering leaders should attend
&lt;/h2&gt;

&lt;p&gt;Five things your team brings back to Monday:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A verification model that does not assume CI catches the regression&lt;/li&gt;
&lt;li&gt;Threat models for prompt-injection and SKILL.md attacks, with mitigations&lt;/li&gt;
&lt;li&gt;Team structures and review workflows that scale with agent throughput&lt;/li&gt;
&lt;li&gt;A working definition of Agent Enablement as a discipline, including ownership and scope&lt;/li&gt;
&lt;li&gt;A model for evaluating skills before they go org-wide, with review patterns and KPIs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvtxq925k99a5393mqw2d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvtxq925k99a5393mqw2d.png" alt="crowd" width="799" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Hosts and the wider lineup
&lt;/h2&gt;

&lt;p&gt;Hosted by &lt;a href="https://www.linkedin.com/in/sammyhepburn/" rel="noopener noreferrer"&gt;&lt;strong&gt;Sam Hepburn&lt;/strong&gt;&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/patrickdebois/" rel="noopener noreferrer"&gt;&lt;strong&gt;Patrick Debois&lt;/strong&gt;&lt;/a&gt;. Day-one keynote from &lt;a href="https://x.com/lievenscheire" rel="noopener noreferrer"&gt;&lt;strong&gt;Lieven Scheire&lt;/strong&gt;&lt;/a&gt; on AI from outside the engineering bubble. The wider roster covers agent observability, MCP transports, runtime intelligence, brownfield adoption, and team-level adoption metrics, with practitioners from Anthropic, OpenAI, NVIDIA, Adobe, Hugging Face, Mozilla.ai, Cisco, Nearform, GitHub, and much more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkdvcmrgty5seazs7mhf6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkdvcmrgty5seazs7mhf6.png" alt="speakers" width="800" height="857"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Full speaker list and abstracts: &lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;tessl.io/devcon&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dates:&lt;/strong&gt; June 1 and 2, 2026&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format:&lt;/strong&gt; 2 days in-person or 1 day virtual&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Venue:&lt;/strong&gt; The Brewery, Barbican, London.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Register:&lt;/strong&gt; &lt;a href="https://luma.com/aidevcon-ldn26?coupon=R30" rel="noopener noreferrer"&gt;https://luma.com/aidevcon-ldn26?coupon=R30&lt;/a&gt; (&lt;code&gt;R30&lt;/code&gt; auto-applies at checkout to knocks off 30% off)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bringing a team?&lt;/strong&gt; Contact at &lt;a href="https://tessl.io/get-in-touch/" rel="noopener noreferrer"&gt;tessl.io/get-in-touch&lt;/a&gt;, and we can arrange a group purchase discount.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See you at the event!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>career</category>
      <category>eventsinyourcity</category>
    </item>
    <item>
      <title>Stop trusting your agent skills with vibes. Eliminate the context security risk.</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Fri, 15 May 2026 04:55:29 +0000</pubDate>
      <link>https://dev.to/tessl/stop-trusting-your-agent-skills-with-vibes-eliminate-the-context-security-risk-1jld</link>
      <guid>https://dev.to/tessl/stop-trusting-your-agent-skills-with-vibes-eliminate-the-context-security-risk-1jld</guid>
      <description>&lt;p&gt;When you install an npm package, you can run &lt;code&gt;npm audit&lt;/code&gt;. When you install a Python package, there's &lt;code&gt;pip-audit&lt;/code&gt;. But when you install plugins that give your AI agent new skills and rules, you know, things that directly shape how it reasons and what it does, what do you run?&lt;/p&gt;

&lt;p&gt;If your answer is "nothing", you're not alone, and that's why I built &lt;code&gt;tessl-audit&lt;/code&gt;! You can check it out on &lt;a href="https://github.com/AI-Native-Dev-Community/tessl-audit" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and &lt;a href="https://www.npmjs.com/package/tessl-audit" rel="noopener noreferrer"&gt;npm&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more than you think
&lt;/h2&gt;

&lt;p&gt;Agent plugins are &lt;em&gt;instructions&lt;/em&gt; that get loaded into your AI agent's context. A plugin with a security issue doesn't just expose a server endpoint. It can influence the agent's behaviour in ways that are subtle and hard to detect, perhaps nudging it toward unsafe patterns, exposing data it shouldn't, or simply making it worse at its job.&lt;/p&gt;

&lt;p&gt;Ask yourself these three questions about your agent skills, and if the answer to any of them is no, you’re seconds away from being able to say yes, with tessl-audit.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Have all your skills been security scanned?&lt;/strong&gt; If so, what was the result?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Can you prove your skills are any good?&lt;/strong&gt; Quality scores tell you how well-written and complete a plugin is. A low score means the agent is getting poor guidance.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Do your skills and plugins actually help?&lt;/strong&gt; Uplift scores measure whether a plugin improves agent task performance compared to a vanilla agent alone.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbraf14e4s66n7ibzuwuk.png" alt="Join us at AI Native DevCon" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br&gt;Join us at AI Native DevCon (use C0DE30 for 30% discount)
&lt;p&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Why not try it right now?
&lt;/h2&gt;

&lt;p&gt;It’s a free open source tool that uses Tessl under the covers. If you have a Tessl project with plugins installed, just run this in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npx tessl-audit
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait, is that it? Absolutely, that's it. It reads your &lt;code&gt;tessl.json&lt;/code&gt;, fetches live data from the registry for every plugin, and prints a report in about 30 seconds.&lt;/p&gt;

&lt;p&gt;The script begins by looking through all your context file that it finds in the tessl.json manifest file. This should complete pretty quickly and you’ll soon see the table below, with a breakdown of your project context., and the types of warnings that have been picked up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0rrz9ig4r2nebvw87p3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0rrz9ig4r2nebvw87p3.png" alt="image1" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, the tool gives a posture summary of all of your context, giving more details of the riskiest skills in your project and what the issues are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9xwxk46mxgxqvjtqios.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9xwxk46mxgxqvjtqios.png" alt="img2" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can click through on any of these links to see the actual issues in the registry web UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsib0z1ar0osa3lfxvrau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsib0z1ar0osa3lfxvrau.png" alt="img3" width="800" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, the tool provides next step actions of the CLI commands to use (you can use an agent to call these also) to optimize, create and run evals on your skills.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwtr6gssymroeyl5g4cf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwtr6gssymroeyl5g4cf.png" alt="img4" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The "so what" for each finding
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Advisory, Risky, or Critical security status?
&lt;/h3&gt;

&lt;p&gt;The report prints each flagged plugin with its warning codes and a direct link to the full security report on the registry. No need to chase them down, the security posture report lets you see the full summary in one listing, allowing you to deep dive here needed. Just open the link, read the finding, decide if it applies to your use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality below 80%?
&lt;/h3&gt;

&lt;p&gt;The plugin you’re using is giving your agent incomplete or poorly-structured guidance. Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tessl skill review --optimize workspace/plugin-name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs a quality review and applies automatic improvements.&lt;/p&gt;

&lt;h3&gt;
  
  
  No uplift data?
&lt;/h3&gt;

&lt;p&gt;The plugin has never been evaluated against real tasks — so you have no idea if it's helping or hurting. Fix that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tessl scenario generate --count 5 workspace/plugin-name
tessl eval run workspace/plugin-name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate a set of test scenarios from the plugin, then run the eval. You'll get a concrete uplift score showing whether the plugin is worth keeping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger picture
&lt;/h2&gt;

&lt;p&gt;Every team that uses AI agents is building a dependency graph of skills, rules, and knowledge, just like they build a dependency graph of packages. The tooling for auditing that graph is still being built, but the risks are real and growing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tessl-audit&lt;/code&gt; is a small, practical step: one command, zero installation, actionable output. Run it today and find out what your agent is actually working with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npx tessl-audit
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;code&gt;tessl-audit&lt;/code&gt; requires the Tessl CLI (no worries, it’s already a dependency) and an authenticated Tessl session (just create a free account if you haven’t got one). You’ll need a &lt;code&gt;tessl.json&lt;/code&gt; in order to run the &lt;code&gt;tessl-audit&lt;/code&gt; tool, which is a context manifest tile.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful docs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://docs.tessl.io/evaluate/evaluate-skill-quality-using-scenarios" rel="noopener noreferrer"&gt;Evaluate skill quality using scenarios&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://docs.tessl.io/evaluate/evaluating-skills" rel="noopener noreferrer"&gt;Review a skill against best practices&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://tessl.io/registry/tessl-labs/skill-optimizer" rel="noopener noreferrer"&gt;Skill Optimizer plugin&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>security</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Tessl Admin Guide: Organizations, Workspaces, and Roles</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Thu, 14 May 2026 06:45:55 +0000</pubDate>
      <link>https://dev.to/tessl/tessl-admin-guide-organizations-workspaces-and-roles-4m75</link>
      <guid>https://dev.to/tessl/tessl-admin-guide-organizations-workspaces-and-roles-4m75</guid>
      <description>&lt;p&gt;Just signed up to Tessl? Wondering next steps to rolling Tessl out to your team? The following article will take you through the steps of managing your top level Organization, invite your users, set policy items, then create your workspaces, assigning membership to those workspaces and defining their &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;roles&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Organizations and Workspaces work in Tessl
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Organizations&lt;/em&gt;&lt;/strong&gt; are top level entities, often representing the billing or corporate entity, with a subcategory called &lt;strong&gt;&lt;em&gt;Workspaces&lt;/em&gt;&lt;/strong&gt; that provide role-based access to the various users across the company.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8e06vahqpkgmjfc731a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8e06vahqpkgmjfc731a.png" alt="A diagram showing a top level Organization, with many workspaces below. Some with Search,Install, and Publish permissions, some with just Install and Publish, and one with no access." width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up your Tessl Organization
&lt;/h2&gt;

&lt;p&gt;Organizations are sometimes created during the presales phase of acquiring Tessl, or may be created later. If one has not been created, it will be auto created when you create your first workspace. If prompted, click &lt;strong&gt;Create workspace&lt;/strong&gt; and name it after your team (i.e. YourCompanyName-Engineering) to start.&lt;/p&gt;

&lt;p&gt;Note workspace names must be unique at this time, and will appear in plugin-names when searched. This is most notable if the plugins are published publicly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festzqydqhxluufhxo384.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festzqydqhxluufhxo384.png" alt="View of the registry page where a Create workspace button is being discplayed." width="800" height="1466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The workspace should now be visible from the main interface&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmjpzw8kyrt0ii7ag52.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmjpzw8kyrt0ii7ag52.png" alt="The workspace selector will appear, displaying the workspaces you have access to,  with sub menu items like eval runs, projects, etc dependant on your permissions." width="800" height="631"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The organization can now be observed by clicking your Account, where your name is displayed, on the bottom left&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl7k6wrz7zz90yaf4zkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl7k6wrz7zz90yaf4zkm.png" alt="By selecting your account/profile, the organization will be displayed with sub menu of members, settings, admin keys, depending on your permissions." width="800" height="1167"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once created, navigate to &lt;strong&gt;&lt;em&gt;Settings&lt;/em&gt;&lt;/strong&gt; for your Organization, rename the organization to your company name and specify if users can publicly share &lt;a href="https://docs.tessl.io/create/creating-skills" rel="noopener noreferrer"&gt;skills&lt;/a&gt; by enabling the button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdjvzjeg5zdqw9ceb5im.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdjvzjeg5zdqw9ceb5im.png" alt="Organization settings displayes an organization name, the ability to save, and an option to block public tile publishing by toggling a selector." width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating and managing Users in Tessl
&lt;/h3&gt;

&lt;p&gt;Next, invite users to your organization, by navigating to the Organization’s &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; menu, assigning the workspaces the users will have access to. Users will be created with the&amp;nbsp; &lt;strong&gt;&lt;em&gt;members&lt;/em&gt;&lt;/strong&gt; role, able to see, search and install skills from the chosen workspaces. Permissions can be promoted from the Workspace &lt;strong&gt;Members&lt;/strong&gt; menu, which will be discussed later below. Users will need to accept the invite they are sent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4g6r3dl7lo16v46uut0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4g6r3dl7lo16v46uut0.png" alt="Invite member screen displayes an email address, a selection of workspaces that can be added to the user specified." width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once created, you can elevate a user to Admin to allow workspace creation or manage users. To do so, navigate to the Organization &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; screen, and click the three dots under &lt;em&gt;&lt;strong&gt;Actions.&lt;/strong&gt; Assign an appropriate role. Examples will be provided below of some common configurations.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fsvjyckwcicbqzup30w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fsvjyckwcicbqzup30w.png" alt="Expanding the options menu, which is three dots, next to each name yields a submenu with change role and remove" width="800" height="774"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Admin keys
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrfooi6l0jsv9cxkw57k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrfooi6l0jsv9cxkw57k.png" alt="image.png" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Admin keys are for integrations and applications where programmatic access is required across workspaces. This is typically used for automation purposes and an expiration can be set up to one year.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing Workspaces and Users in Tessl
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo5e2gi1574xb0he4gaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo5e2gi1574xb0he4gaj.png" alt="On the side menu of the screen, users can select all plugins, eval runs, projects and members from a specified workspace." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click the workspace drop-down to navigate workspaces. Navigate to &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; at the workspace level to specify &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;Roles&lt;/a&gt; for users who require more capabilities within the workspace, such as running evaluations, publishing or managing users.&lt;/p&gt;

&lt;p&gt;To modify a user, search for their name, select their checkbox, a &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;role&lt;/a&gt;, and click the &lt;strong&gt;Add&lt;/strong&gt; button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9fog1k29i17joa6vga0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9fog1k29i17joa6vga0.png" alt="The role selector allows user to select consumer, member, publisher, manager and owner when adding a user to a workspace." width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Example role configurations for your team(s)
&lt;/h2&gt;

&lt;p&gt;The following users demonstrate common configurations and &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;roles&lt;/a&gt; that may be used when rolling Tessl out:&lt;/p&gt;

&lt;h3&gt;
  
  
  Samira - Org. Admin
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Samira&lt;/strong&gt;, the administrator and skills champion, needs the ability to manage all workspaces, the ability to assign users, and create new workspaces. Make her an Organization admin.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8q5xoajbwb8bfdcf8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8q5xoajbwb8bfdcf8m.png" alt="A diagram showing Samira with admin privileges at the Organization level , giving her full permissions on the workspaces below as a result" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Eddie - Lead Engineer
&lt;/h3&gt;

&lt;p&gt;Another user, &lt;strong&gt;Eddie&lt;/strong&gt;, might be a member of an engineering workspace. He needs to be able to use plugins (skills) that have been published, but may need to have access to publish skills within the engineering workspace for others on his team. This could mean Eddie is the publisher &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;role&lt;/a&gt; in certain workspaces. He may also be a Member role of other workspaces where he only needs to search and install from.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgzuipack4tlngxvr26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgzuipack4tlngxvr26.png" alt="A diagram showing an organization with several workspaces. The user has publisher permission on several, giving search. install, and publish rights. Several other workspaces the user is only a member, providing more limited permissions like Search and Install. One workspace is no access because they were not given permissions." width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Jennifer - Manager
&lt;/h3&gt;

&lt;p&gt;Jennifer may require the ability to add users to a workspace that she owns, publish, and possibly need the ability to remove other managers etc. Typically the workspace permission "Owner" or "manager" may be given to that user, depending on the need to remove other "owners" or delete workspace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Joe - New hire engineer
&lt;/h3&gt;

&lt;p&gt;Finally, Joe, a new hire, has the ability to search and install skills from the engineering workspace, but does not have the ability to share/create skills until later, after they’ve gained a little more experience. Joe would be made a member of “engineering” with just a “consumer” role.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps!
&lt;/h2&gt;

&lt;p&gt;Now that you have your users in, and assigned roles to the different workspaces, you and your users can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Start creating &lt;a href="https://docs.tessl.io/create/creating-skills" rel="noopener noreferrer"&gt;new skills&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Evaluate new or existing skill effectiveness through using &lt;a href="https://docs.tessl.io/evaluate/evaluating-skills" rel="noopener noreferrer"&gt;Reviews&lt;/a&gt;, and &lt;a href="https://docs.tessl.io/evaluate/evaluate-skill-quality-using-scenarios" rel="noopener noreferrer"&gt;Evals&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Publish those skills to the Tessl registry to share them for your users and agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Let us know what you think! Tessl would love to hear from you through any one of our &lt;a href="https://docs.tessl.io/support/giving-feedback" rel="noopener noreferrer"&gt;feedback channels (Discord, Email, CLI Feedback, etc)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://tessl.io/blog/tessl-admin-guide-organizations-workspaces-and-roles/" rel="noopener noreferrer"&gt;Tessl.blogs&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>tutorial</category>
      <category>agents</category>
    </item>
    <item>
      <title>GPT-5.5 is OpenAI's best model. But paying more for it makes no sense.</title>
      <dc:creator>Rohan Sharma</dc:creator>
      <pubDate>Wed, 06 May 2026 13:13:28 +0000</pubDate>
      <link>https://dev.to/tessl/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense-2227</link>
      <guid>https://dev.to/tessl/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense-2227</guid>
      <description>&lt;p&gt;We added OpenAI’s gpt-5.5 model to our eval suite the day it launched. We ran 1,742 tests overall, which included over 45 task scenarios across using 11 real engineering skills, each run 6 times and averaged the data, which is shown in this blog.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;The gpt-5.5 model has the highest raw capability of any OpenAI model we've tested. When it uses agent skills and performs the same tasks, it pretty much ties with gpt-5.4 on score but costs 63% more per run.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best Codex model out of the box?&lt;/td&gt;
&lt;td&gt;gpt-5.5: 75.6 avg baseline, highest in the family&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Codex model with skills loaded?&lt;/td&gt;
&lt;td&gt;gpt-5.4 and gpt-5.5 tie at 89.3 and 89.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worth the 63% price premium over gpt-5.4?&lt;/td&gt;
&lt;td&gt;With this data, we don’t think so&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Any scenario where it wins?&lt;/td&gt;
&lt;td&gt;Latency: 89.5s vs 135.4s for gpt-5.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Should you use gpt-5.3 instead?&lt;/td&gt;
&lt;td&gt;No, oddly enough, gpt-5.3 costs 47% more than gpt-5.4 for a worse result because of the token bloat.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The one-line verdict: gpt-5.5 is the most capable Codex model we've benchmarked, and when using agent skills to guide with tasks, it performs pretty much identically to a model that costs a third less. The interesting story is actually gpt-5.3, which costs more than gpt-5.4 and scores worse, because of the token bloat in 5.3. The per-token cost is, of course, more expensive with gpt-5.5.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Takeaways
&lt;/h2&gt;

&lt;p&gt;The most counterintuitive thing in this data: gpt-5.5 and gpt-5.4 score within 0.1 points of each other when given domain skills, 89.4 vs 89.3. The self-sufficiency story holds directionally, but these two models are functionally the same on skill-augmented work. The question is purely cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbraf14e4s66n7ibzuwuk.png" alt="Join us at AI Native DevCon" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;
Join us at AI Native DevCon (use C0DE30 for 30% discount)



&lt;p&gt;The gpt-5.3 story is sharper. The headline numbers put it at 83.9 with skills against 89.3 for gpt-5.4, a 5.4 point gap. It also costs $0.44 per run against $0.30 for gpt-5.4. You pay more and get less, which is a complete description of a bad deal.&lt;/p&gt;

&lt;p&gt;You pay $0.49/run for 89.4 points with gpt-5.5. You pay $0.30/run for 89.3 points with gpt-5.4. The only dimension where gpt-5.5 leads is latency, at 89.5s against 135.4s. If you're running latency-constrained agents and can absorb the cost, it's a defensible choice. Otherwise you're paying a 63% premium for 0.1 points.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Stacks Up
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Task Scores (using agent skill)&lt;/th&gt;
&lt;th&gt;Cost/run&lt;/th&gt;
&lt;th&gt;Score/$&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;claude-opus-4-7&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;93&lt;/td&gt;
&lt;td&gt;+12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2&lt;/td&gt;
&lt;td&gt;89.6&lt;/td&gt;
&lt;td&gt;$0.23&lt;/td&gt;
&lt;td&gt;389&lt;/td&gt;
&lt;td&gt;+15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;td&gt;$0.49&lt;/td&gt;
&lt;td&gt;182&lt;/td&gt;
&lt;td&gt;+13.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;298&lt;/td&gt;
&lt;td&gt;+15.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3-codex&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;$0.44&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;td&gt;+18.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;td&gt;$1.05&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;+10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;gpt-5.5 and gpt-5.4 are functionally interchangeable on skill performance. The question is whether 45 seconds per run is worth $0.19.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Tested
&lt;/h2&gt;

&lt;p&gt;This benchmark runs on &lt;a href="https://tessl.io" rel="noopener noreferrer"&gt;Tessl&lt;/a&gt;, an agentic evaluation platform. A skill is a &lt;code&gt;SKILL.md&lt;/code&gt; file, which is a structured markdown document containing rules, patterns, and examples for a specific domain. For the baseline run, the agent sees only the task prompt with no additional context. For the with-skill run, the &lt;code&gt;SKILL.md&lt;/code&gt; is loaded into the agent's context alongside the task, same model, same task, same rubric. The score delta is the lift. The platform runs each scenario twice and scores the output against a pre-written rubric checklist automatically.&lt;/p&gt;

&lt;p&gt;Each scenario was run 6 times and scored independently; all figures are averaged across those runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why rubric checklists?&lt;/strong&gt; Because the scenarios have objectively right answers. "Does the agent delete &lt;code&gt;.eslintrc.json&lt;/code&gt; and create &lt;code&gt;eslint.config.js&lt;/code&gt;?" is not a matter of opinion. Neither is "Does it use PKCE method S256?" or "Does it call &lt;code&gt;pipeline()&lt;/code&gt; instead of chaining &lt;code&gt;.pipe()&lt;/code&gt;?" Binary criteria eliminate evaluation noise wherever possible.&lt;/p&gt;

&lt;p&gt;Example rubric: &lt;em&gt;Modernize the Linting Setup for a Node.js Library&lt;/em&gt;, 11 criteria, 101 points.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Points&lt;/th&gt;
&lt;th&gt;Pass condition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;neostandard installed&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;neostandard present in devDependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;standard uninstalled&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;standard absent from devDependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flat config file&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;eslint.config.js or .mjs exists, not .eslintrc*&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;neostandard in config&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Config imports from neostandard and calls neostandard()&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lint script uses eslint&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;package.json lint script runs eslint ., not neostandard . or standard .&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;migrate command used&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Instructions reference npx neostandard --migrate to generate the config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lint:fix script present&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;lint:fix script runs eslint . --fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI uses non-fix run&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;CI config runs lint without --fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;standard config removed&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;No top-level standard key in package.json&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lint-staged uses eslint&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Pre-commit hook runs eslint --fix, not neostandard or standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eslint@9 installed&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;eslint at version 9.x in devDependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A model that migrates the config correctly but leaves &lt;code&gt;standard&lt;/code&gt; in devDependencies scores 91/101. One that creates &lt;code&gt;eslint.config.js&lt;/code&gt; alongside &lt;code&gt;.eslintrc.json&lt;/code&gt; instead of replacing it scores 0 on three criteria at once.&lt;/p&gt;

&lt;p&gt;All skills and rubrics are published at &lt;a href="https://tessl.io/registry/simon/skills" rel="noopener noreferrer"&gt;simon/skills on the Tessl registry&lt;/a&gt;. Full eval results for this run &lt;a href="https://tessl.io/registry/simon/skills/evals" rel="noopener noreferrer"&gt;can be found here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Baseline scores (no skill), sorted by highest average
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;docs&lt;/th&gt;
&lt;th&gt;fastify&lt;/th&gt;
&lt;th&gt;init&lt;/th&gt;
&lt;th&gt;lint&lt;/th&gt;
&lt;th&gt;node&lt;/th&gt;
&lt;th&gt;node-core&lt;/th&gt;
&lt;th&gt;oauth&lt;/th&gt;
&lt;th&gt;octocat&lt;/th&gt;
&lt;th&gt;skill-opt&lt;/th&gt;
&lt;th&gt;snip&lt;/th&gt;
&lt;th&gt;ts&lt;/th&gt;
&lt;th&gt;Avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;claude-opus-4-7&lt;/td&gt;
&lt;td&gt;85.7&lt;/td&gt;
&lt;td&gt;80.9&lt;/td&gt;
&lt;td&gt;79.7&lt;/td&gt;
&lt;td&gt;92.9&lt;/td&gt;
&lt;td&gt;73.7&lt;/td&gt;
&lt;td&gt;91.6&lt;/td&gt;
&lt;td&gt;75.7&lt;/td&gt;
&lt;td&gt;84.7&lt;/td&gt;
&lt;td&gt;85.0&lt;/td&gt;
&lt;td&gt;60.1&lt;/td&gt;
&lt;td&gt;78.8&lt;/td&gt;
&lt;td&gt;80.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;89.9&lt;/td&gt;
&lt;td&gt;71.8&lt;/td&gt;
&lt;td&gt;63.6&lt;/td&gt;
&lt;td&gt;94.4&lt;/td&gt;
&lt;td&gt;64.6&lt;/td&gt;
&lt;td&gt;72.3&lt;/td&gt;
&lt;td&gt;73.6&lt;/td&gt;
&lt;td&gt;85.5&lt;/td&gt;
&lt;td&gt;83.2&lt;/td&gt;
&lt;td&gt;54.7&lt;/td&gt;
&lt;td&gt;78.3&lt;/td&gt;
&lt;td&gt;75.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;87.6&lt;/td&gt;
&lt;td&gt;66.7&lt;/td&gt;
&lt;td&gt;71.1&lt;/td&gt;
&lt;td&gt;84.5&lt;/td&gt;
&lt;td&gt;62.3&lt;/td&gt;
&lt;td&gt;77.4&lt;/td&gt;
&lt;td&gt;77.5&lt;/td&gt;
&lt;td&gt;80.5&lt;/td&gt;
&lt;td&gt;80.8&lt;/td&gt;
&lt;td&gt;50.9&lt;/td&gt;
&lt;td&gt;75.9&lt;/td&gt;
&lt;td&gt;74.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2&lt;/td&gt;
&lt;td&gt;84.3&lt;/td&gt;
&lt;td&gt;74.7&lt;/td&gt;
&lt;td&gt;61.6&lt;/td&gt;
&lt;td&gt;94.1&lt;/td&gt;
&lt;td&gt;65.4&lt;/td&gt;
&lt;td&gt;78.8&lt;/td&gt;
&lt;td&gt;73.1&lt;/td&gt;
&lt;td&gt;78.5&lt;/td&gt;
&lt;td&gt;82.3&lt;/td&gt;
&lt;td&gt;58.5&lt;/td&gt;
&lt;td&gt;65.5&lt;/td&gt;
&lt;td&gt;74.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;80.2&lt;/td&gt;
&lt;td&gt;67.3&lt;/td&gt;
&lt;td&gt;60.2&lt;/td&gt;
&lt;td&gt;84.9&lt;/td&gt;
&lt;td&gt;60.4&lt;/td&gt;
&lt;td&gt;76.5&lt;/td&gt;
&lt;td&gt;72.9&lt;/td&gt;
&lt;td&gt;75.3&lt;/td&gt;
&lt;td&gt;63.8&lt;/td&gt;
&lt;td&gt;47.5&lt;/td&gt;
&lt;td&gt;66.5&lt;/td&gt;
&lt;td&gt;68.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3-codex&lt;/td&gt;
&lt;td&gt;63.5&lt;/td&gt;
&lt;td&gt;65.4&lt;/td&gt;
&lt;td&gt;52.1&lt;/td&gt;
&lt;td&gt;76.5&lt;/td&gt;
&lt;td&gt;62.4&lt;/td&gt;
&lt;td&gt;75.3&lt;/td&gt;
&lt;td&gt;77.9&lt;/td&gt;
&lt;td&gt;68.3&lt;/td&gt;
&lt;td&gt;70.5&lt;/td&gt;
&lt;td&gt;42.1&lt;/td&gt;
&lt;td&gt;66.4&lt;/td&gt;
&lt;td&gt;65.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  With-skill scores, sorted by highest average
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;docs&lt;/th&gt;
&lt;th&gt;fastify&lt;/th&gt;
&lt;th&gt;init&lt;/th&gt;
&lt;th&gt;lint&lt;/th&gt;
&lt;th&gt;node&lt;/th&gt;
&lt;th&gt;node-core&lt;/th&gt;
&lt;th&gt;oauth&lt;/th&gt;
&lt;th&gt;octocat&lt;/th&gt;
&lt;th&gt;skill-opt&lt;/th&gt;
&lt;th&gt;snip&lt;/th&gt;
&lt;th&gt;ts&lt;/th&gt;
&lt;th&gt;Avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;claude-opus-4-7&lt;/td&gt;
&lt;td&gt;96.7&lt;/td&gt;
&lt;td&gt;98.9&lt;/td&gt;
&lt;td&gt;82.3&lt;/td&gt;
&lt;td&gt;97.2&lt;/td&gt;
&lt;td&gt;95.1&lt;/td&gt;
&lt;td&gt;84.7&lt;/td&gt;
&lt;td&gt;94.3&lt;/td&gt;
&lt;td&gt;97.7&lt;/td&gt;
&lt;td&gt;99.7&lt;/td&gt;
&lt;td&gt;92.9&lt;/td&gt;
&lt;td&gt;88.0&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2&lt;/td&gt;
&lt;td&gt;95.6&lt;/td&gt;
&lt;td&gt;93.9&lt;/td&gt;
&lt;td&gt;85.7&lt;/td&gt;
&lt;td&gt;96.4&lt;/td&gt;
&lt;td&gt;94.0&lt;/td&gt;
&lt;td&gt;92.3&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;94.5&lt;/td&gt;
&lt;td&gt;93.7&lt;/td&gt;
&lt;td&gt;85.3&lt;/td&gt;
&lt;td&gt;70.4&lt;/td&gt;
&lt;td&gt;89.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;96.1&lt;/td&gt;
&lt;td&gt;86.0&lt;/td&gt;
&lt;td&gt;81.8&lt;/td&gt;
&lt;td&gt;96.3&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.6&lt;/td&gt;
&lt;td&gt;91.7&lt;/td&gt;
&lt;td&gt;92.1&lt;/td&gt;
&lt;td&gt;96.0&lt;/td&gt;
&lt;td&gt;86.5&lt;/td&gt;
&lt;td&gt;79.2&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;97.1&lt;/td&gt;
&lt;td&gt;76.9&lt;/td&gt;
&lt;td&gt;80.0&lt;/td&gt;
&lt;td&gt;98.1&lt;/td&gt;
&lt;td&gt;84.8&lt;/td&gt;
&lt;td&gt;93.7&lt;/td&gt;
&lt;td&gt;91.6&lt;/td&gt;
&lt;td&gt;95.7&lt;/td&gt;
&lt;td&gt;94.6&lt;/td&gt;
&lt;td&gt;90.9&lt;/td&gt;
&lt;td&gt;79.0&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3-codex&lt;/td&gt;
&lt;td&gt;96.9&lt;/td&gt;
&lt;td&gt;86.1&lt;/td&gt;
&lt;td&gt;80.4&lt;/td&gt;
&lt;td&gt;90.2&lt;/td&gt;
&lt;td&gt;75.9&lt;/td&gt;
&lt;td&gt;77.1&lt;/td&gt;
&lt;td&gt;93.1&lt;/td&gt;
&lt;td&gt;92.3&lt;/td&gt;
&lt;td&gt;77.3&lt;/td&gt;
&lt;td&gt;79.4&lt;/td&gt;
&lt;td&gt;74.1&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;62.9&lt;/td&gt;
&lt;td&gt;88.9&lt;/td&gt;
&lt;td&gt;74.8&lt;/td&gt;
&lt;td&gt;92.1&lt;/td&gt;
&lt;td&gt;66.3&lt;/td&gt;
&lt;td&gt;77.7&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;85.9&lt;/td&gt;
&lt;td&gt;80.7&lt;/td&gt;
&lt;td&gt;86.0&lt;/td&gt;
&lt;td&gt;61.1&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Lift: what skills actually added per model, sorted by highest average lift
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;docs&lt;/th&gt;
&lt;th&gt;fastify&lt;/th&gt;
&lt;th&gt;init&lt;/th&gt;
&lt;th&gt;lint&lt;/th&gt;
&lt;th&gt;node&lt;/th&gt;
&lt;th&gt;node-core&lt;/th&gt;
&lt;th&gt;oauth&lt;/th&gt;
&lt;th&gt;octocat&lt;/th&gt;
&lt;th&gt;skill-opt&lt;/th&gt;
&lt;th&gt;snip&lt;/th&gt;
&lt;th&gt;ts&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3-codex&lt;/td&gt;
&lt;td&gt;+33.4&lt;/td&gt;
&lt;td&gt;+20.7&lt;/td&gt;
&lt;td&gt;+28.3&lt;/td&gt;
&lt;td&gt;+13.7&lt;/td&gt;
&lt;td&gt;+13.5&lt;/td&gt;
&lt;td&gt;+1.8&lt;/td&gt;
&lt;td&gt;+15.2&lt;/td&gt;
&lt;td&gt;+24.0&lt;/td&gt;
&lt;td&gt;+6.8&lt;/td&gt;
&lt;td&gt;+37.3&lt;/td&gt;
&lt;td&gt;+7.7&lt;/td&gt;
&lt;td&gt;+18.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2&lt;/td&gt;
&lt;td&gt;+11.3&lt;/td&gt;
&lt;td&gt;+19.2&lt;/td&gt;
&lt;td&gt;+24.1&lt;/td&gt;
&lt;td&gt;+2.3&lt;/td&gt;
&lt;td&gt;+28.6&lt;/td&gt;
&lt;td&gt;+13.5&lt;/td&gt;
&lt;td&gt;+10.8&lt;/td&gt;
&lt;td&gt;+16.0&lt;/td&gt;
&lt;td&gt;+11.4&lt;/td&gt;
&lt;td&gt;+26.8&lt;/td&gt;
&lt;td&gt;+4.9&lt;/td&gt;
&lt;td&gt;+15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;+9.5&lt;/td&gt;
&lt;td&gt;+10.2&lt;/td&gt;
&lt;td&gt;+8.9&lt;/td&gt;
&lt;td&gt;+13.6&lt;/td&gt;
&lt;td&gt;+22.5&lt;/td&gt;
&lt;td&gt;+16.3&lt;/td&gt;
&lt;td&gt;+14.1&lt;/td&gt;
&lt;td&gt;+15.2&lt;/td&gt;
&lt;td&gt;+13.8&lt;/td&gt;
&lt;td&gt;+40.0&lt;/td&gt;
&lt;td&gt;+3.1&lt;/td&gt;
&lt;td&gt;+15.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;+6.2&lt;/td&gt;
&lt;td&gt;+14.2&lt;/td&gt;
&lt;td&gt;+18.2&lt;/td&gt;
&lt;td&gt;+1.9&lt;/td&gt;
&lt;td&gt;+23.7&lt;/td&gt;
&lt;td&gt;+16.3&lt;/td&gt;
&lt;td&gt;+18.1&lt;/td&gt;
&lt;td&gt;+6.6&lt;/td&gt;
&lt;td&gt;+12.8&lt;/td&gt;
&lt;td&gt;+31.8&lt;/td&gt;
&lt;td&gt;+0.9&lt;/td&gt;
&lt;td&gt;+13.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-opus-4-7&lt;/td&gt;
&lt;td&gt;+11.0&lt;/td&gt;
&lt;td&gt;+18.0&lt;/td&gt;
&lt;td&gt;+2.6&lt;/td&gt;
&lt;td&gt;+4.3&lt;/td&gt;
&lt;td&gt;+21.4&lt;/td&gt;
&lt;td&gt;-6.9&lt;/td&gt;
&lt;td&gt;+18.6&lt;/td&gt;
&lt;td&gt;+13.0&lt;/td&gt;
&lt;td&gt;+14.7&lt;/td&gt;
&lt;td&gt;+32.8&lt;/td&gt;
&lt;td&gt;+9.2&lt;/td&gt;
&lt;td&gt;+12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;-17.3&lt;/td&gt;
&lt;td&gt;+21.6&lt;/td&gt;
&lt;td&gt;+14.6&lt;/td&gt;
&lt;td&gt;+7.2&lt;/td&gt;
&lt;td&gt;+5.9&lt;/td&gt;
&lt;td&gt;+1.2&lt;/td&gt;
&lt;td&gt;+16.4&lt;/td&gt;
&lt;td&gt;+10.6&lt;/td&gt;
&lt;td&gt;+16.9&lt;/td&gt;
&lt;td&gt;+38.5&lt;/td&gt;
&lt;td&gt;-5.4&lt;/td&gt;
&lt;td&gt;+10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Reading the lift table.&lt;/strong&gt; A few observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;claude-opus-4-7 &lt;code&gt;node-core&lt;/code&gt;: -6.9.&lt;/strong&gt; Opus starts at 91.6 baseline on Node.js internals, the highest raw score on any skill for any model in the benchmark. Adding a skill that prescribes specific patterns for primordials and commit message format on top of a model that already knows the material produced interference, not uplift. The skill was written to close a gap that Opus doesn't have.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;gpt-5-codex &lt;code&gt;docs&lt;/code&gt;: -17.3.&lt;/strong&gt; The same skill that boosted gpt-5.3-codex by +33.4 points degraded gpt-5-codex by 17. The Diátaxis framework is highly prescriptive about structure: tutorial titles must start with verbs, reference sections must contain no instruction. gpt-5-codex starts at 80.2 baseline for docs, it produces fluent, correct-seeming prose, and the skill's structural constraints appear to actively conflict with its default output style. High baseline does not predict positive lift.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;gpt-5-codex &lt;code&gt;ts&lt;/code&gt;: -5.4.&lt;/strong&gt; Same pattern. A 66.5 baseline on TypeScript drops to 61.1 with the skill. The TypeScript skill enforces branded types and zero &lt;code&gt;any&lt;/code&gt;, rules that require restructuring code rather than extending it. For a model with established TypeScript habits, the prescriptive guidance appears to create noise rather than correct the specific gaps.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;claude-opus-4-7 &lt;code&gt;init&lt;/code&gt;: +2.6.&lt;/strong&gt; The lowest positive lift in the table. Claude Opus is the model that introduced the &lt;code&gt;AGENTS.md&lt;/code&gt; convention, it was already near-ceiling on this skill before any context was added.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;gpt-5.4 &lt;code&gt;snip&lt;/code&gt;: +40.0.&lt;/strong&gt; The single highest lift cell in the entire dataset. snipgrapher's private CLI documentation gives a model that knows nothing a complete specification for a tool it's never encountered. gpt-5.4's strong instruction-following amplifies that advantage cleanly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The cost of running gpt-5.5 vs the alternatives
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost/run (with skill)&lt;/th&gt;
&lt;th&gt;Time (with skill)&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Score/$&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2&lt;/td&gt;
&lt;td&gt;$0.23&lt;/td&gt;
&lt;td&gt;152.0s&lt;/td&gt;
&lt;td&gt;89.6&lt;/td&gt;
&lt;td&gt;389&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;135.4s&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;298&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3-codex&lt;/td&gt;
&lt;td&gt;$0.44&lt;/td&gt;
&lt;td&gt;87.9s&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;$0.49&lt;/td&gt;
&lt;td&gt;89.5s&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;td&gt;182&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-opus-4-7&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;158.9s&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;93&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;$1.05&lt;/td&gt;
&lt;td&gt;136.2s&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  More details about the 11 skills and scenarios
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;fastify-best-practices&lt;/code&gt;:&lt;/strong&gt; Fastify has strong opinions, and the skill encodes them. Scenarios: &lt;em&gt;Security Hardening for a Healthcare Web API&lt;/em&gt; (CORS scoped to two named origins, CSP + HSTS headers, HTTPS redirect, a wildcard &lt;code&gt;*&lt;/code&gt; or a missing header scores zero); &lt;em&gt;Authentication Service for a SaaS Platform&lt;/em&gt; (passwords migrated from bcrypt to argon2id, in-memory rate limiting replaced with Redis for multi-instance correctness, SIGTERM handled with &lt;code&gt;close-with-grace&lt;/code&gt;); &lt;em&gt;Protecting a Product Catalogue API from Overload&lt;/em&gt; (does it reach for &lt;code&gt;@fastify/under-pressure&lt;/code&gt; or invent its own backpressure loop?); &lt;em&gt;Order Management API with PostgreSQL&lt;/em&gt; (uses &lt;code&gt;@fastify/postgres&lt;/code&gt; with correct pool lifecycle, not raw &lt;code&gt;pg&lt;/code&gt;); &lt;em&gt;Consistent Error Handling for a Multi-Tenant SaaS API&lt;/em&gt; (typed &lt;code&gt;createError&lt;/code&gt;, uniform JSON shape, no stack traces to clients).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;node-best-practices&lt;/code&gt;:&lt;/strong&gt; The patterns in this skill diverge from what you'd find on Stack Overflow. Scenarios: &lt;em&gt;Hardening Logging in a Fintech API&lt;/em&gt; (pino must redact auth tokens and raw card fields before they reach the SIEM, masking after the fact doesn't count); &lt;em&gt;Webhook Receiver Service&lt;/em&gt; (structured logging of sensitive payment provider fields, graceful shutdown under concurrent in-flight requests); &lt;em&gt;Fix Throughput Degradation in a High-Load API Gateway&lt;/em&gt; (&lt;code&gt;dns.lookup()&lt;/code&gt; saturating the libuv thread pool, the fix is &lt;code&gt;dns.resolve4()&lt;/code&gt; and &lt;code&gt;UV_THREADPOOL_SIZE&lt;/code&gt;, not a caching layer); &lt;em&gt;High-Throughput Merchant DNS Routing Service&lt;/em&gt; (concurrent resolution under load, observable thread pool saturation).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;snipgrapher&lt;/code&gt;:&lt;/strong&gt; A custom internal CLI with a non-public API. The model has never seen its documentation. At baseline, every model is essentially guessing (avg 50-60/100). With the skill, agents either follow the spec or they don't. Scenarios: &lt;em&gt;Automating Changelog Snippet Images in CI&lt;/em&gt; (correct flag order, env var overrides, pipeline integration) and &lt;em&gt;Code Snippet Image Pipeline for Documentation Site&lt;/em&gt; (batch rendering, profile configuration). This skill delivers the highest lift of any in the benchmark across every model, averaging between 27 and 40 points. The reason: it encodes knowledge that does not exist on the internet. Public skills are becoming less necessary as frontier models grow stronger. Private tooling is where skills still dominate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;typescript-magician&lt;/code&gt;:&lt;/strong&gt; Not "add types to this function." Scenarios: &lt;em&gt;Domain-Safe Payment Processing Types&lt;/em&gt; (branded types for &lt;code&gt;AccountId&lt;/code&gt;, &lt;code&gt;PaymentId&lt;/code&gt;, &lt;code&gt;RefundId&lt;/code&gt;, plain type aliases don't count, &lt;code&gt;as&lt;/code&gt; casts score zero); &lt;em&gt;Product Catalog API for an E-Commerce Platform&lt;/em&gt; (TypeBox schemas inferred as TypeScript types end-to-end, internal cost fields stripped from public responses, no &lt;code&gt;any&lt;/code&gt;); &lt;em&gt;Eliminate &lt;code&gt;any&lt;/code&gt; from a Data Pipeline Utility Library&lt;/em&gt; (&lt;code&gt;tsc&lt;/code&gt; output captured before and after, zero &lt;code&gt;any&lt;/code&gt; remaining, no &lt;code&gt;@ts-ignore&lt;/code&gt;); &lt;em&gt;Project Bootstrap: Node.js TypeScript Service&lt;/em&gt; (native &lt;code&gt;--strip-types&lt;/code&gt;, no &lt;code&gt;ts-node&lt;/code&gt;, no build step, no &lt;code&gt;tsc&lt;/code&gt; in the start script).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;oauth&lt;/code&gt;:&lt;/strong&gt; Is the implicit flow explicitly removed? Is PKCE method S256? Is the refresh token replaced on rotation? Scenarios: &lt;em&gt;Add User Authentication to a Fastify API&lt;/em&gt; (full Authorization Code + PKCE flow with &lt;code&gt;@fastify/oauth2&lt;/code&gt;, state verification, token rotation); &lt;em&gt;OAuth Login Integration for a Fastify Web App&lt;/em&gt; (CSRF-hardened flow, &lt;code&gt;@fastify/session&lt;/code&gt; for state, correct cookie flags).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;linting-neostandard-eslint9&lt;/code&gt;:&lt;/strong&gt; ESLint v9's flat config is a breaking change. Scenarios checked whether agents actually migrated, not just created a new config alongside the old one. Is &lt;code&gt;.eslintrc.json&lt;/code&gt; gone? Is &lt;code&gt;standard&lt;/code&gt; removed from devDependencies? Scenarios: &lt;em&gt;Modernize the Linting Setup&lt;/em&gt; (two variants: &lt;code&gt;envparser&lt;/code&gt; open-source library and &lt;code&gt;payments-api&lt;/code&gt; service); &lt;em&gt;Add Linting to the Inventory Service&lt;/em&gt; (neostandard from scratch); &lt;em&gt;Set Up Automated Lint Enforcement&lt;/em&gt; (husky + lint-staged pre-commit hook, CI step that blocks on violations).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;documentation&lt;/code&gt;:&lt;/strong&gt; Based on the Diátaxis framework. The skill teaches agents when to write a tutorial vs a how-to vs reference vs explanation. Scenarios: &lt;em&gt;Restructure Documentation for a Configuration Library&lt;/em&gt; (sprawling &lt;code&gt;confz&lt;/code&gt; README split into four Diátaxis types, tutorial title must start with a verb, reference section must contain no instruction); &lt;em&gt;Getting Started Guide for a CLI Deployment Tool&lt;/em&gt; (&lt;code&gt;shipctl&lt;/code&gt; onboarding tutorial with Goal→Prerequisites→Numbered steps→Verifiable result structure, no conceptual digressions in the steps).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;init&lt;/code&gt;:&lt;/strong&gt; Writing &lt;code&gt;AGENTS.md&lt;/code&gt; / &lt;code&gt;CLAUDE.md&lt;/code&gt; files that actually help AI assistants. Scenarios: &lt;em&gt;Set Up Agent Instructions for a Growing Python Monorepo&lt;/em&gt; (3-year-old codebase, multiple service packages, identify the three constraints that cause the most agent damage); &lt;em&gt;Set Up Agent Instructions for a Node.js Monorepo&lt;/em&gt; (workspace-aware package manager, per-package test commands, legacy directory exclusion); &lt;em&gt;Audit and Slim Down a Bloated AGENTS.md&lt;/em&gt; (what to cut, what to keep, signal vs noise after a year of uncurated growth); &lt;em&gt;Set Up Agent Instructions for a Growing Monorepo&lt;/em&gt; (hierarchical root-level vs per-package instructions, discoverability filtering).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;octocat&lt;/code&gt;:&lt;/strong&gt; GitHub CLI patterns and correct flag usage. Scenarios: &lt;em&gt;Automate Feature Branch PR Submission&lt;/em&gt; (correct &lt;code&gt;gh pr create&lt;/code&gt; flags, CI polling with &lt;code&gt;gh run watch&lt;/code&gt;, merge only after checks pass); &lt;em&gt;Preparing Commits for a Node.js Core Module Contribution&lt;/em&gt; (subsystem prefix, 72-char subject, &lt;code&gt;Reviewed-By&lt;/code&gt; trailers, the format changelog toolbots parse); &lt;em&gt;Prepare Node.js Core Contribution Commits&lt;/em&gt; (backport workflow, correct metadata for automated release pipelines); &lt;em&gt;Automate Pull Request Workflow&lt;/em&gt; (reusable shell script, idempotent, surfaces CI failures before merge). &lt;strong&gt;&lt;code&gt;nodejs-core&lt;/code&gt;:&lt;/strong&gt; Contributing to Node.js core: primordials, commit message format, native addons with &lt;code&gt;AsyncWorker&lt;/code&gt;. Scenarios: &lt;em&gt;Product Catalog Caching Service&lt;/em&gt; (async-cache-dedupe, concurrency control to prevent thundering herd on a rate-limited upstream); &lt;em&gt;Microservice Routing Layer: Latency Spike Investigation&lt;/em&gt; (diagnosing &lt;code&gt;UV_THREADPOOL_SIZE&lt;/code&gt; exhaustion, &lt;code&gt;dns.lookup()&lt;/code&gt; blocking the pool); &lt;em&gt;Diagnose and Fix V8 Performance Regression in Analytics Processor&lt;/em&gt; (&lt;code&gt;--prof&lt;/code&gt;, &lt;code&gt;--trace-opt&lt;/code&gt;, reading isolate-*.log, acting on deoptimization reasons). &lt;strong&gt;&lt;code&gt;skill-optimizer&lt;/code&gt;:&lt;/strong&gt; Meta: given a poorly-written skill or benchmark report, improve it or interpret it correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Verdict
&lt;/h2&gt;

&lt;p&gt;gpt-5.5 is a better model than gpt-5.4 on raw capability, and on latency it is not close. For everything else, they are the same model at different price points. Pay the 63% premium if you need the speed. Skip it if you care about cost or value per dollar.&lt;/p&gt;

&lt;p&gt;The model to actually avoid is gpt-5.3. It costs 47% more than gpt-5.4 and scores 5.4 points worse. If you are running gpt-5.3 today, the case for switching to gpt-5.4 is strong on both cost and performance.&lt;/p&gt;

&lt;p&gt;Frontier models are becoming more self-sufficient. The ROI on domain skills is concentrating in genuinely proprietary knowledge: your internal APIs, your custom tooling, patterns that simply aren't on the internet. Snipgrapher lifted every model by 27 to 40 points because no model had ever seen its documentation. ESLint v9 flat config lifted them by 2 to 14 points because capable models already know it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The original author of this blog is &lt;a href="//uk.linkedin.com/in/simonmaple"&gt;Simon Maple&lt;/a&gt; and is originally posted on &lt;a href="https://tessl.io/blog/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense/" rel="noopener noreferrer"&gt;tessl.io/blog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Simon Maple is the Head of Developer Relations at Tessl, and AI Native Dev co-host. Previously, Simon was the Field CTO, and VP Developer Relations at Snyk, ZeroTurnaround, and IBM. He became a Java Champion in 2014, JavaOne Rockstar speaker in 2014 and 2017, Duke’s Choice award winner, Virtual JUG founder and organiser, and London Java Community co-leader.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>openai</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
