<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tessl</title>
    <description>The latest articles on DEV Community by Tessl (@tessl-io).</description>
    <link>https://dev.to/tessl-io</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3865880%2Fae4ef80f-404f-4ed5-849f-f94683a6e7b0.png</url>
      <title>DEV Community: Tessl</title>
      <link>https://dev.to/tessl-io</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tessl-io"/>
    <language>en</language>
    <item>
      <title>Why Your Gemini Bill Doesn't Match the Model Names</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Mon, 15 Jun 2026 05:24:38 +0000</pubDate>
      <link>https://dev.to/tessl-io/why-your-gemini-bill-doesnt-match-the-model-names-9nk</link>
      <guid>https://dev.to/tessl-io/why-your-gemini-bill-doesnt-match-the-model-names-9nk</guid>
      <description>&lt;h2&gt;
  
  
  Why Your Gemini Bill Doesn't Match the Model Names
&lt;/h2&gt;

&lt;p&gt;tl;dr - &lt;em&gt;Across roughly 3,300 paired skill-eval runs, Gemini 3.5 Flash cost $1.05 per task against Gemini 3.1 Pro's $0.66, for scores that were effectively identical: 88.6 versus 87.9.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pricing is even stranger when you look at the actual task costs. &lt;strong&gt;Gemini 3.5 Flash&lt;/strong&gt; and &lt;strong&gt;Gemini 4.5 Flash&lt;/strong&gt; are separated by almost 8× in per-task cost, while &lt;strong&gt;Gemini 3.1 Pro&lt;/strong&gt; comes in cheaper than both. The invoice does not appear to follow the naming hierarchy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the numbers come from?
&lt;/h2&gt;

&lt;p&gt;The benchmark ran every task twice, once with the relevant skill applied and once without, across four Gemini models in OpenHands, totaling roughly 800 tasks per model. Rather than relying on dashboard estimates, we pulled per-call token counts directly from agent session logs and computed costs using Google's published per-token prices. We then compared the resulting per-task costs across models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The headline data
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;$/task (w/ skill)&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Pts per $&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Turns&lt;/th&gt;
&lt;th&gt;List $/Mtok&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3.1 Flash Lite&lt;/td&gt;
&lt;td&gt;$0.035&lt;/td&gt;
&lt;td&gt;70.2&lt;/td&gt;
&lt;td&gt;2,006&lt;/td&gt;
&lt;td&gt;0.31M&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 Flash Preview&lt;/td&gt;
&lt;td&gt;$0.135&lt;/td&gt;
&lt;td&gt;85.4&lt;/td&gt;
&lt;td&gt;633&lt;/td&gt;
&lt;td&gt;0.63M&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3.1 Pro Preview&lt;/td&gt;
&lt;td&gt;$0.66&lt;/td&gt;
&lt;td&gt;87.9&lt;/td&gt;
&lt;td&gt;132&lt;/td&gt;
&lt;td&gt;0.65M&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3.5 Flash&lt;/td&gt;
&lt;td&gt;$1.05&lt;/td&gt;
&lt;td&gt;88.6&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;1.41M&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things stand out from this data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Cost order and name order are uncorrelated. &lt;strong&gt;Gemini 3.1 Pro&lt;/strong&gt; is cheaper per task than &lt;strong&gt;Gemini 3.5 Flash&lt;/strong&gt; despite carrying a higher per-token list price, while &lt;strong&gt;Gemini 4.5 Flash&lt;/strong&gt; and &lt;strong&gt;Gemini 4.5 Flash-Lite&lt;/strong&gt;, which sit in the same product family, differ dramatically in actual spend. Model names describe intended positioning, but they are a poor guide to real-world agent costs.&lt;/li&gt;
&lt;li&gt;  Scores do improve with each model generation, which is a genuine positive trend and a good reason to track releases, but capability gains do not automatically translate to cost reductions.&lt;/li&gt;
&lt;li&gt;  Finally, the practical value pick is Gemini 3 Flash Preview, which lands within three points of the leading models at roughly one-fifth the per-task cost, making it the most efficient option for workloads where a score in the 85 range is acceptable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why volume beats unit price
&lt;/h2&gt;

&lt;p&gt;The cost of an agentic task is the product of two variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`Task cost = price-per-token × tokens the model decides to spend`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Model names establish the first variable. The second is determined at runtime by the model's behavior on the specific task, and it only becomes visible after you read your session logs.&lt;/p&gt;

&lt;p&gt;For Gemini 3.5 Flash, the per-task cost breaks down as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Non-cached input: &lt;code&gt;$0.72&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  Cache-read input: &lt;code&gt;$0.14&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  Output (including thinking): &lt;code&gt;$0.19&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dominant driver is input volume. Gemini 3.5 Flash sent 1.41 million tokens of context across 39 agent turns per task. Pro sent roughly half that volume across 26 turns, and even at its higher list price of $2.00 per million tokens, its lower volume resolves to a lower total bill.&lt;/p&gt;

&lt;p&gt;A model with a cheaper per-token rate that takes more turns to reach an answer will erode its own discount. It is also worth noting that 63-75% of input across these runs was cache-read, which means the effective sensitivity to turn count is even higher than raw list prices suggest: the multiplier is accumulating in your session logs, not on your pricing page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills move cost by tier
&lt;/h2&gt;

&lt;p&gt;Adding a relevant skill to each run changed per-task cost in opposite directions depending on which model ran it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Pro&lt;/strong&gt; saw cost drop $0.20 per task (-23%) while the score gained 20 points. The model used fewer turns and less exploratory backtracking, which suggests it was able to act on the structured guidance directly rather than discovering the solution path through iteration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;3.5 Flash&lt;/strong&gt; was essentially flat, with cost shifting by less than $0.03 in either direction.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;3 Flash Preview and Flash Lite&lt;/strong&gt; each spent slightly more tokens for marginal score gains (+$0.03 and +$0.01 respectively).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The underlying pattern is consistent: a skill compresses the solution path for a model capable of following structured guidance precisely, reducing turn count and therefore total cost. For a model still resolving ambiguity through exploration, the same skill adds context to process rather than a shortcut to apply, and the cost holds steady or rises marginally. A skill is a shortcut for a capable model and overhead for a weaker one.&lt;/p&gt;

&lt;p&gt;In practical terms, this produces two clear operating points. Pro with a relevant skill at $0.66 per task is the most cost-efficient route to top-tier performance. Gemini 3 Flash Preview with a skill at $0.135 per task delivers roughly five times the score-per-dollar of either leader, for a score three points lower, which is a reasonable trade for many workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measure, don't assume
&lt;/h2&gt;

&lt;p&gt;Four takeaways from this data that apply beyond this specific benchmark:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1/ Do not budget from the rate card.&lt;/strong&gt; Cost your workload based on measured tokens and turns on your specific tasks, with your specific prompts, in your specific agent harness. Per-token list prices are a useful first filter for ordering candidates, not a reliable predictor of relative spend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2/ Read cost at the session layer.&lt;/strong&gt; Aggregate dashboards can show $0 while spend accumulates in the background. Token usage needs to come from raw API responses or agent session logs to be trusted for budgeting purposes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3/ Watch turn count first.&lt;/strong&gt; The 39-versus-26 turn gap between 3.5 Flash and Pro is the primary cause of the price inversion observed here, and turn count is the variable most commonly absent from observability tooling. It is the multiplier on everything else in the cost equation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4/ Re-measure when models update.&lt;/strong&gt; Gemini 3.5 Flash is a newer release than Gemini 3 Flash Preview and scores higher, but it costs roughly eight times more in this agentic context. Capability improvements and cost improvements are independent variables, and any cost benchmark needs to be re-run with each version update rather than assumed to hold.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;p&gt;These results come from a single agent harness (OpenHands), a single benchmark with explicit skill-relevance disclosure, and a specific sample window. Different tasks, prompt structures, and turn-length patterns will shift the absolute numbers and may shift the relative rankings. The finding to carry forward is not a specific model recommendation but a methodology: in agentic settings, cost rankings are not derivable from per-token rates alone, and the ranking that applies to your workload depends on that workload's specific behavioral profile.&lt;/p&gt;

&lt;p&gt;A model name is a pricing tier, not a cost forecast. In agentic workflows, the deciding variable is how many tokens the model chooses to spend to reach an answer, a figure visible only after you run the work and read the logs. The rate card gives you one of the two inputs; only measurement gives you both.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Next: which skills actually earn their tokens? In these runs, 42% produced significant performance gains while 5% were net overhead. We’ll follow up on this analysis in the next post.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentskills</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Claude Fable 5 vs Opus 4.8: The Mythos Hype Meets Reality</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Sun, 14 Jun 2026 06:39:18 +0000</pubDate>
      <link>https://dev.to/tessl/claude-fable-5-vs-opus-48-the-mythos-hype-meets-reality-od3</link>
      <guid>https://dev.to/tessl/claude-fable-5-vs-opus-48-the-mythos-hype-meets-reality-od3</guid>
      <description>&lt;p&gt;For months, the most interesting model at Anthropic was one we could not use. Mythos was the internal system the company said was too capable to release, the one that found software vulnerabilities at a level that tripped its own safety thresholds. On June 9, 2026, that tier went public for the first time, as Claude Fable 5. Opus 4.8, the model anchoring production coding agents, suddenly had a successor that's a full capability class above it.&lt;/p&gt;

&lt;p&gt;This raises two questions for anyone running coding agents. The practical one is whether you should move your fleet from Opus 4.8 to Fable 5. The bigger one is whether a Mythos-class model, the tier Anthropic held back as too capable to ship, lives up to what the name promised. This article answers both, and the numbers tell a more interesting story than the announcement did.&lt;/p&gt;

&lt;p&gt;We ran both models through the same evaluation, close to 1000 shared scenarios scored twice each, once with no skill supplied and once with the relevant skill in context. The short answer, as of mid-2026, is that Opus 4.8 is still the better value for most agent fleets, and the gap between the Mythos hype and the measured reality is the real story in the data.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Mythos-class model is a tier of Claude that sits above the Opus class in capability&lt;/strong&gt;. It reaches a threshold Anthropic considers high-risk, particularly at discovering and exploiting software vulnerabilities. Fable 5 and Mythos 5 are the same underlying model with the same capabilities. What separates them is the safeguards: Fable 5 is the public version that ships with safety classifiers, while Mythos 5, restricted to approved partners, runs without them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the industry expected from a Mythos-class model
&lt;/h2&gt;

&lt;p&gt;Before launch, the speculation was not subtle. Across Reddit, X, and a run of explainer posts, Mythos was framed as the model that would change how agents work, not just how well they answer. The recurring predictions clustered around four capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Restructuring a large codebase in one coherent pass.&lt;/li&gt;
&lt;li&gt;  Spotting security flaws that experienced engineers miss.&lt;/li&gt;
&lt;li&gt;  Working unsupervised for hours on a single hard problem.&lt;/li&gt;
&lt;li&gt;  Acting like a collaborator, not an assistant you steer turn by turn.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of the four, the cybersecurity claim was the one with hard evidence behind it. Through Project Glasswing, roughly 50 early partners with Mythos Preview access reported finding more than 10,000 high or critical severity vulnerabilities, and the program has since expanded past 150 organizations. Anthropic's CPO Mike Krieger called it "the most capable class of systems we've built." That is the dream the name sold: a model so powerful it stayed in the lab.&lt;/p&gt;

&lt;p&gt;What reached the public is narrower, and deliberately so. The model you can actually use is Fable 5, the Mythos-class system wrapped in safety classifiers. Whether it delivers comes down to the gap between that promise and what was released.&lt;/p&gt;

&lt;h2&gt;
  
  
  The headline numbers: Claude Fable 5 vs Opus 4.8
&lt;/h2&gt;

&lt;p&gt;Every scenario in the evaluation is a real agent task tied to a published skill, scored on two axes: instruction-following (does the agent do what it was told, in the way it was told) and task-completion (does it reach the goal). The overall score weights instruction-following at 4 and task-completion at 3, then divides by 7. Each task runs with and without the skill, so the lift from the skill is visible directly. The tasks and skills are public, in the &lt;a href="https://huggingface.co/datasets/tesslio/task-evals-for-skills" rel="noopener noreferrer"&gt;task-evals-for-skills dataset&lt;/a&gt;, so you can inspect any scenario yourself.&lt;/p&gt;

&lt;p&gt;This design is deliberate. The tasks come from published skills, so they mirror the real work teams write skills for, not frontier puzzles meant to find a model's ceiling. That is why task-completion runs high for both models and why the signal that separates them is instruction-following: doing the work the specific way the skill asks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension (with skill)&lt;/th&gt;
&lt;th&gt;Fable 5&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall score&lt;/td&gt;
&lt;td&gt;92.9&lt;/td&gt;
&lt;td&gt;92.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall score (no skill, baseline)&lt;/td&gt;
&lt;td&gt;75.7&lt;/td&gt;
&lt;td&gt;74.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall lift from the skill&lt;/td&gt;
&lt;td&gt;+17.2&lt;/td&gt;
&lt;td&gt;+17.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instruction-following&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;88.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task-completion&lt;/td&gt;
&lt;td&gt;97.8&lt;/td&gt;
&lt;td&gt;97.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turns to complete&lt;/td&gt;
&lt;td&gt;16.9&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens per task&lt;/td&gt;
&lt;td&gt;9,025&lt;/td&gt;
&lt;td&gt;10,687&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;List price (input / output, per MTok)&lt;/td&gt;
&lt;td&gt;$10 / $50&lt;/td&gt;
&lt;td&gt;$5 / $25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per task (average)&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$0.74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Points per dollar&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;125&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On the 917 scenarios both models ran, Fable 5 leads on overall score by 0.9 points (92.9 to 92.0). Scenario by scenario, the two tie on 61% of tasks, Fable wins 24%, and Opus wins 16%, at a two-point threshold. A capability class above Opus, and on everyday agent skill tasks the quality difference is inside the noise.&lt;/p&gt;

&lt;p&gt;One caveat sits underneath that number. The 917 are the tasks both models completed and scored. Fable 5 refused 26 that Opus 4.8 finished, and we excluded them, so the near-tie is measured only on the tasks Fable agreed to do. That exclusion turns out to be the most revealing part of the comparison, and we return to it below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why agent skill evaluation matters more than the model upgrade
&lt;/h2&gt;

&lt;p&gt;Here is the number that reframes the comparison. The skill adds about 17 overall points to both models: +17.2 for Fable 5 and +17.5 for Opus 4.8. The model upgrade from Opus 4.8 to Fable 5 adds less than 1 point on shared tasks. The context you supply moves the agent far more than the frontier tier you pick.&lt;/p&gt;

&lt;p&gt;The lift concentrates in instruction-following, where both models gain more than 27 points from the skill, while task-completion gains under 5. Both models can usually reach the goal on their own. What they cannot do reliably without a skill is follow the specific conventions, constraints, and steps a real task demands. That is what a good skill encodes.&lt;/p&gt;

&lt;p&gt;Skill receptivity is how much an agent's output improves when you supply a relevant skill. It shows up mostly as better instruction-following. It matters because it can outweigh the model choice, which is the practical case for investing in &lt;a href="https://tessl.io/registry" rel="noopener noreferrer"&gt;agent skills&lt;/a&gt; before chasing the newest tier. Running the same task with and without the skill, then measuring the difference, is a task eval. It is also the only way to know whether a model upgrade earns its price on your workload, which is what &lt;a href="https://tessl.io/blog/introducing-task-evals-measure-whether-your-skills-actually-work/" rel="noopener noreferrer"&gt;agent skill evaluation&lt;/a&gt; is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The price gap is the deciding factor for most teams
&lt;/h2&gt;

&lt;p&gt;On the agent skill tasks we measured, the trade comes down to paying a steep premium for a marginal gain. Fable 5 lists at $10 per million input tokens and $50 per million output tokens against Opus 4.8's $5 and $25, exactly twice across every token category, including cache reads and writes. For that, across our 917 shared scenarios, you get an overall score of 92.9 versus 92.0, a 0.9-point edge that sits well inside the range where the two are interchangeable. This is the everyday-agent-work picture, not a verdict on the marquee Mythos capabilities our eval does not test.&lt;/p&gt;

&lt;p&gt;Token behavior softens the unit price but does not close it. Across the 917 shared scenarios Fable 5 generated about 16% fewer output tokens per task (9,025 versus 10,687), so the real cost per task lands at $1.25 against $0.74, a 73% premium rather than a clean 2x. The value gap is the number to remember: Opus 4.8 returns 125 points per dollar to Fable 5's 74, about 69% more quality for every dollar spent.&lt;/p&gt;

&lt;p&gt;For a single session the difference is cents. For a fleet running thousands of agent tasks a day, it is the line item your finance team will ask about, and twice the price for under a point of quality on the tasks most teams actually run is not an easy answer to give them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fable refuses work Opus completes without issues
&lt;/h2&gt;

&lt;p&gt;The most consequential difference between Fable 5 and Opus 4.8 is not on the scoreboard. It is the safety layer that defines the Mythos class.&lt;/p&gt;

&lt;p&gt;Fable 5 ships with safeguards covering four domains: cybersecurity, biology and chemistry, distillation, and frontier LLM development. For the first three, a triggered request comes back as a refusal. Anthropic's design hands it to Opus 4.8 and informs the user, but that fallback is opt-in rather than a default, so in a stock harness like ours the blocked requests simply refused.&lt;/p&gt;

&lt;p&gt;The fourth domain worked differently during this run. By Anthropic's own documentation, requests touching frontier AI development were not refused or even flagged. The model quietly steered or fine-tuned its answer instead, with no notice to the user. That silent manipulation drew the sharpest backlash, and on June 11, the day after this run, Anthropic switched it to a visible classifier like the other three while conceding the restrictions had been "overly conservative." Because it never produced a refusal, that domain leaves no mark in our numbers; any effect would surface only as quietly weaker answers.&lt;/p&gt;

&lt;p&gt;A Mythos-class model routes some requests to a weaker model by design, so your harness needs to detect the fallback rather than trust that every response came from Fable. And the affected domains are exactly the ones you most want to check yourself, which is the practical edge of &lt;a href="https://tessl.io/blog/the-tessl-registry-now-has-security-scores-powered-by-snyk/" rel="noopener noreferrer"&gt;context governance and security&lt;/a&gt;: catch the regression in an eval, not in production.&lt;/p&gt;

&lt;p&gt;Our run shows how that plays out, and it is not flattering. Fable 5 refused 26 of the roughly 940 tasks it attempted, returning a usage-policy block with a refusal stop reason instead of doing the work, while Opus 4.8 completed and scored every one of them. What Fable refused is the revealing part. Four were defensive security reviews, including "review this Flask application for security vulnerabilities before deploying it," blocked as "violative cyber content." Five were routine bioinformatics tasks, such as running quality control on a single-cell RNA-seq file. One was a literature review on the landscape of AI-assisted drug discovery. A model from the class Anthropic markets for finding vulnerabilities in critical software declined to audit a Flask app for the developer who owns it. Anthropic's own "overly conservative" admission lands hardest here.&lt;/p&gt;

&lt;p&gt;On the security tasks Fable did complete, it was competitive. Across 51 authentication and security skill scenarios, from Auth0, Better Auth, and Bitwarden, Fable 5 averaged 95.0 with the skill against Opus 4.8's 96.6, a near-tie. The lesson is not that one model is safe and the other is not. It is that a Mythos-class model will sometimes refuse the defensive work you most need done, and only an eval on your own tasks will tell you where.&lt;/p&gt;

&lt;h2&gt;
  
  
  Did Fable deliver on the Mythos promise?
&lt;/h2&gt;

&lt;p&gt;Our evaluation answers the question that matters for a deployment decision: how both models handle hundreds of real, skill-driven agent tasks across dozens of tool ecosystems, which is the work most teams actually run coding agents on. The marquee Mythos feats sit outside this eval, but the day-to-day behavior it captures is exactly what you are buying when you point a fleet at a model.&lt;/p&gt;

&lt;p&gt;What the data does show is where Fable's extra capability surfaces in normal use. Grouped by the organization that owns the skill, Fable 5 pulls ahead on web-research and scraping workloads: Apify (+7.8 overall), Google Gemini (+4.6), Tavily (+3.4), and Firecrawl (+2.7). If your agents fetch, map, and extract from the open web, Fable 5 is the stronger pick. Opus 4.8 holds its ground where Fable regresses: Mastra (-7.3), Auth0 (-4.5), and Axiom (-2.5).&lt;/p&gt;

&lt;p&gt;So the Mythos dream of an autonomous collaborator is not what most teams will buy on day one. What they will buy is a model that is marginally better at instruction-following, meaningfully better at web research, twice the price, and gated by classifiers that occasionally hand the job to Opus 4.8 anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use each
&lt;/h2&gt;

&lt;p&gt;Choose Opus 4.8 if you run a coding-agent fleet at scale and care about cost per task. The quality difference is inside the noise for most workloads, Opus returns far more points per dollar, and it has no fallback layer to design around.&lt;/p&gt;

&lt;p&gt;Choose Fable 5 if your agents do heavy web research and scraping, if you need its reasoning depth on long-horizon tasks, or if you have a workload that genuinely benefits from the capability class above Opus. Budget for the roughly 73% per-task premium, and build fallback detection into your harness from day one. If your work touches the classifier domains, confirm the model is not silently routing to Opus 4.8 before you depend on it.&lt;/p&gt;

&lt;p&gt;Fable's edge shows up when you build around it, not when you swap it into an Opus 4.8 pipeline unchanged. Fable is the more autonomous model, but that edge only pays off in flows built for it: longer unsupervised runs, larger units of work, less step-by-step steering.&lt;/p&gt;

&lt;p&gt;For almost everyone, the larger lever is neither model. The skill adds about 17 points; the model upgrade adds less than 1. Standardize the model in your tessl.json, prove the switch with an eval before you roll it to the fleet, and watch for the tasks a Mythos-class model quietly declines to do.&lt;/p&gt;

&lt;p&gt;Want to see how a skill changes your own agent's behavior, on your own tasks, across both models? Start with the &lt;a href="https://tessl.io/registry" rel="noopener noreferrer"&gt;Tessl Registry&lt;/a&gt; and run the eval before you switch.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>The State of the AI Coding Stack: Agent Skills, Harnesses, and Enablement at AI Native DevCon London 2026</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Fri, 12 Jun 2026 08:08:04 +0000</pubDate>
      <link>https://dev.to/tessl-io/the-state-of-the-ai-coding-stack-agent-skills-harnesses-and-enablement-at-ai-native-devcon-con</link>
      <guid>https://dev.to/tessl-io/the-state-of-the-ai-coding-stack-agent-skills-harnesses-and-enablement-at-ai-native-devcon-con</guid>
      <description>&lt;p&gt;Curating a conference is a way of taking the industry's temperature. The choices (what gets a keynote, what gets a breakout, what nobody bothers arguing about anymore) tell you roughly where things actually stand. AI Native DevCon London wrapped on June 2nd. Two days, three tracks, 41 talks. Here's what the program revealed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context and skills — the new unit of software
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=KpfnldjO3Iw" rel="noopener noreferrer"&gt;Guy Podjarny&lt;/a&gt;'s opening keynote at &lt;a href="https://tessl.io/" rel="noopener noreferrer"&gt;Tessl&lt;/a&gt; was a state-of-the-industry overview: where AI-native development actually stands, and where the stack is heading. The &lt;a href="https://tessl.io/blog/context-development-lifecycle-better-context-for-ai-coding-agents/" rel="noopener noreferrer"&gt;Context Development Lifecycle&lt;/a&gt; he laid out named the full layers of the emerging development stack: tools giving models arms and legs, context and skills guiding them, harnesses as deterministic frameworks constraining the probabilistic model, and harnesses composing into factory lines and pipelines, all the way up to full automated development processes. At scale, governance follows: knowing which skills are in use across an organisation, whether they are secure, who owns them. The argument he closed on was that humans should now be living in the CDLC and leaving the SDLC to the agents. The two days that followed played out largely as confirmation of that picture, with each section of the program filling in a different layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3r64e92kpxj30dr84yt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3r64e92kpxj30dr84yt.png" alt="guy" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The baseline he started from was already a shift from last year. A meaningful chunk of the previous program had been making the case for context engineering: why it matters, how to produce it, who owns it. This year those sessions were gone. A skill, in the CDLC, is context with a defined boundary: named, versioned, testable, installable. The same discipline applied to libraries for twenty years, now applied one rung higher. &lt;a href="https://docs.tessl.io/" rel="noopener noreferrer"&gt;"Skills are the New Code"&lt;/a&gt; wasn't a claim to debate; it was the premise the conference built from.&lt;/p&gt;

&lt;p&gt;Which means the problems that come after that premise are the ones that filled the schedule. The most immediate is sprawl: the moment an organisation has more context files than anyone can track, with no versioning, no approval flows, and no shared source of truth, it has a supply chain problem, which is exactly what &lt;a href="https://www.youtube.com/watch?v=6VRKZQ3pmoU" rel="noopener noreferrer"&gt;James Moss&lt;/a&gt; at &lt;a href="https://tessl.io/" rel="noopener noreferrer"&gt;Tessl&lt;/a&gt; walked through. &lt;a href="https://www.youtube.com/watch?v=78KQVxMTAQ4" rel="noopener noreferrer"&gt;John Groetzinger&lt;/a&gt; at &lt;a href="https://cisco.com/" rel="noopener noreferrer"&gt;Cisco&lt;/a&gt; showed the other end of that: knowledge pipelined so that engineers can read it and agents can consume it from the same source, without divergence. Meanwhile the protocol that moves context between systems, MCP, is still being actively shaped, and &lt;a href="https://www.youtube.com/watch?v=nfwNjmZSKMY" rel="noopener noreferrer"&gt;Shaun Smith&lt;/a&gt; from &lt;a href="https://huggingface.co/" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt; mapped where the choices being made now will determine how much of the problem it can actually solve.&lt;/p&gt;

&lt;p&gt;Skills also showed up in surfaces nobody was quite thinking about yet. &lt;a href="https://www.youtube.com/watch?v=7Sy2gSoxu7o" rel="noopener noreferrer"&gt;Steve Ruiz&lt;/a&gt; at &lt;a href="https://tldraw.com/" rel="noopener noreferrer"&gt;tldraw&lt;/a&gt; demonstrated the canvas as a live build environment where agents and humans work side by side on the same shared space, creating shapes, annotating, iterating in real time, with multiple agent instances running concurrently on the same document. &lt;a href="https://www.youtube.com/watch?v=Uo-Y7AtPlas" rel="noopener noreferrer"&gt;Lars Trieloff&lt;/a&gt; at &lt;a href="https://adobe.com/" rel="noopener noreferrer"&gt;Adobe&lt;/a&gt; went further with his browser-native agent project: rather than putting the agent in a sidebar alongside a web app, the agentic loop runs inside the browser tab itself, controlling the browser from within. The app becomes the sidebar. The agent is the primary surface.&lt;/p&gt;

&lt;p&gt;And once skills are a shipped artifact, they inherit what every shipped artifact has: a supply chain and an attack surface. &lt;a href="https://www.youtube.com/watch?v=oJGX8GYLWxg" rel="noopener noreferrer"&gt;Liran Tal&lt;/a&gt; at &lt;a href="https://snyk.io/" rel="noopener noreferrer"&gt;Snyk&lt;/a&gt; brought the data: scanning publicly circulating skills on ClowHub, his team found that roughly one in seven had security issues, including malware distribution, credential harvesting, and known vulnerabilities embedded in &lt;a href="http://skill.md/" rel="noopener noreferrer"&gt;SKILL.md&lt;/a&gt; files that agents were reading and trusting without verification. He called the attack pattern &lt;em&gt;toxic flows&lt;/em&gt; and drew the parallel to the early npm ecosystem: the same supply chain problems, the same trust assumptions, now in natural language rather than code. The security problem didn't disappear when the industry moved from shipping code to shipping context. It followed.&lt;/p&gt;

&lt;p&gt;None of which matters much if skills don't actually work. &lt;a href="https://www.youtube.com/watch?v=4d3-Zrmf9Wo" rel="noopener noreferrer"&gt;Simon Obstbaum&lt;/a&gt; from Stanford's Software Engineering Productivity Research Group and &lt;a href="https://www.youtube.com/watch?v=4d3-Zrmf9Wo" rel="noopener noreferrer"&gt;Rob Willoughby&lt;/a&gt; at &lt;a href="https://tessl.io/" rel="noopener noreferrer"&gt;Tessl&lt;/a&gt; brought data: productivity measurements across 150,000 engineers, and what agent instruction-following actually looks like when you run 500 skills against 1,000 tasks systematically. Skills that look reasonable in isolation behave differently at scale and across model-harness combinations. &lt;a href="https://www.linkedin.com/in/jbaruch/" rel="noopener noreferrer"&gt;Baruch Sadogursky&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/macebake/" rel="noopener noreferrer"&gt;Macey Baker&lt;/a&gt; from &lt;a href="https://tessl.io/" rel="noopener noreferrer"&gt;Tessl&lt;/a&gt; brought the practical counterpart in their workshop, &lt;em&gt;Don't Write Prompts, Write Software&lt;/em&gt;, which ran the same premise as a hands-on exercise rather than a talk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Harness engineering
&lt;/h2&gt;

&lt;p&gt;If skills are the new code, the harness is the new framework. Guy Podjarny had named it in his keynote: &lt;a href="https://tessl.io/blog/context-development-lifecycle-better-context-for-ai-coding-agents/" rel="noopener noreferrer"&gt;"deterministic software that wraps a probabilistic model,"&lt;/a&gt; sitting above models, tools, and context alike. Not every team will build their own, but more and more will substantially customise one, and the conference's most technically dense track was about what that looks like in practice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=c8bE0cj7vHY" rel="noopener noreferrer"&gt;Ryan Lopopolo&lt;/a&gt; at &lt;a href="https://openai.com/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; put forward the term &lt;em&gt;harness engineering&lt;/em&gt; and defined it precisely: "making context around what it means to do a good job legible, and then just-in-time surfaced to the agent over the course of its trajectories." The harness is the deterministic layer around the probabilistic model, the part that can be reasoned about, tested, and enforced.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw87yflphkrz2b1kbkffv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw87yflphkrz2b1kbkffv.png" alt="ryan lopopolo" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key insight in Ryan's talk was about where to spend human attention, not where to put automated checks. He keeps himself out of the loop during a run. Let the agent execute, wait for the PR. When the PR is wrong, the right response isn't to fix it: it's to encode the problem as a permanent guardrail so it can't recur. "I never want to give the same review feedback twice." The harness progressively shifts those checks left into lints, tests, and reviewer agents; the human stays at the end, treating the agent the way they'd treat any teammate whose code needs a convincing argument before it merges.&lt;/p&gt;

&lt;p&gt;The guardrails work at several levels. &lt;a href="https://www.youtube.com/watch?v=gc5_ICZg9tg" rel="noopener noreferrer"&gt;Joseph Katsioloudes&lt;/a&gt; at &lt;a href="https://github.com/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; covered how AI changes the security equation from both sides: helping find and fix vulnerabilities faster, while also raising the stakes when generation outpaces review. &lt;a href="https://www.youtube.com/watch?v=bFBNXIoLkW4" rel="noopener noreferrer"&gt;Oleg Šelajev&lt;/a&gt; at &lt;a href="https://docker.com/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; covered the execution environment itself: a running agent needs a real sandbox, not good intentions, and that sandbox belongs in the harness design from day one rather than as a later security addition. &lt;a href="https://www.youtube.com/watch?v=PsEnuv3S5I0" rel="noopener noreferrer"&gt;Luke Marsden&lt;/a&gt; at &lt;a href="https://helix.ml/" rel="noopener noreferrer"&gt;HelixML&lt;/a&gt; went further in practice, giving each agent its own full desktop and reporting honestly on what production infrastructure actually looks like when agents run on it.&lt;/p&gt;

&lt;p&gt;The browser is becoming part of this layer too. &lt;a href="https://www.youtube.com/watch?v=BV7RYioryKE" rel="noopener noreferrer"&gt;Maximiliano Firtman&lt;/a&gt;, Founder &amp;amp; Professor of &lt;a href="https://codemia.io/" rel="noopener noreferrer"&gt;Codemia&lt;/a&gt;, presented WebMCP, which was in a Chrome 149 origin trial at the time of the conference: a standard that lets a page declare its own tools so an agent can call them directly rather than scraping the DOM. Rather than guessing at the UI from pixels and selectors, the website author writes the contract; the agent works from that contract instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dark factory: agent-first development at scale
&lt;/h2&gt;

&lt;p&gt;Follow the logic of the harness far enough, with humans providing intent at the start and guardrails ensuring quality throughout, and you reach the endpoint several talks were already operating from. &lt;a href="https://www.youtube.com/watch?v=wuGJNWhUOoE" rel="noopener noreferrer"&gt;Paul Stack&lt;/a&gt; at &lt;a href="https://swamp-club.com/" rel="noopener noreferrer"&gt;Swamp Club&lt;/a&gt; was direct about it from the first slide: "The distinction here is between writing code and building the machine that writes the code." Not a single line of production code written by hand since January. The company threw away its previous codebase and started fresh. Agents write every line, pull requests from humans are rejected outright, the team works on design constraints, architecture, and operational guidelines. Changes arrive as issues to discuss. The factory floor runs, and the humans are at the edges.&lt;/p&gt;

&lt;p&gt;That pattern showed up at every scale. &lt;a href="https://www.youtube.com/watch?v=kbvqRWY-bUs" rel="noopener noreferrer"&gt;Don Syme&lt;/a&gt; at &lt;a href="https://githubnext.com/" rel="noopener noreferrer"&gt;GitHub Next&lt;/a&gt; demonstrated it at repository level with automated triage, maintenance, and contribution workflows for an open-source project, with humans staying in the loop via PRs and issues while agents handle the throughput. At the enterprise end, &lt;a href="https://www.youtube.com/watch?v=mB74LGAmmV0" rel="noopener noreferrer"&gt;Daniel Jones&lt;/a&gt; from &lt;a href="https://re-cinq.com/" rel="noopener noreferrer"&gt;Resync&lt;/a&gt; and &lt;a href="https://www.youtube.com/watch?v=mB74LGAmmV0" rel="noopener noreferrer"&gt;Tomasz Maj&lt;/a&gt; at &lt;a href="https://odevo.com/" rel="noopener noreferrer"&gt;Odevo&lt;/a&gt; walked through Odevo's transformation: a case study in what it actually looks like when an engineering team stops writing every line and starts setting direction, with developers reporting months without writing production code themselves.&lt;/p&gt;

&lt;p&gt;The argument is usually made about greenfield systems. &lt;a href="https://www.youtube.com/watch?v=5SKh-FmjX7U" rel="noopener noreferrer"&gt;Katie Roberts&lt;/a&gt; at &lt;a href="https://nearform.com/" rel="noopener noreferrer"&gt;Nearform&lt;/a&gt; addressed the harder and more common case: code that already exists and can't be started over. What does it actually mean to apply AI-native practices to a brownfield codebase? Stop maintaining, start evolving, but the evolution has to start somewhere real. A last-minute panel featuring &lt;a href="https://www.youtube.com/watch?v=1grkxo4cyKY" rel="noopener noreferrer"&gt;Stephane Jourdan&lt;/a&gt; from &lt;a href="https://anyshift.io/" rel="noopener noreferrer"&gt;AnyShift&lt;/a&gt;, &lt;a href="https://www.youtube.com/watch?v=1grkxo4cyKY" rel="noopener noreferrer"&gt;Simon Rohrer&lt;/a&gt; from &lt;a href="https://home.saxo/" rel="noopener noreferrer"&gt;Saxo Bank&lt;/a&gt;, and &lt;a href="https://www.youtube.com/watch?v=1grkxo4cyKY" rel="noopener noreferrer"&gt;Pini Reznik&lt;/a&gt; from &lt;a href="https://re-cinq.com/" rel="noopener noreferrer"&gt;ReCinq&lt;/a&gt; tackled the organisational side of that same shift, moving from DevOps pipelines to prompt-driven workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Product/engineer interaction
&lt;/h2&gt;

&lt;p&gt;When humans sit at the ends of an automated pipeline, what they hand in and what they take out carries all the weight. That made the product/engineer interface one of the most contested areas of the program: not whether it's changing, but how fast and in whose favour.&lt;/p&gt;

&lt;p&gt;The PM's role is the most visibly changed. Rather than writing requirements for engineers, &lt;a href="https://www.youtube.com/watch?v=M6Nhl4zR0uk" rel="noopener noreferrer"&gt;Emma Burrows&lt;/a&gt; at &lt;a href="https://rezonant.io/" rel="noopener noreferrer"&gt;Rezonant&lt;/a&gt; argued the job becomes building a &lt;em&gt;product brain&lt;/em&gt;, a structured, queryable knowledge base that agents can draw from, and orchestrating from above. Leverage comes from how well the product thinking is encoded, not from how many engineers receive it.&lt;/p&gt;

&lt;p&gt;The spec is the other piece. &lt;a href="https://www.youtube.com/watch?v=aWrGSM5vVyc" rel="noopener noreferrer"&gt;Shachar Azriel&lt;/a&gt; at &lt;a href="https://baz.co/" rel="noopener noreferrer"&gt;Baz&lt;/a&gt; made requirements executable: the same document that drives the build becomes the verification layer that checks whether the output is right. &lt;a href="https://www.youtube.com/watch?v=odbNXv9xXjc" rel="noopener noreferrer"&gt;Simon Martinelli&lt;/a&gt;, a consultant with seventeen years of enterprise practice, brought the practitioner's view of what spec-driven development looks like applied to large-scale modernisation: extracting use cases from legacy code, cutting team sizes, replacing sprint cycles with continuous flow. &lt;a href="https://www.linkedin.com/in/alfonso-graziano/" rel="noopener noreferrer"&gt;Alfonso Graziano&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/steve-goode-2488853/" rel="noopener noreferrer"&gt;Steve Goode&lt;/a&gt; from &lt;a href="https://nearform.com/" rel="noopener noreferrer"&gt;Nearform&lt;/a&gt; took the same theme into a workshop, &lt;em&gt;Spec-Driven Development: From Prompting to Production-Ready Systems&lt;/em&gt;, running teams through the full process themselves.&lt;/p&gt;

&lt;p&gt;The design and product constraints on agents are less discussed but equally real. &lt;a href="https://www.youtube.com/watch?v=tf6VNGH3tRk" rel="noopener noreferrer"&gt;Marc Sloan&lt;/a&gt; at &lt;a href="https://tessl.io/" rel="noopener noreferrer"&gt;Tessl&lt;/a&gt; asked what agents need from product and design to produce work that's actually usable rather than just technically correct. &lt;a href="https://www.youtube.com/watch?v=Ex1Zu0qel8M" rel="noopener noreferrer"&gt;Matthias Lübken&lt;/a&gt; approached the same question from the embedding side, with a concrete client case: a business workflow product for processing after-sales email enquiries, where the coding agent primitives had to be adapted for a domain that has nothing to do with software development, covering tool design, session lifecycle, and output contracts.&lt;/p&gt;

&lt;p&gt;In practice, the signal that teams are navigating this well shows up in metrics. &lt;a href="https://www.youtube.com/watch?v=KzhjnILSP0Y" rel="noopener noreferrer"&gt;Tammuz Dubnov&lt;/a&gt; at &lt;a href="https://autonomyai.io/" rel="noopener noreferrer"&gt;Autonomy AI&lt;/a&gt; found that when their PM started writing code directly, merge rate became the number that told them whether adaptation was actually happening. &lt;a href="https://www.youtube.com/watch?v=tJUAef_dBtU" rel="noopener noreferrer"&gt;Christopher Batey&lt;/a&gt; at &lt;a href="https://www.cecg.io/" rel="noopener noreferrer"&gt;Core Engineering Consulting Group&lt;/a&gt; documented what product teams kept having to relearn every quarter as the dynamics between product and engineering shifted under their feet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification: evaluating what AI coding agents produce
&lt;/h2&gt;

&lt;p&gt;Generating code at pace has been solved. Trusting what was generated hasn't.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=libNzUdL9eM" rel="noopener noreferrer"&gt;Dave Farley&lt;/a&gt; at &lt;a href="https://www.youtube.com/c/ContinuousDelivery" rel="noopener noreferrer"&gt;Continuous Delivery&lt;/a&gt; put the challenge to the room plainly. He wrote &lt;em&gt;Continuous Delivery&lt;/em&gt; and has been watching software engineering practices for decades, and his question was simple: is vibe coding actually the best approach the industry has arrived at? The engineering practices that make software trustworthy haven't changed: small changes, tight feedback loops, verification at each step. Agents change how fast you produce; they don't change whether you need to be able to trust what you produced. His summary was blunt: "We sped up the coding bit. That was the easy part of software development."&lt;/p&gt;

&lt;p&gt;The problem is that the tools for establishing that trust haven't kept pace. &lt;a href="https://www.youtube.com/watch?v=xHxfeWtkXrM" rel="noopener noreferrer"&gt;Justin Cormack&lt;/a&gt;, formerly CTO at &lt;a href="https://docker.com/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt;, looked at what happens when tests pass but the agent has still produced something wrong (the lying tests problem), and argued that the answer is observability: instrument what the agent actually does during a run rather than just checking outputs afterward. &lt;a href="https://www.youtube.com/watch?v=-6SNbcE3C9o" rel="noopener noreferrer"&gt;May Walter&lt;/a&gt; at &lt;a href="https://hud.io/" rel="noopener noreferrer"&gt;HUD&lt;/a&gt; took the runtime angle, with agents that instrument their own execution to surface blind spots before they become merged failures. &lt;a href="https://www.youtube.com/watch?v=guhTp2Q8VX0" rel="noopener noreferrer"&gt;Amit Kushwaha&lt;/a&gt; at &lt;a href="https://nvidia.com/" rel="noopener noreferrer"&gt;NVIDIA&lt;/a&gt; pushed on the benchmarking question: when you're measuring agent performance rather than model performance, the metrics have to be built differently. &lt;a href="https://www.linkedin.com/in/derekashmore/" rel="noopener noreferrer"&gt;Derek Ashmore&lt;/a&gt; ran a hands-on workshop on the agent testing pyramid, working through how the engineering disciplines that make software trustworthy translate directly into agentic systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human and organisational friction
&lt;/h2&gt;

&lt;p&gt;All of this is running through organisations and through people, and several talks were honest about what that costs.&lt;/p&gt;

&lt;p&gt;The platform question is structural. When a significant share of your users are agents rather than humans, the design decisions change. &lt;a href="https://www.youtube.com/watch?v=ItTJpQz35CY" rel="noopener noreferrer"&gt;Dana Lawson&lt;/a&gt; at &lt;a href="https://netlify.com/" rel="noopener noreferrer"&gt;Netlify&lt;/a&gt; framed this as agent experience, a discipline distinct from developer experience. The API surfaces, the structured error responses, the event-driven capabilities: all of it looks different when the consumer is a model. &lt;a href="https://www.youtube.com/watch?v=pyYKOLEnsZk" rel="noopener noreferrer"&gt;Hannah Foxwell&lt;/a&gt;, advising independently across platform engineering and AI, built her talk around a conviction she put plainly: "speed requires safety." That framed her read of what agentic development actually does to team structure. Two roles are becoming more prominent. The product engineer is an engineer who doesn't need to ask permission of a product manager before improving the product, tightening the loop between what users need and what ships. The forward-deployed engineer takes that further: an empowered engineer embedded side by side with users, able to see a gap and fix it on the spot. Both patterns point toward smaller, higher-agency teams with better developer-to-product-manager ratios. The floor on how small you can go, though, is set by something agents can't replace. An agent cannot hold the pager. On-call needs a sustainable rota, which puts the minimum viable team at around four people, always with a primary and a secondary available.&lt;/p&gt;

&lt;p&gt;Open source is feeling the pressure in a specific way. &lt;a href="https://www.youtube.com/watch?v=HgA0spItnZI" rel="noopener noreferrer"&gt;Jack Wotherspoon&lt;/a&gt; at &lt;a href="https://google.com/" rel="noopener noreferrer"&gt;Google&lt;/a&gt; reported that when agents generate contributions at scale, human maintainers become the bottleneck, not on quality but on bandwidth. Communities that took years to build are having to rewrite their norms in months.&lt;/p&gt;

&lt;p&gt;The talk that stayed with the room longest was &lt;a href="https://www.youtube.com/watch?v=ACL7_EsfIio" rel="noopener noreferrer"&gt;Dave Kerr&lt;/a&gt;'s at &lt;a href="https://www.mckinsey.com/our-people/dave-kerr" rel="noopener noreferrer"&gt;McKinsey&lt;/a&gt;. Using his own bipolar disorder diagnosis as a framework, he worked through the dysregulators he recognises in AI-era engineering: the short dopamine cycle of fast feedback ("very, very addictive"), and what he calls &lt;em&gt;Attentional Leng Che&lt;/em&gt;, named after the grindcore band Leng Che whose name means death by a thousand cuts, meaning attention destroyed by constant short-cycle work and the pressure to be across everything at once. The concept he built toward, &lt;em&gt;maladaptive creativity&lt;/em&gt;, is about the palace of stuff one person can now create, where what they have built is already very different from their own mental model, and more different still from what any colleague's mental model will be. It gave a name to something a lot of people in the room were privately recognising.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent enablement: how AI agents improve between runs
&lt;/h2&gt;

&lt;p&gt;The question of how an agent gets better between runs turns out to be separate from how it performs during one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=tTcxVv8HHNw" rel="noopener noreferrer"&gt;Lamis Mukta&lt;/a&gt; at &lt;a href="https://anthropic.com/" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt; presented the approach her team calls &lt;em&gt;dreaming&lt;/em&gt;, a batch, asynchronous process in which a fleet of sub-agents reviews transcripts from recent interactions, identifies patterns where agents consistently failed or lacked context, and updates the memory store. The next day's agents run smarter without any human having diagnosed the gap. It's consolidation, not retrieval. Ryan Lopopolo had set up the intuition earlier in the day: every agent interruption, failed build, and review comment is evidence that context was missing at the point it was needed. Dreaming is the systematic way to close that loop without requiring a human to notice the pattern first. Lamis and &lt;a href="https://www.linkedin.com/in/aashreytiku/" rel="noopener noreferrer"&gt;Aashrey Tiku&lt;/a&gt;, both from &lt;a href="https://anthropic.com/" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;, also ran a workshop the same day for teams ready to ship their first managed agent.&lt;/p&gt;

&lt;p&gt;The same principle applies at team level. &lt;a href="https://www.youtube.com/watch?v=-sYhcsy5OwI" rel="noopener noreferrer"&gt;Edouard Maleix&lt;/a&gt;, a freelance consultant, showed that when teams explicitly trace which AI-generated decision produced which outcome, they build up a shared picture of where the gaps are and the errors compound less over time. &lt;a href="https://www.youtube.com/watch?v=o-IunU6b1t8" rel="noopener noreferrer"&gt;Brian Douglas&lt;/a&gt; at &lt;a href="https://papercompute.com/" rel="noopener noreferrer"&gt;Paper Compute&lt;/a&gt; focused on domain knowledge: capturing agent sessions, extracting what they learned, and feeding it back into future runs so the institutional knowledge compounds rather than evaporating when the session closes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Organisational enablement
&lt;/h2&gt;

&lt;p&gt;Knowing what individual agents and teams are learning is one thing. Building the organisational practice to support and scale it is something else.&lt;/p&gt;

&lt;p&gt;My own &lt;a href="https://www.youtube.com/watch?v=I9RWrW32QEw" rel="noopener noreferrer"&gt;talk on agent enablement&lt;/a&gt; mapped four layers where that practice has to take hold. At the developer level, the key shift is stopping the instinct to jump in and fix agent mistakes, and instead improving the system that produces the code so the mistake can't recur. At the team level, the team lead is now accountable for agent performance alongside human performance: agents are team members, and metrics like turn count per task and agent retrospectives start making that visible. The platform layer is where a dedicated agentic enablement team builds the shared infrastructure individual teams shouldn't each be reinventing — skill registry, shared harness, central observability, eval platform — with self-serve as the north star. And at the organisational level, the VP Engineering's job is governance, cross-team KPIs for agent quality, and making the return on investment legible enough to justify the investment at all. Engineering management has accumulated decades of practice around developing human engineers. Almost none of that thinking has been applied to agents yet. The framing, &lt;em&gt;agent enablement&lt;/em&gt;, names the gap.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flntigkohcadz90w7vzfn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flntigkohcadz90w7vzfn.jpg" alt="patric" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSdbQwFRzffqW7PjnYEQVU3xBEqcVCgIHWYyNLcXrjRu-oGgEA/viewform" rel="noopener noreferrer"&gt;Tessl Agent&lt;/a&gt; is the product we are building around this thesis. A managed agent that ships with the skill registry, harness, observability, and evals teams would otherwise have to assemble themselves. Private beta access is open if you want to try it on your own enablement layer.&lt;/p&gt;

&lt;p&gt;Part of what makes that hard is that organisational knowledge is quietly eroding. &lt;a href="https://www.youtube.com/watch?v=AHIY1XccX_E" rel="noopener noreferrer"&gt;Peter Wilson and Davide Eynard&lt;/a&gt; at &lt;a href="https://mozilla.ai/" rel="noopener noreferrer"&gt;Mozilla.ai&lt;/a&gt; pointed at something concrete: Stack Overflow contributions plummeted after ChatGPT launched. The institutional Q&amp;amp;A knowledge that used to accumulate in public threads stopped growing, because people stopped writing it down. Their project &lt;em&gt;cq&lt;/em&gt;, "Stack Overflow for agents", tries to rebuild that layer inside organisations: agents hit a problem, solve it, and save the solution in a queryable knowledge base that the rest of the team's agents can draw from too. &lt;a href="https://www.youtube.com/watch?v=rmxRlpi7xN4" rel="noopener noreferrer"&gt;Robert Overweg&lt;/a&gt; took a practical approach to the same problem at company level: building a knowledge system using Obsidian, GitHub, and Telegram so that nothing valuable is lost when a session closes, with one brain, no filtering, and all company context accessible to every agent run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=NVzKtCSeoRA" rel="noopener noreferrer"&gt;Ian Thomas&lt;/a&gt; at &lt;a href="https://meta.com/" rel="noopener noreferrer"&gt;Meta&lt;/a&gt; showed what building this practice looks like at scale. His team's maturity model gives teams a structured self-assessment across six dimensions of AI adoption, with regular workshops to track how they're actually progressing, not through self-reported progress but through discussion, voting, and regular revisits. He was candid about what still needs proving out: code quality drift over years of generated code, and how to prevent review from becoming the new bottleneck as generation accelerates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=tFffUnSq7VA" rel="noopener noreferrer"&gt;Birgitta Böckeler&lt;/a&gt; at &lt;a href="https://www.thoughtworks.com/" rel="noopener noreferrer"&gt;ThoughtWorks&lt;/a&gt; closed the conference with a stated purpose: help the room see the forest for the trees after two days of individual talks. She walked the four-year arc from autocomplete to harness engineering and named the costs that have accumulated alongside the capability: security exposure, code quality drift, token spend, cognitive load, and a review crisis where throughput has long since outrun the ability to trust what was produced. Then she landed on what she called the biggest risk underneath all of it. The pressure to ship more and faster is pushing teams toward cognitive surrender, displacing system-two thinking with AI: not engaging deeply with large changes, not teaching junior engineers, using the most expensive model because it was faster than solving the problem. Surrender, she argued, takes more forms than just the cognitive one. The question she closed on was directed at everyone in the room with any ability to shape how this unfolds: &lt;em&gt;"If you are a person of influence in your engineering organisation, are you creating an environment that leads to surrender?"&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What the program adds up to
&lt;/h2&gt;

&lt;p&gt;Looking back at two days: the questions that dominated last year's conversation (does context matter, should specs drive the build) are now starting points. The live debates are about managing context at scale, building harnesses that make agent output predictable, verifying what agents produce without surrendering the speed, and connecting all of it back to the humans and organisations running it.&lt;/p&gt;

&lt;p&gt;The further out you go, into how individuals and agents retain what they've learned and how organisations build the practice of working with agents as a workforce, the less resolved it gets. Which is roughly where you'd expect the frontier to be.&lt;/p&gt;

&lt;p&gt;One moment that put all of it in perspective: comedian &lt;a href="https://www.lievenscheire.com/ai-show" rel="noopener noreferrer"&gt;Lieven Scheire&lt;/a&gt; gave a keynote explaining to a room full of engineers exactly what it is they do all day, and closed it by solving Where's Wally with AI. Sometimes it takes someone from outside the industry to show the room what it's actually building.&lt;/p&gt;

&lt;p&gt;Thank you to all speakers who brought real work and real honesty to the stage. Thank you to everyone who submitted a talk, whether it made the program or not. The quality of the submissions is what makes curation possible. And thank you to everyone who came, asked questions, and kept the hallway conversations going as long as you did.&lt;/p&gt;

&lt;p&gt;The next event is already in the works. New York is next — &lt;a href="https://luma.com/aidevcon-nyc2026" rel="noopener noreferrer"&gt;AI Native DevCon NYC 2026&lt;/a&gt; is open for registration.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>devops</category>
      <category>aiops</category>
    </item>
    <item>
      <title>Same quality, a quarter of the cost: Should DeepSeek Flash be your model of choice?</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Thu, 11 Jun 2026 06:59:02 +0000</pubDate>
      <link>https://dev.to/tessl/same-quality-a-quarter-of-the-cost-should-deepseek-flash-be-your-model-of-choice-1c85</link>
      <guid>https://dev.to/tessl/same-quality-a-quarter-of-the-cost-should-deepseek-flash-be-your-model-of-choice-1c85</guid>
      <description>&lt;p&gt;&lt;strong&gt;$0.0236&lt;/strong&gt; is how much DeepSeek V4 Flash costs to run a complete agentic task, skill included, on the Fireworks price sheet. Claude Haiku 4.5 costs $0.10 for the same task. Sonnet 4.6 costs $0.30.&lt;/p&gt;

&lt;p&gt;In terms of how good they are, in our evals Flash scores 82.3, and Haiku scores 82.9. So the evals points to them being comparable, with skills applied, but one is four times the cost.&lt;/p&gt;

&lt;p&gt;In our eval we ran 19 model configurations through the same benchmark harness. The tasks we asked of them were real agentic tasks, and we measured the total token counts, and looked at the charged provider pricing. To be honest, the value story we expected to find was "cheap models are a trap." What we found instead was more interesting, and particularly useful if you're running agents at any kind of scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, the Pro comparison
&lt;/h2&gt;

&lt;p&gt;DeepSeek V4 ships two tiers: Pro and Flash. In our eval runs, Pro costs &lt;strong&gt;$0.183/task&lt;/strong&gt; and Flash costs &lt;strong&gt;$0.0236/task&lt;/strong&gt;. That's a &lt;strong&gt;7.7× price gap within the same model family&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you look at what you get for the extra spend, it’s only three points. On the eval results, Pro scores 85.3, Flash scores 82.3. When we scale that, 10,000 tasks/month costs you an extra &lt;strong&gt;$19,000/year&lt;/strong&gt; and 100,000 tasks/month costs an extra &lt;strong&gt;$190,000/year&lt;/strong&gt;. For three points that may not be too visible from a quality point of view.&lt;/p&gt;

&lt;h2&gt;
  
  
  Points-per-dollar
&lt;/h2&gt;

&lt;p&gt;When we look at cost per point of eval score, this gives us a ratio between quality and cost, which can be useful, so long as the overall quality of the model satisfies your needs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score (w/ skill)&lt;/th&gt;
&lt;th&gt;$/task&lt;/th&gt;
&lt;th&gt;pts/$&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;82.3&lt;/td&gt;
&lt;td&gt;$0.024&lt;/td&gt;
&lt;td&gt;3,482&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;82.9&lt;/td&gt;
&lt;td&gt;$0.097&lt;/td&gt;
&lt;td&gt;829&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;85.3&lt;/td&gt;
&lt;td&gt;$0.183&lt;/td&gt;
&lt;td&gt;467&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM 5.1&lt;/td&gt;
&lt;td&gt;90.4&lt;/td&gt;
&lt;td&gt;$0.200&lt;/td&gt;
&lt;td&gt;451&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;90.8&lt;/td&gt;
&lt;td&gt;$0.296&lt;/td&gt;
&lt;td&gt;303&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The number your cost model is probably missing
&lt;/h2&gt;

&lt;p&gt;Cost-per-token is the number everyone tends to quote and often mistakenly use as the most important factor in making a decision. It's also the number that will quietly blow your budget if you're not watching turns per solve as well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr60224o4620wv6dkop8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr60224o4620wv6dkop8i.png" alt="tokens/turn" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Flash's mean average is around 20 turns per task which is pretty manageable. But the single worst-case runs in our dataset hit roughly 10× that. This isn’t unusual for models in this class, but in dollar terms, that's a single task costing as much as 10 average tasks. Multiply that across thousands of concurrent agent runs and you may start to have a budget problem that didn't show up in your per-token estimate.&lt;/p&gt;

&lt;p&gt;The reason most teams don't catch this is that agent frameworks surface token counts by default. Turn counts, which is the variable that actually drives fat-tail cost explosions, often need to be logged explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instrument your agents for turns, not just tokens.&lt;/strong&gt; Know your median and your 95th percentile. Set your timeout policies against the 95th, not the median, or you're either killing valid runs or absorbing surprise bills.&lt;/p&gt;

&lt;h2&gt;
  
  
  The skill is doing half the work
&lt;/h2&gt;

&lt;p&gt;One thing worth being very direct about here is that Flash's 82.3 score is a &lt;strong&gt;skill-augmented score&lt;/strong&gt;. Without a skill, Flash scores 64.1. The skill adds +18.2 points.&lt;/p&gt;

&lt;p&gt;That lift is real, but very conditional on the skill being precise, well-scoped, and actually relevant to the task. A vague skill will drag you back down closer to the 64.1 baseline, whereas a sharp one gets you 82.3.&lt;/p&gt;

&lt;p&gt;This matters more than most model evaluations acknowledge since the model you test in a playground doesn’t usually use a skill or relevant context, but just raw capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going further: find cheaper models and test them yourself
&lt;/h2&gt;

&lt;p&gt;The analysis above shows the cheapest hosted options we measured. But there are two obvious next steps if you want to push it further, and both are more accessible than you might think.&lt;/p&gt;

&lt;p&gt;Every model in this benchmark that isn't GPT, Anthropic, or Gemini has publicly available weights. DeepSeek V4 Flash, GLM 5.1, you can run all of them yourself. When you do, the marginal token cost drops to near zero. You're paying for compute (GPU rental or owned infra), not per-call pricing.&lt;/p&gt;

&lt;p&gt;The maths of self-hosting only make sense above a certain volume threshold, the ops overhead and GPU costs aren't free of course, but if you're running tens of thousands of agentic tasks per month, the crossover point is lower than you'd expect.&lt;/p&gt;

&lt;p&gt;The skill in this benchmark is doing +18.2 points of work. The question worth asking is: where did that skill come from, and how do you know it's any good?&lt;/p&gt;

&lt;p&gt;The Tessl registry is a good place to start and look at the quality, impact and security posture of your skill. Before you write a skill from scratch, check whether one already exists and has eval data behind it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluate your skills properly.&lt;/strong&gt; You can run two types of evaluation: reviews (automated quality assessment of whether your skill is well-structured) and task evals (end-to-end runs that measure whether the skill actually improves agent performance on real tasks). The task eval output is exactly the kind of "with skill / without skill" delta that the Flash benchmark is built on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use skill quality as a model selection input.&lt;/strong&gt; The 18-point lift Flash gets from a well-scoped skill isn't a fixed number, it depends on the skill and the tasks. A skill that has been evaluated by Tessl with a high task eval score gives you confidence that the lift is real and reproducible. A skill that's never been evaluated is a variable you can't account for in your cost modelling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your own workload, not someone else's benchmark.&lt;/strong&gt; The task eval system lets you define scenarios from your actual codebase and run them. That's the self-evaluation framework described above.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaways, flat out
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;DeepSeek V4 Flash at $0.0236/task is the value pick.&lt;/strong&gt; Haiku costs 4× more for 0.6 points. Pro costs 7.7× more for 3 points.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Set a quality floor before you rank by cost.&lt;/strong&gt; pts/$ flatters cheap-and-weak models. Above 80 points, it's a real signal.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Instrument for turns, not just tokens.&lt;/strong&gt; Your 95th percentile turn count is the budget variable nobody's logging.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The skill is doing half the work.&lt;/strong&gt; A bad skill collapses your score back to baseline. Evaluate your skills — with task evals, not vibes.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You can run this yourself.&lt;/strong&gt; 20-30 tasks, turn logging, a spreadsheet, and Tessl's eval system.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Self-hosting open source models is a real option.&lt;/strong&gt; The weights are public, the ops trade-off is real. You should run your own evals with your models to see if they can be substituted in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tier name told you Flash was cheap; the data says it's also good. Now you have the tools to find out whether that holds for what &lt;em&gt;you're&lt;/em&gt; building.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>productivity</category>
      <category>agents</category>
    </item>
    <item>
      <title>Opus 4.8 tops the LLM leaderboard with 95% on skill evals</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Wed, 10 Jun 2026 05:54:17 +0000</pubDate>
      <link>https://dev.to/tessl-io/opus-48-tops-the-llm-leaderboard-with-95-on-skill-evals-1lnf</link>
      <guid>https://dev.to/tessl-io/opus-48-tops-the-llm-leaderboard-with-95-on-skill-evals-1lnf</guid>
      <description>&lt;p&gt;We added Claude Opus 4.8 to our ongoing model benchmark. It scored 95% with skill context, which puts it 1.6 points above Opus 4.7 and 2.3 points above Cursor's Composer 2.5 Fast. It is also, by a meaningful margin, the slowest model we have tested.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Opus 4.8 scores 95% with skill context, taking the top spot from Opus 4.7.&lt;/li&gt;
&lt;li&gt;  Its 81% baseline is the highest ever recorded in this benchmark, higher than every other model and remains top even when models run evals with skills loaded.&lt;/li&gt;
&lt;li&gt;  All three independent judges agreed within two points, the tightest spread we have seen across nine models. Previous high-variance models swung over seven points between judges.&lt;/li&gt;
&lt;li&gt;  On matched runs, Opus 4.8 takes roughly 671 seconds per eval. Composer 2.5 averages 327 seconds on the same pairs. Composer 2.5 Fast averages 215 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How the benchmark works
&lt;/h2&gt;

&lt;p&gt;We test models against a set of engineering skills, each skill is a structured context document that tells an agent how to work correctly in a specific domain. The 11 skills in this benchmark cover: API documentation, Fastify server patterns, project initialisation, ESLint/neostandard linting, Node.js best practices, Node.js core contribution conventions, OAuth 2.0 security patterns, GitHub automation via the Octocat API, skill optimisation, code snippet rendering with Snipgrapher, and TypeScript configuration.&lt;/p&gt;

&lt;p&gt;Each skill has five scenarios. Each scenario runs twice: once with the skill loaded (with-skill) and once without (baseline). That gives us the lift score (how much the skill context actually helps). Every run is scored independently by three LLM judges (Sonnet, GPT-5.5, and Opus 4.7), and we average the results. We covered why we use three judges, and what happens when you use only one, in a previous post. The short version: a single judge can swing results by over seven points depending on which model family it belongs to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Opus 4.8 lands
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Baseline&lt;/th&gt;
&lt;th&gt;Avg With-Skill&lt;/th&gt;
&lt;th&gt;Lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;claude:claude-opus-4-8&lt;/td&gt;
&lt;td&gt;81.0%&lt;/td&gt;
&lt;td&gt;95.0%&lt;/td&gt;
&lt;td&gt;+14.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude:claude-opus-4-7&lt;/td&gt;
&lt;td&gt;80.8%&lt;/td&gt;
&lt;td&gt;93.4%&lt;/td&gt;
&lt;td&gt;+12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2.5-fast&lt;/td&gt;
&lt;td&gt;79.6%&lt;/td&gt;
&lt;td&gt;92.7%&lt;/td&gt;
&lt;td&gt;+13.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2.5&lt;/td&gt;
&lt;td&gt;79.0%&lt;/td&gt;
&lt;td&gt;92.1%&lt;/td&gt;
&lt;td&gt;+13.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cursor:composer-2&lt;/td&gt;
&lt;td&gt;74.2%&lt;/td&gt;
&lt;td&gt;89.6%&lt;/td&gt;
&lt;td&gt;+15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;codex:gpt-5.5&lt;/td&gt;
&lt;td&gt;75.5%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;+13.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;codex:gpt-5.4&lt;/td&gt;
&lt;td&gt;74.1%&lt;/td&gt;
&lt;td&gt;89.3%&lt;/td&gt;
&lt;td&gt;+15.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;codex:gpt-5.3&lt;/td&gt;
&lt;td&gt;65.5%&lt;/td&gt;
&lt;td&gt;83.9%&lt;/td&gt;
&lt;td&gt;+18.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;codex:gpt-5-codex&lt;/td&gt;
&lt;td&gt;68.7%&lt;/td&gt;
&lt;td&gt;78.7%&lt;/td&gt;
&lt;td&gt;+10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Opus 4.8, Opus 4.7, and Composer 2.5 Fast are now meaningfully separated from the rest of the field. Everything below Composer 2.5 sits at 89-90% or lower, a gap of around 3 points that has been stable across our last several benchmark runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The baseline number
&lt;/h2&gt;

&lt;p&gt;Opus 4.8 scores 81% without any skill context. That is higher than Composer 2.5, GPT-5.5, GPT-5.4, and every other model in the field even when those models have skills loaded. Every other model in this benchmark needs scaffolding to reach the floor Opus 4.8 starts from.&lt;/p&gt;

&lt;p&gt;The implication for skill deployment is worth spelling out. Weaker baseline models get more absolute value from skill context because they need it more, gpt-5.3, for example, shows the largest lift at 18.4 points, starting from 65.5%. Opus 4.8 gets +14 lift, but it starts at 81%, which is a different category of floor. The skill is pushing a strong model further rather than compensating for a weak one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-skill breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;With-Skill&lt;/th&gt;
&lt;th&gt;Lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;linting&lt;/td&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;+0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nodejs-core&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;td&gt;98%&lt;/td&gt;
&lt;td&gt;+9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;skill-optimizer&lt;/td&gt;
&lt;td&gt;88%&lt;/td&gt;
&lt;td&gt;98%&lt;/td&gt;
&lt;td&gt;+10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fastify&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;98%&lt;/td&gt;
&lt;td&gt;+16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;documentation&lt;/td&gt;
&lt;td&gt;86%&lt;/td&gt;
&lt;td&gt;97%&lt;/td&gt;
&lt;td&gt;+11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;octocat&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;96%&lt;/td&gt;
&lt;td&gt;+13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;oauth&lt;/td&gt;
&lt;td&gt;76%&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;td&gt;+19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;node&lt;/td&gt;
&lt;td&gt;69%&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;td&gt;+25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snipgrapher&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;td&gt;+36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;init&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;+12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;typescript&lt;/td&gt;
&lt;td&gt;81%&lt;/td&gt;
&lt;td&gt;86%&lt;/td&gt;
&lt;td&gt;+5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Take a look at these skills &lt;a href="https://tessl.io/registry/simon/skills/evals" rel="noopener noreferrer"&gt;here, including as the scenarios&lt;/a&gt; I ran against them.&lt;/p&gt;

&lt;p&gt;Linting is near-perfect with or without skills, the rubric checks binary outcomes like whether a file was deleted or a package removed, which gives judges nothing to disagree about and leaves almost no room for the model to fail on the with-skill run either.&lt;/p&gt;

&lt;p&gt;Snipgrapher is the outlier: a 58% baseline rising to 94% with skill context, a 36-point lift and the largest we have recorded for any model on any skill. Snipgrapher asks agents to follow a rendering specification they have never encountered before, so without the skill most agents approximate and with it they follow the spec. The gap is that large because the tool is genuinely obscure with no training signal for it.&lt;/p&gt;

&lt;p&gt;Node best practices follows a similar pattern at +25. The baseline of 69% reflects how much the model has to infer from general coding knowledge alone. The skill provides the specific idioms and patterns that push the score to 94%.&lt;/p&gt;

&lt;p&gt;The typescript result is the recurring problem. Both Opus 4.7 and Composer 2.5 showed a regression in this skill too: the model's own assumptions about TypeScript seem to conflict with the skill's guidance rather than build on it. At 81% baseline and only 86% with skill, Opus 4.8 gets just five points of lift where every other skill gets at least nine. The pattern is consistent enough across models that it points to a skill design problem rather than a model problem. If TypeScript configuration is central to your workflow, this is worth investigating before deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  The judges agreed, which is unusual
&lt;/h2&gt;

&lt;p&gt;In the judges post we documented a 7.3-point swing for Opus 4.7 across the three judges, with GPT-5.5 grading it at 89.2% and Opus giving itself 96.5%. We attributed the high Opus-as-judge score partly to self-judge bias. Opus gave itself a 4.6-point boost over what the other judges awarded.&lt;/p&gt;

&lt;p&gt;For Opus 4.8 the spread was just two points: Sonnet gave it 96%, Opus 4.7 gave it 95%, and GPT-5.5 gave it 94%. That is the tightest cross-judge agreement we have seen for any model in this benchmark. A strict judge and a generous one converge when the answer leaves no room for interpretation. Opus 4.8 got there more consistently than any model we have tested, which is why the judges stopped disagreeing.&lt;/p&gt;

&lt;p&gt;This also has an implication for eval cost. A model that produces consistently unambiguous outputs could potentially be scored with a single strict judge without much risk of inflation. You would still want to verify that for your specific rubrics, but the data suggests the three-judge overhead is less necessary here than it was for previous models.&lt;/p&gt;

&lt;h2&gt;
  
  
  It is slower, and that cost compounds
&lt;/h2&gt;

&lt;p&gt;We measured timing on matched skill and judge pairs. These are the same scenarios and judges for both models, ensuring we give a fair comparison. Opus 4.8 averaged 671 seconds per eval run. Composer 2.5 averaged 327 seconds on the same pairs and Composer 2.5 Fast averaged 215 seconds, roughly two to three times faster.&lt;/p&gt;

&lt;p&gt;For a one-off task the latency barely registers. In an agentic loop over hundreds of sequential tasks, Composer 2.5 Fast completes three full runs in the time Opus 4.8 finishes one, and that gap turns into hours at scale.&lt;/p&gt;

&lt;p&gt;Pick it when task accuracy has downstream consequences. When throughput is the binding constraint, Composer 2.5 Fast is three times faster and only 2.3 points behind.&lt;/p&gt;

&lt;h2&gt;
  
  
  How these numbers were produced
&lt;/h2&gt;

&lt;p&gt;Every score in this post is averaged across three independent judges: Sonnet, GPT-5.5, and Opus 4.7. We do not publish single-judge scores. The Opus 4.8 runs used the same 11 skills and 5 scenarios per skill as every prior model in this benchmark, so the comparison is direct.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentskills</category>
      <category>productivity</category>
      <category>security</category>
    </item>
    <item>
      <title>AI Coding Agent Accuracy: Opus 4.7 vs 4.8</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Tue, 09 Jun 2026 07:23:08 +0000</pubDate>
      <link>https://dev.to/tessl/ai-coding-agent-accuracy-opus-47-vs-48-3051</link>
      <guid>https://dev.to/tessl/ai-coding-agent-accuracy-opus-47-vs-48-3051</guid>
      <description>&lt;p&gt;You are deciding whether to roll your default agent model from Opus 4.7 to 4.8. The release notes promise improvements, the leaderboard moves a fraction of a point, so you shrug, schedule the upgrade for a quiet Friday, and move on.&lt;/p&gt;

&lt;p&gt;We ran both versions through the same skills evaluation, roughly 850 scenarios solved twice each, and on the headline metric they finished level. Underneath the tie, though, 4.8 reached the same answers in four fewer turns and for measurably less money, so the upgrade that looks like a non-event on the scoreboard turns out to be a real efficiency gain in the place that actually bills you: the agent loop.&lt;/p&gt;

&lt;p&gt;AI agent evaluation measures how an agent behaves on real tasks rather than only scoring its final answer, tracking cost, turns, and reliability across paired runs. The reason to bother is that two models can post the same score while spending very different amounts of work to reach it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two versions, one eval harness
&lt;/h2&gt;

&lt;p&gt;Both models ran the identical setup. Every scenario is solved twice, once with no help and once with the relevant skill installed, so we can isolate what the skill contributes from what the base model already knows. We score three things: instruction following (did the agent do what the skill tells it to do), task completion (did it reach the goal), and an overall blend weighted toward instruction following. We also flag integrity issues, like an agent peeking at the grading rubric instead of solving the task.&lt;/p&gt;

&lt;p&gt;Opus 4.7 is the incumbent. In our runs it is a strong agent that leans heavily on skills to reach its ceiling, and it explores a lot of paths to get there.&lt;/p&gt;

&lt;p&gt;Opus 4.8 is the point release. It posts the same ceiling with a skill installed, but it starts from a higher floor without one, and it gets to the answer with noticeably less wandering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI coding agent accuracy stops being the story
&lt;/h2&gt;

&lt;p&gt;Here is the head-to-head on the shared scenario set, all with the relevant skill installed unless noted.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall score&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;td&gt;92.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Baseline score, no skill&lt;/td&gt;
&lt;td&gt;71.4&lt;/td&gt;
&lt;td&gt;74.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task completion&lt;/td&gt;
&lt;td&gt;97.1&lt;/td&gt;
&lt;td&gt;97.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instruction following&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turns per task&lt;/td&gt;
&lt;td&gt;19.2&lt;/td&gt;
&lt;td&gt;15.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens per task&lt;/td&gt;
&lt;td&gt;7,820&lt;/td&gt;
&lt;td&gt;9,763&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per task, API pricing&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;about 5% lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integrity flags raised&lt;/td&gt;
&lt;td&gt;10.2%&lt;/td&gt;
&lt;td&gt;7.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The overall accuracy gap is 0.2 points. If you stopped reading the row labeled "overall score," you would conclude nothing changed. Three other rows complicate that picture.&lt;/p&gt;

&lt;p&gt;The first is the baseline. Without any skill, 4.8 scores 74.1 against 4.7's 71.4, a 2.6 point gain, and its no-skill instruction following climbed from the high 50s into the low 60s. The ceiling is shared because the skill pulls both versions up to roughly the same place. The floor is where 4.8 actually improved, and that has a practical consequence: 4.8 depends on the skill slightly less to do good work. This suggests some of the knowledge previously only present in skills has been trained into the model weights.&lt;/p&gt;

&lt;p&gt;The second is turns. 4.8 finishes the average task in 15.0 turns versus 19.2 for 4.7, a 21% reduction. In an agent loop, a turn is a full round trip of context, reasoning, and tool use. Cutting four turns off the average task lowers latency, reduces the chances for an agent to talk itself into a wrong path, and, as we will see, lowers cost.&lt;/p&gt;

&lt;p&gt;The third is integrity. The eval flags runs where the agent took a shortcut, like reading the grading rubric or reaching outside its workspace. Those flags dropped from 10.2% of shared runs to 7.9%. 4.8 is modestly more disciplined about how it reaches an answer. This matches Anthropic’s claims about 4.8 being more honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading the cost: turns, not tokens
&lt;/h2&gt;

&lt;p&gt;Look again at two rows that seem to contradict each other. 4.8 produces more output per task, 9,763 tokens against 7,820, yet it costs about 5% less.&lt;/p&gt;

&lt;p&gt;This is because output volume does not dominate agentic cost. The dominant term is the context replayed on every turn. Each turn re-sends the accumulated conversation and tool results, and in long agent runs that cached input swamps the fresh output the model writes. Fewer turns means fewer replays, so 4.8 can be more verbose inside each turn and still come out ahead, because it takes four fewer turns to converge.&lt;/p&gt;

&lt;p&gt;Model cards only show the per-token rate that sets the price of a unit of work, while turn count sets how many units the model decides to spend. A point release that holds accuracy flat while spending 21% fewer turns is working on that second term, which is the one that scales with your usage.&lt;/p&gt;

&lt;p&gt;The same dynamic shows up in how each version absorbs a skill. Adding the relevant skill is not free: it pulls in instructions and reference material the agent has to process, and the question is how efficiently the model turns that overhead into a result.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Effect of installing the skill&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall score gain&lt;/td&gt;
&lt;td&gt;+20.5&lt;/td&gt;
&lt;td&gt;+18.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost increase&lt;/td&gt;
&lt;td&gt;+38%&lt;/td&gt;
&lt;td&gt;+12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turn increase&lt;/td&gt;
&lt;td&gt;+41%&lt;/td&gt;
&lt;td&gt;+14%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On 4.7, switching on a skill added 41% more turns to cash in a 20 point accuracy gain. On 4.8, the same class of skill buys nearly the same gain for much less turn and cost overhead. 4.8 treats a skill more like a shortcut and less like an invitation to explore. If you run agent skills at scale, that lower skill tax compounds across every task you ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one place 4.8 regressed
&lt;/h2&gt;

&lt;p&gt;A fair comparison reports where the new version loses ground. Per scenario, the record is close to a wash: 4.8 scored higher on 23% of shared tasks, tied on 61%, and scored lower on 17%, using a two point threshold. The interesting part is that the losses cluster.&lt;/p&gt;

&lt;p&gt;4.8 regressed on web research and scraping skill families. Firecrawl tasks dropped 3.3 points on average across 72 scenarios. LangChain dropped 2.9 points across 48. Smaller families like Tavily and Apify fell further, 10.4 and 7.6 points, though on fewer tasks. Meanwhile 4.8 improved on infrastructure, auth, and code tooling: Cloudflare gained 4.5 points across 38 scenarios, Auth0 gained 4.3 across 18, and Mastra gained 10.1 across 10.&lt;/p&gt;

&lt;p&gt;The aggregate hid this completely, because the gains and losses nearly cancel. Only a per domain breakdown surfaces it. That is the whole argument for paired skill evals over a single leaderboard number: the headline can be a tie while two coherent shifts run in opposite directions underneath it.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to roll forward to 4.8
&lt;/h3&gt;

&lt;p&gt;Roll forward to 4.8 if your agents run long, multi turn tasks where turn count, latency, and cost matter, which is most production agent work. You get the same accuracy ceiling, a higher floor before skills, a 21% turn reduction, a cheaper skill tax, and fewer integrity flags. If your workloads lean on infrastructure, auth, or general code tooling, 4.8 is flat to clearly better.&lt;/p&gt;

&lt;p&gt;Test before you roll forward if your agents live in the scrape, crawl, and summarize world. The web research regression is small in absolute terms but consistent across the families we measured. Run your own A/B on your top scraping workflows first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway: measure behavior, not the changelog
&lt;/h2&gt;

&lt;p&gt;A skeptic has two reasonable objections. The first: a flat score is just no improvement, so why care? Two models can tie on accuracy while one spends 21% more turns and about 5% more budget to get there. The second: these are our eval harness costs. However, the relative differences in turns, tokens, and cost reflect model behavior which does generalize.&lt;/p&gt;

&lt;p&gt;Make sure you’re measuring each release on behavior, on your own tasks, with skills installed and stripped out, and look at the per domain breakdown before you trust the average.&lt;/p&gt;

&lt;p&gt;Want to see how your own stack behaves across a model upgrade? Browse the Tessl Registry to find the skills your agents depend on, then run the same paired evaluations we used here to measure what actually changed.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>agentskills</category>
      <category>agents</category>
    </item>
    <item>
      <title>Why We're Changing Our Default Eval Model</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Mon, 08 Jun 2026 04:23:30 +0000</pubDate>
      <link>https://dev.to/tessl-io/why-were-changing-our-default-eval-model-50i4</link>
      <guid>https://dev.to/tessl-io/why-were-changing-our-default-eval-model-50i4</guid>
      <description>&lt;p&gt;We're changing the default solver model in our eval harness from Claude Sonnet 4.6 to GLM 5.1. This is the default we provide to everyone running evals on the platform. For most of the work the harness does, a frontier model gives you the strongest possible signal. However, that's more signal than the job needs and the difference is where eval budgets quietly leak. The question that decides how much you should be paying is whether a given eval run is measuring the model or measuring the skill.&lt;/p&gt;

&lt;p&gt;The principle behind it: specify the model only when you care about the model. When your eval exists to answer "does &lt;em&gt;this specific model&lt;/em&gt; ship well?", you have to run that exact model. When it exists to answer "does &lt;em&gt;this skill&lt;/em&gt; improve agent behavior, and has anything regressed?", you don't need a specific model, you need a representative one.&lt;/p&gt;

&lt;p&gt;We put this to the test on our own &lt;a href="https://tessl.io/blog/a-proposed-framework-for-evaluating-skills-research-eng-blog/" rel="noopener noreferrer"&gt;skill-evaluation harness&lt;/a&gt; and validated GLM 5.1 against Sonnet 4.6, the model it replaces as the default. We lost almost none of the signal skill authors rely on, and the eval bill went down. This post is the reasoning behind the switch, and a framework you can apply to your own eval stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two questions, one eval harness
&lt;/h2&gt;

&lt;p&gt;Our harness runs a large skill-evaluation suite: roughly 500 skills across about 850 tasks, each run twice, with the skill and without it. We score three things: instruction following (did the agent do what the skill tells it to do), task completion (did it reach the goal), and an overall blend weighted toward instruction following.&lt;/p&gt;

&lt;p&gt;Lift is the difference between an agent's behavior with a skill and without it, and it's the number a skill author reads, because it isolates the skill's effect from the model's baseline.&lt;/p&gt;

&lt;p&gt;Two models are in play on every run. The judge grades the trajectories; we keep it fixed and strong because the judge's grading on the rubric determines lift. The solver is the agent doing the task, and it's the free variable. Because each agentic trajectory is longer than a judging round, the solver dominates eval cost, so the practical question is whether we can swap the default solver for something cheaper without losing the lift signal.&lt;/p&gt;

&lt;p&gt;To answer that, you need to know which of two questions your harness is answering.&lt;/p&gt;

&lt;p&gt;The first is "does &lt;em&gt;this specific model&lt;/em&gt; ship well?" If you are deciding which model goes into production, no proxy will do, because the model is the subject. The second is "does &lt;em&gt;this skill&lt;/em&gt; change agent behavior, and not regress?" Here the model isn't the subject but an instrument for reading the skill, and an instrument only needs to be accurate enough to reproduce the signal you act on.&lt;/p&gt;

&lt;p&gt;Most day-to-day skill development is the second question. You are iterating on a skill, watching whether the lift goes up, guarding against regressions. The specific solver underneath barely matters, as long as it tracks the frontier closely enough. The right default for that work is the cheapest model that faithfully reproduces the lift.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate AI agents without paying frontier prices for every run
&lt;/h2&gt;

&lt;p&gt;The obvious objection: a cheaper model is cheaper because it's worse, so won't the signal degrade with it? That depends on which signal. The absolute levels do degrade; the lift mostly doesn't.&lt;/p&gt;

&lt;p&gt;We ran roughly 850 tasks across 500 skills head-to-head on both GLM 5.1 and Sonnet 4.6: same tasks, same judge, same with-and-without protocol. Then we correlated the per-skill lift.&lt;/p&gt;

&lt;p&gt;At the skill level, across those 500 skills, the lift correlation was r = 0.72 (Spearman 0.69). If a skill lifts Sonnet, it often lifts GLM by a similar amount, and the correlation holds when you decompose it. This matters because a single headline number can hide a saturation artifact. Instruction-following lift, where almost all the signal lives (standard deviation 26), came in at r = 0.71. Task completion lift, which is small and near-saturated but carries the rare unlocks, came in at r = 0.74. The agreement is on each dimension and its magnitude.&lt;/p&gt;

&lt;p&gt;For a screening tool, the number to watch is decision agreement. On the binary call every author actually makes, "does this skill help?", the two models agreed &lt;strong&gt;88.5%&lt;/strong&gt; of the time, and where they differ they differ in a safe direction: GLM is mildly conservative, with a mean lift of 22.3 against Sonnet's 24.3 and a regression slope around 0.76. It won't over-credit a skill, which is what you want for a regression guard.&lt;/p&gt;

&lt;p&gt;For skill authors the takeaway is simple: the thing you act on, the sign and rough size of a skill's lift, reads the same on either model, so run the cheap one by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  The limits of a cheap screen
&lt;/h2&gt;

&lt;p&gt;The two models diverge on fine-grained, low-impact flagging. GLM catches roughly half the skills Sonnet rates as low-impact (under 5 points of lift), and on the rare outright-negative skills the overlap is smaller still. However, with only about two tasks per skill, plus the irreducible noise any LLM judge carries, the marginal cases are precisely where any two models disagree. The disagreement is concentrated where the evidence is thinnest, not spread across the confident calls.&lt;/p&gt;

&lt;p&gt;This means that GLM is the cheap, fast screen you run constantly while developing skills and guarding against regressions. When a decision hinges on a single borderline skill or on which model you ship, you escalate to the model you care about. The screen narrows the field and the frontier model makes the final call. You're not trading accuracy for cost so much as spending accuracy where the decisions are and throughput everywhere else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost story
&lt;/h2&gt;

&lt;p&gt;Today, for our API pricing, the typical task is about 1.5x cheaper on GLM. GLM is cheaper on 83% of tasks and at least 1.5x cheaper on 52% of them, and on per-token API list price the gap widens to 2 to 3x. The one-line version: cheaper on the large majority of tasks, typically around 1.5x, and up to 2 to 3x per token.&lt;/p&gt;

&lt;p&gt;Total eval spend is about 1.4x cheaper, roughly 28% lower, which is narrower than the per-task figure. The reason is a heavy cost tail: about 17% of tasks are runaway, chatty trajectories that loop or burn far more tokens than the median with one task alone reading 2.1M cached tokens. The aggregate gets pulled by that tail rather than by the typical task.&lt;/p&gt;

&lt;p&gt;That tail is something we can optimize. The gap between the typical task at 1.5x and the aggregate at 1.4x comes from those runaway trajectories, and tightening turn and loop limits and how the harness drives long trajectories collapses the tail toward the median. That alone moves the aggregate toward the 1.5 to 2x the typical task already shows. This is cheaper today on most tasks, and on a cost curve we can keep pushing down.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to apply this to your own stack
&lt;/h2&gt;

&lt;p&gt;The principle generalizes well beyond our harness:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Default your skill-development and regression evals to a cheap, SOTA-correlated solver.&lt;/strong&gt; The volume of runs lives here.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Pin the frontier model only for ship decisions and borderline single-skill calls.&lt;/strong&gt; Here, a decision actually turns on the accuracy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GLM 5.1 is now our default solver and is configurable in the eval runner. So before your next eval run, ask what that eval is actually answering: are you measuring the model, or measuring the skill? If it's the skill, what's the cheapest instrument that still moves when the skill moves?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentskills</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Your benchmarks are lying to you, and your judge is to blame!</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Tue, 19 May 2026 09:18:00 +0000</pubDate>
      <link>https://dev.to/tessl-io/your-benchmarks-are-lying-to-you-and-your-judge-is-to-blame-2k20</link>
      <guid>https://dev.to/tessl-io/your-benchmarks-are-lying-to-you-and-your-judge-is-to-blame-2k20</guid>
      <description>&lt;p&gt;Last week I &lt;a href="https://tessl.io/blog/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense/" rel="noopener noreferrer"&gt;published&lt;/a&gt; a benchmark comparing &lt;strong&gt;six models&lt;/strong&gt; across &lt;strong&gt;eleven agent skills&lt;/strong&gt;. The numbers in that post are averages, and we did not explain why.&lt;/p&gt;

&lt;p&gt;When I shared the data internally, Maria from our AI Research team pointed out something that we should take very seriously: &lt;strong&gt;an LLM judge is likely to favour outputs from its own model family&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I ran the full benchmark again with a second judge, then a third, to see if this hypothesis held any water. The scores shifted, the rankings moved, and one model swung 47 percentage points on a single skill depending on the judge who graded it. If you are publishing or trusting eval numbers from a single LLM judge, you are partly benchmarking judge preference rather than model capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR for this post
&lt;/h2&gt;

&lt;p&gt;We’re all busy people, here’s the tl;dr.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The same benchmark, graded by three different LLM judges, produced different scores and different rankings. &lt;strong&gt;One model swung 47 points on a single skill&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Sonnet&lt;/strong&gt; is the most generous judge, &lt;strong&gt;GPT-5.5&lt;/strong&gt; the strictest. The gap between them averages 6.9 points across all models and skills.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;LLM judges favour their own model family&lt;/strong&gt;. Opus gave itself a 4.6 point boost over what the other two judges awarded it.&lt;/li&gt;
&lt;li&gt;  Rankings only stay stable at the top of the agents we tested. &lt;strong&gt;opus-4-7 held first place&lt;/strong&gt; under all three judges. Everything below it moved position.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Eval criteria built around binary, verifiable things&lt;/strong&gt; (file deleted or not, flag enabled or not) produce stable scores regardless of the judge. &lt;strong&gt;Skills that require qualitative judgment can easily swing 25 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  Fix: &lt;strong&gt;run multiple judges and average&lt;/strong&gt;. Favour the same judge as the model you’ll tend to use in development, if you know. Design rubrics with yes/no criteria wherever the task allows it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbraf14e4s66n7ibzuwuk.png" alt="Join us at AI Native DevCon" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;
Join us at AI Native DevCon (use C0DE30 for 30% discount)



&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Six models, eleven skills, five scenarios per agent skill, one rubric. The only variable was the scoring model, i.e. the judge: Sonnet, GPT-5.5, and Opus-4-7 each graded every run independently. The figures in our main benchmark are averaged across all three. This post is about what happens before the averaging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The raw results
&lt;/h2&gt;

&lt;p&gt;The Tessl UI shows eval results as an average of all the runs, but you can still see all the information about the scenarios/tasks that were set and how they fared with and without the skills, on the &lt;a href="https://tessl.io/registry/simon/skills/evals" rel="noopener noreferrer"&gt;Tessl registry here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3z39t8zbyuznps4tso55.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3z39t8zbyuznps4tso55.png" alt="image1" width="799" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Criteria are pretty easy to create with Tessl. Once you’ve &lt;a href="https://docs.tessl.io/introduction-to-tessl/installation" rel="noopener noreferrer"&gt;installed and authenticated with a free account&lt;/a&gt; using Tessl command line tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`$ curl -fsSL https://get.tessl.io | sh`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simply ask your agent to create them using Tessl, or just run the following at the command line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`$ tessl scenario generate &amp;lt;path/to/tile&amp;gt; --count=5`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then download them to disk once to validate, and make any changes you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`tessl scenario download --last`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now run them, choosing the using the model you want to run the scenarios with the &lt;code&gt;--agent&lt;/code&gt; flag, as well as the model you wish to judge the output using the --scorer-agent flag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`tessl eval run &amp;lt;path/to/tile&amp;gt; --agent=claude:claude-opus-4-6 --scorer-agent codex:gpt-5.5`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Judge Strictness: Sonnet Grades Easiest, GPT-5.5 Grades Hardest
&lt;/h2&gt;

&lt;p&gt;Averaged across all six models and all eleven skills, here is what each judge returned:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Judge&lt;/th&gt;
&lt;th&gt;Avg without-skill&lt;/th&gt;
&lt;th&gt;Avg with-skill&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;td&gt;76.1&lt;/td&gt;
&lt;td&gt;90.3&lt;/td&gt;
&lt;td&gt;14.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus-4-7&lt;/td&gt;
&lt;td&gt;72.6&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;15.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;70.7&lt;/td&gt;
&lt;td&gt;83.4&lt;/td&gt;
&lt;td&gt;12.7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sonnet grades most generously. GPT-5.5 is 6.9 points stricter on average. If your pipeline scores agents with Sonnet as the default judge, your numbers are probably 5 to 7 points higher than a stricter grader would return. That gap is real and it is not uniformly distributed across models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rankings Shift
&lt;/h2&gt;

&lt;p&gt;Here are the per-judge leaderboards for with-skill performance, using the same models and rubrics with only the judge swapped:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;th&gt;Opus-4-7&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;opus-4-7 (94.5)&lt;/td&gt;
&lt;td&gt;opus-4-7 (89.2)&lt;/td&gt;
&lt;td&gt;opus-4-7 (96.5)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;gpt-5.4 (92.7)&lt;/td&gt;
&lt;td&gt;gpt-5.5 (88.4)&lt;/td&gt;
&lt;td&gt;gpt-5.5 (92.3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;gpt-5.3 (91.9)&lt;/td&gt;
&lt;td&gt;composer (88.0)&lt;/td&gt;
&lt;td&gt;composer (90.3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;composer (90.5)&lt;/td&gt;
&lt;td&gt;gpt-5.4 (86.5)&lt;/td&gt;
&lt;td&gt;gpt-5.4 (88.8)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;gpt-5.5 (87.4)&lt;/td&gt;
&lt;td&gt;gpt-5.3 (75.7)&lt;/td&gt;
&lt;td&gt;gpt-5.3 (84.0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;gpt-5-codex (85.1)&lt;/td&gt;
&lt;td&gt;gpt-5-codex (72.9)&lt;/td&gt;
&lt;td&gt;gpt-5-codex (78.1)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The leaderboard format shows position but hides distances. Here are the raw scores:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;th&gt;Opus-4-7&lt;/th&gt;
&lt;th&gt;Avg&lt;/th&gt;
&lt;th&gt;Swing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7&lt;/td&gt;
&lt;td&gt;94.5&lt;/td&gt;
&lt;td&gt;89.2&lt;/td&gt;
&lt;td&gt;96.5&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;7.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;composer&lt;/td&gt;
&lt;td&gt;90.5&lt;/td&gt;
&lt;td&gt;88.0&lt;/td&gt;
&lt;td&gt;90.3&lt;/td&gt;
&lt;td&gt;89.6&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;87.4&lt;/td&gt;
&lt;td&gt;88.4&lt;/td&gt;
&lt;td&gt;92.3&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;td&gt;4.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;92.7&lt;/td&gt;
&lt;td&gt;86.5&lt;/td&gt;
&lt;td&gt;88.8&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;td&gt;75.7&lt;/td&gt;
&lt;td&gt;84.0&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;85.1&lt;/td&gt;
&lt;td&gt;72.9&lt;/td&gt;
&lt;td&gt;78.1&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;td&gt;12.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The swing column is the gap between the highest and lowest score any judge gave a model. composer swings 2.5 points, meaning all three judges broadly agree on it. &lt;strong&gt;gpt-5.3 swings 16.2 points&lt;/strong&gt;, which is more than the gap between first and last place in the averaged rankings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;opus-4-7 holds first place under every judge&lt;/strong&gt;, and that is the one stable finding in the table. Everything else shifts. gpt-5.3 sits third under Sonnet and falls to fifth under both GPT-5.5 and Opus. gpt-5.5 sits fifth under Sonnet and climbs to second under the other two. The Sonnet-only leaderboard, which is what most default Tessl runs would have produced, gives a flattering picture of gpt-5.3 and an unflattering one of gpt-5.5.&lt;/p&gt;

&lt;p&gt;Judge choice also affects how much credit each model gets for using skill context. Lift scores, meaning the gap between baseline and with-skill performance, vary considerably by judge:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Sonnet lift&lt;/th&gt;
&lt;th&gt;GPT-5.5 lift&lt;/th&gt;
&lt;th&gt;Opus lift&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3&lt;/td&gt;
&lt;td&gt;16.1&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;td&gt;22.9&lt;/td&gt;
&lt;td&gt;18.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;composer&lt;/td&gt;
&lt;td&gt;16.9&lt;/td&gt;
&lt;td&gt;13.8&lt;/td&gt;
&lt;td&gt;15.4&lt;/td&gt;
&lt;td&gt;15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;16.8&lt;/td&gt;
&lt;td&gt;13.5&lt;/td&gt;
&lt;td&gt;15.3&lt;/td&gt;
&lt;td&gt;15.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;10.2&lt;/td&gt;
&lt;td&gt;15.1&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;td&gt;13.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7&lt;/td&gt;
&lt;td&gt;14.0&lt;/td&gt;
&lt;td&gt;9.7&lt;/td&gt;
&lt;td&gt;14.1&lt;/td&gt;
&lt;td&gt;12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;11.3&lt;/td&gt;
&lt;td&gt;8.3&lt;/td&gt;
&lt;td&gt;10.4&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Opus judge gives gpt-5.3 a lift of 22.9 points&lt;/strong&gt;. Sonnet and GPT-5.5 give it 16. The rubric was identical. The disagreement is purely about whether gpt-5.3's output counted as genuine compliance or a close approximation, and a single judge cannot tell you which reading is correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Judge Bias Is Measurable
&lt;/h2&gt;

&lt;p&gt;The results split along model family lines, though not symmetrically.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Own judge score&lt;/th&gt;
&lt;th&gt;Other judges avg&lt;/th&gt;
&lt;th&gt;Boost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7 (Opus judge)&lt;/td&gt;
&lt;td&gt;96.5&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;td&gt;+4.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5 (GPT-5.5 judge)&lt;/td&gt;
&lt;td&gt;88.4&lt;/td&gt;
&lt;td&gt;89.9&lt;/td&gt;
&lt;td&gt;-1.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Opus case is unambiguous. &lt;strong&gt;Opus gives itself 96.5&lt;/strong&gt;; Sonnet gives it 94.5; GPT-5.5 gives it 89.2. The 7.3 point gap between Opus-as-judge and GPT-5.5-as-judge for the same model on the same runs is entirely a grading artefact. The gpt-5.5 case does not follow the same pattern: GPT-5.5 actually scores its own model lower than the other two judges do, and Opus gives gpt-5.5 its highest score at 92.3. Self-favour exists but is not symmetric, and its size and direction vary by model and judge pairing.&lt;/p&gt;

&lt;p&gt;The practical consequence for the Opus case specifically: if you are using Claude models to grade Claude outputs, expect a systematic upward bias of 4 to 5 points. It does not show up as a bias in your data; it just looks like good scores.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Averaged Picture
&lt;/h2&gt;

&lt;p&gt;Given all this variance, here is what you get when you average across all three judges:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg baseline&lt;/th&gt;
&lt;th&gt;Avg with-skill&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7&lt;/td&gt;
&lt;td&gt;80.8&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;composer&lt;/td&gt;
&lt;td&gt;74.2&lt;/td&gt;
&lt;td&gt;89.6&lt;/td&gt;
&lt;td&gt;15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;75.5&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;td&gt;13.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;74.1&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;15.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3&lt;/td&gt;
&lt;td&gt;65.5&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;18.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;68.7&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The lift column deserves a second look. &lt;strong&gt;gpt-5.3 shows the largest average lift at 18.4 points&lt;/strong&gt;, which sounds like a strength, but its baseline is also the weakest of any non-codex model by nearly ten points. It benefits most from skill context and starts furthest behind without it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the gpt-5.3 Drop Actually Tells Us
&lt;/h2&gt;

&lt;p&gt;gpt-5.3 scored 91.9 under Sonnet, 75.7 under GPT-5.5, and 84.0 under Opus. Two judges independently came back with substantially lower scores, which means the Sonnet-only number was inflated, and that inflation was specific to gpt-5.3 in a way it was not specific to gpt-5.4 or composer.&lt;/p&gt;

&lt;p&gt;The pattern in the per-skill data points to one cause: gpt-5.3 produces outputs that are in the right direction but not precisely correct. Sonnet, the most generous judge, gives partial credit. GPT-5.5, the strictest, does not. If you care about whether your agent follows a spec exactly rather than approximately, GPT-5.5's score is the more informative one. If you care about general capability, the average of three judges is probably right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do About It
&lt;/h2&gt;

&lt;p&gt;Single-judge evals are benchmarking judge preference as much as model capability. Running multiple judges and averaging the results fixes this: three independent judges will smooth out individual preferences and give you a number that is harder to game and more stable across reruns.&lt;/p&gt;

&lt;p&gt;Beyond averaging, design your rubric for binary criteria wherever the task allows it. "Is the file deleted?" is a better eval item than "how well did the agent explain the migration?" The first gives every judge the same answer. The second gives every judge a different one.&lt;/p&gt;

&lt;p&gt;The models that perform consistently across all three judges in this benchmark, gpt-5.4 and composer, share one characteristic: their outputs are correct rather than approximately correct. A strict grader and a generous one disagree less when the answer is unambiguously right.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentskills</category>
      <category>security</category>
    </item>
    <item>
      <title>Stop trusting your agent skills with vibes. Eliminate the context security risk.</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Fri, 15 May 2026 04:55:29 +0000</pubDate>
      <link>https://dev.to/tessl/stop-trusting-your-agent-skills-with-vibes-eliminate-the-context-security-risk-1jld</link>
      <guid>https://dev.to/tessl/stop-trusting-your-agent-skills-with-vibes-eliminate-the-context-security-risk-1jld</guid>
      <description>&lt;p&gt;When you install an npm package, you can run &lt;code&gt;npm audit&lt;/code&gt;. When you install a Python package, there's &lt;code&gt;pip-audit&lt;/code&gt;. But when you install plugins that give your AI agent new skills and rules, you know, things that directly shape how it reasons and what it does, what do you run?&lt;/p&gt;

&lt;p&gt;If your answer is "nothing", you're not alone, and that's why I built &lt;code&gt;tessl-audit&lt;/code&gt;! You can check it out on &lt;a href="https://github.com/AI-Native-Dev-Community/tessl-audit" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and &lt;a href="https://www.npmjs.com/package/tessl-audit" rel="noopener noreferrer"&gt;npm&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more than you think
&lt;/h2&gt;

&lt;p&gt;Agent plugins are &lt;em&gt;instructions&lt;/em&gt; that get loaded into your AI agent's context. A plugin with a security issue doesn't just expose a server endpoint. It can influence the agent's behaviour in ways that are subtle and hard to detect, perhaps nudging it toward unsafe patterns, exposing data it shouldn't, or simply making it worse at its job.&lt;/p&gt;

&lt;p&gt;Ask yourself these three questions about your agent skills, and if the answer to any of them is no, you’re seconds away from being able to say yes, with tessl-audit.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Have all your skills been security scanned?&lt;/strong&gt; If so, what was the result?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Can you prove your skills are any good?&lt;/strong&gt; Quality scores tell you how well-written and complete a plugin is. A low score means the agent is getting poor guidance.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Do your skills and plugins actually help?&lt;/strong&gt; Uplift scores measure whether a plugin improves agent task performance compared to a vanilla agent alone.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbraf14e4s66n7ibzuwuk.png" alt="Join us at AI Native DevCon" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br&gt;Join us at AI Native DevCon (use C0DE30 for 30% discount)
&lt;p&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Why not try it right now?
&lt;/h2&gt;

&lt;p&gt;It’s a free open source tool that uses Tessl under the covers. If you have a Tessl project with plugins installed, just run this in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npx tessl-audit
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait, is that it? Absolutely, that's it. It reads your &lt;code&gt;tessl.json&lt;/code&gt;, fetches live data from the registry for every plugin, and prints a report in about 30 seconds.&lt;/p&gt;

&lt;p&gt;The script begins by looking through all your context file that it finds in the tessl.json manifest file. This should complete pretty quickly and you’ll soon see the table below, with a breakdown of your project context., and the types of warnings that have been picked up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0rrz9ig4r2nebvw87p3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0rrz9ig4r2nebvw87p3.png" alt="image1" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, the tool gives a posture summary of all of your context, giving more details of the riskiest skills in your project and what the issues are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9xwxk46mxgxqvjtqios.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9xwxk46mxgxqvjtqios.png" alt="img2" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can click through on any of these links to see the actual issues in the registry web UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsib0z1ar0osa3lfxvrau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsib0z1ar0osa3lfxvrau.png" alt="img3" width="800" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, the tool provides next step actions of the CLI commands to use (you can use an agent to call these also) to optimize, create and run evals on your skills.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwtr6gssymroeyl5g4cf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwtr6gssymroeyl5g4cf.png" alt="img4" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The "so what" for each finding
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Advisory, Risky, or Critical security status?
&lt;/h3&gt;

&lt;p&gt;The report prints each flagged plugin with its warning codes and a direct link to the full security report on the registry. No need to chase them down, the security posture report lets you see the full summary in one listing, allowing you to deep dive here needed. Just open the link, read the finding, decide if it applies to your use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality below 80%?
&lt;/h3&gt;

&lt;p&gt;The plugin you’re using is giving your agent incomplete or poorly-structured guidance. Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tessl skill review --optimize workspace/plugin-name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs a quality review and applies automatic improvements.&lt;/p&gt;

&lt;h3&gt;
  
  
  No uplift data?
&lt;/h3&gt;

&lt;p&gt;The plugin has never been evaluated against real tasks — so you have no idea if it's helping or hurting. Fix that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tessl scenario generate --count 5 workspace/plugin-name
tessl eval run workspace/plugin-name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate a set of test scenarios from the plugin, then run the eval. You'll get a concrete uplift score showing whether the plugin is worth keeping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger picture
&lt;/h2&gt;

&lt;p&gt;Every team that uses AI agents is building a dependency graph of skills, rules, and knowledge, just like they build a dependency graph of packages. The tooling for auditing that graph is still being built, but the risks are real and growing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tessl-audit&lt;/code&gt; is a small, practical step: one command, zero installation, actionable output. Run it today and find out what your agent is actually working with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npx tessl-audit
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;code&gt;tessl-audit&lt;/code&gt; requires the Tessl CLI (no worries, it’s already a dependency) and an authenticated Tessl session (just create a free account if you haven’t got one). You’ll need a &lt;code&gt;tessl.json&lt;/code&gt; in order to run the &lt;code&gt;tessl-audit&lt;/code&gt; tool, which is a context manifest tile.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful docs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://docs.tessl.io/evaluate/evaluate-skill-quality-using-scenarios" rel="noopener noreferrer"&gt;Evaluate skill quality using scenarios&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://docs.tessl.io/evaluate/evaluating-skills" rel="noopener noreferrer"&gt;Review a skill against best practices&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://tessl.io/registry/tessl-labs/skill-optimizer" rel="noopener noreferrer"&gt;Skill Optimizer plugin&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>security</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Tessl Admin Guide: Organizations, Workspaces, and Roles</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Thu, 14 May 2026 06:45:55 +0000</pubDate>
      <link>https://dev.to/tessl/tessl-admin-guide-organizations-workspaces-and-roles-4m75</link>
      <guid>https://dev.to/tessl/tessl-admin-guide-organizations-workspaces-and-roles-4m75</guid>
      <description>&lt;p&gt;Just signed up to Tessl? Wondering next steps to rolling Tessl out to your team? The following article will take you through the steps of managing your top level Organization, invite your users, set policy items, then create your workspaces, assigning membership to those workspaces and defining their &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;roles&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Organizations and Workspaces work in Tessl
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Organizations&lt;/em&gt;&lt;/strong&gt; are top level entities, often representing the billing or corporate entity, with a subcategory called &lt;strong&gt;&lt;em&gt;Workspaces&lt;/em&gt;&lt;/strong&gt; that provide role-based access to the various users across the company.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8e06vahqpkgmjfc731a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8e06vahqpkgmjfc731a.png" alt="A diagram showing a top level Organization, with many workspaces below. Some with Search,Install, and Publish permissions, some with just Install and Publish, and one with no access." width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up your Tessl Organization
&lt;/h2&gt;

&lt;p&gt;Organizations are sometimes created during the presales phase of acquiring Tessl, or may be created later. If one has not been created, it will be auto created when you create your first workspace. If prompted, click &lt;strong&gt;Create workspace&lt;/strong&gt; and name it after your team (i.e. YourCompanyName-Engineering) to start.&lt;/p&gt;

&lt;p&gt;Note workspace names must be unique at this time, and will appear in plugin-names when searched. This is most notable if the plugins are published publicly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festzqydqhxluufhxo384.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festzqydqhxluufhxo384.png" alt="View of the registry page where a Create workspace button is being discplayed." width="800" height="1466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The workspace should now be visible from the main interface&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmjpzw8kyrt0ii7ag52.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmjpzw8kyrt0ii7ag52.png" alt="The workspace selector will appear, displaying the workspaces you have access to,  with sub menu items like eval runs, projects, etc dependant on your permissions." width="800" height="631"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The organization can now be observed by clicking your Account, where your name is displayed, on the bottom left&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl7k6wrz7zz90yaf4zkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl7k6wrz7zz90yaf4zkm.png" alt="By selecting your account/profile, the organization will be displayed with sub menu of members, settings, admin keys, depending on your permissions." width="800" height="1167"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once created, navigate to &lt;strong&gt;&lt;em&gt;Settings&lt;/em&gt;&lt;/strong&gt; for your Organization, rename the organization to your company name and specify if users can publicly share &lt;a href="https://docs.tessl.io/create/creating-skills" rel="noopener noreferrer"&gt;skills&lt;/a&gt; by enabling the button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdjvzjeg5zdqw9ceb5im.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdjvzjeg5zdqw9ceb5im.png" alt="Organization settings displayes an organization name, the ability to save, and an option to block public tile publishing by toggling a selector." width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating and managing Users in Tessl
&lt;/h3&gt;

&lt;p&gt;Next, invite users to your organization, by navigating to the Organization’s &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; menu, assigning the workspaces the users will have access to. Users will be created with the&amp;nbsp; &lt;strong&gt;&lt;em&gt;members&lt;/em&gt;&lt;/strong&gt; role, able to see, search and install skills from the chosen workspaces. Permissions can be promoted from the Workspace &lt;strong&gt;Members&lt;/strong&gt; menu, which will be discussed later below. Users will need to accept the invite they are sent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4g6r3dl7lo16v46uut0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4g6r3dl7lo16v46uut0.png" alt="Invite member screen displayes an email address, a selection of workspaces that can be added to the user specified." width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once created, you can elevate a user to Admin to allow workspace creation or manage users. To do so, navigate to the Organization &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; screen, and click the three dots under &lt;em&gt;&lt;strong&gt;Actions.&lt;/strong&gt; Assign an appropriate role. Examples will be provided below of some common configurations.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fsvjyckwcicbqzup30w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fsvjyckwcicbqzup30w.png" alt="Expanding the options menu, which is three dots, next to each name yields a submenu with change role and remove" width="800" height="774"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Admin keys
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrfooi6l0jsv9cxkw57k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrfooi6l0jsv9cxkw57k.png" alt="image.png" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Admin keys are for integrations and applications where programmatic access is required across workspaces. This is typically used for automation purposes and an expiration can be set up to one year.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing Workspaces and Users in Tessl
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo5e2gi1574xb0he4gaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo5e2gi1574xb0he4gaj.png" alt="On the side menu of the screen, users can select all plugins, eval runs, projects and members from a specified workspace." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click the workspace drop-down to navigate workspaces. Navigate to &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; at the workspace level to specify &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;Roles&lt;/a&gt; for users who require more capabilities within the workspace, such as running evaluations, publishing or managing users.&lt;/p&gt;

&lt;p&gt;To modify a user, search for their name, select their checkbox, a &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;role&lt;/a&gt;, and click the &lt;strong&gt;Add&lt;/strong&gt; button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9fog1k29i17joa6vga0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9fog1k29i17joa6vga0.png" alt="The role selector allows user to select consumer, member, publisher, manager and owner when adding a user to a workspace." width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Example role configurations for your team(s)
&lt;/h2&gt;

&lt;p&gt;The following users demonstrate common configurations and &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;roles&lt;/a&gt; that may be used when rolling Tessl out:&lt;/p&gt;

&lt;h3&gt;
  
  
  Samira - Org. Admin
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Samira&lt;/strong&gt;, the administrator and skills champion, needs the ability to manage all workspaces, the ability to assign users, and create new workspaces. Make her an Organization admin.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8q5xoajbwb8bfdcf8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8q5xoajbwb8bfdcf8m.png" alt="A diagram showing Samira with admin privileges at the Organization level , giving her full permissions on the workspaces below as a result" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Eddie - Lead Engineer
&lt;/h3&gt;

&lt;p&gt;Another user, &lt;strong&gt;Eddie&lt;/strong&gt;, might be a member of an engineering workspace. He needs to be able to use plugins (skills) that have been published, but may need to have access to publish skills within the engineering workspace for others on his team. This could mean Eddie is the publisher &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;role&lt;/a&gt; in certain workspaces. He may also be a Member role of other workspaces where he only needs to search and install from.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgzuipack4tlngxvr26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgzuipack4tlngxvr26.png" alt="A diagram showing an organization with several workspaces. The user has publisher permission on several, giving search. install, and publish rights. Several other workspaces the user is only a member, providing more limited permissions like Search and Install. One workspace is no access because they were not given permissions." width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Jennifer - Manager
&lt;/h3&gt;

&lt;p&gt;Jennifer may require the ability to add users to a workspace that she owns, publish, and possibly need the ability to remove other managers etc. Typically the workspace permission "Owner" or "manager" may be given to that user, depending on the need to remove other "owners" or delete workspace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Joe - New hire engineer
&lt;/h3&gt;

&lt;p&gt;Finally, Joe, a new hire, has the ability to search and install skills from the engineering workspace, but does not have the ability to share/create skills until later, after they’ve gained a little more experience. Joe would be made a member of “engineering” with just a “consumer” role.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps!
&lt;/h2&gt;

&lt;p&gt;Now that you have your users in, and assigned roles to the different workspaces, you and your users can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Start creating &lt;a href="https://docs.tessl.io/create/creating-skills" rel="noopener noreferrer"&gt;new skills&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Evaluate new or existing skill effectiveness through using &lt;a href="https://docs.tessl.io/evaluate/evaluating-skills" rel="noopener noreferrer"&gt;Reviews&lt;/a&gt;, and &lt;a href="https://docs.tessl.io/evaluate/evaluate-skill-quality-using-scenarios" rel="noopener noreferrer"&gt;Evals&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Publish those skills to the Tessl registry to share them for your users and agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Let us know what you think! Tessl would love to hear from you through any one of our &lt;a href="https://docs.tessl.io/support/giving-feedback" rel="noopener noreferrer"&gt;feedback channels (Discord, Email, CLI Feedback, etc)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://tessl.io/blog/tessl-admin-guide-organizations-workspaces-and-roles/" rel="noopener noreferrer"&gt;Tessl.blogs&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>tutorial</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
