<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tessl</title>
    <description>The latest articles on DEV Community by Tessl (@tessl-io).</description>
    <link>https://dev.to/tessl-io</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3865880%2Fae4ef80f-404f-4ed5-849f-f94683a6e7b0.png</url>
      <title>DEV Community: Tessl</title>
      <link>https://dev.to/tessl-io</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tessl-io"/>
    <language>en</language>
    <item>
      <title>Your benchmarks are lying to you, and your judge is to blame!</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Tue, 19 May 2026 09:18:00 +0000</pubDate>
      <link>https://dev.to/tessl-io/your-benchmarks-are-lying-to-you-and-your-judge-is-to-blame-2k20</link>
      <guid>https://dev.to/tessl-io/your-benchmarks-are-lying-to-you-and-your-judge-is-to-blame-2k20</guid>
      <description>&lt;p&gt;Last week I &lt;a href="https://tessl.io/blog/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense/" rel="noopener noreferrer"&gt;published&lt;/a&gt; a benchmark comparing &lt;strong&gt;six models&lt;/strong&gt; across &lt;strong&gt;eleven agent skills&lt;/strong&gt;. The numbers in that post are averages, and we did not explain why.&lt;/p&gt;

&lt;p&gt;When I shared the data internally, Maria from our AI Research team pointed out something that we should take very seriously: &lt;strong&gt;an LLM judge is likely to favour outputs from its own model family&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I ran the full benchmark again with a second judge, then a third, to see if this hypothesis held any water. The scores shifted, the rankings moved, and one model swung 47 percentage points on a single skill depending on the judge who graded it. If you are publishing or trusting eval numbers from a single LLM judge, you are partly benchmarking judge preference rather than model capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR for this post
&lt;/h2&gt;

&lt;p&gt;We’re all busy people, here’s the tl;dr.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The same benchmark, graded by three different LLM judges, produced different scores and different rankings. &lt;strong&gt;One model swung 47 points on a single skill&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Sonnet&lt;/strong&gt; is the most generous judge, &lt;strong&gt;GPT-5.5&lt;/strong&gt; the strictest. The gap between them averages 6.9 points across all models and skills.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;LLM judges favour their own model family&lt;/strong&gt;. Opus gave itself a 4.6 point boost over what the other two judges awarded it.&lt;/li&gt;
&lt;li&gt;  Rankings only stay stable at the top of the agents we tested. &lt;strong&gt;opus-4-7 held first place&lt;/strong&gt; under all three judges. Everything below it moved position.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Eval criteria built around binary, verifiable things&lt;/strong&gt; (file deleted or not, flag enabled or not) produce stable scores regardless of the judge. &lt;strong&gt;Skills that require qualitative judgment can easily swing 25 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  Fix: &lt;strong&gt;run multiple judges and average&lt;/strong&gt;. Favour the same judge as the model you’ll tend to use in development, if you know. Design rubrics with yes/no criteria wherever the task allows it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbraf14e4s66n7ibzuwuk.png" alt="Join us at AI Native DevCon" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;
Join us at AI Native DevCon (use C0DE30 for 30% discount)



&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Six models, eleven skills, five scenarios per agent skill, one rubric. The only variable was the scoring model, i.e. the judge: Sonnet, GPT-5.5, and Opus-4-7 each graded every run independently. The figures in our main benchmark are averaged across all three. This post is about what happens before the averaging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The raw results
&lt;/h2&gt;

&lt;p&gt;The Tessl UI shows eval results as an average of all the runs, but you can still see all the information about the scenarios/tasks that were set and how they fared with and without the skills, on the &lt;a href="https://tessl.io/registry/simon/skills/evals" rel="noopener noreferrer"&gt;Tessl registry here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3z39t8zbyuznps4tso55.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3z39t8zbyuznps4tso55.png" alt="image1" width="799" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Criteria are pretty easy to create with Tessl. Once you’ve &lt;a href="https://docs.tessl.io/introduction-to-tessl/installation" rel="noopener noreferrer"&gt;installed and authenticated with a free account&lt;/a&gt; using Tessl command line tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`$ curl -fsSL https://get.tessl.io | sh`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simply ask your agent to create them using Tessl, or just run the following at the command line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`$ tessl scenario generate &amp;lt;path/to/tile&amp;gt; --count=5`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then download them to disk once to validate, and make any changes you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`tessl scenario download --last`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now run them, choosing the using the model you want to run the scenarios with the &lt;code&gt;--agent&lt;/code&gt; flag, as well as the model you wish to judge the output using the --scorer-agent flag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`tessl eval run &amp;lt;path/to/tile&amp;gt; --agent=claude:claude-opus-4-6 --scorer-agent codex:gpt-5.5`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Judge Strictness: Sonnet Grades Easiest, GPT-5.5 Grades Hardest
&lt;/h2&gt;

&lt;p&gt;Averaged across all six models and all eleven skills, here is what each judge returned:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Judge&lt;/th&gt;
&lt;th&gt;Avg without-skill&lt;/th&gt;
&lt;th&gt;Avg with-skill&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;td&gt;76.1&lt;/td&gt;
&lt;td&gt;90.3&lt;/td&gt;
&lt;td&gt;14.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus-4-7&lt;/td&gt;
&lt;td&gt;72.6&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;15.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;70.7&lt;/td&gt;
&lt;td&gt;83.4&lt;/td&gt;
&lt;td&gt;12.7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sonnet grades most generously. GPT-5.5 is 6.9 points stricter on average. If your pipeline scores agents with Sonnet as the default judge, your numbers are probably 5 to 7 points higher than a stricter grader would return. That gap is real and it is not uniformly distributed across models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rankings Shift
&lt;/h2&gt;

&lt;p&gt;Here are the per-judge leaderboards for with-skill performance, using the same models and rubrics with only the judge swapped:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;th&gt;Opus-4-7&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;opus-4-7 (94.5)&lt;/td&gt;
&lt;td&gt;opus-4-7 (89.2)&lt;/td&gt;
&lt;td&gt;opus-4-7 (96.5)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;gpt-5.4 (92.7)&lt;/td&gt;
&lt;td&gt;gpt-5.5 (88.4)&lt;/td&gt;
&lt;td&gt;gpt-5.5 (92.3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;gpt-5.3 (91.9)&lt;/td&gt;
&lt;td&gt;composer (88.0)&lt;/td&gt;
&lt;td&gt;composer (90.3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;composer (90.5)&lt;/td&gt;
&lt;td&gt;gpt-5.4 (86.5)&lt;/td&gt;
&lt;td&gt;gpt-5.4 (88.8)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;gpt-5.5 (87.4)&lt;/td&gt;
&lt;td&gt;gpt-5.3 (75.7)&lt;/td&gt;
&lt;td&gt;gpt-5.3 (84.0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;gpt-5-codex (85.1)&lt;/td&gt;
&lt;td&gt;gpt-5-codex (72.9)&lt;/td&gt;
&lt;td&gt;gpt-5-codex (78.1)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The leaderboard format shows position but hides distances. Here are the raw scores:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;th&gt;Opus-4-7&lt;/th&gt;
&lt;th&gt;Avg&lt;/th&gt;
&lt;th&gt;Swing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7&lt;/td&gt;
&lt;td&gt;94.5&lt;/td&gt;
&lt;td&gt;89.2&lt;/td&gt;
&lt;td&gt;96.5&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;7.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;composer&lt;/td&gt;
&lt;td&gt;90.5&lt;/td&gt;
&lt;td&gt;88.0&lt;/td&gt;
&lt;td&gt;90.3&lt;/td&gt;
&lt;td&gt;89.6&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;87.4&lt;/td&gt;
&lt;td&gt;88.4&lt;/td&gt;
&lt;td&gt;92.3&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;td&gt;4.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;92.7&lt;/td&gt;
&lt;td&gt;86.5&lt;/td&gt;
&lt;td&gt;88.8&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;td&gt;75.7&lt;/td&gt;
&lt;td&gt;84.0&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;85.1&lt;/td&gt;
&lt;td&gt;72.9&lt;/td&gt;
&lt;td&gt;78.1&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;td&gt;12.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The swing column is the gap between the highest and lowest score any judge gave a model. composer swings 2.5 points, meaning all three judges broadly agree on it. &lt;strong&gt;gpt-5.3 swings 16.2 points&lt;/strong&gt;, which is more than the gap between first and last place in the averaged rankings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;opus-4-7 holds first place under every judge&lt;/strong&gt;, and that is the one stable finding in the table. Everything else shifts. gpt-5.3 sits third under Sonnet and falls to fifth under both GPT-5.5 and Opus. gpt-5.5 sits fifth under Sonnet and climbs to second under the other two. The Sonnet-only leaderboard, which is what most default Tessl runs would have produced, gives a flattering picture of gpt-5.3 and an unflattering one of gpt-5.5.&lt;/p&gt;

&lt;p&gt;Judge choice also affects how much credit each model gets for using skill context. Lift scores, meaning the gap between baseline and with-skill performance, vary considerably by judge:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Sonnet lift&lt;/th&gt;
&lt;th&gt;GPT-5.5 lift&lt;/th&gt;
&lt;th&gt;Opus lift&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3&lt;/td&gt;
&lt;td&gt;16.1&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;td&gt;22.9&lt;/td&gt;
&lt;td&gt;18.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;composer&lt;/td&gt;
&lt;td&gt;16.9&lt;/td&gt;
&lt;td&gt;13.8&lt;/td&gt;
&lt;td&gt;15.4&lt;/td&gt;
&lt;td&gt;15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;16.8&lt;/td&gt;
&lt;td&gt;13.5&lt;/td&gt;
&lt;td&gt;15.3&lt;/td&gt;
&lt;td&gt;15.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;10.2&lt;/td&gt;
&lt;td&gt;15.1&lt;/td&gt;
&lt;td&gt;16.2&lt;/td&gt;
&lt;td&gt;13.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7&lt;/td&gt;
&lt;td&gt;14.0&lt;/td&gt;
&lt;td&gt;9.7&lt;/td&gt;
&lt;td&gt;14.1&lt;/td&gt;
&lt;td&gt;12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;11.3&lt;/td&gt;
&lt;td&gt;8.3&lt;/td&gt;
&lt;td&gt;10.4&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Opus judge gives gpt-5.3 a lift of 22.9 points&lt;/strong&gt;. Sonnet and GPT-5.5 give it 16. The rubric was identical. The disagreement is purely about whether gpt-5.3's output counted as genuine compliance or a close approximation, and a single judge cannot tell you which reading is correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Judge Bias Is Measurable
&lt;/h2&gt;

&lt;p&gt;The results split along model family lines, though not symmetrically.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Own judge score&lt;/th&gt;
&lt;th&gt;Other judges avg&lt;/th&gt;
&lt;th&gt;Boost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7 (Opus judge)&lt;/td&gt;
&lt;td&gt;96.5&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;td&gt;+4.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5 (GPT-5.5 judge)&lt;/td&gt;
&lt;td&gt;88.4&lt;/td&gt;
&lt;td&gt;89.9&lt;/td&gt;
&lt;td&gt;-1.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Opus case is unambiguous. &lt;strong&gt;Opus gives itself 96.5&lt;/strong&gt;; Sonnet gives it 94.5; GPT-5.5 gives it 89.2. The 7.3 point gap between Opus-as-judge and GPT-5.5-as-judge for the same model on the same runs is entirely a grading artefact. The gpt-5.5 case does not follow the same pattern: GPT-5.5 actually scores its own model lower than the other two judges do, and Opus gives gpt-5.5 its highest score at 92.3. Self-favour exists but is not symmetric, and its size and direction vary by model and judge pairing.&lt;/p&gt;

&lt;p&gt;The practical consequence for the Opus case specifically: if you are using Claude models to grade Claude outputs, expect a systematic upward bias of 4 to 5 points. It does not show up as a bias in your data; it just looks like good scores.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Averaged Picture
&lt;/h2&gt;

&lt;p&gt;Given all this variance, here is what you get when you average across all three judges:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg baseline&lt;/th&gt;
&lt;th&gt;Avg with-skill&lt;/th&gt;
&lt;th&gt;Avg lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;opus-4-7&lt;/td&gt;
&lt;td&gt;80.8&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;12.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;composer&lt;/td&gt;
&lt;td&gt;74.2&lt;/td&gt;
&lt;td&gt;89.6&lt;/td&gt;
&lt;td&gt;15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.5&lt;/td&gt;
&lt;td&gt;75.5&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;td&gt;13.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.4&lt;/td&gt;
&lt;td&gt;74.1&lt;/td&gt;
&lt;td&gt;89.3&lt;/td&gt;
&lt;td&gt;15.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.3&lt;/td&gt;
&lt;td&gt;65.5&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;18.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-codex&lt;/td&gt;
&lt;td&gt;68.7&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The lift column deserves a second look. &lt;strong&gt;gpt-5.3 shows the largest average lift at 18.4 points&lt;/strong&gt;, which sounds like a strength, but its baseline is also the weakest of any non-codex model by nearly ten points. It benefits most from skill context and starts furthest behind without it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the gpt-5.3 Drop Actually Tells Us
&lt;/h2&gt;

&lt;p&gt;gpt-5.3 scored 91.9 under Sonnet, 75.7 under GPT-5.5, and 84.0 under Opus. Two judges independently came back with substantially lower scores, which means the Sonnet-only number was inflated, and that inflation was specific to gpt-5.3 in a way it was not specific to gpt-5.4 or composer.&lt;/p&gt;

&lt;p&gt;The pattern in the per-skill data points to one cause: gpt-5.3 produces outputs that are in the right direction but not precisely correct. Sonnet, the most generous judge, gives partial credit. GPT-5.5, the strictest, does not. If you care about whether your agent follows a spec exactly rather than approximately, GPT-5.5's score is the more informative one. If you care about general capability, the average of three judges is probably right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do About It
&lt;/h2&gt;

&lt;p&gt;Single-judge evals are benchmarking judge preference as much as model capability. Running multiple judges and averaging the results fixes this: three independent judges will smooth out individual preferences and give you a number that is harder to game and more stable across reruns.&lt;/p&gt;

&lt;p&gt;Beyond averaging, design your rubric for binary criteria wherever the task allows it. "Is the file deleted?" is a better eval item than "how well did the agent explain the migration?" The first gives every judge the same answer. The second gives every judge a different one.&lt;/p&gt;

&lt;p&gt;The models that perform consistently across all three judges in this benchmark, gpt-5.4 and composer, share one characteristic: their outputs are correct rather than approximately correct. A strict grader and a generous one disagree less when the answer is unambiguously right.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentskills</category>
      <category>security</category>
    </item>
    <item>
      <title>Stop trusting your agent skills with vibes. Eliminate the context security risk.</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Fri, 15 May 2026 04:55:29 +0000</pubDate>
      <link>https://dev.to/tessl/stop-trusting-your-agent-skills-with-vibes-eliminate-the-context-security-risk-1jld</link>
      <guid>https://dev.to/tessl/stop-trusting-your-agent-skills-with-vibes-eliminate-the-context-security-risk-1jld</guid>
      <description>&lt;p&gt;When you install an npm package, you can run &lt;code&gt;npm audit&lt;/code&gt;. When you install a Python package, there's &lt;code&gt;pip-audit&lt;/code&gt;. But when you install plugins that give your AI agent new skills and rules, you know, things that directly shape how it reasons and what it does, what do you run?&lt;/p&gt;

&lt;p&gt;If your answer is "nothing", you're not alone, and that's why I built &lt;code&gt;tessl-audit&lt;/code&gt;! You can check it out on &lt;a href="https://github.com/AI-Native-Dev-Community/tessl-audit" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and &lt;a href="https://www.npmjs.com/package/tessl-audit" rel="noopener noreferrer"&gt;npm&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more than you think
&lt;/h2&gt;

&lt;p&gt;Agent plugins are &lt;em&gt;instructions&lt;/em&gt; that get loaded into your AI agent's context. A plugin with a security issue doesn't just expose a server endpoint. It can influence the agent's behaviour in ways that are subtle and hard to detect, perhaps nudging it toward unsafe patterns, exposing data it shouldn't, or simply making it worse at its job.&lt;/p&gt;

&lt;p&gt;Ask yourself these three questions about your agent skills, and if the answer to any of them is no, you’re seconds away from being able to say yes, with tessl-audit.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Have all your skills been security scanned?&lt;/strong&gt; If so, what was the result?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Can you prove your skills are any good?&lt;/strong&gt; Quality scores tell you how well-written and complete a plugin is. A low score means the agent is getting poor guidance.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Do your skills and plugins actually help?&lt;/strong&gt; Uplift scores measure whether a plugin improves agent task performance compared to a vanilla agent alone.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://tessl.io/devcon" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbraf14e4s66n7ibzuwuk.png" alt="Join us at AI Native DevCon" width="800" height="267"&gt;&lt;/a&gt;&lt;br&gt;Join us at AI Native DevCon (use C0DE30 for 30% discount)
&lt;/p&gt;
&lt;h2&gt;
  
  
  Why not try it right now?
&lt;/h2&gt;

&lt;p&gt;It’s a free open source tool that uses Tessl under the covers. If you have a Tessl project with plugins installed, just run this in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npx tessl-audit
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait, is that it? Absolutely, that's it. It reads your &lt;code&gt;tessl.json&lt;/code&gt;, fetches live data from the registry for every plugin, and prints a report in about 30 seconds.&lt;/p&gt;

&lt;p&gt;The script begins by looking through all your context file that it finds in the tessl.json manifest file. This should complete pretty quickly and you’ll soon see the table below, with a breakdown of your project context., and the types of warnings that have been picked up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0rrz9ig4r2nebvw87p3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0rrz9ig4r2nebvw87p3.png" alt="image1" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, the tool gives a posture summary of all of your context, giving more details of the riskiest skills in your project and what the issues are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9xwxk46mxgxqvjtqios.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9xwxk46mxgxqvjtqios.png" alt="img2" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can click through on any of these links to see the actual issues in the registry web UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsib0z1ar0osa3lfxvrau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsib0z1ar0osa3lfxvrau.png" alt="img3" width="800" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, the tool provides next step actions of the CLI commands to use (you can use an agent to call these also) to optimize, create and run evals on your skills.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwtr6gssymroeyl5g4cf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwtr6gssymroeyl5g4cf.png" alt="img4" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The "so what" for each finding
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Advisory, Risky, or Critical security status?
&lt;/h3&gt;

&lt;p&gt;The report prints each flagged plugin with its warning codes and a direct link to the full security report on the registry. No need to chase them down, the security posture report lets you see the full summary in one listing, allowing you to deep dive here needed. Just open the link, read the finding, decide if it applies to your use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality below 80%?
&lt;/h3&gt;

&lt;p&gt;The plugin you’re using is giving your agent incomplete or poorly-structured guidance. Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tessl skill review --optimize workspace/plugin-name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs a quality review and applies automatic improvements.&lt;/p&gt;

&lt;h3&gt;
  
  
  No uplift data?
&lt;/h3&gt;

&lt;p&gt;The plugin has never been evaluated against real tasks — so you have no idea if it's helping or hurting. Fix that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tessl scenario generate --count 5 workspace/plugin-name
tessl eval run workspace/plugin-name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate a set of test scenarios from the plugin, then run the eval. You'll get a concrete uplift score showing whether the plugin is worth keeping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger picture
&lt;/h2&gt;

&lt;p&gt;Every team that uses AI agents is building a dependency graph of skills, rules, and knowledge, just like they build a dependency graph of packages. The tooling for auditing that graph is still being built, but the risks are real and growing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tessl-audit&lt;/code&gt; is a small, practical step: one command, zero installation, actionable output. Run it today and find out what your agent is actually working with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npx tessl-audit
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;code&gt;tessl-audit&lt;/code&gt; requires the Tessl CLI (no worries, it’s already a dependency) and an authenticated Tessl session (just create a free account if you haven’t got one). You’ll need a &lt;code&gt;tessl.json&lt;/code&gt; in order to run the &lt;code&gt;tessl-audit&lt;/code&gt; tool, which is a context manifest tile.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful docs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://docs.tessl.io/evaluate/evaluate-skill-quality-using-scenarios" rel="noopener noreferrer"&gt;Evaluate skill quality using scenarios&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://docs.tessl.io/evaluate/evaluating-skills" rel="noopener noreferrer"&gt;Review a skill against best practices&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://tessl.io/registry/tessl-labs/skill-optimizer" rel="noopener noreferrer"&gt;Skill Optimizer plugin&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>security</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Tessl Admin Guide: Organizations, Workspaces, and Roles</title>
      <dc:creator>Tessl</dc:creator>
      <pubDate>Thu, 14 May 2026 06:45:55 +0000</pubDate>
      <link>https://dev.to/tessl/tessl-admin-guide-organizations-workspaces-and-roles-4m75</link>
      <guid>https://dev.to/tessl/tessl-admin-guide-organizations-workspaces-and-roles-4m75</guid>
      <description>&lt;p&gt;Just signed up to Tessl? Wondering next steps to rolling Tessl out to your team? The following article will take you through the steps of managing your top level Organization, invite your users, set policy items, then create your workspaces, assigning membership to those workspaces and defining their &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;roles&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Organizations and Workspaces work in Tessl
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Organizations&lt;/em&gt;&lt;/strong&gt; are top level entities, often representing the billing or corporate entity, with a subcategory called &lt;strong&gt;&lt;em&gt;Workspaces&lt;/em&gt;&lt;/strong&gt; that provide role-based access to the various users across the company.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8e06vahqpkgmjfc731a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8e06vahqpkgmjfc731a.png" alt="A diagram showing a top level Organization, with many workspaces below. Some with Search,Install, and Publish permissions, some with just Install and Publish, and one with no access." width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up your Tessl Organization
&lt;/h2&gt;

&lt;p&gt;Organizations are sometimes created during the presales phase of acquiring Tessl, or may be created later. If one has not been created, it will be auto created when you create your first workspace. If prompted, click &lt;strong&gt;Create workspace&lt;/strong&gt; and name it after your team (i.e. YourCompanyName-Engineering) to start.&lt;/p&gt;

&lt;p&gt;Note workspace names must be unique at this time, and will appear in plugin-names when searched. This is most notable if the plugins are published publicly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festzqydqhxluufhxo384.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festzqydqhxluufhxo384.png" alt="View of the registry page where a Create workspace button is being discplayed." width="800" height="1466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The workspace should now be visible from the main interface&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmjpzw8kyrt0ii7ag52.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmjpzw8kyrt0ii7ag52.png" alt="The workspace selector will appear, displaying the workspaces you have access to,  with sub menu items like eval runs, projects, etc dependant on your permissions." width="800" height="631"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The organization can now be observed by clicking your Account, where your name is displayed, on the bottom left&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl7k6wrz7zz90yaf4zkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl7k6wrz7zz90yaf4zkm.png" alt="By selecting your account/profile, the organization will be displayed with sub menu of members, settings, admin keys, depending on your permissions." width="800" height="1167"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once created, navigate to &lt;strong&gt;&lt;em&gt;Settings&lt;/em&gt;&lt;/strong&gt; for your Organization, rename the organization to your company name and specify if users can publicly share &lt;a href="https://docs.tessl.io/create/creating-skills" rel="noopener noreferrer"&gt;skills&lt;/a&gt; by enabling the button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdjvzjeg5zdqw9ceb5im.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdjvzjeg5zdqw9ceb5im.png" alt="Organization settings displayes an organization name, the ability to save, and an option to block public tile publishing by toggling a selector." width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating and managing Users in Tessl
&lt;/h3&gt;

&lt;p&gt;Next, invite users to your organization, by navigating to the Organization’s &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; menu, assigning the workspaces the users will have access to. Users will be created with the  &lt;strong&gt;&lt;em&gt;members&lt;/em&gt;&lt;/strong&gt; role, able to see, search and install skills from the chosen workspaces. Permissions can be promoted from the Workspace &lt;strong&gt;Members&lt;/strong&gt; menu, which will be discussed later below. Users will need to accept the invite they are sent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4g6r3dl7lo16v46uut0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4g6r3dl7lo16v46uut0.png" alt="Invite member screen displayes an email address, a selection of workspaces that can be added to the user specified." width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once created, you can elevate a user to Admin to allow workspace creation or manage users. To do so, navigate to the Organization &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; screen, and click the three dots under &lt;em&gt;&lt;strong&gt;Actions.&lt;/strong&gt; Assign an appropriate role. Examples will be provided below of some common configurations.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fsvjyckwcicbqzup30w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fsvjyckwcicbqzup30w.png" alt="Expanding the options menu, which is three dots, next to each name yields a submenu with change role and remove" width="800" height="774"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Admin keys
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrfooi6l0jsv9cxkw57k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrfooi6l0jsv9cxkw57k.png" alt="image.png" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Admin keys are for integrations and applications where programmatic access is required across workspaces. This is typically used for automation purposes and an expiration can be set up to one year.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing Workspaces and Users in Tessl
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo5e2gi1574xb0he4gaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo5e2gi1574xb0he4gaj.png" alt="On the side menu of the screen, users can select all plugins, eval runs, projects and members from a specified workspace." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click the workspace drop-down to navigate workspaces. Navigate to &lt;strong&gt;&lt;em&gt;Members&lt;/em&gt;&lt;/strong&gt; at the workspace level to specify &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;Roles&lt;/a&gt; for users who require more capabilities within the workspace, such as running evaluations, publishing or managing users.&lt;/p&gt;

&lt;p&gt;To modify a user, search for their name, select their checkbox, a &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;role&lt;/a&gt;, and click the &lt;strong&gt;Add&lt;/strong&gt; button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9fog1k29i17joa6vga0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9fog1k29i17joa6vga0.png" alt="The role selector allows user to select consumer, member, publisher, manager and owner when adding a user to a workspace." width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Example role configurations for your team(s)
&lt;/h2&gt;

&lt;p&gt;The following users demonstrate common configurations and &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;roles&lt;/a&gt; that may be used when rolling Tessl out:&lt;/p&gt;

&lt;h3&gt;
  
  
  Samira - Org. Admin
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Samira&lt;/strong&gt;, the administrator and skills champion, needs the ability to manage all workspaces, the ability to assign users, and create new workspaces. Make her an Organization admin.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8q5xoajbwb8bfdcf8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8q5xoajbwb8bfdcf8m.png" alt="A diagram showing Samira with admin privileges at the Organization level , giving her full permissions on the workspaces below as a result" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Eddie - Lead Engineer
&lt;/h3&gt;

&lt;p&gt;Another user, &lt;strong&gt;Eddie&lt;/strong&gt;, might be a member of an engineering workspace. He needs to be able to use plugins (skills) that have been published, but may need to have access to publish skills within the engineering workspace for others on his team. This could mean Eddie is the publisher &lt;a href="https://docs.tessl.io/reference/roles" rel="noopener noreferrer"&gt;role&lt;/a&gt; in certain workspaces. He may also be a Member role of other workspaces where he only needs to search and install from.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgzuipack4tlngxvr26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgzuipack4tlngxvr26.png" alt="A diagram showing an organization with several workspaces. The user has publisher permission on several, giving search. install, and publish rights. Several other workspaces the user is only a member, providing more limited permissions like Search and Install. One workspace is no access because they were not given permissions." width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Jennifer - Manager
&lt;/h3&gt;

&lt;p&gt;Jennifer may require the ability to add users to a workspace that she owns, publish, and possibly need the ability to remove other managers etc. Typically the workspace permission "Owner" or "manager" may be given to that user, depending on the need to remove other "owners" or delete workspace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Joe - New hire engineer
&lt;/h3&gt;

&lt;p&gt;Finally, Joe, a new hire, has the ability to search and install skills from the engineering workspace, but does not have the ability to share/create skills until later, after they’ve gained a little more experience. Joe would be made a member of “engineering” with just a “consumer” role.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps!
&lt;/h2&gt;

&lt;p&gt;Now that you have your users in, and assigned roles to the different workspaces, you and your users can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Start creating &lt;a href="https://docs.tessl.io/create/creating-skills" rel="noopener noreferrer"&gt;new skills&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Evaluate new or existing skill effectiveness through using &lt;a href="https://docs.tessl.io/evaluate/evaluating-skills" rel="noopener noreferrer"&gt;Reviews&lt;/a&gt;, and &lt;a href="https://docs.tessl.io/evaluate/evaluate-skill-quality-using-scenarios" rel="noopener noreferrer"&gt;Evals&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Publish those skills to the Tessl registry to share them for your users and agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Let us know what you think! Tessl would love to hear from you through any one of our &lt;a href="https://docs.tessl.io/support/giving-feedback" rel="noopener noreferrer"&gt;feedback channels (Discord, Email, CLI Feedback, etc)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://tessl.io/blog/tessl-admin-guide-organizations-workspaces-and-roles/" rel="noopener noreferrer"&gt;Tessl.blogs&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>tutorial</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
