DEV Community

Cover image for Evaluate agents skills, ship 3x better code
Bap
Bap

Posted on

Evaluate agents skills, ship 3x better code

Today, I’m excited to announce that you can evaluate your skills and optimize them.

This means you can stop debugging agent output and start shipping quality code, faster: Product Hunt launch

Agent skills help agents use your products, build in your codebase and enforce your policies.

They're the new unit of software for devs - but most are still treated like simple Markdown files copied between repos with no versioning, no quality signal, no updates.

Without AI evaluations, you can’t tell if a skill helps, provides minimal uplift or even degrades functionality.

You spend your time course-correcting agents instead of shipping.

Tessl is a development platform and package manager for agent skills. With Tessl, we were able to evaluate and optimize ElevenLabs' skills, 2x'ing their agent success in using their APIs.

If you are building a personal project, maintaining an OSS library, or developing with AI at work, you can now evaluate your skill and optimize it to help agents use it properly.

We’ve launched on Product Hunt launch! If you find it useful, we’d appreciate an upvote - and even more, your feedback in the comments:

Top comments (6)

Collapse
 
fernandezbaptiste profile image
Bap

What skills are you using or creating? Keen to hear your use cases - and your experience in optimizing it.

Collapse
 
rohan_sharma profile image
Rohan Sharma

looking to try the frontend skills/.

Collapse
 
matthewhou profile image
Matthew Hou

Evaluating skills before they run is an underrated problem. Most people think about evaluating the model, but the skill layer — the instructions, context, and constraints you give the agent — is where most of the variance actually comes from.

I've seen the same codebase get wildly different results from the same model just by changing how the task was structured in the skill file. The model is a constant. The skill is the variable you actually control.

Curious how the evaluation handles the "looks correct but is subtly wrong" case. That's the hardest failure mode — code that passes surface-level checks but has a logic error that only shows up under specific conditions. That's where the evaluation criteria matter more than the scoring algorithm.

Collapse
 
fernandezbaptiste profile image
Bap

Appreciate the comment! And you’re nailing the pin on its head there. We take a two sided approach for this: 1. Is the skill structurally sound (this is our review score) 2. Scenario task evaluations - how is the skill working for different real tasks. The second option is what would help unearth any “looks correct, but is subtly wrong”. What skills are you currently working with? And have you tried evaluating them, yet?

Collapse
 
anmolbaranwal profile image
Anmol Baranwal

just saw this on rohan's post on linkedin

Collapse
 
k0msenapati profile image
K Om Senapati

yup