We’re adding something new at LangWatch: Skills.
And the idea is pretty simple:
your coding agent already knows how to do a lot of the work you’re still doing manually
You just haven’t packaged it properly yet.
The frustrating part of building AI agents
If you’ve built an LLM agent recently, you probably recognize this loop:
- you tweak something
- you run a few test conversations
- it seems better
- you ship it
- something breaks in production
Then you repeat.
We’ve been there too.
It’s not that you don’t know you need evals, testing, or simulations.
It’s that doing all of that properly is… a lot.
The real work isn’t building, it’s validating
When we started LangWatch, we thought the main challenge was:
getting agents to behave correctly
But in practice, the bigger challenge was:
proving that they behave correctly
That means:
- setting up eval datasets
- writing tests
- simulating real user behavior
- instrumenting pipelines
- understanding failures
And most of this ends up being:
- manual
- repetitive
- easy to skip
The worst part: testing agents doesn’t look like testing code
Traditional testing breaks down with LLMs.
You can’t just say:
assert output == expected
Because agents are non-deterministic. The same input can give different outputs, which makes rigid testing fragile ([LangWatch][1]).
So what do people do instead?
They “vibe test”.
- try a few examples
- eyeball the results
- hope nothing breaks
It doesn’t scale.
We already solved part of this (with agents testing agents)
If you’ve seen our earlier work (Scenario), you know we took a different approach:
use an agent to test your agent
Instead of fixed inputs/outputs, you:
- simulate real user behavior
- define success criteria
- let an agent explore and evaluate
This makes testing much closer to reality.
But even then…
You still had to set everything up yourself.
So we asked: why are we still doing this manually?
At this point, most developers already have a coding agent open all day.
And those agents are actually pretty good at:
- writing tests
- structuring code
- following instructions
So we started asking:
what if we let the coding agent handle the “quality work” too?
Not just writing features.
But:
- setting up evals
- creating simulations
- instrumenting systems
- analyzing behavior
That’s where Skills come in
We built LangWatch Skills as a way to give your coding agent reusable capabilities.
A Skill is basically:
a structured way to get your coding agent to do something correctly, every time
Not just:
“generate some code”
But:
“do this properly, following best practices, with full coverage”
What a Skill actually looks like
Under the hood, Skills are closer to:
- structured instructions
- workflows
- examples
- best practices
In general, agent skills are “instruction modules” that extend what an agent can do without retraining it ([philschmid.de][2]).
They tell the agent:
- when to apply something
- how to do it
- what good looks like
What you can do with LangWatch Skills
With Skills, you can tell your coding agent to:
- instrument your agent
- generate evaluation notebooks
- create simulation-based tests
- explore production performance
- red-team your system
And instead of figuring out how to do it…
…the agent just does it.
The shift is subtle, but important
Before:
you write eval code, tests, and infrastructure
After:
you review and guide what your agent generates
You move from:
- implementation to
- coordination
And that’s actually where most of the value is.
This is part of a bigger shift: “harness engineering”
There’s a growing idea in the ecosystem:
the performance of your agent depends heavily on how you configure it
Not just the model.
But:
- tools
- context
- memory
- skills
These are all part of what some people call the agent “harness” — the system around the model that shapes its behavior ([humanlayer.dev][3]).
Skills are one of the most powerful (and underused) pieces of that.
But Skills aren’t magic
One important thing we’ve learned:
Skills don’t automatically fix everything.
In fact, a lot of skills:
- don’t improve performance
- or only help in specific contexts
Recent research shows many skills have limited impact unless they’re well-designed and properly evaluated ([arXiv][4]).
So the goal isn’t:
“add more skills”
It’s:
“add the right skills, and make them actually useful”
Why this matters now
We’re entering a phase where:
- building agents is easy
- making them reliable is not
The bottleneck has shifted.
And the teams that win won’t just be the ones who:
- build faster
But the ones who:
- validate better
- iterate faster with confidence
What we’re aiming for
With Skills, the goal is simple:
reduce the amount of manual work required to build reliable AI systems
So instead of:
- wiring pipelines
- writing eval scaffolding
- guessing what broke
You can:
- delegate
- review
- improve
SWould love feedback
This is a new direction for us, and we’re still figuring out:
- What makes a “good” Skill?
- Where do Skills break down?
- What should be automated vs controlled?
If you’re working on LLM agents, I’d love to hear:
- how you’re handling evals today
- what’s still painful
- what you’ve tried that didn’t work
Try it out
If this resonates, you can check out what we’re building here:
Final thought
Your coding agent is already capable of doing much more than we typically ask of it.
Skills are just a way to unlock that.
The interesting question now is:
what else are we still doing manually that agents could handle better?

Top comments (0)