DEV Community

Cover image for Your coding agent already knows how to test your AI agent (we just turned it into a Skill)
Manouk Draisma
Manouk Draisma

Posted on

Your coding agent already knows how to test your AI agent (we just turned it into a Skill)

We’re adding something new at LangWatch: Skills.

And the idea is pretty simple:

your coding agent already knows how to do a lot of the work you’re still doing manually

You just haven’t packaged it properly yet.

The frustrating part of building AI agents

If you’ve built an LLM agent recently, you probably recognize this loop:

  • you tweak something
  • you run a few test conversations
  • it seems better
  • you ship it
  • something breaks in production

Then you repeat.

We’ve been there too.

It’s not that you don’t know you need evals, testing, or simulations.

It’s that doing all of that properly is… a lot.

The real work isn’t building, it’s validating

When we started LangWatch, we thought the main challenge was:

getting agents to behave correctly

But in practice, the bigger challenge was:

proving that they behave correctly

That means:

  • setting up eval datasets
  • writing tests
  • simulating real user behavior
  • instrumenting pipelines
  • understanding failures

And most of this ends up being:

  • manual
  • repetitive
  • easy to skip

The worst part: testing agents doesn’t look like testing code

Traditional testing breaks down with LLMs.

You can’t just say:

assert output == expected
Enter fullscreen mode Exit fullscreen mode

Because agents are non-deterministic. The same input can give different outputs, which makes rigid testing fragile ([LangWatch][1]).

So what do people do instead?

They “vibe test”.

  • try a few examples
  • eyeball the results
  • hope nothing breaks

It doesn’t scale.


We already solved part of this (with agents testing agents)

If you’ve seen our earlier work (Scenario), you know we took a different approach:

use an agent to test your agent

Instead of fixed inputs/outputs, you:

  • simulate real user behavior
  • define success criteria
  • let an agent explore and evaluate

This makes testing much closer to reality.

But even then…

You still had to set everything up yourself.


So we asked: why are we still doing this manually?

At this point, most developers already have a coding agent open all day.

And those agents are actually pretty good at:

  • writing tests
  • structuring code
  • following instructions

So we started asking:

what if we let the coding agent handle the “quality work” too?

Not just writing features.

But:

  • setting up evals
  • creating simulations
  • instrumenting systems
  • analyzing behavior

That’s where Skills come in

We built LangWatch Skills as a way to give your coding agent reusable capabilities.

A Skill is basically:

a structured way to get your coding agent to do something correctly, every time

Not just:

“generate some code”

But:

“do this properly, following best practices, with full coverage”


What a Skill actually looks like

Under the hood, Skills are closer to:

  • structured instructions
  • workflows
  • examples
  • best practices

In general, agent skills are “instruction modules” that extend what an agent can do without retraining it ([philschmid.de][2]).

They tell the agent:

  • when to apply something
  • how to do it
  • what good looks like

What you can do with LangWatch Skills

With Skills, you can tell your coding agent to:

  • instrument your agent
  • generate evaluation notebooks
  • create simulation-based tests
  • explore production performance
  • red-team your system

And instead of figuring out how to do it…

…the agent just does it.


The shift is subtle, but important

Before:

you write eval code, tests, and infrastructure

After:

you review and guide what your agent generates

You move from:

  • implementation to
  • coordination

And that’s actually where most of the value is.


This is part of a bigger shift: “harness engineering”

There’s a growing idea in the ecosystem:

the performance of your agent depends heavily on how you configure it

Not just the model.

But:

  • tools
  • context
  • memory
  • skills

These are all part of what some people call the agent “harness” — the system around the model that shapes its behavior ([humanlayer.dev][3]).

Skills are one of the most powerful (and underused) pieces of that.


But Skills aren’t magic

One important thing we’ve learned:

Skills don’t automatically fix everything.

In fact, a lot of skills:

  • don’t improve performance
  • or only help in specific contexts

Recent research shows many skills have limited impact unless they’re well-designed and properly evaluated ([arXiv][4]).

So the goal isn’t:

“add more skills”

It’s:

“add the right skills, and make them actually useful”


Why this matters now

We’re entering a phase where:

  • building agents is easy
  • making them reliable is not

The bottleneck has shifted.

And the teams that win won’t just be the ones who:

  • build faster

But the ones who:

  • validate better
  • iterate faster with confidence

What we’re aiming for

With Skills, the goal is simple:

reduce the amount of manual work required to build reliable AI systems

So instead of:

  • wiring pipelines
  • writing eval scaffolding
  • guessing what broke

You can:

  • delegate
  • review
  • improve

SWould love feedback

This is a new direction for us, and we’re still figuring out:

  • What makes a “good” Skill?
  • Where do Skills break down?
  • What should be automated vs controlled?

If you’re working on LLM agents, I’d love to hear:

  • how you’re handling evals today
  • what’s still painful
  • what you’ve tried that didn’t work

Try it out

If this resonates, you can check out what we’re building here:

👉 LangWatch Skills


Final thought

Your coding agent is already capable of doing much more than we typically ask of it.

Skills are just a way to unlock that.

The interesting question now is:

what else are we still doing manually that agents could handle better?

Top comments (0)