Kamesh Sampath

Posted on Jul 3 • Originally published at blogs.kameshs.dev on Jun 22

From Intent to Evidence: Why AI Skills Need Tests

#ai #devtools #softwaretesting #agents

Moving beyond “it worked once” to build verifiable, trustworthy AI agents in real-world developer workflows.

Intent is the source. Evidence is the contract.

AI skills should not be trusted because they worked once. They should be trusted because they can reach the intended state repeatedly, with executable tests proving the outcome.

AI agents are easy to celebrate when they work once, but that is also the trap. The demo runs, the code looks right, the file appears in the right place, and everyone nods. Then you run it again. This time a file is missing, a parameter is read differently, or a command is skipped. The final result is close, but not correct. This is where AI-native development gets real: we do not only need agents that can generate; we need skills that can be trusted.

In my earlier posts on Infrastructure-as-Intent, Intent-Driven Development, Intent Compression Ratio, and token economics, I argued that developers are moving away from writing every step by hand and toward expressing outcomes. Not every command, not every click, and not every line of glue code — just the outcome. But there is a missing layer in that story. Intent needs evidence.

Without evidence, intent is only a better prompt

“It Worked Once” Is Not Engineering

Every developer knows this. We do not trust code because it compiled once, we do not trust infrastructure because one deployment succeeded, and we do not trust an API because one request returned 200 OK. We trust systems when we can verify them again and again.

AI skills should be held to the same standard. Today, many AI workflows still feel like demo-driven development: we run an agent, inspect the output, and if it looks good, we move on. That is fine for exploration, but it is not enough for reusable skills.

A reusable skill needs a stronger question. Not: did the agent produce something useful? But rather: did the system reach the intended state, and can it do that again?

The Gap Between Output and Outcome

There is a gap between a plausible result and a proven result. I call this the Verification Gap. On one hand, the agent gives an answer that looks right. On the other hand, the system is in the correct state. That gap matters because AI agents operate through probabilistic reasoning. The same instruction can lead to different paths, tools, or decisions depending on context and model behavior. Often those paths converge on the same outcome. Sometimes they do not.

The Verification Gap: The difference between an agent generating a plausible output and proving the system actually reached the correct state.

The mistake is not using AI; the mistake is assuming one successful run is proof. A better question is: how do we make non-deterministic behavior measurable enough for developers to work with? An answer that looks right is not the same as a system that is right.

CoCo as the Proving Ground

I encountered this problem firsthand while building skills for Snowflake Cortex Code, also known as CoCo. Because CoCo is an agentic coding assistant that combines reasoning, tool use, and code execution, it acts as a practical environment for testing AI skills in real-world workflows. I needed a way to prove that an agent’s answer wasn’t just plausible, but that the system was actually in the right state. That practical need led me to build inspect-coco.

A CoCo skill can package developer intent into a reusable capability. Instead of asking an agent to reason from scratch every time, we can give it a skill that captures a workflow, convention, tool pattern, or common engineering task. This is the shift I described in Infrastructure-as-Intent: we move from prescribing steps to expressing outcomes, turning a skill into a reusable unit of intent.

That is powerful, but it is also risky if we do not test it. A skill can hide many actions behind one instruction — it may create files, change configurations, call tools, or update a project. The user sees a simple request, and the skill performs the work. This is exactly what I mean by Intent Compression Ratio. A high-value skill compresses many steps into one intent, but high compression requires high confidence. If one intent represents ten steps, we need to know whether all ten steps completed correctly.

The same challenge exists whether you are building skills for CoCo, Claude Code, Codex, Gemini CLI, OpenHands, or any other agentic development environment. Once an agent can inspect repositories, modify files, execute commands, and automate parts of the software lifecycle, the question becomes the same: how do we know it did the right thing?

CoCo exposed the gap early. It became the first proving ground because it offered a practical environment where skills could interact with real projects, tools, and workflows. That made it easier to test a simple question: can intent be translated into repeatable, verifiable outcomes?

inspect-coco bridges the foundational evaluation capabilities of Inspect AI with the concrete execution environment of Snowflake CoCo.

While the pattern is not limited to CoCo, CoCo is the concrete runtime. Inspect AI is the evaluation foundation. inspect-coco connects the two.

What inspect-coco Does ?

inspect-coco is a developer-first test harness for the instructions and workflows that become AI skills. The flow is intentionally familiar:

It uses Inspect AI to run evaluations.
It uses CoCo to execute the skill.
It uses Docker to isolate the environment.
It uses a test script to verify the result.

The core evaluation loop: translating intent into execution, verification, and reliable measurement.

The key is the test. inspect-coco does not ask another model whether the answer looks good; it runs the skill and checks the system state. Did the file exist? Did the content match? Did the command succeed, and did the project end up in the expected shape?

If yes, the test passes; if not, it fails. That is a language developers already understand. If you prefer, the verification layer can also use familiar testing frameworks such as pytest, allowing teams to reuse existing assertions, fixtures, and testing practices instead of learning a new evaluation model. The important idea is not the framework itself, but that the outcome is verified by executable tests rather than judged by appearance.

Three Files

An evaluation can start with just three files. That is intentional; no heavy platform is required to understand the idea.

instruction.md (Markdown) describes the intent.
tests/test.sh (Shell script) verifies the outcome.
task.toml (TOML) tells the evaluation how to run.

A developer-first evaluation requires minimal overhead: intent, verification, and configuration.

The simplest example is the hello-world evaluation in the repository. The instruction asks the skill to create /workspace/hello.txt. The agent can choose its path, but the test objectively checks the result: does the file exist, and does the content match exactly? Exit 0 means pass; anything else means failure.

That is boring in the best possible way.

Unit Tests for Skill Instructions

One design choice in inspect-coco is worth calling out: it does not require starting with a fully packaged skill and running the whole lifecycle end to end. It can start with an instruction. That instruction may later become part of a CoCo skill, a Claude Code command, a Codex workflow, or another agentic environment. But before it becomes reusable automation, the instruction itself should be tested.

A skill is not only code; it is packaged intent. If the intent is vague, the skill will be vague. If the instruction is hard to verify, the skill will be hard to trust. If the constraints are weak, the agent has too much room to improvise. Because of this, inspect-coco treats the instruction as a testable artifact, making it closer to unit testing for instructions than demo testing for agents.

The question is not just whether the complete agent workflow can succeed. The earlier question is whether the instruction is clear, constrained, and verifiable enough to become a reliable skill in the first place. Instruction-level tests tell us if the unit of intent is well-formed before packaging it, while end-to-end tests tell us if the whole workflow actually worked. Both are useful, but they answer fundamentally different questions.

Intent Needs Structure

The instruction follows a simple structure, where each section has a specific job:

Goal: Says what we want.
Requirements: Say what must be true.
Constraints: Reduce unwanted choices.
Output: Tells us what success looks like.

This structure matters because vague intent creates drift. If the instruction is loose, the agent has more room to guess. It may still produce something useful, but useful is not the same as correct. A good instruction narrows the space. It does not remove all non-determinism, but it makes the outcome easier to test.

To enforce this, inspect-coco also includes an IDD-style check for instruction quality, catching weak instructions early. The framework does not only ask if the skill worked; it also asks if the intent was clear enough to test.

Repeated Runs Matter

One passing run is useful; it tells us the skill can work. But it does not tell us whether the skill is reliable. Because AI systems vary between runs, inspect-coco can execute the same task multiple times, measuring results across epochs and reporting pass rates such as pass@k.

Measuring reliability: Because AI systems are non-deterministic, true confidence comes from repeated execution and measuring stability over time.

A skill that passes once out of three is not the same as a skill that passes ten out of ten. Both may look good in a demo, but only one is ready to trust. The goal is not to pretend AI is deterministic; rather, it is to measure how stable the outcome is. This is where evaluation becomes more than a checklist — it becomes a feedback loop.

Why Not Just Use LLM-as-Judge?

LLM-as-judge has value when the output is subjective. A summary may need to be judged for clarity, a response checked for tone, or a support answer reviewed for usefulness.

But many developer tasks are not subjective. A file exists or it does not. A test passes or it does not. A command exits cleanly or it does not. A generated project runs or it does not. For those cases, we should not ask for an opinion; we should run a test.

That is the design choice behind inspect-coco: when the outcome can be verified by the system, the system should verify it.

Why Inspect AI ?

I built inspect-coco on Inspect AI because agent evaluation needs more than prompt scoring. Developer skills often act in a real environment: they create files, modify projects, run commands, and use tools. So, the evaluation framework needs to support real execution.

Inspect AI provides that foundation, giving us tasks, solvers, scorers, sandboxes, repeated runs, and logs. inspect-coco adds a CoCo-specific layer on top:

CoCo execution
CoCo skill scaffolding
IDD instruction checks
Docker-based environments
Snowflake authentication via OAuth, personal access tokens, and JWT
Script-based verification

That is the important split. Inspect AI is the general evaluation foundation, CoCo is the concrete runtime where I am proving the pattern, and inspect-coco connects the two for developers who want to build reliable CoCo skills today.

Developer-First by Design

I wanted this to feel like normal development. Not like a research project, not like a dashboard-first platform, and not like another abstract AI evaluation layer. A developer should be able to look at the files and understand what is happening:

Markdown for intent
Shell for verification
TOML for configuration
Docker for isolation
CLI for execution
Pass or fail for confidence

That is the whole point. AI changes how we build, but it should not remove the habits that made software engineering work.

Evaluation Should Be Close to the Skill

inspect-coco can scaffold evaluations from an existing CoCo plugin, meaning a skill can grow a test right beside it. This is important because testing should not feel like a separate ceremony. If writing evaluations is too far away from writing skills, developers will skip it.

The local evaluation loop: Bringing testing closer to the skill development process prevents evaluation from becoming a skipped ceremony.

The closer the test is to the skill, the more likely it becomes part of the workflow. That is how software teams learned to treat tests, and AI skill development should learn the same lesson.

From Prompt Engineering to Intent Engineering

Prompt engineering taught us how to talk to models. Intent engineering asks a harder question: can we express the outcome clearly enough for a system to act on it? But there is an even more important question after that: can we prove the system reached that outcome?

This is where AI-native development has to mature. The future will not belong to teams that generate the most code; it will belong to teams that can express intent clearly, compress complexity safely, and verify outcomes continuously. CoCo skills give us a way to package intent, and inspect-coco gives us a way to test it. That is the bridge from intent to evidence.

In Intent-Driven Development, the question is not only whether the agent produced something. The better question is whether the system reached the intended state, and whether it can do it again. That is what we should measure, and that is what we should improve. That is what turns AI skills from impressive demos into reliable developer tools.

Intent is the source. Evidence is the contract. Every serious AI skill needs tests.

Demo GIF in the article showing inspect-coco in action.

References and Further Reading

Project

Snowflake and CoCo

My IDD Series

Related Reading

About the Author

Kamesh Sampath is a Lead Developer Advocate at Snowflake, author, and long-time open-source contributor with 25+ years in enterprise software. He works across data engineering and AI with developer communities, helping practitioners turn modern data platforms into systems that hold up in production.

Through talks, writing, and hands-on demos, Kamesh makes cloud, data, and AI topics easier to understand and apply — grounded in real-world constraints. His sessions mix deep technical detail with practical patterns that developers and data teams can apply right away.

Lately, he’s been speaking about Apache NiFi (Snowflake Openflow), AI (Snowflake Cortex), and PostgreSQL.

He believes technology becomes powerful when it is shared, taught, and built together.

GitHub |LinkedIn |Blog | X

DEV Community