You wouldn't ship code without tests. But most AI agents ship with nothing — a handful of manual prompts in a notebook, a screenshot of "it worked once," and a prayer that production inputs don't look too different from the test ones.
OAS 1.4 ships oa test a test harness that runs eval cases against real models, asserts on output shape and content, and emits CI-friendly JSON. Your agents get tested like code, because they are code.
What a test file looks like
Tests live alongside the spec. One YAML file per agent:
# .agents/summariser.test.yaml
spec: ./summariser.yaml
cases:
- name: summarises short documents
task: summarise
input:
document: "The sky is blue. The grass is green. Water is wet."
expect:
output.summary: { type: string, min_length: 10 }
- name: handles empty facts gracefully
task: summarise
input:
document: ""
expect:
output.summary: { contains: "no content" }
- name: smoke test only
task: summarise
input:
document: "..."
# no expect block — passes if the model returns anything valid
Three cases, one file. Each case targets a task in the spec, provides the input, and optionally asserts on the output.
The assertion vocabulary
oa test supports a small, practical set of assertions, enough to catch real bugs without turning tests into a DSL.
| Assertion | Example | Checks |
|---|---|---|
contains |
{ contains: "welcome" } |
Substring match (case-insensitive by default) |
equals |
{ equals: "greeting" } |
Exact value equality |
type |
{ type: array } |
Value type: string, number, boolean, object, array
|
min_length |
{ min_length: 1 } |
Length for strings or arrays |
max_length |
{ max_length: 500 } |
Upper bound for strings or arrays |
You can combine them:
expect:
output.items: { type: array, min_length: 1, max_length: 10 }
output.items[0].id: { type: string }
output.summary: { contains: "sky", case_sensitive: false }
Paths support dotted access and array indexing (output.items[0].id). The parser is deliberately simple, if you need richer assertions, drop to a post-processing step in CI rather than extending the harness.
Running the tests
From the terminal:
oa test .agents/summariser.test.yaml
You get human-readable output — green ticks, red crosses, which case failed and why.
For CI, flip to JSON mode:
oa test .agents/summariser.test.yaml --quiet
{
"spec": ".agents/summariser.yaml",
"total": 3,
"passed": 2,
"failed": 1,
"cases": [
{
"name": "summarises short documents",
"passed": true,
"duration_ms": 842
},
{
"name": "handles empty facts gracefully",
"passed": false,
"reason": "output.summary: expected to contain 'no content', got 'The document is empty'",
"duration_ms": 512
},
{
"name": "smoke test only",
"passed": true,
"duration_ms": 654
}
]
}
Pipe this into whatever CI system you use. The exit code is non-zero on any failure, so oa test plays nicely with standard test-runner conventions.
Testing in CI
Drop it into a GitHub Actions workflow:
# .github/workflows/test-agents.yml
name: Test agents
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pipx install open-agent-spec
- name: Run agent tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
for test in .agents/*.test.yaml; do
oa test "$test" --quiet
done
Agents now have the same test discipline as the rest of your codebase. Break a prompt? The test case catches it before merge. Swap models? Run the suite and see what drifted.
What to actually test
Model outputs are non-deterministic, so your tests need to assert on shape and invariants, not exact strings.
Do test:
- Output schema conformance required fields present, types correct
- Structural invariants "the summary is always under 500 chars," "the category is always one of these enum values"
- Refusal handling empty or adversarial inputs don't crash the pipeline
- Tool interaction tool-using agents produce the expected tool calls for known inputs
-
Delegated spec integration a spec pulling
oa://prime-vector/summariserstill works after the registry updates
Don't test:
- Exact phrasing — "the response should be 'Hello, Alice!'" — brittle and wrong
- Creative output quality — that's a human eval problem, not a test-suite problem
- Token counts or latency — monitor these in production, don't gate PRs on them
Test invariants, not novelty. That's where agent tests earn their keep.
The bigger picture
Agents-as-code only works if the agents are actually code-like. That means:
- Version-controlled — ✅ YAML in your repo
- Reviewable — ✅ prompts and schemas in a PR diff
- Reusable — ✅ spec delegation and the OAS registry
-
Testable — ✅
oa test
oa test was the last piece missing. With it, agents get the same discipline as any other component of your system: change them, test them, merge them, deploy them.
Define what your agents do. Let the runtime be someone else's problem.
Getting started
pipx install open-agent-spec
# Add a test file next to your spec
cat > .agents/example.test.yaml <<'EOF'
spec: ./example.yaml
cases:
- name: greets by name
task: greet
input: { name: "CI" }
expect:
output.response: { contains: "CI" }
EOF
# Run it
oa test .agents/example.test.yaml
One command. One YAML file. Your agents now have a test suite.
Resources:
Also in this series:
- The Sidecar Agent: Add AI to Any Project Without a Framework
- Why Your AI Agent Sidecar Shouldn't Have SDK Dependencies
- We Published a Formal Spec for Our Agent Framework
- Composable Agent Specs: Spec Delegation and the OAS Registry
Open Agent Spec is MIT-licensed and maintained by Prime Vector. If you're running agents in CI, we'd love to hear what broke — issues welcome on GitHub.
Top comments (0)