Scotty G

Posted on Apr 23

Testing AI Agents Like Code: the `oa test` Harness

#agents #testing #devops #ai

You wouldn't ship code without tests. But most AI agents ship with nothing — a handful of manual prompts in a notebook, a screenshot of "it worked once," and a prayer that production inputs don't look too different from the test ones.

OAS 1.4 ships oa test a test harness that runs eval cases against real models, asserts on output shape and content, and emits CI-friendly JSON. Your agents get tested like code, because they are code.

What a test file looks like

Tests live alongside the spec. One YAML file per agent:

# .agents/summariser.test.yaml
spec: ./summariser.yaml

cases:
  - name: summarises short documents
    task: summarise
    input:
      document: "The sky is blue. The grass is green. Water is wet."
    expect:
      output.summary: { type: string, min_length: 10 }

  - name: handles empty facts gracefully
    task: summarise
    input:
      document: ""
    expect:
      output.summary: { contains: "no content" }

  - name: smoke test only
    task: summarise
    input:
      document: "..."
    # no expect block — passes if the model returns anything valid

Three cases, one file. Each case targets a task in the spec, provides the input, and optionally asserts on the output.

The assertion vocabulary

oa test supports a small, practical set of assertions, enough to catch real bugs without turning tests into a DSL.

Assertion	Example	Checks
`contains`	`{ contains: "welcome" }`	Substring match (case-insensitive by default)
`equals`	`{ equals: "greeting" }`	Exact value equality
`type`	`{ type: array }`	Value type: `string`, `number`, `boolean`, `object`, `array`
`min_length`	`{ min_length: 1 }`	Length for strings or arrays
`max_length`	`{ max_length: 500 }`	Upper bound for strings or arrays

You can combine them:

expect:
  output.items: { type: array, min_length: 1, max_length: 10 }
  output.items[0].id: { type: string }
  output.summary: { contains: "sky", case_sensitive: false }

Paths support dotted access and array indexing (output.items[0].id). The parser is deliberately simple, if you need richer assertions, drop to a post-processing step in CI rather than extending the harness.

Running the tests

From the terminal:

oa test .agents/summariser.test.yaml

You get human-readable output — green ticks, red crosses, which case failed and why.

For CI, flip to JSON mode:

oa test .agents/summariser.test.yaml --quiet

{
  "spec": ".agents/summariser.yaml",
  "total": 3,
  "passed": 2,
  "failed": 1,
  "cases": [
    {
      "name": "summarises short documents",
      "passed": true,
      "duration_ms": 842
    },
    {
      "name": "handles empty facts gracefully",
      "passed": false,
      "reason": "output.summary: expected to contain 'no content', got 'The document is empty'",
      "duration_ms": 512
    },
    {
      "name": "smoke test only",
      "passed": true,
      "duration_ms": 654
    }
  ]
}

Pipe this into whatever CI system you use. The exit code is non-zero on any failure, so oa test plays nicely with standard test-runner conventions.

Testing in CI

Drop it into a GitHub Actions workflow:

# .github/workflows/test-agents.yml
name: Test agents

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pipx install open-agent-spec
      - name: Run agent tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          for test in .agents/*.test.yaml; do
            oa test "$test" --quiet
          done

Agents now have the same test discipline as the rest of your codebase. Break a prompt? The test case catches it before merge. Swap models? Run the suite and see what drifted.

What to actually test

Model outputs are non-deterministic, so your tests need to assert on shape and invariants, not exact strings.

Do test:

Output schema conformance required fields present, types correct
Structural invariants "the summary is always under 500 chars," "the category is always one of these enum values"
Refusal handling empty or adversarial inputs don't crash the pipeline
Tool interaction tool-using agents produce the expected tool calls for known inputs
Delegated spec integration a spec pulling oa://prime-vector/summariser still works after the registry updates

Don't test:

Exact phrasing — "the response should be 'Hello, Alice!'" — brittle and wrong
Creative output quality — that's a human eval problem, not a test-suite problem
Token counts or latency — monitor these in production, don't gate PRs on them

Test invariants, not novelty. That's where agent tests earn their keep.

The bigger picture

Agents-as-code only works if the agents are actually code-like. That means:

Version-controlled — ✅ YAML in your repo
Reviewable — ✅ prompts and schemas in a PR diff
Reusable — ✅ spec delegation and the OAS registry
Testable — ✅ oa test

oa test was the last piece missing. With it, agents get the same discipline as any other component of your system: change them, test them, merge them, deploy them.

Define what your agents do. Let the runtime be someone else's problem.

Getting started

pipx install open-agent-spec

# Add a test file next to your spec
cat > .agents/example.test.yaml <<'EOF'
spec: ./example.yaml
cases:
  - name: greets by name
    task: greet
    input: { name: "CI" }
    expect:
      output.response: { contains: "CI" }
EOF

# Run it
oa test .agents/example.test.yaml

One command. One YAML file. Your agents now have a test suite.

Resources:

Also in this series:

Open Agent Spec is MIT-licensed and maintained by Prime Vector. If you're running agents in CI, we'd love to hear what broke — issues welcome on GitHub.

DEV Community