DEV Community

Cover image for Testing AI Agents Like Code: the `oa test` Harness
Scotty G
Scotty G

Posted on

Testing AI Agents Like Code: the `oa test` Harness

You wouldn't ship code without tests. But most AI agents ship with nothing — a handful of manual prompts in a notebook, a screenshot of "it worked once," and a prayer that production inputs don't look too different from the test ones.

OAS 1.4 ships oa test a test harness that runs eval cases against real models, asserts on output shape and content, and emits CI-friendly JSON. Your agents get tested like code, because they are code.

What a test file looks like

Tests live alongside the spec. One YAML file per agent:

# .agents/summariser.test.yaml
spec: ./summariser.yaml

cases:
  - name: summarises short documents
    task: summarise
    input:
      document: "The sky is blue. The grass is green. Water is wet."
    expect:
      output.summary: { type: string, min_length: 10 }

  - name: handles empty facts gracefully
    task: summarise
    input:
      document: ""
    expect:
      output.summary: { contains: "no content" }

  - name: smoke test only
    task: summarise
    input:
      document: "..."
    # no expect block — passes if the model returns anything valid
Enter fullscreen mode Exit fullscreen mode

Three cases, one file. Each case targets a task in the spec, provides the input, and optionally asserts on the output.

The assertion vocabulary

oa test supports a small, practical set of assertions, enough to catch real bugs without turning tests into a DSL.

Assertion Example Checks
contains { contains: "welcome" } Substring match (case-insensitive by default)
equals { equals: "greeting" } Exact value equality
type { type: array } Value type: string, number, boolean, object, array
min_length { min_length: 1 } Length for strings or arrays
max_length { max_length: 500 } Upper bound for strings or arrays

You can combine them:

expect:
  output.items: { type: array, min_length: 1, max_length: 10 }
  output.items[0].id: { type: string }
  output.summary: { contains: "sky", case_sensitive: false }
Enter fullscreen mode Exit fullscreen mode

Paths support dotted access and array indexing (output.items[0].id). The parser is deliberately simple, if you need richer assertions, drop to a post-processing step in CI rather than extending the harness.

Running the tests

From the terminal:

oa test .agents/summariser.test.yaml
Enter fullscreen mode Exit fullscreen mode

You get human-readable output — green ticks, red crosses, which case failed and why.

For CI, flip to JSON mode:

oa test .agents/summariser.test.yaml --quiet
Enter fullscreen mode Exit fullscreen mode
{
  "spec": ".agents/summariser.yaml",
  "total": 3,
  "passed": 2,
  "failed": 1,
  "cases": [
    {
      "name": "summarises short documents",
      "passed": true,
      "duration_ms": 842
    },
    {
      "name": "handles empty facts gracefully",
      "passed": false,
      "reason": "output.summary: expected to contain 'no content', got 'The document is empty'",
      "duration_ms": 512
    },
    {
      "name": "smoke test only",
      "passed": true,
      "duration_ms": 654
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Pipe this into whatever CI system you use. The exit code is non-zero on any failure, so oa test plays nicely with standard test-runner conventions.

Testing in CI

Drop it into a GitHub Actions workflow:

# .github/workflows/test-agents.yml
name: Test agents

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pipx install open-agent-spec
      - name: Run agent tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          for test in .agents/*.test.yaml; do
            oa test "$test" --quiet
          done
Enter fullscreen mode Exit fullscreen mode

Agents now have the same test discipline as the rest of your codebase. Break a prompt? The test case catches it before merge. Swap models? Run the suite and see what drifted.

What to actually test

Model outputs are non-deterministic, so your tests need to assert on shape and invariants, not exact strings.

Do test:

  • Output schema conformance required fields present, types correct
  • Structural invariants "the summary is always under 500 chars," "the category is always one of these enum values"
  • Refusal handling empty or adversarial inputs don't crash the pipeline
  • Tool interaction tool-using agents produce the expected tool calls for known inputs
  • Delegated spec integration a spec pulling oa://prime-vector/summariser still works after the registry updates

Don't test:

  • Exact phrasing — "the response should be 'Hello, Alice!'" — brittle and wrong
  • Creative output quality — that's a human eval problem, not a test-suite problem
  • Token counts or latency — monitor these in production, don't gate PRs on them

Test invariants, not novelty. That's where agent tests earn their keep.

The bigger picture

Agents-as-code only works if the agents are actually code-like. That means:

  • Version-controlled — ✅ YAML in your repo
  • Reviewable — ✅ prompts and schemas in a PR diff
  • Reusable — ✅ spec delegation and the OAS registry
  • Testable — ✅ oa test

oa test was the last piece missing. With it, agents get the same discipline as any other component of your system: change them, test them, merge them, deploy them.

Define what your agents do. Let the runtime be someone else's problem.

Getting started

pipx install open-agent-spec

# Add a test file next to your spec
cat > .agents/example.test.yaml <<'EOF'
spec: ./example.yaml
cases:
  - name: greets by name
    task: greet
    input: { name: "CI" }
    expect:
      output.response: { contains: "CI" }
EOF

# Run it
oa test .agents/example.test.yaml
Enter fullscreen mode Exit fullscreen mode

One command. One YAML file. Your agents now have a test suite.

Resources:

Also in this series:


Open Agent Spec is MIT-licensed and maintained by Prime Vector. If you're running agents in CI, we'd love to hear what broke — issues welcome on GitHub.

Top comments (0)