Kang

Posted on Mar 31

Your AI Agent Has No Tests - Here's How to Fix That in 5 Minutes

#ai #testing #typescript #opensource

You test your UI. You test your API. You write integration tests, unit tests, E2E tests.

But your AI agent? It picks tools, handles failures, processes PII, makes autonomous decisions - and you're running it in production with zero tests.

That's wild. Let's fix it.

The Problem Nobody Talks About

AI agents are not just LLMs with a nice wrapper. They:

Call tools - and sometimes call the wrong one
Make decisions - routing, retries, fallbacks
Handle errors - or silently swallow them
Process sensitive data - PII, credentials, financial info

Existing testing tools don't cover this. Promptfoo tests prompts. DeepEval tests outputs. But nothing tests agent behavior - the decisions your agent makes between receiving a request and returning a response.

What happens when your tool times out? When the LLM hallucinates a function name? When two agents in a pipeline disagree? You don't know, because you've never tested it.

AgentProbe: Playwright for AI Agents

AgentProbe brings the same test-driven discipline you use for web apps to AI agents. Define tests in YAML. Run them in CI. Get deterministic results.

Here's what a test case looks like:

name: weather-tool-selection
description: Agent should pick the weather tool for forecast queries

steps:
  - send:
      message: "What's the weather in Tokyo tomorrow?"
    assert:
      - tool_called: get_weather
      - tool_args:
          location: "Tokyo"
      - response_contains: "forecast"
      - no_pii_leaked: true

That's it. No SDK to learn, no test framework to fight. Write YAML, run tests, ship with confidence.

What Makes AgentProbe Different

Chaos Testing - Inject tool failures, slow responses, malformed outputs. See how your agent handles the real world, not just the happy path.

chaos:
  - tool: get_weather
    failure: timeout
    after: 2 calls

Contract Testing - Verify that your agent's tool calls match the expected schema. Catch breaking changes before they hit production.

Multi-Agent Testing - Test pipelines where multiple agents collaborate. Assert on handoffs, message passing, and coordination failures.

Record & Replay - Record a live agent session, then replay it as a regression test. No mocking required.

Battle-Tested

AgentProbe isn't a weekend project. The framework runs 2,907 passing tests against itself. We test the testing framework - because we actually believe in testing.

Get Started in 5 Minutes

npm install @neuzhou/agentprobe

Create a test file agent.test.yaml:

name: basic-agent-test
agent:
  entrypoint: ./my-agent

tests:
  - name: tool-selection
    send: "Search for recent news about AI"
    assert:
      - tool_called: web_search
      - response_not_empty: true

  - name: error-handling
    send: "Search for news"
    chaos:
      - tool: web_search
        failure: error
    assert:
      - graceful_fallback: true
      - no_raw_error_in_response: true

Run it:

npx agentprobe run agent.test.yaml

Done. Your agent now has tests.

Why This Matters

Every month, another story drops about an AI agent going rogue in production - leaking data, calling wrong APIs, running up bills on infinite retry loops. The fix isn't better prompts. It's tests.

You wouldn't deploy a web app without tests. Stop deploying agents without them.

? GitHub: NeuZhou/agentprobe

MIT Licensed. PRs welcome.

Top comments (1)

Ali Muwwakkil • Mar 31

Most AI agents fail not due to a lack of testing but because tests aren't aligned with business goals. In our experience with enterprise teams, integrating AI behavioral testing with business metrics can transform your approach. Instead of just functional correctness, measure how the agent's actions impact key outcomes, like customer engagement or operational efficiency. This requires a shift but ensures AI delivers real value beyond the technical box-checking. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)