Testing AI agents with domain-driven TDD

#testing #softwaredevelopment #ai #llm

TL;DR:
Traditional testing doesn’t work well for AI agents (LLMs are nondeterministic, brittle to assert). I built a flight booking agent from scratch using Scenario, a framework we built for running agent simulations + LLM evaluations. Writing scenario tests first (domain-driven TDD) gave me a way to discover domain rules, evolve the model, and ship an agent with confidence.

Why testing AI agents is hard
Normal unit/integration tests fall apart with AI systems.
Same input, different outputs
Hard to assert strings reliably
You don’t just care about text, you care about business outcomes

Without testing, you’re basically flying blind.

That’s why I tried a scenario-driven approach: define business capabilities first, then let the failures tell me what’s missing.

Step 1 – Write a scenario test

Start with a business journey, not code.

`const result = await scenario.run({
  setId: "booking-agent-demo",
  name: "Basic greeting test",
  description: "User wants to book a flight and expects polite interaction",
  maxTurns: 5,
  agents: [
    scenario.userSimulatorAgent(),
    agentAdapter,
    scenario.judgeAgent({
      criteria: [
        "The agent should greet politely",
        "The agent should understand the user wants to book a flight",
      ],
    }),
  ],
  script: [scenario.proceed(5)],
});
`

First run = red, as expected. That failure became my to-do list.

Step 2 – Let failures guide the build

No endpoint? → Build a basic HTTP route.

Agent replies static text? → Hook up an LLM.

LLM forgets context? → Add conversation memory.

Each test failure = missing piece of the domain.

Step 3 – Scale up to a full booking journey

Once greetings + memory worked, I moved to full booking:

`const result = await scenario.run({
  setId: "booking-agent-scenario-demo",
  name: "Complete flight booking",
  description: "User books NY → London with all required details",
  maxTurns: 100,
  agents: [
    scenario.userSimulatorAgent(),
    agentAdapter,
    scenario.judgeAgent({
      criteria: [
        "Collect passenger info",
        "Collect dates + airports",
        "Create booking in system",
        "Confirm booking to user",
      ],
    }),
  ],
  script: [scenario.proceed(100)],
});`

Ran it → failed. Checked DB → no bookings (because I hadn’t built the tools). Wrote tools → ran again → bookings created, but airport codes didn’t match. Another hidden domain rule uncovered.

Why this worked
Scenarios = living documentation of the domain
Failing tests = backlog of missing business rules
Confidence = I can change prompts, swap models, or go multi-agent without fear of silent regressions

Takeaways

Scenario = domain-driven TDD for AI agents

You don’t just test outputs, you validate business outcomes

Each failure teaches you something new about your domain

Scenarios double as specs + tests + onboarding docs

If you’re curious: Scenario is open source
(works with any LLM/agent framework).