Don't Let Your Jarvis Become Ultron: A Field Guide to Testing Agentic AI system

#ai #agents #agentaichallenge

Stage 1: Component tests. Write deterministic unit tests for each layer: test_research_agent.py, test_web_search_tool.py, test_user_profile_memory.py. Use mock data that your domain expert has signed off on. These run on every commit, cost nothing, and catch the obvious breakages before any LLM call gets billed. While you're here, stub the external APIs too (GA4, Shopify, Meta, OpenSearch). If a test goes red because Shopify was down, it isn't telling you anything about your agent.

Stage 2: The prompt repository. Sit with the domain expert and collect the sharpest prompts you can, the ones that force specific tools, functions, agents, and memory to fire. Tag each prompt with what it's supposed to exercise, and group prompts by business area so a change in one area only re-runs its own set. This is the most valuable thing you'll build, so treat it that way.

Two kinds of prompts people forget. First, the failure cases: out-of-scope questions, prompt injection, ambiguous input, empty or malformed tool responses, timeouts. In banking, a prompt that checks whether the agent correctly refuses to give financial advice is a real test, not an edge case. Second, the multi-turn cases. Memory bugs show up across a conversation, not in a single call. Does it carry context forward, drop it when it should, and never leak one user's profile into another user's session? That last one matters a lot under DPDP.

Stage 3: Coverage and trajectory. Run the whole repository and confirm every agent and tool actually fired. That's the coverage check. Then go one level deeper and look at the path the agent took to get there. A tool firing isn't the same as the right tool firing, with the right arguments, in the right order, without three pointless detours, and recovering when a tool returned an error. This trajectory check is the part most teams skip, and it's the part that's specific to agents rather than plain LLM apps.

Stage 4: Versioned runs and capture. Stamp every run with a version (gpt-5.5-upgrade-20260623) and store every response against it. Now regression is something you can point to instead of something you argue about. Two additions. Run each prompt several times rather than once, because the model is stochastic and a single scored run is closer to a coin flip than a test. Track the pass rate and the variance. And capture cost, tokens, latency, and tool-call count on every run, because the upgrade decision is a trade-off. "Four percent more accurate at three times the tokens and twice the latency" is a business call, and you can't make it without those numbers in front of you.

Stage 5: Ground truth store. Keep domain-expert-verified ground truths for each prompt and tool, versioned the same way (...-20250510). One thing to decide early: who is allowed to change a ground truth, and how that approval gets recorded. When the product changes for real, old ground truths go stale, and without an update process the suite slowly starts failing things that are actually correct.

Stage 6: The evaluator. Score each candidate run against the ground truth using Ragas plus an LLM judge, on precision, recall, completeness, correctness, and whatever else the business asks for. The catch is that your judge is also an LLM, with its own biases toward longer answers, toward whatever comes first, toward its own style. Keep a small set of human-labelled examples and check how often the judge agrees with the humans. If you don't, you get metrics that are wrong and confident at the same time.

Stage 7: Dashboard and human review. Surface the low-scoring cases and let a human confirm or correct both the ground truth and the new response. The same screen does double duty: the labels people produce here are what you use to calibrate the judge in Stage 6.

Stage 8: CI/CD Decide where this runs. Component tests on every pull request, the full evaluation suite nightly and before a release, and a gate that blocks the deploy when scores fall below a threshold. A suite that nothing in your pipeline calls won't get run, and won't get maintained.

DEV Community

Don't Let Your Jarvis Become Ultron: A Field Guide to Testing Agentic AI system

Top comments (0)