Abhi Chatterjee

Posted on Apr 21 • Edited on Apr 30

Testing AI Systems in Production: From LLM Evals to Agent Reliability

#ai #llm #machinelearning #softwaretesting

Testing AI Systems in Production: From LLM Evals to Agent Reliability

Practical strategies to evaluate LLMs, RAG pipelines, and AI agents in real-world systems

Most AI systems don’t fail in development — they fail quietly in production.

Not with crashes, but with subtle errors: hallucinations, incorrect tool usage, or inconsistent outputs that slip past traditional tests.

The root problem is simple: we are still trying to test probabilistic systems using deterministic testing strategies.

This is Part 1 of a series on testing AI systems in production.
In this post, we’ll build a practical mental model and testing strategy.
In upcoming parts, I’ll go deeper into evaluation pipelines, RAG testing, and agent-level reliability.

Why Traditional Testing Breaks for AI

In traditional software, a given input maps to a predictable output.

That assumption breaks with AI systems.

Key differences:

Outputs are non-deterministic
Correctness is often subjective
Ground truth is hard to define
Behavior can shift with small prompt changes

This means unit tests alone are not enough. You need layered evaluation strategies.

The AI Testing Stack (A Practical Mental Model)

Think of AI testing as a stack rather than a single technique:

+--------------------------------------------------+
| Agent / Workflow Testing (multi-step reasoning)   |
+--------------------------------------------------+
| System Testing (RAG, tools, memory)              |
+--------------------------------------------------+
| Prompt Testing (instructions, few-shot behavior) |
+--------------------------------------------------+
| Model Evaluation (benchmarks, accuracy)          |
+--------------------------------------------------+

Each layer introduces different failure modes — and requires different testing approaches.

1. Model-Level Evaluation

This is the foundation: evaluating raw model capability.

Typical techniques:

Benchmark datasets (task-specific)
Accuracy, precision/recall (structured outputs)
BLEU / ROUGE (for text similarity)

But strong benchmark performance does not guarantee real-world reliability.

Example:
A model performing well on QA benchmarks may still hallucinate on domain-specific queries.

Takeaway: Model evals are necessary, but insufficient.

2. Prompt-Level Testing

Prompts are effectively your “programming layer” — and they are fragile.

What to test:

Consistency across paraphrased inputs
Sensitivity to prompt changes
Instruction adherence
Edge cases and adversarial phrasing

Example test case:

Input: "Summarize this document in 3 bullet points"
Variation: "Give me a short summary in bullets"
Expected: Similar structure and quality

Small wording changes shouldn’t break behavior — but often do.

Approach:

Maintain a golden dataset
Run regression tests when prompts change

3. System-Level Testing (RAG, Tools, Pipelines)

Once you introduce retrieval or external tools, complexity increases.

Typical components:

Retrieval (vector DB / search)
Context construction
Tool/API calls
Output formatting

Common failure modes:

Irrelevant retrieval results
Missing critical context
Incorrect tool selection
Hallucinated answers despite available data

Example RAG flow:

User Query
    ↓
Retriever → Context
    ↓
LLM → Response

What to evaluate:

Context relevance — Did we fetch the right data?
Faithfulness — Did the model use the context?
Answer correctness

4. Agent-Level Testing (Where Things Get Hard)

Agents introduce multi-step reasoning, planning, and state.

Example loop:

User Goal
   ↓
Plan → Tool Call → Observe → Repeat
   ↓
Final Answer

Common failures:

Infinite loops
Wrong tool usage
Partial task completion
Confident but incorrect outputs

How to test agents:

1. Scenario-based testing

Define end-to-end tasks
Measure success rate and correctness

2. Simulation environments

Mock tools and external dependencies

3. Trace inspection

Log actions, inputs, outputs
Analyze decision paths

This is essential for debugging complex failures.

Core Testing Techniques That Work

1. Golden Datasets

Curate:

Real user queries
Edge cases
Known failure scenarios

This becomes your most valuable testing asset.

2. LLM-as-a-Judge

Use a model to evaluate outputs.

Example:

"Is this answer correct and grounded in the context?"

Pros:

Scalable
Flexible

Cons:

Can be biased
Requires validation

3. Regression Testing

Every change should trigger evaluation:

Prompt updates
Model changes
Retrieval modifications

Track:

Accuracy
Hallucination rate
Task success

4. Red Teaming

Actively try to break the system:

Prompt injection
Jailbreak attempts
Malicious inputs

Critical for production readiness.

A Practical Testing Workflow

Define Metrics
     ↓
Build Eval Dataset
     ↓
Run Automated Evals
     ↓
Analyze Failures
     ↓
Fix (Prompt / System / Model)
     ↓
Repeat (CI/CD Integration)

In practice:

Version control your eval datasets
Automate evaluations in CI/CD
Track performance over time

Real-World Example: Support Chatbot

Scenario:

A chatbot answering queries from a knowledge base.

Issues:

Hallucinated responses
Ignoring retrieved context
Inconsistent tone

Solution:

Built dataset (~200 real queries)
Added evaluation metrics (correctness, grounding)
Introduced regression testing
Added adversarial test cases

Result:

Reduced hallucinations
Improved consistency
Faster iteration

Key Challenges (That Don’t Go Away)

Non-determinism
Expensive evaluations
Limited ground truth
Continuous model drift

The goal isn’t perfection — it’s controlled reliability.

What’s Next

In the next parts of this series, I’ll go deeper into:

Building automated evaluation pipelines
Testing RAG systems (metrics + pitfalls)
Agent evaluation and tracing strategies
Tooling and implementation patterns

Final Thoughts

AI testing is not a single technique — it’s a discipline.

The teams that succeed:

Test at multiple layers
Build strong evaluation datasets
Automate aggressively
Continuously learn from failures

Because in AI systems, what you don’t test is exactly where things break.

Top comments (3)

Armorer Labs • Jun 25

The layered testing stack framing is useful. The extra wrinkle I would add for agents is that the test artifact should not only be final output quality, but the state transitions that got there.

For production agent reliability I usually want each eval case to assert: allowed tools, selected tool, normalized args, retry/escalation path, verifier result, and what changed after execution. Otherwise two runs can both look like "passed" while one burned 5 retries, used the wrong tool first, or needed hidden human cleanup.

That has been the most useful lesson for us while building Armorer/Armorer Guard: evals become much more actionable when they are joined to run receipts, not just model responses.

Disclosure: I work on Armorer Labs.

Abhi Chatterjee • Jun 25

That's a great point, and I completely agree. Looking only at the final output can hide important execution issues that directly impact reliability and operational cost. Capturing state transitions, tool selection, retries, and execution traces provides much richer signals for evaluating agent behavior. Thanks for sharing your experience from building Armorer Guard—it's a valuable perspective.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.