<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abhi Chatterjee</title>
    <description>The latest articles on DEV Community by Abhi Chatterjee (@abhi_chatterjee_979801).</description>
    <link>https://dev.to/abhi_chatterjee_979801</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890932%2F829ef7da-8a3f-402c-8839-16d64b32d92e.jpg</url>
      <title>DEV Community: Abhi Chatterjee</title>
      <link>https://dev.to/abhi_chatterjee_979801</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abhi_chatterjee_979801"/>
    <language>en</language>
    <item>
      <title>Testing AI Systems in Production: From LLM Evals to Agent Reliability</title>
      <dc:creator>Abhi Chatterjee</dc:creator>
      <pubDate>Tue, 21 Apr 2026 16:34:27 +0000</pubDate>
      <link>https://dev.to/abhi_chatterjee_979801/testing-ai-systems-in-production-from-llm-evals-to-agent-reliability-4do5</link>
      <guid>https://dev.to/abhi_chatterjee_979801/testing-ai-systems-in-production-from-llm-evals-to-agent-reliability-4do5</guid>
      <description>&lt;h1&gt;
  
  
  Testing AI Systems in Production: From LLM Evals to Agent Reliability
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Practical strategies to evaluate LLMs, RAG pipelines, and AI agents in real-world systems&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Most AI systems don’t fail in development — they fail quietly in production.&lt;/p&gt;

&lt;p&gt;Not with crashes, but with subtle errors: hallucinations, incorrect tool usage, or inconsistent outputs that slip past traditional tests.&lt;/p&gt;

&lt;p&gt;The root problem is simple: we are still trying to test probabilistic systems using deterministic testing strategies.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is Part 1 of a series on testing AI systems in production.&lt;/strong&gt;&lt;br&gt;
In this post, we’ll build a practical mental model and testing strategy.&lt;br&gt;
In upcoming parts, I’ll go deeper into evaluation pipelines, RAG testing, and agent-level reliability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why Traditional Testing Breaks for AI
&lt;/h2&gt;

&lt;p&gt;In traditional software, a given input maps to a predictable output.&lt;/p&gt;

&lt;p&gt;That assumption breaks with AI systems.&lt;/p&gt;

&lt;p&gt;Key differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outputs are &lt;strong&gt;non-deterministic&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Correctness is often &lt;strong&gt;subjective&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Ground truth is &lt;strong&gt;hard to define&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Behavior can shift with &lt;strong&gt;small prompt changes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means unit tests alone are not enough. You need layered evaluation strategies.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AI Testing Stack (A Practical Mental Model)
&lt;/h2&gt;

&lt;p&gt;Think of AI testing as a stack rather than a single technique:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------+
| Agent / Workflow Testing (multi-step reasoning)   |
+--------------------------------------------------+
| System Testing (RAG, tools, memory)              |
+--------------------------------------------------+
| Prompt Testing (instructions, few-shot behavior) |
+--------------------------------------------------+
| Model Evaluation (benchmarks, accuracy)          |
+--------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer introduces different failure modes — and requires different testing approaches.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Model-Level Evaluation
&lt;/h2&gt;

&lt;p&gt;This is the foundation: evaluating raw model capability.&lt;/p&gt;

&lt;p&gt;Typical techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark datasets (task-specific)&lt;/li&gt;
&lt;li&gt;Accuracy, precision/recall (structured outputs)&lt;/li&gt;
&lt;li&gt;BLEU / ROUGE (for text similarity)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But strong benchmark performance does &lt;strong&gt;not&lt;/strong&gt; guarantee real-world reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
A model performing well on QA benchmarks may still hallucinate on domain-specific queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; Model evals are necessary, but insufficient.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Prompt-Level Testing
&lt;/h2&gt;

&lt;p&gt;Prompts are effectively your “programming layer” — and they are fragile.&lt;/p&gt;

&lt;p&gt;What to test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consistency across paraphrased inputs&lt;/li&gt;
&lt;li&gt;Sensitivity to prompt changes&lt;/li&gt;
&lt;li&gt;Instruction adherence&lt;/li&gt;
&lt;li&gt;Edge cases and adversarial phrasing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example test case:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: "Summarize this document in 3 bullet points"
Variation: "Give me a short summary in bullets"
Expected: Similar structure and quality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Small wording changes shouldn’t break behavior — but often do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maintain a &lt;strong&gt;golden dataset&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Run regression tests when prompts change&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. System-Level Testing (RAG, Tools, Pipelines)
&lt;/h2&gt;

&lt;p&gt;Once you introduce retrieval or external tools, complexity increases.&lt;/p&gt;

&lt;p&gt;Typical components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval (vector DB / search)&lt;/li&gt;
&lt;li&gt;Context construction&lt;/li&gt;
&lt;li&gt;Tool/API calls&lt;/li&gt;
&lt;li&gt;Output formatting&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common failure modes:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Irrelevant retrieval results&lt;/li&gt;
&lt;li&gt;Missing critical context&lt;/li&gt;
&lt;li&gt;Incorrect tool selection&lt;/li&gt;
&lt;li&gt;Hallucinated answers despite available data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example RAG flow:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    ↓
Retriever → Context
    ↓
LLM → Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What to evaluate:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context relevance&lt;/strong&gt; — Did we fetch the right data?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness&lt;/strong&gt; — Did the model use the context?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Answer correctness&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Agent-Level Testing (Where Things Get Hard)
&lt;/h2&gt;

&lt;p&gt;Agents introduce multi-step reasoning, planning, and state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example loop:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Goal
   ↓
Plan → Tool Call → Observe → Repeat
   ↓
Final Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Common failures:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Infinite loops&lt;/li&gt;
&lt;li&gt;Wrong tool usage&lt;/li&gt;
&lt;li&gt;Partial task completion&lt;/li&gt;
&lt;li&gt;Confident but incorrect outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to test agents:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Scenario-based testing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define end-to-end tasks&lt;/li&gt;
&lt;li&gt;Measure success rate and correctness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Simulation environments&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mock tools and external dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Trace inspection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log actions, inputs, outputs&lt;/li&gt;
&lt;li&gt;Analyze decision paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is essential for debugging complex failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Testing Techniques That Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Golden Datasets
&lt;/h3&gt;

&lt;p&gt;Curate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real user queries&lt;/li&gt;
&lt;li&gt;Edge cases&lt;/li&gt;
&lt;li&gt;Known failure scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This becomes your most valuable testing asset.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. LLM-as-a-Judge
&lt;/h3&gt;

&lt;p&gt;Use a model to evaluate outputs.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Is this answer correct and grounded in the context?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalable&lt;/li&gt;
&lt;li&gt;Flexible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be biased&lt;/li&gt;
&lt;li&gt;Requires validation&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Regression Testing
&lt;/h3&gt;

&lt;p&gt;Every change should trigger evaluation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt updates&lt;/li&gt;
&lt;li&gt;Model changes&lt;/li&gt;
&lt;li&gt;Retrieval modifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accuracy&lt;/li&gt;
&lt;li&gt;Hallucination rate&lt;/li&gt;
&lt;li&gt;Task success&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. Red Teaming
&lt;/h3&gt;

&lt;p&gt;Actively try to break the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt injection&lt;/li&gt;
&lt;li&gt;Jailbreak attempts&lt;/li&gt;
&lt;li&gt;Malicious inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Critical for production readiness.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Practical Testing Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Define Metrics
     ↓
Build Eval Dataset
     ↓
Run Automated Evals
     ↓
Analyze Failures
     ↓
Fix (Prompt / System / Model)
     ↓
Repeat (CI/CD Integration)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  In practice:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Version control your eval datasets&lt;/li&gt;
&lt;li&gt;Automate evaluations in CI/CD&lt;/li&gt;
&lt;li&gt;Track performance over time&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-World Example: Support Chatbot
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario:
&lt;/h3&gt;

&lt;p&gt;A chatbot answering queries from a knowledge base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinated responses&lt;/li&gt;
&lt;li&gt;Ignoring retrieved context&lt;/li&gt;
&lt;li&gt;Inconsistent tone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built dataset (~200 real queries)&lt;/li&gt;
&lt;li&gt;Added evaluation metrics (correctness, grounding)&lt;/li&gt;
&lt;li&gt;Introduced regression testing&lt;/li&gt;
&lt;li&gt;Added adversarial test cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced hallucinations&lt;/li&gt;
&lt;li&gt;Improved consistency&lt;/li&gt;
&lt;li&gt;Faster iteration&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Key Challenges (That Don’t Go Away)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-determinism&lt;/li&gt;
&lt;li&gt;Expensive evaluations&lt;/li&gt;
&lt;li&gt;Limited ground truth&lt;/li&gt;
&lt;li&gt;Continuous model drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal isn’t perfection — it’s &lt;strong&gt;controlled reliability&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What’s Next&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the next parts of this series, I’ll go deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Building automated evaluation pipelines&lt;/li&gt;
&lt;li&gt;Testing RAG systems (metrics + pitfalls)&lt;/li&gt;
&lt;li&gt;Agent evaluation and tracing strategies&lt;/li&gt;
&lt;li&gt;Tooling and implementation patterns&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI testing is not a single technique — it’s a discipline.&lt;/p&gt;

&lt;p&gt;The teams that succeed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test at multiple layers&lt;/li&gt;
&lt;li&gt;Build strong evaluation datasets&lt;/li&gt;
&lt;li&gt;Automate aggressively&lt;/li&gt;
&lt;li&gt;Continuously learn from failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in AI systems, what you don’t test is exactly where things break.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>softwaretesting</category>
    </item>
  </channel>
</rss>
