<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mr.Bong</title>
    <description>The latest articles on DEV Community by Mr.Bong (@jinbongjun).</description>
    <link>https://dev.to/jinbongjun</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859775%2F465a616e-cf1e-4381-8b51-e71fd92405f1.png</url>
      <title>DEV Community: Mr.Bong</title>
      <link>https://dev.to/jinbongjun</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jinbongjun"/>
    <language>en</language>
    <item>
      <title>'"Looks fine" isn't evidence: Why 1 spot check hides silent LLM regression'</title>
      <dc:creator>Mr.Bong</dc:creator>
      <pubDate>Fri, 03 Apr 2026 17:01:01 +0000</pubDate>
      <link>https://dev.to/jinbongjun/looks-fine-isnt-evidence-why-1-spot-check-hides-silent-llm-regression-4ni3</link>
      <guid>https://dev.to/jinbongjun/looks-fine-isnt-evidence-why-1-spot-check-hides-silent-llm-regression-4ni3</guid>
      <description>&lt;p&gt;LLMs don't produce a single output.&lt;br&gt;
So why do we test them like they do?&lt;/p&gt;

&lt;p&gt;If you are shipping AI features to production, chances are you've experienced this nightmare loop:&lt;br&gt;
You identify an edge case where an agent breaks. You patch the system prompt. You run a spot test in your dev environment. It looks fine. Dashboards are green. You click deploy. &lt;br&gt;
And two days later, behavior mysteriously changes in production.&lt;/p&gt;

&lt;p&gt;This drove our team crazy until we realized what we were doing wrong. We were testing what the output &lt;em&gt;looked like&lt;/em&gt; on a single run, instead of testing the &lt;strong&gt;behavior distribution&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Danger of Silent Behavioral Drift
&lt;/h3&gt;

&lt;p&gt;When you change a system prompt, you expect the output text to change. But what you usually don't see is &lt;strong&gt;Behavioral Drift&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example, your support bot might still eventually answer the user's question, but because of your new prompt instruction, it started calling your internal APIs in a different order, or it took 3 extra loop steps to get there.&lt;/p&gt;

&lt;p&gt;The final output is identical. But latency just spiked by 40%. The grounding is gone. The agent is now calling tools unpredictably. A standard "LLM-as-a-judge" usually won't catch this because the text output &lt;em&gt;looks&lt;/em&gt; correct.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix: Multi-run Simulation &amp;amp; The "Flaky" Metric
&lt;/h3&gt;

&lt;p&gt;We realized that testing an LLM app means treating it like a chaotic system. A single "PASS" means nothing.&lt;/p&gt;

&lt;p&gt;To stop these regressions, we built a &lt;strong&gt;Pre-Deploy Release Gate&lt;/strong&gt; workflow. Before we ship any changes to prompts, models, or orchestration, we force it to prove its stability.&lt;/p&gt;

&lt;p&gt;Here is exactly how we do it, without using slow, expensive LLM-judges:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Capture real cases, not synthetic data&lt;/strong&gt;&lt;br&gt;
We save a tight set of real production inputs. Synthetic data misses the weirdness of real users. We treat these saved cases as our ground truth dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multi-run Simulation (Catch the Flakiness)&lt;/strong&gt;&lt;br&gt;
A single spot check tells you nothing about variance. We run the old, known-good trace against the new candidate prompt multiple times—whether that's &lt;strong&gt;10x for a quick sanity check, or 50x to 100x for a rigorous statistical threshold&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9b1tiqtcfhug4i5u6ta1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9b1tiqtcfhug4i5u6ta1.png" alt=" " width="800" height="380"&gt;&lt;/a&gt;&lt;br&gt;
As you can see in our actual regression UI: Under a repeat test, some cases stay perfectly &lt;strong&gt;Healthy (10/10)&lt;/strong&gt;. But one case suddenly failed 4 out of 10 times due to latency spikes and missing required keywords. &lt;br&gt;
If we had only tested it 1 time (spot check), we would have had a 60% chance of passing it and shipping a broken sequence to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Tool Sequence Edit Distance&lt;/strong&gt;&lt;br&gt;
Outputs might look the same, but the underlying mechanics break. We stopped analyzing the output text and started analyzing the engine. We normalize tool calls into an AST-like tree and compute the &lt;strong&gt;Levenshtein edit distance&lt;/strong&gt; on the tool execution graph. If the agent subtly swaps the order of operations, the regression gate blocks the deploy immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stop gambling with Spot Checks
&lt;/h3&gt;

&lt;p&gt;"Looks fine" isn't evidence. If you want to deploy agents safely, you need to test what actually matters: real user inputs at scale, repeated enough times to measure the true variance.&lt;/p&gt;

&lt;p&gt;We formalised this CI/CD pattern into a tool called &lt;strong&gt;&lt;a href="https://www.pluvianai.com/" rel="noopener noreferrer"&gt;PluvianAI&lt;/a&gt;&lt;/strong&gt;, which handles the traffic capture, baseline saving, and the repeat pre-deploy gating automatically. &lt;/p&gt;

&lt;p&gt;If you want to see the exact code that reproduces this silent flaky behavior, I put together a minimal repro repo here: &lt;a href="https://github.com/JinBongJun/support-bot-regression-demo" rel="noopener noreferrer"&gt;support-bot-regression-demo&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;How is your team deciding what counts as "enough evidence" before shipping an LLM change? Let me know.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
