<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rohan Rajesh Khairnar</title>
    <description>The latest articles on DEV Community by Rohan Rajesh Khairnar (@rohan_101104).</description>
    <link>https://dev.to/rohan_101104</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916442%2Fd40240f1-d23a-4a92-a028-9c4e298766a8.png</url>
      <title>DEV Community: Rohan Rajesh Khairnar</title>
      <link>https://dev.to/rohan_101104</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rohan_101104"/>
    <language>en</language>
    <item>
      <title>You can’t test prompts like code - and it’s breaking real systems</title>
      <dc:creator>Rohan Rajesh Khairnar</dc:creator>
      <pubDate>Wed, 06 May 2026 17:36:42 +0000</pubDate>
      <link>https://dev.to/rohan_101104/you-cant-test-prompts-like-code-and-its-breaking-real-systems-3o90</link>
      <guid>https://dev.to/rohan_101104/you-cant-test-prompts-like-code-and-its-breaking-real-systems-3o90</guid>
      <description>&lt;p&gt;I’ve been building with LLMs for a while now, and there’s one problem that keeps showing up — not in tutorials, not in docs, not in any framework README — but in production, at 2am.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You change one line in your prompt.&lt;/li&gt;
&lt;li&gt;You test a few inputs manually.&lt;/li&gt;
&lt;li&gt;Everything looks fine.&lt;/li&gt;
&lt;li&gt;You ship.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two days later, a user reports something broken.&lt;/p&gt;

&lt;p&gt;And now you don’t know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which prompt change caused it&lt;/li&gt;
&lt;li&gt;when it broke&lt;/li&gt;
&lt;li&gt;how many users were affected&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  We solved this for code. Not for prompts.
&lt;/h2&gt;

&lt;p&gt;When we write normal software, we have a clear feedback loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;write code&lt;/li&gt;
&lt;li&gt;run tests&lt;/li&gt;
&lt;li&gt;know immediately if something broke&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools like &lt;code&gt;pytest&lt;/code&gt;, CI pipelines, and assertions made this reliable.&lt;/p&gt;

&lt;p&gt;We take this for granted.&lt;/p&gt;

&lt;p&gt;But with prompts?&lt;/p&gt;

&lt;p&gt;We don’t have that loop.&lt;/p&gt;

&lt;p&gt;We test manually.&lt;br&gt;
Or worse — we don’t test at all.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why prompt testing is fundamentally harder
&lt;/h2&gt;

&lt;p&gt;The core issue is simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLMs are not deterministic.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The same input can produce different outputs&lt;/li&gt;
&lt;li&gt;Output formats can drift&lt;/li&gt;
&lt;li&gt;Small prompt changes can have large effects&lt;/li&gt;
&lt;li&gt;Model updates can silently break behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So this doesn’t work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And because exact assertions don’t work…&lt;/p&gt;

&lt;p&gt;👉 most developers just stop asserting anything.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this looks like in production
&lt;/h2&gt;

&lt;p&gt;These aren’t edge cases. These are normal failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A support bot starts leaking internal information after a tone change&lt;/li&gt;
&lt;li&gt;A JSON output format shifts and silently breaks downstream systems&lt;/li&gt;
&lt;li&gt;A model upgrade changes structure and no one notices&lt;/li&gt;
&lt;li&gt;Bias creeps into outputs because no rule was enforcing constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And sometimes the worst case:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;nothing crashes — but the system quietly degrades&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The real problem
&lt;/h2&gt;

&lt;p&gt;We tried to apply traditional testing thinking to something that behaves differently.&lt;/p&gt;

&lt;p&gt;Prompt outputs shouldn’t be tested for &lt;strong&gt;exact values&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;They should be tested for &lt;strong&gt;behavior&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the output contain required information?&lt;/li&gt;
&lt;li&gt;Is the format valid (JSON, structure, etc.)?&lt;/li&gt;
&lt;li&gt;Does it avoid forbidden content?&lt;/li&gt;
&lt;li&gt;Does it stay within expected bounds (length, tone, constraints)?&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I’ve been working on something to fix this — a way to test prompt behavior like we test code, but designed for how LLMs actually behave.&lt;/p&gt;

&lt;p&gt;Releasing this Friday.&lt;/p&gt;

&lt;p&gt;If this sounds familiar, I’d genuinely like to hear your experience — what’s the worst prompt bug you’ve shipped?&lt;/p&gt;

&lt;p&gt;This feels like a gap in how we build with LLMs.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
