<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muskan Joshi</title>
    <description>The latest articles on DEV Community by Muskan Joshi (@muskan_joshi_).</description>
    <link>https://dev.to/muskan_joshi_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3907821%2Fbaa5154d-e274-4e83-9b1e-0f63b3c5f6e4.jpg</url>
      <title>DEV Community: Muskan Joshi</title>
      <link>https://dev.to/muskan_joshi_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/muskan_joshi_"/>
    <language>en</language>
    <item>
      <title>I tested Claude's consistency across prompts — here's what I found</title>
      <dc:creator>Muskan Joshi</dc:creator>
      <pubDate>Tue, 05 May 2026 14:51:04 +0000</pubDate>
      <link>https://dev.to/muskan_joshi_/i-tested-claudes-consistency-across-prompts-heres-what-i-found-4p48</link>
      <guid>https://dev.to/muskan_joshi_/i-tested-claudes-consistency-across-prompts-heres-what-i-found-4p48</guid>
      <description>&lt;h2&gt;
  
  
  I tested Claude's consistency across prompts — here's what I found
&lt;/h2&gt;

&lt;p&gt;Every developer building an AI-powered app assumes their LLM gives consistent answers. I did too — until I actually measured it.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/muskanjoshi01/llm-test-kit" rel="noopener noreferrer"&gt;llm-test-kit&lt;/a&gt;, an open source test suite for LLM-powered applications. While building it, I ran hundreds of tests against Claude Sonnet and discovered something that surprised me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The finding
&lt;/h2&gt;

&lt;p&gt;Claude is &lt;strong&gt;content-consistent but format-inconsistent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Run the same factual question three times and you'll get the same answer every time. But the structure — headers, bullet points, analogies — changes with every response.&lt;/p&gt;

&lt;p&gt;Here's what that looks like in practice. I ran "What is an API?" three times:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run 1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# API (Application Programming Interface)&lt;/span&gt;
An API is a set of rules and protocols that allows different software 
applications to communicate with each other.
&lt;span class="gu"&gt;## Simple Analogy&lt;/span&gt;
Think of it like a restaurant menu...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run 2:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# API (Application Programming Interface)&lt;/span&gt;
&lt;span class="gu"&gt;## Simple Definition&lt;/span&gt;
An API is a set of rules and protocols that allows different software 
applications to communicate with each other.
&lt;span class="gu"&gt;## Simple Analogy&lt;/span&gt;
Think of it like a restaurant menu...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run 3:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# API (Application Programming Interface)&lt;/span&gt;
An API is a set of rules and protocols that allows different software 
applications to communicate with each other.
&lt;span class="gu"&gt;## Simple Analogy&lt;/span&gt;
Think of an API like a restaurant menu and waiter...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core answer is identical. But Run 2 added a "## Simple Definition" subheader that didn't appear in the others. Run 3 changed the analogy slightly. My consistency scorer gave this a &lt;strong&gt;D (60/100)&lt;/strong&gt; — below the 70 threshold I consider production-safe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;If your app parses or displays LLM responses, format inconsistency will break things. Markdown headers that appear sometimes but not others. Bullet points that show up in some responses but not in others. Section labels that change between calls.&lt;/p&gt;

&lt;p&gt;The fix is simple — a system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reply in plain text only. No markdown, no headers, no bullet points.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With that system prompt, the same test scores an &lt;strong&gt;A (94/100)&lt;/strong&gt;. Same question, same answer, consistent format every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I measured this
&lt;/h2&gt;

&lt;p&gt;I built llm-test-kit specifically to surface these kinds of issues. It runs four tests against any prompt:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency&lt;/strong&gt; — runs the same prompt N times and scores how much responses vary using Jaccard similarity. Score of 100 means identical every time. Below 70 is a red flag for production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency&lt;/strong&gt; — benchmarks response time with min, max, avg, and p95. The p95 number is the one that matters — it tells you what your slowest users actually experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt; — tracks token usage and spend per run. Detects cost spikes before they become surprise bills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behavior&lt;/strong&gt; — lets you write assertions against the output. Does it contain a specific word? Does it stay under a length limit? Does it match a pattern?&lt;/p&gt;

&lt;p&gt;One command generates a visual HTML report with all four results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real numbers from my tests
&lt;/h2&gt;

&lt;p&gt;Running against Claude Sonnet on "What is an API?":&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Consistency score&lt;/td&gt;
&lt;td&gt;60/100 (D)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg latency&lt;/td&gt;
&lt;td&gt;6823ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cost (3 runs)&lt;/td&gt;
&lt;td&gt;$0.014418&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Behavior assertions&lt;/td&gt;
&lt;td&gt;2/2 passed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The latency grade is F for this prompt — 6.8 seconds average. That's because "What is an API?" triggers a long detailed response. Shorter, more specific prompts benchmark much better. "Define API in one sentence" gets a B grade at under 2 seconds.&lt;/p&gt;

&lt;p&gt;This is the second finding: &lt;strong&gt;prompt specificity directly controls latency&lt;/strong&gt;. Vague prompts produce long responses. Long responses take longer. Test your prompts before you ship them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjp4hsjtv5dcv8j3i1jri.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjp4hsjtv5dcv8j3i1jri.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5c288iy8rvel1af0m79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5c288iy8rvel1af0m79.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The consistency fix in action
&lt;/h2&gt;

&lt;p&gt;Here's what happens when you add a system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Without system prompt — D (60)&lt;/span&gt;
node bin/cli.js consistency &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"What is an API?"&lt;/span&gt; &lt;span class="nt"&gt;--runs&lt;/span&gt; 3

&lt;span class="c"&gt;# With system prompt — A (94)  &lt;/span&gt;
node bin/cli.js consistency &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"What is an API?"&lt;/span&gt; &lt;span class="nt"&gt;--runs&lt;/span&gt; 3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--system&lt;/span&gt; &lt;span class="s2"&gt;"Reply in plain text only. No markdown or headers."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The content is identical. The score jumps from 60 to 94. One line of system prompt, 34 point improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm building next
&lt;/h2&gt;

&lt;p&gt;These findings are going into a research paper on LLM behavioral consistency patterns across providers. The next phase of testing will compare OpenAI and Anthropic head-to-head on the same prompts across different domains — factual questions, creative tasks, code generation, and summarization.&lt;/p&gt;

&lt;p&gt;If you want to run these tests yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/muskanjoshi01/llm-test-kit.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llm-test-kit
npm &lt;span class="nb"&gt;install
cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Add your API key&lt;/span&gt;
node bin/report.js &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Your prompt here"&lt;/span&gt; &lt;span class="nt"&gt;--runs&lt;/span&gt; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HTML report saves automatically. Open it in your browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  What tests would help you most?
&lt;/h2&gt;

&lt;p&gt;I'm actively adding new test modules. The ones on my roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Side-by-side provider comparison (OpenAI vs Anthropic on the same prompt)&lt;/li&gt;
&lt;li&gt;CI/CD integration — fail the build if consistency drops below a threshold&lt;/li&gt;
&lt;li&gt;Watch mode — run tests on a schedule and alert on regression&lt;/li&gt;
&lt;li&gt;JSON output for programmatic use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If there's a test you wish existed for your LLM app, open an issue on GitHub. I'm building this in public and every piece of feedback shapes what gets built next.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;llm-test-kit is open source and MIT licensed. GitHub: &lt;a href="https://github.com/muskanjoshi01/llm-test-kit" rel="noopener noreferrer"&gt;muskanjoshi01/llm-test-kit&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If this was useful, a ⭐ on GitHub goes a long way.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
