I tested Claude's consistency across prompts — here's what I found
Every developer building an AI-powered app assumes their LLM gives consistent answers. I did too — until I actually measured it.
I built llm-test-kit, an open source test suite for LLM-powered applications. While building it, I ran hundreds of tests against Claude Sonnet and discovered something that surprised me.
The finding
Claude is content-consistent but format-inconsistent.
Run the same factual question three times and you'll get the same answer every time. But the structure — headers, bullet points, analogies — changes with every response.
Here's what that looks like in practice. I ran "What is an API?" three times:
Run 1:
# API (Application Programming Interface)
An API is a set of rules and protocols that allows different software
applications to communicate with each other.
## Simple Analogy
Think of it like a restaurant menu...
Run 2:
# API (Application Programming Interface)
## Simple Definition
An API is a set of rules and protocols that allows different software
applications to communicate with each other.
## Simple Analogy
Think of it like a restaurant menu...
Run 3:
# API (Application Programming Interface)
An API is a set of rules and protocols that allows different software
applications to communicate with each other.
## Simple Analogy
Think of an API like a restaurant menu and waiter...
The core answer is identical. But Run 2 added a "## Simple Definition" subheader that didn't appear in the others. Run 3 changed the analogy slightly. My consistency scorer gave this a D (60/100) — below the 70 threshold I consider production-safe.
Why this matters
If your app parses or displays LLM responses, format inconsistency will break things. Markdown headers that appear sometimes but not others. Bullet points that show up in some responses but not in others. Section labels that change between calls.
The fix is simple — a system prompt:
Reply in plain text only. No markdown, no headers, no bullet points.
With that system prompt, the same test scores an A (94/100). Same question, same answer, consistent format every time.
How I measured this
I built llm-test-kit specifically to surface these kinds of issues. It runs four tests against any prompt:
Consistency — runs the same prompt N times and scores how much responses vary using Jaccard similarity. Score of 100 means identical every time. Below 70 is a red flag for production.
Latency — benchmarks response time with min, max, avg, and p95. The p95 number is the one that matters — it tells you what your slowest users actually experience.
Cost — tracks token usage and spend per run. Detects cost spikes before they become surprise bills.
Behavior — lets you write assertions against the output. Does it contain a specific word? Does it stay under a length limit? Does it match a pattern?
One command generates a visual HTML report with all four results.
Real numbers from my tests
Running against Claude Sonnet on "What is an API?":
| Metric | Result |
|---|---|
| Consistency score | 60/100 (D) |
| Avg latency | 6823ms |
| Total cost (3 runs) | $0.014418 |
| Behavior assertions | 2/2 passed |
The latency grade is F for this prompt — 6.8 seconds average. That's because "What is an API?" triggers a long detailed response. Shorter, more specific prompts benchmark much better. "Define API in one sentence" gets a B grade at under 2 seconds.
This is the second finding: prompt specificity directly controls latency. Vague prompts produce long responses. Long responses take longer. Test your prompts before you ship them.
The consistency fix in action
Here's what happens when you add a system prompt:
# Without system prompt — D (60)
node bin/cli.js consistency -p "What is an API?" --runs 3
# With system prompt — A (94)
node bin/cli.js consistency -p "What is an API?" --runs 3 \
--system "Reply in plain text only. No markdown or headers."
The content is identical. The score jumps from 60 to 94. One line of system prompt, 34 point improvement.
What I'm building next
These findings are going into a research paper on LLM behavioral consistency patterns across providers. The next phase of testing will compare OpenAI and Anthropic head-to-head on the same prompts across different domains — factual questions, creative tasks, code generation, and summarization.
If you want to run these tests yourself:
git clone https://github.com/muskanjoshi01/llm-test-kit.git
cd llm-test-kit
npm install
cp .env.example .env
# Add your API key
node bin/report.js -p "Your prompt here" --runs 3
The HTML report saves automatically. Open it in your browser.
What tests would help you most?
I'm actively adding new test modules. The ones on my roadmap:
- Side-by-side provider comparison (OpenAI vs Anthropic on the same prompt)
- CI/CD integration — fail the build if consistency drops below a threshold
- Watch mode — run tests on a schedule and alert on regression
- JSON output for programmatic use
If there's a test you wish existed for your LLM app, open an issue on GitHub. I'm building this in public and every piece of feedback shapes what gets built next.
llm-test-kit is open source and MIT licensed. GitHub: muskanjoshi01/llm-test-kit
If this was useful, a ⭐ on GitHub goes a long way.


Top comments (1)
Happy to answer any questions about how the consistency scoring works or how to set up llm-test-kit for your own LLM app. Also curious — what tests would be most useful for your use case? I'm actively building new modules and feedback shapes what gets built next.