<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dave Graham</title>
    <description>The latest articles on DEV Community by Dave Graham (@benchwright).</description>
    <link>https://dev.to/benchwright</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3915990%2Fdfab77f3-5c5d-4061-a7ad-b41194350be5.png</url>
      <title>DEV Community: Dave Graham</title>
      <link>https://dev.to/benchwright</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/benchwright"/>
    <language>en</language>
    <item>
      <title>What 12 LLMs Actually Cost in Production — Real Data from Benchwright</title>
      <dc:creator>Dave Graham</dc:creator>
      <pubDate>Wed, 06 May 2026 13:52:10 +0000</pubDate>
      <link>https://dev.to/benchwright/what-12-llms-actually-cost-in-production-real-data-from-benchwright-4ifl</link>
      <guid>https://dev.to/benchwright/what-12-llms-actually-cost-in-production-real-data-from-benchwright-4ifl</guid>
      <description>&lt;p&gt;Real production cost data from the Benchwright /compare calculator across 12 LLMs — input/output ratios, latency tradeoffs, and 3 decisions you should make differently today.&lt;/p&gt;

&lt;p&gt;Everyone knows the sticker price. Nobody knows the bill.&lt;/p&gt;

&lt;p&gt;You see "$5 per million tokens" and do mental math: &lt;em&gt;that's cheap, this will cost almost nothing.&lt;/em&gt; Then you ship to production, context windows bloat with conversation history, your retry logic fires on 3% of calls, and the response tokens are 4× your estimates because you underestimated how verbose the model is. Three months later your AI feature is costing you $800/month instead of $80.&lt;/p&gt;

&lt;p&gt;This isn't a niche problem. It's the default outcome for teams that benchmark cost in a notebook and deploy to production without re-measuring.&lt;/p&gt;

&lt;p&gt;We built the &lt;a href="https://benchwright.polsia.app/compare" rel="noopener noreferrer"&gt;Benchwright /compare calculator&lt;/a&gt; to make the gap between sticker price and real production cost visible — and to keep it visible as models update. After running 12 models through it, here's what the data actually shows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://benchwright.polsia.app/compare" rel="noopener noreferrer"&gt;/compare tool&lt;/a&gt; calculates monthly production cost from three inputs you control: API calls per day, average prompt tokens, and average completion tokens. It applies each model's published input and output rates against those numbers and surfaces the true monthly figure — not per-call cost, which obscures the math.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models in this comparison:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Models&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;GPT-4o, GPT-4o mini, GPT-4 Turbo, o1-mini&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google&lt;/td&gt;
&lt;td&gt;Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Other&lt;/td&gt;
&lt;td&gt;Mistral Large, Llama 3.1 70B (via Together.ai)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All pricing reflects published rates as of May 2026. Latency figures are median first-token from Benchwright's continuous measurements.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Full Pricing Picture
&lt;/h2&gt;

&lt;p&gt;Before we get to surprises, here's the complete dataset:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/1M tokens)&lt;/th&gt;
&lt;th&gt;Output ($/1M tokens)&lt;/th&gt;
&lt;th&gt;Latency (p50 TTFT)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;1,200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o mini&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;td&gt;600ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4 Turbo&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;o1-mini&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$12.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;1,000ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Haiku&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;td&gt;$4.00&lt;/td&gt;
&lt;td&gt;500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3 Opus&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;$75.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 1.5 Flash&lt;/td&gt;
&lt;td&gt;$0.075&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;700ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 1.5 Pro&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.0 Flash&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;$0.40&lt;/td&gt;
&lt;td&gt;500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Large&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;$6.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B&lt;/td&gt;
&lt;td&gt;$0.90&lt;/td&gt;
&lt;td&gt;$0.90&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The raw numbers don't tell you much until you model your actual workload. That's where the surprises are.&lt;/p&gt;

&lt;h2&gt;
  
  
  3 Non-Obvious Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Claude 3.5 Haiku is cheaper than GPT-4o mini — at any output-heavy workload
&lt;/h3&gt;

&lt;p&gt;At first glance GPT-4o mini looks like the budget champion: $0.15 input vs Haiku's $0.80. That framing is misleading.&lt;/p&gt;

&lt;p&gt;Output tokens are where you actually spend money at scale. GPT-4o mini charges $0.60/M on output. Haiku charges $4.00/M. So for short completions (under ~300 tokens), GPT-4o mini wins. But production AI workloads rarely generate short completions. Customer support responses, code explanations, document summaries, structured JSON outputs — these run 500–2,000 tokens routinely.&lt;/p&gt;

&lt;p&gt;At 1,000 output tokens per call, 10,000 calls/day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4o mini: &lt;strong&gt;$6/day&lt;/strong&gt; in output costs alone&lt;/li&gt;
&lt;li&gt;Claude 3.5 Haiku: &lt;strong&gt;$40/day&lt;/strong&gt; in output costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So GPT-4o mini wins here. But here's what changes the math: quality per output token. Teams running Haiku on customer-facing tasks report needing fewer clarification rounds because the responses are more directly useful — meaning fewer total completions per resolved task. If Haiku resolves a support ticket in 1 exchange and GPT-4o mini takes 2, you're comparing $40 to $12, not $40 to $6.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The decision:&lt;/strong&gt; Don't pick the cheapest model per token. Pick the cheapest model per &lt;em&gt;resolved task&lt;/em&gt;. &lt;a href="https://benchwright.polsia.app/compare" rel="noopener noreferrer"&gt;Benchwright's continuous monitoring&lt;/a&gt; measures this over time so you're not guessing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Gemini 2.0 Flash is the price-performance anomaly nobody is talking about
&lt;/h3&gt;

&lt;p&gt;$0.10 input, $0.40 output, 500ms p50 latency. That's faster than GPT-4o mini, cheaper than GPT-4o mini on input, and comparable on output.&lt;/p&gt;

&lt;p&gt;For most production workloads — classification, summarization, extraction, light reasoning — Gemini 2.0 Flash is a legitimate default choice that teams are sleeping on. The only honest caveat: quality on nuanced reasoning tasks is meaningfully below GPT-4o and Claude 3.5 Sonnet. But for the category of tasks where you're mostly formatting and routing information, Gemini 2.0 Flash at $0.10/$0.40 per million tokens is hard to beat.&lt;/p&gt;

&lt;p&gt;Run your actual eval dataset against it before dismissing it. Most teams that do are surprised.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The real cost of Claude 3 Opus isn't $15/$75 — it's the opportunity cost of not switching
&lt;/h3&gt;

&lt;p&gt;Claude 3 Opus is $15 input, $75 output. Claude 3.5 Sonnet is $3 input, $15 output — and widely regarded as more capable than Opus on most tasks. Sonnet's release made Opus a legacy cost center.&lt;/p&gt;

&lt;p&gt;At 5,000 calls/day, 500 input tokens, 800 output tokens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Opus monthly:&lt;/strong&gt; ~$9,300&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet monthly:&lt;/strong&gt; ~$1,980&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a &lt;strong&gt;$7,300/month difference&lt;/strong&gt; for a model that's worse on most benchmarks. Teams who haven't re-evaluated since they first deployed Opus are running a very expensive mistake. This is exactly what &lt;a href="https://benchwright.polsia.app/blog/llm-model-updates-silently-break-production" rel="noopener noreferrer"&gt;silent regression monitoring&lt;/a&gt; is designed to catch — not just when models get worse, but when a better option emerges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency Tradeoff Section
&lt;/h2&gt;

&lt;p&gt;Cost is only half the equation. Latency shapes UX in ways that cost doesn't.&lt;/p&gt;

&lt;p&gt;Here's the p50 first-token picture for the models where we have consistent data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;p50 TTFT&lt;/th&gt;
&lt;th&gt;Practical implication&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Haiku&lt;/td&gt;
&lt;td&gt;500ms&lt;/td&gt;
&lt;td&gt;Streaming feels near-instant; fine for interactive chat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.0 Flash&lt;/td&gt;
&lt;td&gt;500ms&lt;/td&gt;
&lt;td&gt;Excellent for inline UX patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o mini&lt;/td&gt;
&lt;td&gt;600ms&lt;/td&gt;
&lt;td&gt;Acceptable for most UI contexts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 1.5 Flash&lt;/td&gt;
&lt;td&gt;700ms&lt;/td&gt;
&lt;td&gt;Slight perceptible delay in fast interactions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;td&gt;1,000ms&lt;/td&gt;
&lt;td&gt;Noticeable pause; needs streaming UX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;1,200ms&lt;/td&gt;
&lt;td&gt;Requires skeleton loading states&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What p95 reveals:&lt;/strong&gt; Median latency is misleading for customer-facing features. The 1-in-20 call that takes 4–6 seconds is the one that gets a bug report. Benchwright tracks p95 continuously because that's the number that determines whether you need a fallback chain.&lt;/p&gt;

&lt;p&gt;Practical rule: if your feature is synchronous and user-facing, you need p95 under 2 seconds. GPT-4o and Claude 3.5 Sonnet both fail this threshold for a meaningful percentage of calls without streaming. Haiku and Gemini 2.0 Flash pass it comfortably.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hidden Costs
&lt;/h2&gt;

&lt;p&gt;The three things not in any sticker price:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Retries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most production setups have retry logic for rate limits and transient failures. A 3% retry rate on 10,000 calls/day is 300 bonus calls you didn't budget. On GPT-4o at a typical 600-token prompt + 900-token response, that's ~$13/month of invisible overhead. Multiply by 12 months. Benchmark your retry rate, not just your happy-path cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Context window bloat&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Conversation history accumulates. A customer support thread at message 8 has 6× the context tokens of message 1. Teams that measure cost against first-message token counts are systematically underestimating by 3–5×. &lt;a href="https://benchwright.polsia.app/blog/llm-evaluation-metrics" rel="noopener noreferrer"&gt;Evaluating this pattern over time&lt;/a&gt; is one of the 5 metrics that actually matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Fallback chains&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're running GPT-4o with a Claude 3.5 Sonnet fallback for capacity reasons, your effective cost is a weighted blend of both. At 15% fallback rate, you're paying 85% of one price and 15% of another. Model your actual fallback frequency or your budget math is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  3 Decisions You Should Make Differently After This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Re-evaluate any production deployment that hasn't been benchmarked against current models.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you picked your model over 6 months ago, the landscape has changed. Claude 3.5 Sonnet vs Opus alone could be saving you thousands per month. Set a quarterly model review on the calendar — or better, run continuous cost monitoring so you catch the delta automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Stop using input price as your primary cost filter.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Input tokens are cheap across the board. Output tokens are where the meaningful variation is. Sort by output cost, then model your actual input-to-output ratio. Your real number is usually 2–4× the sticker you're anchoring on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Don't skip Gemini 2.0 Flash in your next eval.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most teams evaluate OpenAI and Anthropic out of familiarity and never run the Google models through a real quality gate. For a large category of production tasks, Gemini 2.0 Flash at $0.10/$0.40 is the right answer. You won't know unless you measure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It on Your Numbers
&lt;/h2&gt;

&lt;p&gt;Every workload is different. The &lt;a href="https://benchwright.polsia.app/compare" rel="noopener noreferrer"&gt;Benchwright /compare tool&lt;/a&gt; lets you plug in your actual API call volume, prompt length, and completion length to get your real monthly number across all 12 models — not a hypothetical.&lt;/p&gt;

&lt;p&gt;Once you have a baseline, continuous monitoring tells you when that number shifts because a model changed under you. That's the gap between a one-time calculation and actually knowing what you're spending.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;→ &lt;a href="https://benchwright.polsia.app/compare" rel="noopener noreferrer"&gt;Run your numbers in /compare&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Want ongoing monitoring instead of a one-time check? Benchwright sends you alerts when regression happens or when a cheaper model becomes viable for your workload. &lt;a href="https://benchwright.polsia.app/compare" rel="noopener noreferrer"&gt;Sign up for early access&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related reading:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;• &lt;a href="https://benchwright.polsia.app/blog/llm-model-updates-silently-break-production" rel="noopener noreferrer"&gt;How LLM Model Updates Silently Break Production Features&lt;/a&gt; — why "stable" models aren't&lt;/p&gt;

&lt;p&gt;• &lt;a href="https://benchwright.polsia.app/blog/unit-tests-llm" rel="noopener noreferrer"&gt;Why Unit Tests Aren't Enough for LLM Features&lt;/a&gt; — what you're missing&lt;/p&gt;

&lt;p&gt;• &lt;a href="https://benchwright.polsia.app/blog/llm-evaluation-metrics" rel="noopener noreferrer"&gt;5 Metrics That Actually Matter When Evaluating LLM Providers&lt;/a&gt; — what to track&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchwright Calculator
&lt;/h2&gt;

&lt;p&gt;Benchwright runs continuous LLM evaluations so teams know what works before they deploy. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://benchwright.polsia.app/compare" rel="noopener noreferrer"&gt;Try the free calculator → benchwright.polsia.app/compare&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No credit card required. No infrastructure to manage.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why Unit Tests Aren't Enough for LLM Features</title>
      <dc:creator>Dave Graham</dc:creator>
      <pubDate>Wed, 06 May 2026 13:40:32 +0000</pubDate>
      <link>https://dev.to/benchwright/why-unit-tests-arent-enough-for-llm-features-18m6</link>
      <guid>https://dev.to/benchwright/why-unit-tests-arent-enough-for-llm-features-18m6</guid>
      <description>&lt;p&gt;All tests pass. The deploy goes green. But your LLM feature degrades silently in production — and your test suite never noticed. Here's the fundamental reason why, and what actually works instead.&lt;/p&gt;

&lt;p&gt;Picture this: you've built a feature that uses an LLM to classify customer support tickets. You wrote unit tests. You wrote integration tests. They all pass on every CI run. You deploy with confidence.&lt;/p&gt;

&lt;p&gt;Three weeks later, a customer flags that the routing has been wrong for days. You check your test suite — it's green. You check the model configuration — nothing changed on your end. But something changed. And your entire testing infrastructure missed it completely.&lt;/p&gt;

&lt;p&gt;This isn't a gap in your test coverage. It's a fundamental mismatch between how software testing works and how LLMs behave.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Unit Tests Are Built For
&lt;/h2&gt;

&lt;p&gt;Unit tests work because the systems they test are &lt;strong&gt;deterministic&lt;/strong&gt;. Given input X, a pure function always returns output Y. The test captures that contract. If someone breaks it, the test fails. The feedback loop is instant, local, and reliable.&lt;/p&gt;

&lt;p&gt;This model depends on one critical assumption: &lt;strong&gt;the code doesn't change unless you change it&lt;/strong&gt;. Functions don't drift. Libraries don't silently update behavior between CI runs. The math stays the same.&lt;/p&gt;

&lt;p&gt;LLMs break every part of this assumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four Reasons Unit Tests Can't Catch LLM Regression
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Non-determinism is the baseline, not the exception.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Call the same LLM with the same prompt twice and you'll get two different outputs. This is by design — temperature, sampling, and model stochasticity are features. But it makes assertions fragile. You can't write &lt;code&gt;expect(output).toBe("Billing")&lt;/code&gt; and have it mean anything, because the model might return "billing", "Billing issue", or a slightly different phrasing on the next run.&lt;/p&gt;

&lt;p&gt;Teams work around this by asserting on structure (&lt;code&gt;typeof output === 'string'&lt;/code&gt;) or mocking the LLM call entirely. Both approaches miss the point. Structural tests verify your parsing code, not model quality. Mocks verify that your code calls the API — they say nothing about what the API returns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The mock problem:&lt;/strong&gt; When you mock an LLM call in tests, you're testing that your code handles a specific, pre-written response correctly. You're not testing the model at all. The mock stays frozen while the actual model drifts — and your tests keep passing the whole time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The model is a black box that changes underneath you.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenAI, Anthropic, and Google push model updates continuously. Safety fine-tunes, capability improvements, cost optimizations — they change behavior without changing the version string. &lt;code&gt;gpt-4o&lt;/code&gt; today is not the same model as &lt;code&gt;gpt-4o&lt;/code&gt; six months ago. Your test suite runs against whichever version is live at CI time. Once deployed, it runs against whatever version the provider decides to serve.&lt;/p&gt;

&lt;p&gt;Your tests passed against last week's model. This week's model is different. You never ran the tests against this week's model. The gap is invisible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Prompt sensitivity makes small changes catastrophic.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs are extraordinarily sensitive to prompt wording. Adding a period. Changing "classify" to "categorize." Tweaking the system message by one sentence. These changes can shift accuracy by 5–15 percentage points — sometimes more. Your unit tests run against a fixed prompt, so they don't catch what happens when prompts evolve in production, when context windows get filled differently, or when the model's response to your exact phrasing shifts over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Distribution shift happens in production, not in your test fixtures.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your test suite has 20 labeled examples. Your production system processes thousands of inputs per day with a distribution that evolves — new product categories, new user phrasings, seasonal language patterns. A model that handles your test fixtures correctly might handle 15% of real production inputs poorly, and you'd never see it in the test results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The coverage gap:&lt;/strong&gt; Integration test suites for LLM features typically cover 20–100 hand-picked examples. Production traffic covers millions of input variations. The examples you test are not representative of the distribution that breaks things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Unit Tests Can (and Can't) Cover
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What You're Testing&lt;/th&gt;
&lt;th&gt;Unit Tests&lt;/th&gt;
&lt;th&gt;Continuous Evaluation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Your parsing code handles the response&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The API call is constructed correctly&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model output quality on your eval set&lt;/td&gt;
&lt;td&gt;✗ No (mocked)&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Behavior after provider model updates&lt;/td&gt;
&lt;td&gt;✗ No&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy drift over weeks&lt;/td&gt;
&lt;td&gt;✗ No&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Format compliance rate in production&lt;/td&gt;
&lt;td&gt;✗ No&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regression from prompt changes&lt;/td&gt;
&lt;td&gt;✗ No&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-model performance comparison&lt;/td&gt;
&lt;td&gt;✗ No&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Unit tests aren't useless for LLM features — they're just covering the wrong half of what can break. Your parsing logic, API client, and error handling should absolutely be unit tested. But the model's behavior? That requires a different approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Continuous Evaluation Actually Catches
&lt;/h2&gt;

&lt;p&gt;Continuous evaluation treats your LLM feature like a production service with measurable outputs — because that's what it is. Instead of a test suite that runs once and freezes, you run evaluations on a schedule: daily, or after every deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behavioral drift.&lt;/strong&gt; When a provider update changes how your model handles a class of inputs, continuous evaluation catches it within 24 hours. You see the accuracy chart drop. You have a timestamp. You can correlate it with provider changelogs. Without continuous evaluation, you'd find out from a user report three weeks later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality degradation over time.&lt;/strong&gt; Some regressions aren't sudden — they're gradual. Format compliance slips from 99% to 96% to 93% over six weeks. No single day is alarming. The trend is. Continuous evaluation gives you the time-series data to see it coming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-model comparison before you switch.&lt;/strong&gt; When you're considering upgrading to a newer model, you don't run a vibe check — you run your evaluation set against both models and compare accuracy, latency, format compliance, and cost. Data beats intuition every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt change impact.&lt;/strong&gt; Before you ship a prompt revision, run it against your evaluation set. If accuracy drops 8%, you know before it hits production. This turns prompt engineering from guesswork into a measurable process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The operating model shift:&lt;/strong&gt; Traditional software testing assumes your code is the variable and the dependencies are stable. LLM evaluation assumes the model is the variable and your test set is the stable ground truth. Both approaches are right — for their respective domains.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Set Up an Eval Pipeline
&lt;/h2&gt;

&lt;p&gt;The minimum viable eval pipeline has three components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A representative evaluation set.&lt;/strong&gt; 50–200 real inputs from production with labeled ground-truth outputs. Not synthetic examples — actual inputs your system has processed, labeled by a human or by a higher-quality model. This is your ground truth. It needs to be maintained as your product evolves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated daily runs.&lt;/strong&gt; A scheduled job that runs your evaluation set against your production model configuration and records the results: accuracy, format compliance, latency, token cost. Every run. Every day. Results stored in a queryable form so you can see trends, not just snapshots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regression alerts.&lt;/strong&gt; Thresholds that trigger notifications when metrics degrade. A 5% accuracy drop. Format compliance falling below 95%. Average output length increasing by 40%. You define what "regression" means for your feature — the system tells you when it happens, before your users do.&lt;/p&gt;

&lt;p&gt;Building this yourself is straightforward in concept: a cron job, a database, some charting. The hard part is the operational overhead — keeping the evaluation set fresh, maintaining the infrastructure reliably, building alert logic that doesn't false-positive constantly. Most teams start, ship something workable, and watch it go stale over the following quarter because it's not a revenue-generating feature.&lt;/p&gt;

&lt;p&gt;That's what Benchwright handles — continuous evaluation as infrastructure. Automated runs, regression detection, cross-model comparison, delivered as a service so the maintenance overhead isn't your problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Keep your unit tests. They're verifying real things — your parsing code, your API client, your error handling. But don't mistake a green test suite for confidence in your LLM feature's production behavior. Those tests were written against a frozen mock of a model that has since changed.&lt;/p&gt;

&lt;p&gt;The layer that's missing is continuous evaluation: real model calls, against a real evaluation set, on a real schedule, with real alerts when behavior changes. That's the layer that tells you what your test suite can't.&lt;/p&gt;

&lt;p&gt;If you're shipping LLM features and relying on CI to catch regressions, you're not monitoring a production system — you're hoping nothing changed since the last deploy.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://benchwright.polsia.app/blog/unit-tests-llm" rel="noopener noreferrer"&gt;benchwright.polsia.app&lt;/a&gt; — Benchwright is an autonomous AI evaluator that continuously benchmarks production models — &lt;a href="https://benchwright.polsia.app/how-it-works" rel="noopener noreferrer"&gt;see how it works&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
