DEV Community: Dave Graham

How to Evaluate Your RAG Pipeline

Dave Graham — Sun, 24 May 2026 13:32:56 +0000

RAG has two places to fail: retrieval and generation. Most teams only catch one. Here's the complete evaluation framework.

Your RAG-powered feature returns a confident, well-formatted answer. The problem: it's wrong. Not hallucinated in an obvious way — it cites a real document, uses correct terminology, and sounds authoritative. But the document it retrieved was from six months ago, before the policy changed, and the answer is no longer valid.

This is the failure mode that makes RAG evaluation hard. Unlike a pure LLM where you're testing one system, RAG is a pipeline: a retriever that finds relevant context, and a generator that synthesizes an answer from that context. Each component can fail independently. Evaluating only the final answer misses half the problem.

This post covers the complete RAG evaluation framework: how to evaluate retrieval and generation separately, the three core metrics (context precision, faithfulness, answer relevance), the hyperparameters worth sweeping, and how to detect the silent failures that only appear in production.

Why RAG Requires Separate Evaluation of Retrieval and Generation

When a RAG system fails, the root cause is almost always one of two things: the retriever returned the wrong chunks, or the LLM generated an answer that wasn't supported by the chunks it received. These require different fixes, so you need to know which failure you're dealing with.

Consider the failure modes:

Retrieval failure: The relevant document exists in your corpus, but the retriever doesn't surface it. The LLM never sees the right context, so even a perfect generator can't produce the right answer.
Generation failure: The retriever returns excellent context, but the LLM ignores it or contradicts it — synthesizing an answer from prior training weights instead of the retrieved chunks.
Compound failure: Partially relevant context retrieved, and the LLM extrapolates beyond what the context supports.

Evaluating only the final answer tells you that something went wrong. It doesn't tell you where to fix it. If your retrieval is broken, optimizing your prompt won't help. If your generation is hallucinating, upgrading your embedding model won't help.

The silent failure pattern: Teams often report their RAG system "works fine" because end-to-end answer quality is acceptable on their test set. But their test set only covers queries where retrieval is easy. On long-tail queries — niche topics, recent documents, ambiguous phrasing — retrieval silently degrades, and the LLM fills the gap with plausible-sounding fabrications.

The RAG Triad: The Core Evaluation Framework

The RAG Triad is the most widely used framework for RAG evaluation. It measures three relationships: between the query and retrieved context, between the context and the answer, and between the query and the answer.

Metric	What It Measures	Failure Signal
Context Relevance	Are the retrieved chunks relevant to the query?	Retriever returning noise
Faithfulness	Is the answer grounded in the retrieved context?	LLM hallucinating beyond context
Answer Relevance	Does the answer actually address the query?	Correct but non-responsive answers

All three metrics need to be high simultaneously. A high faithfulness score with low context relevance means the LLM faithfully reproduced irrelevant content. High answer relevance with low faithfulness means the LLM answered the question correctly but made it up.

Retrieval Quality Metrics

Context Precision

Context precision measures what fraction of your retrieved chunks are actually relevant to the query. If you retrieve 5 chunks and only 2 are relevant, your precision is 0.4. Low precision means you're feeding your LLM noisy context — irrelevant information that increases the chance of confused or distracted generation.

function contextPrecision(retrievedChunks, relevantChunkIds) {
    const relevantSet = new Set(relevantChunkIds);
    const relevant = retrievedChunks.filter(chunk => relevantSet.has(chunk.id));
    return relevant.length / retrievedChunks.length;
}

async function scoreRetrieval(query, retrievedChunks, evaluatorLLM) {
    const relevanceJudgments = await Promise.all(
        retrievedChunks.map(chunk =>
            evaluatorLLM.judge({
                query,
                chunk: chunk.text,
                task: 'Is this chunk relevant to answering the query? Answer YES or NO.'
            })
        )
    );
    const relevantChunkIds = retrievedChunks
        .filter((_, i) => relevanceJudgments[i] === 'YES')
        .map(c => c.id);
    return {
        precision: contextPrecision(retrievedChunks, relevantChunkIds),
        relevant_count: relevantChunkIds.length,
        total_retrieved: retrievedChunks.length
    };
}

Context Recall

Context recall is the complement: of all the relevant chunks that exist in your corpus, how many did your retriever actually find? High precision with low recall means you're retrieving a clean set of relevant chunks, but missing others that could have improved the answer. You need both.

Recall requires knowing the ground truth — which chunks in your corpus are relevant for a given query. For evaluation sets, you build this ground truth manually (or with LLM-assisted annotation) on a representative sample of queries, then measure how often your retriever surfaces those chunks.

Entity Recall

Entity recall is a cheaper proxy when full ground-truth annotation isn't feasible. Extract the named entities from the correct answer, then check what fraction of those entities appear in the retrieved context.

function entityRecall(answerText, retrievedChunksText) {
    const answerEntities = extractEntities(answerText);
    const contextText = retrievedChunksText.join(' ').toLowerCase();
    const foundEntities = answerEntities.filter(entity =>
        contextText.includes(entity.toLowerCase())
    );
    return {
        recall: foundEntities.length / answerEntities.length,
        found: foundEntities,
        missing: answerEntities.filter(e => !foundEntities.includes(e))
    };
}

Retrieval evaluation shortcut: If you have ground-truth question-answer pairs, you can score retrieval without human chunk annotation. Use the gold answer to check whether the retrieved context contains the information needed to derive that answer — an LLM evaluator can judge this cheaply at scale.

Generation Quality Metrics

Faithfulness (Groundedness)

Faithfulness is the most critical generation metric. It measures whether every claim in the generated answer is supported by the retrieved context — not by the LLM's training data, not by plausible inference, but directly supported by what was retrieved.

def score_faithfulness(answer, retrieved_context, evaluator_llm):
    claims_response = evaluator_llm.complete(f"""
    Break the following answer into individual factual claims.
    Return a JSON list of strings, each a single verifiable claim.
    Answer: {answer}
    """)
    claims = json.loads(claims_response)
    if not claims:
        return 1.0
    verdicts = []
    for claim in claims:
        verdict = evaluator_llm.complete(f"""
        Context: {retrieved_context}
        Claim: {claim}
        Is this claim directly supported by the context? Answer YES or NO only.
        """)
        verdicts.append(verdict.strip() == "YES")
    faithful_count = sum(verdicts)
    return {
        "faithfulness": faithful_count / len(claims),
        "claims_total": len(claims),
        "claims_supported": faithful_count,
        "unsupported_claims": [c for c, v in zip(claims, verdicts) if not v]
    }

Hallucination Detection

Hallucination is the inverse of faithfulness — it measures what fraction of the answer is fabricated. Pay special attention to precise factual claims: numbers, dates, names, percentages, and direct quotes. LLMs hallucinate specific details far more often than general concepts. A pipeline that scores 95% faithfulness overall might still be fabricating specific figures 20% of the time.

Answer Relevance

Answer relevance is deceptively tricky. An answer can be faithful to the context and still completely miss the question. This happens most often when the retrieved context is technically related but not directly responsive.

The evaluation approach: ask an LLM evaluator to generate the question that the answer is most directly responding to, then measure semantic similarity between that generated question and the original query.

The completeness trap: An answer can score high on faithfulness and answer relevance but still be dangerously incomplete. If the retrieved context only contains half the relevant information and the LLM faithfully summarizes that half, you get a confident, grounded, relevant — but incomplete — answer. This is why context recall matters: incompleteness starts at retrieval.

Hyperparameter Sweeps: What to Tune and How to Measure It

Chunk Size

Smaller chunks improve precision but hurt recall. Larger chunks improve recall but decrease precision and increase LLM context noise. The optimal chunk size varies by document type.

Sweep strategy: fix top-K, vary chunk size from 256 to 2048 tokens in steps, measure context precision and recall on your eval set. Pick the chunk size at the knee of the precision-recall curve.

Top-K Retrieval

Retrieving more chunks increases recall but decreases precision and can overwhelm the LLM context window. There's also a position bias effect in many LLMs — information near the beginning and end of the context is more likely to be used than information in the middle.

async function sweepTopK(evalQueries, vectorStore, evaluator) {
    const kValues = [1, 3, 5, 10, 15, 20];
    const results = [];
    for (const k of kValues) {
        const queryScores = await Promise.all(
            evalQueries.map(async ({ query, goldAnswer }) => {
                const chunks = await vectorStore.retrieve(query, { topK: k });
                const answer = await generateAnswer(query, chunks);
                return {
                    contextPrecision: await evaluator.precision(query, chunks),
                    faithfulness: await evaluator.faithfulness(answer, chunks),
                    answerRelevance: await evaluator.relevance(query, answer)
                };
            })
        );
        results.push({
            k,
            composite: mean(queryScores.map(s =>
                (s.contextPrecision + s.faithfulness + s.answerRelevance) / 3
            ))
        });
    }
    return results;
}

Embedding Model

The embedding model determines how well semantic similarity maps to actual relevance for your domain. When evaluating embedding models, hold everything else constant and vary only the embedding model. Measure context recall and precision — you want to isolate the retrieval contribution.

Production Monitoring: Silent RAG Failures

Lab evaluation covers the queries you anticipated. Production covers everything else. Three signals worth monitoring:

Average context relevance score: If this drops, your corpus has likely grown stale or your query distribution has shifted.
Faithfulness rate: An uptick in low-faithfulness answers signals the LLM is increasingly ignoring context — often triggered by context window saturation or corpus contamination.
No-retrieval rate: Queries that return zero relevant chunks above your confidence threshold may receive fabricated responses instead of honest "I don't know" answers.

class RAGMonitor {
    constructor(evaluator, alertThresholds) {
        this.evaluator = evaluator;
        this.thresholds = alertThresholds;
        this.window = [];
    }
    async logQuery(query, retrievedChunks, answer) {
        const shouldEval = Math.random() < 0.05; // 5% sampling
        if (!shouldEval) return;
        const scores = {
            timestamp: Date.now(),
            context_relevance: await this.evaluator.precision(query, retrievedChunks),
            faithfulness: await this.evaluator.faithfulness(answer, retrievedChunks),
            answer_relevance: await this.evaluator.relevance(query, answer)
        };
        this.window.push(scores);
        if (this.window.length > 500) this.window.shift();
        this.checkAlerts(scores);
        return scores;
    }
    checkAlerts(latestScore) {
        const recent = this.window.slice(-50);
        const avgFaithfulness = mean(recent.map(s => s.faithfulness));
        const avgRelevance = mean(recent.map(s => s.context_relevance));
        if (avgFaithfulness < this.thresholds.faithfulness) {
            this.alert('FAITHFULNESS_DEGRADATION', { avg: avgFaithfulness });
        }
        if (avgRelevance < this.thresholds.contextRelevance) {
            this.alert('RETRIEVAL_DEGRADATION', { avg: avgRelevance });
        }
    }
}

The corpus drift problem: Re-run your evaluation suite monthly. Treat a 5% drop in context recall as a deployment-blocking regression, not a minor inconvenience.

How Benchwright Handles RAG Evaluation

The evaluation framework above requires infrastructure: an evaluation dataset, an LLM evaluator, a metrics pipeline, and a monitoring layer. Building this from scratch adds weeks of engineering time that isn't core to your product.

Benchwright provides this infrastructure out of the box. Connect your RAG pipeline — any vector store, any LLM generator — define your evaluation dataset, and Benchwright runs the full RAG Triad evaluation on a schedule. You get context precision, context recall, faithfulness, and answer relevance scores tracked over time, with regression alerts when any metric drops below your threshold.

When you change your chunk size, swap embedding models, or update your system prompt, Benchwright automatically re-evaluates and flags regressions before they reach users. No spreadsheets, no manual scoring runs, no finding out from support tickets.

Originally published on Benchwright

→ Evaluate your RAG pipeline on Benchwright — free, no credit card required

How to A/B Test LLM Prompts Without Breaking Production

Dave Graham — Fri, 15 May 2026 12:54:45 +0000

Prompt changes break production more than model updates. Here's how to test them safely.

Your AI customer support bot starts returning wrong refund policies. The document parser starts stripping legal disclaimers. The code reviewer starts approving things it shouldn't. None of the models changed. You changed the prompt.

Prompt changes are the #1 source of LLM regressions in production. Model updates are visible — you get a changelog, a version bump, an announcement. Prompt changes are silent. You edit a string, deploy it, and find out three days later when a customer screenshots your bot saying something it shouldn't.

The fix is not "be more careful with prompts." The fix is a testing pipeline that treats prompt changes like code changes: run them against a benchmark, measure the impact, ship only when you have evidence.

The Naive Approach (And Why It Fails)

The typical workflow looks like this: PM says "the bot should mention our SLA," engineer adds one sentence to the system prompt, deploys, checks the output on three test cases, calls it done. Three weeks later someone notices the bot now refuses to process invoices over $500.

The problem isn't the engineer. The problem is the process. Testing three cases is not testing. A prompt that works on your three test cases might behave completely differently on the other 10,000 inputs your users will send. And you won't notice until the damage is done.

The math: A prompt that improves performance by 5% on 90% of inputs but degrades badly on the other 10% will feel fine in a 10-sample test. In production, 10% of thousands of daily requests = dozens of broken interactions per day.

Shadow Testing: Run Both Prompts Before You Commit

Shadow testing is the safest way to evaluate a prompt change. You run the new prompt in the background, alongside the old one, on the same inputs, and compare outputs before switching. No users see the new prompt until you have data.

The setup:

Route a sample of production traffic (or your evaluation dataset) to both the control prompt and the treatment prompt
Score both outputs on your success criteria (accuracy, format compliance, relevance)
Compare aggregate results after N samples
If the new prompt is better (or at least not worse), switch. If it regresses, diagnose and iterate.

How Many Samples Do You Need?

LLM outputs are variable. The same prompt with the same input can return different answers. So how many samples before you can trust your results?

The answer depends on the effect size you want to detect. If you're looking for a 5% improvement, you need more samples than if you're looking for a 20% improvement. Here's a rough framework:

10-20 samples: Catch catastrophic regressions (new prompt returns garbage 80%+ of the time). Not enough for anything subtle.
50-100 samples: Detect moderate effects (5-10% accuracy change). Minimum viable for production decisions.
200-500 samples: Detect small effects (1-3% change). Required if you're optimizing cost-sensitive, high-volume features.

A practical rule: if you don't have enough samples to be statistically confident, wait. Run more evaluations. The cost of running 200 extra evaluation samples is $2-5 depending on your model. The cost of shipping a broken prompt to thousands of users is much higher.

Statistical significance for non-deterministic outputs: LLM outputs aren't coin flips — they have variance. Use the standard error of the mean to calculate confidence intervals. If the 95% CI of the new prompt overlaps the old, you don't have evidence to switch yet.

Building a Prompt A/B Pipeline

A production pipeline for prompt testing has four stages. Automate them and you can ship prompt changes with confidence instead of fingers crossed.

Stage 1: Evaluation Dataset

You need a test set that represents real production inputs. Not cherry-picked examples — real distribution. If your support bot handles 50 categories of requests, your test set should cover all 50, weighted by frequency.

Stage 2: Parallel Evaluation

Run control and treatment prompts against the full dataset. Score each output with your validators. Store results with enough metadata to reproduce — prompt version, model, timestamp, input, output, score.

Stage 3: Statistical Comparison

Aggregate results and run a comparison test. The key question: is the new prompt better, worse, or inconclusive? Not "does the average go up" — does the distribution of outcomes improve?

Stage 4: Staged Rollout

Don't switch from 0 to 100% in one deploy. Roll out in stages: 5% → 25% → 50% → 100%, with monitoring at each stage. If you see error rates spike or customer satisfaction drop, roll back to 100% control before investigating.

Metrics That Actually Matter

Not all metrics are created equal. Here's what to track in your prompt A/B tests:

Metric	What It Tells You	Alert Threshold
Task Accuracy	Does the model do the right thing?	Drop > 2% vs control
Format Compliance	Does output parse correctly?	Drop below 95%
Latency p95	Is response time still acceptable?	Increase > 30%
Cost per Query	Token usage vs output quality	Increase without accuracy gain
Hallucination Rate	Does it make things up?	Any increase > 0.5%
Output Consistency	Same input → same output?	Drop in consistency score

What Most Teams Get Wrong

Testing on the same inputs they used to develop the prompt. If you iterated on your prompt by testing on examples A, B, and C, and those are the same examples in your test set, you're measuring memorization, not generalization. Your test set needs to be separate from your development set.

Only measuring accuracy. A prompt can score higher on accuracy but take 3x longer and use 5x more tokens. Measure cost and latency alongside quality, or you'll optimize one axis and destroy another.

Not tracking regression direction. When the new prompt loses, you need to know why. Is it worse on specific input categories? Does it handle edge cases worse but normal cases better? Without this data, the next iteration is guesswork.

The failure mode nobody talks about: Prompts that improve average case but introduce catastrophic failure modes. The new prompt might score 3% higher overall but make the model confidently wrong in ways that cause real harm (legal advice, medical guidance, financial decisions). Catch these in your test set, not in production.

Benchwright Makes This Automatic

This is the workflow Benchwright implements for you. Define your evaluation dataset, set your metrics, pick your test prompts, and Benchwright runs the shadow testing, statistical comparison, and staged rollout — tracking all six metrics in real time.

When a prompt change shows a regression, you get an alert before it hits production. When it looks good, you get a clear signal to proceed. No spreadsheet juggling, no "I tested it locally so it should be fine."

Ready to Test Prompt Changes Safely?

Benchwright runs shadow tests, measures all six key metrics, and gates deployments on statistical evidence. No more shipping prompts and hoping for the best.

Start Evaluating → Free evaluation, no credit card required

How to Detect LLM Model Regressions Before They Hit Production

Dave Graham — Tue, 12 May 2026 12:48:22 +0000

When LLM providers push model updates, output quality silently degrades. Here's how to catch regressions before they reach users.

You deploy on Tuesday. Everything works. Wednesday morning, an LLM provider pushes a model patch. Thursday your Slack channel explodes with reports that your AI features are returning nonsense.

This happens constantly. GPT-4o mini gets a stealth improvement that breaks your prompt assumptions. Claude adds better instruction-following that changes how it parses your structured output. Gemini's latency swings by 2x overnight. The updates are usually good for the provider's metrics, but they're invisible to your production system until they break something.

The fix is not to panic after the fact — it's to catch regressions before they matter. This means building a detection system that runs continuously, evaluates your features against your actual success criteria, and alerts you the moment an update hurts your performance.

Why Model Regressions Happen

LLM providers update models constantly. Most updates are invisible: a weights patch, a tokenizer tweak, a system prompt adjustment. But your specific use case? You don't know until it breaks.

The hidden cost: Every day without regression detection is a day your production system could be degraded without you knowing.

The Four Pillars of Regression Detection

1. Baseline Scoring

Before you deploy a feature, know what "good" looks like. Run your evaluation suite against the current model and capture baseline metrics: accuracy = 94.2%, latency p95 = 1.8s, output format compliance = 100%.

2. Automated Regression Tests

Run your evaluation suite on a schedule. Daily is ideal. Focus on your actual success criteria:

Accuracy metrics
Format compliance
Latency thresholds
Edge cases

3. Shadow Scoring

When a new model version is released, run both old and new models on the same test set in parallel. Shadow scoring gives you hard data before you commit to switching.

4. Alert Thresholds

Define numeric thresholds for alerts:

Accuracy drops more than 2% → investigate
Format compliance below 95% → critical alert
Latency p95 increases 50%+ → investigate

Implementation Roadmap

Week 1: Run evaluation suite 100+ times. Capture baseline metrics.

Week 2: Set up daily cron job that runs evaluation and posts to Slack.

Week 3: Define thresholds and wire them into the daily test.

Week 4: When new model version releases, set up shadow scoring on 5% of traffic.

Benchwright Makes This Automatic

Benchwright runs your evaluation suite continuously, detects regressions automatically, and alerts you before production breaks.

When a new model version is released, shadow-score it against your current model in Benchwright's interface. Get side-by-side comparison: which model is more accurate, faster, cheaper, more consistent. Flip a switch and move to the new one.

Start Evaluating Now → Free evaluation, no credit card required

LLM API Pricing Trends Q2 2026 — Who Got Cheaper, Who Got Expensive

Dave Graham — Fri, 08 May 2026 13:59:15 +0000

The LLM market has repriced dramatically since early 2025. Frontier intelligence that cost $10/M input tokens 18 months ago now runs $1–3/M. Budget tiers have hit $0.10/M. But not every direction is down — Anthropic's budget tier got more expensive when Haiku 3 retired. Here's the full picture.

If you haven't re-evaluated your model selection in the past six months, you are almost certainly overpaying. The LLM pricing landscape has moved more in Q1–Q2 2026 than in most full calendar years before it. Multiple flagship models dropped 50–80% in price. New model generations entered with competitive pricing from day one. And a few quiet deprecations pushed some teams onto more expensive tiers without noticing.

This is a full-provider pricing audit as of May 2026 — what changed, by how much, and what it means for production workloads. All pricing reflects published API rates. Use the Benchwright /compare tool to model your specific call volume and token mix.

The Full Pricing Table — Q2 2026

Every major provider, current rates, with change indicators versus late 2025 prices.

Provider	Model	Input ($/1M)	Output ($/1M)	vs Late 2025
OpenAI	GPT-4o	$2.50	$10.00	−50% input
OpenAI	GPT-4o mini	$0.15	$0.60	Stable
OpenAI	GPT-4.1	$2.00	$8.00	NEW
OpenAI	GPT-4.1 Nano	$0.10	$0.40	NEW
OpenAI	o3	$2.00	$8.00	−80%
OpenAI	o4-mini	$1.10	$4.40	NEW
OpenAI	GPT-5	$1.25	$10.00	NEW
Anthropic	Claude Haiku 3 (retired)	$0.25	$1.25	EOL Apr 19
Anthropic	Claude Haiku 3.5	$0.80	$4.00	Stable
Anthropic	Claude Haiku 4.5	$1.00	$5.00	NEW
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	NEW
Anthropic	Claude Opus 4.6	$5.00	$25.00	NEW
Google	Gemini 2.0 Flash	$0.10	$0.40	EOL Jun 1
Google	Gemini 2.5 Flash-Lite	$0.10	$0.40	NEW
Google	Gemini 2.5 Flash	$0.30	$2.50	NEW
Google	Gemini 2.5 Pro (≤200K)	$1.25	$10.00	−25% vs 1.5 Pro
Mistral	Mistral Small 3.1	$0.10	$0.30	−75%
Mistral	Mistral Large 3	$2.00	$6.00	−50%
xAI	Grok 4.3	$1.25	$2.50	−83% output
DeepSeek	DeepSeek V3.2	$0.28	$0.42	Stable
DeepSeek	DeepSeek R1	$0.55	$2.19	Stable
Meta	Llama 4 Maverick (Together)	$0.15	$0.60	NEW
Cohere	Command A	$2.50	$10.00	NEW flagship

Who Got Cheaper (and by How Much)

OpenAI — Aggressive Repricing Across the Board

OpenAI has made the most dramatic pricing moves of any major provider in 2026. GPT-4o input dropped from $5/M to $2.50/M in a cut that happened quietly in mid-2025 and held into Q2 2026. The bigger story is o3: at launch it was priced at $10 input / $40 output per million tokens. It now sits at $2/$8 — an 80% reduction in under a year.

The GPT-4.1 family is the other structural change. GPT-4.1 Nano at $0.10/$0.40 matches Gemini 2.5 Flash-Lite on price with OpenAI's ecosystem familiarity. GPT-5 launched at $1.25 input / $10.00 output — cheaper input than GPT-4o was a year ago, with better capability.

The o3 repricing signal: When a reasoning model drops 80% in price in one year, it's not a product decision — it's a statement about where compute costs are heading. Reasoning at scale is becoming economically viable for production workloads that would have been cost-prohibitive in 2024.

xAI Grok — Biggest Single-Cut Story of Q2

Grok 4.3 launched around April 30, 2026 at $1.25/$2.50 — replacing Grok 3 at $3/$15. That's an 83% reduction in output cost for the flagship model. The output price of $2.50/M puts it well below GPT-4o and Claude Sonnet on the same dimension, while the 1M context window is a meaningful differentiator for long-document workloads.

xAI still has a thin track record on production reliability compared to OpenAI and Anthropic. But at these prices, it warrants a place in your evaluation set.

Mistral — Steady Downward Drift

Mistral Large went from ~$4/$12 (Large 2) to ~$2/$6 (Large 3) — roughly a 50% reduction. Mistral Small 3.1 at $0.10/$0.30 is now one of the cheapest options from a European provider, useful for teams with data residency constraints or who want provider diversification.

Google — New Generation, Better Value

Gemini 2.5 Pro at $1.25/$10 undercuts Gemini 1.5 Pro ($1.25/$5 — but now deprecated). Gemini 2.5 Flash at $0.30/$2.50 is the interesting one: it has 1M context, solid multimodal capabilities, and a price point that makes it viable as a default for many production workloads that previously defaulted to GPT-4o mini.

Who Got More Expensive (and Why)

Not all movement was down. Two situations quietly raised costs for teams that weren't paying attention.

Anthropic's Budget Tier Repriced Upward

Claude Haiku 3 — priced at $0.25 input / $1.25 output — retired on April 19, 2026. Teams that didn't migrate were bumped to Claude Haiku 3.5 at $0.80/$4.00 or Claude Haiku 4.5 at $1.00/$5.00.

That's a 3–4× cost increase on output tokens for anyone who didn't notice the deprecation. At 10,000 calls/day with 400 completion tokens:

Claude Haiku 3: ~$150/month in output costs
Claude Haiku 3.5: ~$480/month in output costs
Claude Haiku 4.5: ~$600/month in output costs

If your budget was built around Haiku 3 and you weren't monitoring costs, this was a silent 3× increase that hit on a specific date. This is exactly the kind of change Benchwright's continuous monitoring flags — not a regression in output quality, but a pricing event that changes your cost structure overnight.

Action required if you're on Haiku 3: The model retired April 19. If you haven't migrated, you're either hitting errors or being routed to a replacement. Check your API costs from the past 30 days against the prior 30 days — the jump will be visible.

The Hidden Cost of Not Re-Evaluating

Claude 3 Opus ($15/$75) is still technically accessible but has been functionally superseded by Claude Opus 4.6 ($5/$25). Teams still running Opus 3 are paying 3× the output cost for an older model. That's not a price increase from Anthropic — it's a failure to migrate that creates the same effect.

Same pattern with GPT-4 Turbo ($10/$30) vs GPT-4o ($2.50/$10): a 75% savings is sitting there for teams that haven't updated their model string.

What This Means for Production Workloads

The Budget Tier Is Now Genuinely Capable

In 2024, "cheap" meant compromising significantly on quality. In Q2 2026, GPT-4.1 Nano at $0.10/$0.40, Gemini 2.5 Flash-Lite at $0.10/$0.40, and Mistral Small 3.1 at $0.10/$0.30 are all significantly more capable than what was considered "flagship" 18 months ago.

For classification, extraction, summarization, and light reasoning tasks, defaulting to a $0.10/M input model and validating the quality tradeoff is the right starting point — not the fallback.

Reasoning Models Are Becoming Viable at Scale

o3 at $2/$8 and o4-mini at $1.10/$4.40 are priced in the same range as non-reasoning frontier models from a year ago. For workloads that benefit from chain-of-thought — complex code generation, multi-step data extraction, decision support — the price delta versus a standard model no longer represents a major budget line item.

Provider Diversification Has Real Risk-Adjusted Value

The Haiku 3 retirement is a reminder: when you build a production workload on a single provider's specific model, that provider controls your cost structure. DeepSeek at $0.28/$0.42 and Mistral Small at $0.10/$0.30 are real alternatives for teams with high-volume, quality-tolerant workloads. The diversification is not just about price — it's about not having your budget repriced by a deprecation decision you didn't see coming.

Caching and Batching Discounts Are Now Universal

Every major provider now offers batch API discounts (typically 50%) and prompt cache discounts (typically 50–90% on cache hits). For production workloads with repeated system prompts, few-shot examples, or shared context — and that's most of them — effective rates are half to one-tenth of the published prices. If you're not using caching, your real cost is roughly double what it should be.

The Headline Number

GPT-4 launched in March 2023 at $30 input / $60 output per million tokens. GPT-5 is available today at $1.25 input / $10 output. That's a 96% reduction in input cost in just over three years.

More practically: GPT-4o class intelligence — the quality benchmark for production AI in 2024 — is now available from multiple providers at $1–3/M input. The question is no longer "can we afford to use a capable model?" It's "which capable model fits our workload, and are we measuring it continuously enough to catch the moment that answer changes?"

Prices will keep moving. The model you benchmarked last quarter is not the best option today, and the pricing you budgeted last quarter is not the right number to plan against. The only reliable approach is to keep measuring — which is what the Benchwright /compare tool is built for.

The Three Decisions to Make Now

1. Check whether any model you're running has been deprecated or repriced. Haiku 3 retired April 19. Gemini 2.0 Flash retires June 1. GPT-4 Turbo and Claude 3 Opus are legacy cost centers. If you haven't explicitly confirmed your current model strings against provider documentation in the past 60 days, do it today.

2. Add Gemini 2.5 Flash and GPT-4.1 Nano to your next evaluation run. These two represent the best value points in the Q2 2026 market for high-volume workloads. Most teams haven't evaluated them yet. The teams that have are surprised by the quality-to-cost ratio.

3. Enable prompt caching if you haven't already. If your workload has any repeated context — system prompts, instructions, few-shot examples — you're likely paying 2× what you should be. The implementation is usually a single flag or a minor API change.

CTA

Compare these models live in the interactive calculator → benchwright.polsia.app/compare

5 Metrics That Actually Matter When Evaluating LLM Providers

Dave Graham — Thu, 07 May 2026 12:44:59 +0000

Most teams pick LLM providers based on demos and vibes. Here's the evaluation framework that separates good choices from expensive ones.

When teams evaluate LLM providers, they almost always do it wrong. They run a prompt, compare the outputs, pick the one that sounds best, and move on. Three months later they're dealing with inconsistent behavior, unexpected cost spikes, or mysterious accuracy drops they can't explain.

The problem isn't the evaluation — it's that they're measuring the wrong things. Output quality in a controlled test is not the same as output quality in production. What matters is what happens over time, at scale, under variance. Here's what to actually measure.

The 5 Metrics That Matter

Metric	What It Tells You	Target Range
Accuracy Consistency	Does the model perform the same on identical inputs over time?	CV < 5% across daily runs
Latency p95	What's your 95th percentile response time?	< 2s for most tasks
Cost per Eval	What's your evaluation cost per test run?	Track trend, not absolute
Regression Frequency	How often does behavior change unexpectedly?	Monthly or less
Format Compliance Rate	Does output match your expected structure?	> 98% for structured tasks

1. Accuracy Consistency

Accuracy on day one means nothing if it drifts on day 30. Accuracy consistency is the coefficient of variation in your evaluation scores across repeated runs over weeks. A model that scores 91% Monday and 88% Friday is less consistent than one that holds 89–90% every day.

This is different from raw accuracy. A model could be consistently mediocre — always 82% — and that's stable. But if it's 95% one week and 80% the next, you can't trust it in production even if the average looks fine.

To measure this: run your evaluation set at the same time every day for at least two weeks. Plot the daily accuracy scores. If the variance is high with no external cause (no model update, no prompt change), that's a consistency problem — not a bad model, just an unstable one for your use case.

How to use it: Run accuracy consistency alongside any model upgrade evaluation. Even if a new model scores higher on average, flag it if consistency degrades — variance is invisible until it hits a critical moment in production.

2. Latency p95

Average latency lies. A model that averages 800ms but spikes to 4 seconds during peak load is worse than one that averages 1.2s but stays within 1.5s. p95 latency — the response time at the 95th percentile — tells you what your users actually experience.

Why p95 and not p99? p99 is so dominated by cold starts and rare events that it doesn't reflect user experience. p95 is where you start seeing the tail that impacts real users, not infrastructure anomalies.

Measure this in production, not just in your evaluation environment. Your eval harness probably isn't sending concurrent requests. Production will — and that's when latency compounds.

Watch for patterns: does latency creep up over the month? Does it spike on certain time windows? Provider infrastructure changes over time, and p95 trends are the canary.

3. Cost per Evaluation Run

Token cost is easy to track. Cost per eval run is what it actually costs you to run your full evaluation suite — all prompts, all inputs, all output processing. This compounds quickly.

If you're running 200 evaluation inputs daily at 500 tokens in and 150 out at $3/1M tokens, that's about $0.39/day. That sounds trivial. But run that across 5 different model configurations you're comparing, and you're at $2/day — $730/year before you ship a single feature. Some teams are running eval costs in the thousands monthly without realizing it.

Track this metric not to minimize it but to make it visible. Once you see the real cost, you can make informed tradeoffs: do you need 200 inputs or is 50 statistically equivalent for your use case? Can you run the full suite weekly instead of daily?

Rule of thumb: If your evaluation cost per month exceeds your expected savings from switching models (e.g., cheaper per token), re-examine your eval strategy. Evaluations should inform decisions, not become a budget line item.

4. Regression Frequency

This is the hardest metric to measure but the most important. Regression frequency is how often the model changes behavior in ways that affect your production output — without notice from the provider.

Providers don't announce every fine-tune. Safety updates, cost optimizations, capability shifts — these happen continuously and silently. Regression frequency tracks how many times your evaluation metrics moved outside normal variance in a given period. If you see a 3%+ accuracy drop with no code or prompt change on your end, that counts as a regression event.

You can't prevent regressions if you're using a provider's rolling release. What you can do is detect them faster than your users do. That's why continuous evaluation matters — you want to be the one who catches the drop, not the support ticket.

Target: zero unexplained regressions per month. If you get more than one, it's either a bad model fit for your use case or a sign that your evaluation set doesn't cover your production distribution well enough.

5. Format Compliance Rate

If your LLM output is consumed by code — not just humans — then format compliance rate matters as much as output quality. A classification model that's 94% accurate but only returns valid JSON 87% of the time is effectively an 87% accurate model in your pipeline.

Format compliance means: does the output match your expected structure? For JSON extraction, does it parse cleanly? For bullet-point summaries, does it return a list or prose? For tool calls, does it include all required fields?

This metric is especially important for structured output tasks. If you're using JSON mode, tool calling, or any system where downstream code depends on consistent parsing, track what percentage of outputs your parser accepts without fallback. A drop from 99% to 94% means 5% of your production requests are hitting fallback behavior — and you might not even know it.

The compliance gap: Most teams discover format compliance failures through downstream errors — a parse exception, a missing field in a database insert, a malformed webhook. By the time you see the error, the output is lost. Automated format checking catches every failure, not just the ones that crash.

Putting It Together

These five metrics aren't independent. Accuracy consistency and regression frequency are related — a model with high regression frequency will have low accuracy consistency. Format compliance rate and latency often trade off — enforcing strict output schemas can slow down inference. Cost per eval and latency connect through token count and batching.

The framework isn't about finding a perfect model. It's about finding a model that's predictably good for your specific use case. A model that's 88% accurate every day is more useful than one that's 95% one week and 71% the next.

The practical workflow: establish baseline metrics with your current configuration, then re-run the same evaluation against any proposed model change before switching. That way you're comparing models on your evaluation criteria, not on the provider's marketing benchmarks.

Most teams don't do this because it takes time to build a representative evaluation set and the infrastructure to run it reliably. That's the operational gap Benchwright fills — automated evaluation runs, regression detection, and provider comparison across your evaluation criteria on a continuous schedule.

Evaluation isn't a one-time decision. It's a continuous process. The teams that get the most out of LLM providers are the ones measuring them like production systems — with metrics, alerts, and baselines — not like demos.

What 12 LLMs Actually Cost in Production — Real Data from Benchwright

Dave Graham — Wed, 06 May 2026 13:52:10 +0000

Real production cost data from the Benchwright /compare calculator across 12 LLMs — input/output ratios, latency tradeoffs, and 3 decisions you should make differently today.

Everyone knows the sticker price. Nobody knows the bill.

You see "$5 per million tokens" and do mental math: that's cheap, this will cost almost nothing. Then you ship to production, context windows bloat with conversation history, your retry logic fires on 3% of calls, and the response tokens are 4× your estimates because you underestimated how verbose the model is. Three months later your AI feature is costing you $800/month instead of $80.

This isn't a niche problem. It's the default outcome for teams that benchmark cost in a notebook and deploy to production without re-measuring.

We built the Benchwright /compare calculator to make the gap between sticker price and real production cost visible — and to keep it visible as models update. After running 12 models through it, here's what the data actually shows.

Methodology

The /compare tool calculates monthly production cost from three inputs you control: API calls per day, average prompt tokens, and average completion tokens. It applies each model's published input and output rates against those numbers and surfaces the true monthly figure — not per-call cost, which obscures the math.

Models in this comparison:

Provider	Models
OpenAI	GPT-4o, GPT-4o mini, GPT-4 Turbo, o1-mini
Anthropic	Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
Google	Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash
Other	Mistral Large, Llama 3.1 70B (via Together.ai)

All pricing reflects published rates as of May 2026. Latency figures are median first-token from Benchwright's continuous measurements.

The Full Pricing Picture

Before we get to surprises, here's the complete dataset:

Model	Input ($/1M tokens)	Output ($/1M tokens)	Latency (p50 TTFT)
GPT-4o	$5.00	$15.00	1,200ms
GPT-4o mini	$0.15	$0.60	600ms
GPT-4 Turbo	$10.00	$30.00	—
o1-mini	$3.00	$12.00	—
Claude 3.5 Sonnet	$3.00	$15.00	1,000ms
Claude 3.5 Haiku	$0.80	$4.00	500ms
Claude 3 Opus	$15.00	$75.00	—
Gemini 1.5 Flash	$0.075	$0.30	700ms
Gemini 1.5 Pro	$1.25	$5.00	—
Gemini 2.0 Flash	$0.10	$0.40	500ms
Mistral Large	$2.00	$6.00	—
Llama 3.1 70B	$0.90	$0.90	—

The raw numbers don't tell you much until you model your actual workload. That's where the surprises are.

3 Non-Obvious Findings

1. Claude 3.5 Haiku is cheaper than GPT-4o mini — at any output-heavy workload

At first glance GPT-4o mini looks like the budget champion: $0.15 input vs Haiku's $0.80. That framing is misleading.

Output tokens are where you actually spend money at scale. GPT-4o mini charges $0.60/M on output. Haiku charges $4.00/M. So for short completions (under ~300 tokens), GPT-4o mini wins. But production AI workloads rarely generate short completions. Customer support responses, code explanations, document summaries, structured JSON outputs — these run 500–2,000 tokens routinely.

At 1,000 output tokens per call, 10,000 calls/day:

GPT-4o mini: $6/day in output costs alone
Claude 3.5 Haiku: $40/day in output costs

So GPT-4o mini wins here. But here's what changes the math: quality per output token. Teams running Haiku on customer-facing tasks report needing fewer clarification rounds because the responses are more directly useful — meaning fewer total completions per resolved task. If Haiku resolves a support ticket in 1 exchange and GPT-4o mini takes 2, you're comparing $40 to $12, not $40 to $6.

The decision: Don't pick the cheapest model per token. Pick the cheapest model per resolved task. Benchwright's continuous monitoring measures this over time so you're not guessing.

2. Gemini 2.0 Flash is the price-performance anomaly nobody is talking about

$0.10 input, $0.40 output, 500ms p50 latency. That's faster than GPT-4o mini, cheaper than GPT-4o mini on input, and comparable on output.

For most production workloads — classification, summarization, extraction, light reasoning — Gemini 2.0 Flash is a legitimate default choice that teams are sleeping on. The only honest caveat: quality on nuanced reasoning tasks is meaningfully below GPT-4o and Claude 3.5 Sonnet. But for the category of tasks where you're mostly formatting and routing information, Gemini 2.0 Flash at $0.10/$0.40 per million tokens is hard to beat.

Run your actual eval dataset against it before dismissing it. Most teams that do are surprised.

3. The real cost of Claude 3 Opus isn't $15/$75 — it's the opportunity cost of not switching

Claude 3 Opus is $15 input, $75 output. Claude 3.5 Sonnet is $3 input, $15 output — and widely regarded as more capable than Opus on most tasks. Sonnet's release made Opus a legacy cost center.

At 5,000 calls/day, 500 input tokens, 800 output tokens:

Opus monthly: ~$9,300
Sonnet monthly: ~$1,980

That's a $7,300/month difference for a model that's worse on most benchmarks. Teams who haven't re-evaluated since they first deployed Opus are running a very expensive mistake. This is exactly what silent regression monitoring is designed to catch — not just when models get worse, but when a better option emerges.

Latency Tradeoff Section

Cost is only half the equation. Latency shapes UX in ways that cost doesn't.

Here's the p50 first-token picture for the models where we have consistent data:

Model	p50 TTFT	Practical implication
Claude 3.5 Haiku	500ms	Streaming feels near-instant; fine for interactive chat
Gemini 2.0 Flash	500ms	Excellent for inline UX patterns
GPT-4o mini	600ms	Acceptable for most UI contexts
Gemini 1.5 Flash	700ms	Slight perceptible delay in fast interactions
Claude 3.5 Sonnet	1,000ms	Noticeable pause; needs streaming UX
GPT-4o	1,200ms	Requires skeleton loading states

What p95 reveals: Median latency is misleading for customer-facing features. The 1-in-20 call that takes 4–6 seconds is the one that gets a bug report. Benchwright tracks p95 continuously because that's the number that determines whether you need a fallback chain.

Practical rule: if your feature is synchronous and user-facing, you need p95 under 2 seconds. GPT-4o and Claude 3.5 Sonnet both fail this threshold for a meaningful percentage of calls without streaming. Haiku and Gemini 2.0 Flash pass it comfortably.

Hidden Costs

The three things not in any sticker price:

1. Retries

Most production setups have retry logic for rate limits and transient failures. A 3% retry rate on 10,000 calls/day is 300 bonus calls you didn't budget. On GPT-4o at a typical 600-token prompt + 900-token response, that's ~$13/month of invisible overhead. Multiply by 12 months. Benchmark your retry rate, not just your happy-path cost.

2. Context window bloat

Conversation history accumulates. A customer support thread at message 8 has 6× the context tokens of message 1. Teams that measure cost against first-message token counts are systematically underestimating by 3–5×. Evaluating this pattern over time is one of the 5 metrics that actually matter.

3. Fallback chains

If you're running GPT-4o with a Claude 3.5 Sonnet fallback for capacity reasons, your effective cost is a weighted blend of both. At 15% fallback rate, you're paying 85% of one price and 15% of another. Model your actual fallback frequency or your budget math is wrong.

3 Decisions You Should Make Differently After This

1. Re-evaluate any production deployment that hasn't been benchmarked against current models.

If you picked your model over 6 months ago, the landscape has changed. Claude 3.5 Sonnet vs Opus alone could be saving you thousands per month. Set a quarterly model review on the calendar — or better, run continuous cost monitoring so you catch the delta automatically.

2. Stop using input price as your primary cost filter.

Input tokens are cheap across the board. Output tokens are where the meaningful variation is. Sort by output cost, then model your actual input-to-output ratio. Your real number is usually 2–4× the sticker you're anchoring on.

3. Don't skip Gemini 2.0 Flash in your next eval.

Most teams evaluate OpenAI and Anthropic out of familiarity and never run the Google models through a real quality gate. For a large category of production tasks, Gemini 2.0 Flash at $0.10/$0.40 is the right answer. You won't know unless you measure.

Try It on Your Numbers

Every workload is different. The Benchwright /compare tool lets you plug in your actual API call volume, prompt length, and completion length to get your real monthly number across all 12 models — not a hypothetical.

Once you have a baseline, continuous monitoring tells you when that number shifts because a model changed under you. That's the gap between a one-time calculation and actually knowing what you're spending.

→ Run your numbers in /compare

Want ongoing monitoring instead of a one-time check? Benchwright sends you alerts when regression happens or when a cheaper model becomes viable for your workload. Sign up for early access.

Related reading:

• How LLM Model Updates Silently Break Production Features — why "stable" models aren't

• Why Unit Tests Aren't Enough for LLM Features — what you're missing

• 5 Metrics That Actually Matter When Evaluating LLM Providers — what to track

Benchwright Calculator

Benchwright runs continuous LLM evaluations so teams know what works before they deploy.

Try the free calculator → benchwright.polsia.app/compare

No credit card required. No infrastructure to manage.

Why Unit Tests Aren't Enough for LLM Features

Dave Graham — Wed, 06 May 2026 13:40:32 +0000

All tests pass. The deploy goes green. But your LLM feature degrades silently in production — and your test suite never noticed. Here's the fundamental reason why, and what actually works instead.

Picture this: you've built a feature that uses an LLM to classify customer support tickets. You wrote unit tests. You wrote integration tests. They all pass on every CI run. You deploy with confidence.

Three weeks later, a customer flags that the routing has been wrong for days. You check your test suite — it's green. You check the model configuration — nothing changed on your end. But something changed. And your entire testing infrastructure missed it completely.

This isn't a gap in your test coverage. It's a fundamental mismatch between how software testing works and how LLMs behave.

What Unit Tests Are Built For

Unit tests work because the systems they test are deterministic. Given input X, a pure function always returns output Y. The test captures that contract. If someone breaks it, the test fails. The feedback loop is instant, local, and reliable.

This model depends on one critical assumption: the code doesn't change unless you change it. Functions don't drift. Libraries don't silently update behavior between CI runs. The math stays the same.

LLMs break every part of this assumption.

Four Reasons Unit Tests Can't Catch LLM Regression

1. Non-determinism is the baseline, not the exception.

Call the same LLM with the same prompt twice and you'll get two different outputs. This is by design — temperature, sampling, and model stochasticity are features. But it makes assertions fragile. You can't write expect(output).toBe("Billing") and have it mean anything, because the model might return "billing", "Billing issue", or a slightly different phrasing on the next run.

Teams work around this by asserting on structure (typeof output === 'string') or mocking the LLM call entirely. Both approaches miss the point. Structural tests verify your parsing code, not model quality. Mocks verify that your code calls the API — they say nothing about what the API returns.

The mock problem: When you mock an LLM call in tests, you're testing that your code handles a specific, pre-written response correctly. You're not testing the model at all. The mock stays frozen while the actual model drifts — and your tests keep passing the whole time.

2. The model is a black box that changes underneath you.

OpenAI, Anthropic, and Google push model updates continuously. Safety fine-tunes, capability improvements, cost optimizations — they change behavior without changing the version string. gpt-4o today is not the same model as gpt-4o six months ago. Your test suite runs against whichever version is live at CI time. Once deployed, it runs against whatever version the provider decides to serve.

Your tests passed against last week's model. This week's model is different. You never ran the tests against this week's model. The gap is invisible.

3. Prompt sensitivity makes small changes catastrophic.

LLMs are extraordinarily sensitive to prompt wording. Adding a period. Changing "classify" to "categorize." Tweaking the system message by one sentence. These changes can shift accuracy by 5–15 percentage points — sometimes more. Your unit tests run against a fixed prompt, so they don't catch what happens when prompts evolve in production, when context windows get filled differently, or when the model's response to your exact phrasing shifts over time.

4. Distribution shift happens in production, not in your test fixtures.

Your test suite has 20 labeled examples. Your production system processes thousands of inputs per day with a distribution that evolves — new product categories, new user phrasings, seasonal language patterns. A model that handles your test fixtures correctly might handle 15% of real production inputs poorly, and you'd never see it in the test results.

The coverage gap: Integration test suites for LLM features typically cover 20–100 hand-picked examples. Production traffic covers millions of input variations. The examples you test are not representative of the distribution that breaks things.

What Unit Tests Can (and Can't) Cover

What You're Testing	Unit Tests	Continuous Evaluation
Your parsing code handles the response	✓ Yes	✓ Yes
The API call is constructed correctly	✓ Yes	✓ Yes
Model output quality on your eval set	✗ No (mocked)	✓ Yes
Behavior after provider model updates	✗ No	✓ Yes
Accuracy drift over weeks	✗ No	✓ Yes
Format compliance rate in production	✗ No	✓ Yes
Regression from prompt changes	✗ No	✓ Yes
Cross-model performance comparison	✗ No	✓ Yes

Unit tests aren't useless for LLM features — they're just covering the wrong half of what can break. Your parsing logic, API client, and error handling should absolutely be unit tested. But the model's behavior? That requires a different approach.

What Continuous Evaluation Actually Catches

Continuous evaluation treats your LLM feature like a production service with measurable outputs — because that's what it is. Instead of a test suite that runs once and freezes, you run evaluations on a schedule: daily, or after every deploy.

Behavioral drift. When a provider update changes how your model handles a class of inputs, continuous evaluation catches it within 24 hours. You see the accuracy chart drop. You have a timestamp. You can correlate it with provider changelogs. Without continuous evaluation, you'd find out from a user report three weeks later.

Quality degradation over time. Some regressions aren't sudden — they're gradual. Format compliance slips from 99% to 96% to 93% over six weeks. No single day is alarming. The trend is. Continuous evaluation gives you the time-series data to see it coming.

Cross-model comparison before you switch. When you're considering upgrading to a newer model, you don't run a vibe check — you run your evaluation set against both models and compare accuracy, latency, format compliance, and cost. Data beats intuition every time.

Prompt change impact. Before you ship a prompt revision, run it against your evaluation set. If accuracy drops 8%, you know before it hits production. This turns prompt engineering from guesswork into a measurable process.

The operating model shift: Traditional software testing assumes your code is the variable and the dependencies are stable. LLM evaluation assumes the model is the variable and your test set is the stable ground truth. Both approaches are right — for their respective domains.

How to Set Up an Eval Pipeline

The minimum viable eval pipeline has three components:

A representative evaluation set. 50–200 real inputs from production with labeled ground-truth outputs. Not synthetic examples — actual inputs your system has processed, labeled by a human or by a higher-quality model. This is your ground truth. It needs to be maintained as your product evolves.

Automated daily runs. A scheduled job that runs your evaluation set against your production model configuration and records the results: accuracy, format compliance, latency, token cost. Every run. Every day. Results stored in a queryable form so you can see trends, not just snapshots.

Regression alerts. Thresholds that trigger notifications when metrics degrade. A 5% accuracy drop. Format compliance falling below 95%. Average output length increasing by 40%. You define what "regression" means for your feature — the system tells you when it happens, before your users do.

Building this yourself is straightforward in concept: a cron job, a database, some charting. The hard part is the operational overhead — keeping the evaluation set fresh, maintaining the infrastructure reliably, building alert logic that doesn't false-positive constantly. Most teams start, ship something workable, and watch it go stale over the following quarter because it's not a revenue-generating feature.

That's what Benchwright handles — continuous evaluation as infrastructure. Automated runs, regression detection, cross-model comparison, delivered as a service so the maintenance overhead isn't your problem.

The Takeaway

Keep your unit tests. They're verifying real things — your parsing code, your API client, your error handling. But don't mistake a green test suite for confidence in your LLM feature's production behavior. Those tests were written against a frozen mock of a model that has since changed.

The layer that's missing is continuous evaluation: real model calls, against a real evaluation set, on a real schedule, with real alerts when behavior changes. That's the layer that tells you what your test suite can't.

If you're shipping LLM features and relying on CI to catch regressions, you're not monitoring a production system — you're hoping nothing changed since the last deploy.

Originally published on benchwright.polsia.app — Benchwright is an autonomous AI evaluator that continuously benchmarks production models — see how it works.