Stuart Palmer

Posted on Nov 19

Testing LLM Prompts in Production Pipelines: A Practical Approach

#webdev #testing #ai

Over the past few months, I’ve been working on integrating a number of LLM-based features throughout our product; things like content generation, intelligent recommendations, and agentic task flows. As these features matured, one question kept coming up: how do we actually test this?

Traditional unit testing works beautifully for deterministic code. You write a test, assert an expected output, and move on. But LLMs don’t behave that way; run the same prompt twice and you'll get two different responses. So how do you test something that never gives you the same answer?

Where We Started: Mocking Our Way Forward

Our in-house LLM service already had strong test coverage, so at the application layer we treated it like any other dependency: we mocked it. Our unit tests would stub out the LLM responses and focus on testing the code around them: error handling, data transformations, business logic.

This approach did have value. It gave us confidence that our integration code was solid, let us test edge cases without eating model compute time, and kept our test suite fast. For early development, it was exactly what we needed.

But there was an elephant in the room; we weren't testing the prompts themselves. We had no way to catch when a prompt change degraded output quality, shifted tone, or broke RAG accuracy. We were testing everything except the thing that mattered the most.

What Mocks Can't Catch

As our LLM features moved into production, the gaps became obvious. We had no way to catch things like:

Prompt regressions (changes that subtly degraded output quality)
Behavioural requirements (whether outputs met specific criteria like tone, structure, or completeness)
RAG faithfulness (whether our Retrieval-Augmented Generation features accurately represented the knowledge they retrieved, or were hallucinating despite having correct context)

These weren’t just theoretical risk. In one case, I updated a function that generates performance recommendations for technical specialists. I added some extra metrics; a straightforward improvement, or so I thought. The feature still ran fine, but the tone had shifted. What was previously clear, actionable technical advice became overly dramatic and superficial.

Our mocked tests passed with no issues, but the actual output had regressed in a way that would frustrate users.

We needed a way to test the prompts themselves.

The Solution: Prompt Integration Testing

To fill the gap, we added a new pipeline stage specifically for testing LLM prompts, focusing on integration testing that evaluates real model outputs against defined requirements. We use PromptFoo, a framework for LLM testing, and integrated it with our existing Jest test suite.

Our tests fall into two main categories:

1. Quality/Subjective

These test behavioural characteristics like tone, structure, completeness and safety. We’re not looking for exact textural matches, instead we define expectations such as:

"Response is professional and contains no profanity"
"Answer is structured with headings and bullet points"
"Response does not include personal opinions or speculation"

These tests catch regressions where functionality still works but quality silently drops.

2. Faithfulness/Objective

For our RAG-based features, we need to verify that outputs faithfully reflect the retrieved context. These tests check whether the model is grounding its output correctly:

"If the knowledge base lists opening hours as 11am-3pm for the holiday, does the response reflect that?"
"When the documentation specifies version 2.4.1, does the answer cite the correct version?"
"Does the summary include all critical warnings from the source material?"

This catches hallucinations early and ensures users aren’t seeing confidently incorrect information.

Implementation: Making It Feel Like Normal Testing

The key to adoption was making prompt tests feel like writing any other test. We built a Jest wrapper around PromptFoo that lets developers write LLM tests using familiar patterns.

First we create a custom Jest Matcher:

// addPromptMatchers.ts
import { assertions } from 'promptFoo';

const { matchesLlmRubric } = assertions;

export const addPromptMatchers = () => {
  expect.extend({
    async toSatisfyLlmRule(
      actual: string[],
      expected: string,
      passRateThreshold = 0.95
    ) {
      const results = await Promise.all(
        actual.map(
          async (llmOutput) => await matchesLlmRubric(expected, llmOutput)
        )
      );

      const passCount = results.filter(({ pass }) => pass).length;
      const passRate = passCount / actual.length;

      if (passRate >= passRateThreshold) {
        return {
          message: () => 'passed',
          pass: true
        };
      } else {
        return {
          message: () =>
            `Prompt Eval Failed: ${passRate * 100}% pass rate (needed ${
              passRateThreshold * 100
            }% to pass test)`,
          pass: false
        };
      }
    }
  });
};

Then, we add this to the Jest startup config, and create a separate Jest config for prompt-specific test runs:

//jest.setupAfter.ts
import { addPromptMatchers } from './addPromptMatchers.ts'

addPromptMatchers();

//jest.promptEval.config.ts
export default {
  setupFilesAfterEnv: ['./jest.setupAfter.ts'],
  // notice the glob pattern matches files with the suffix .test.prompt.ts
  testMatch: ['**/?(*.)+(spec|test).?(prompt).?([mc])[jt]s?(x)'] 
  // Other Jest config
};

Then when you plumb it into your pipeline, remember to invoke an additional Jest stage with the additional config file.

// package.json
{
  "scripts": {
    "tests:prompt": "jest --config=jest.promptEval.config.ts"
  }
}

To generate tests which utilise this and run within the dedicated pipeline, we use a modified file naming convention (.test.prompt.ts vs .test.ts) to make the distinction clear. Developers who know how to write unit tests can write prompt tests.

// LLM prompt test file
// myFeature.test.prompt.ts

describe('recommendation prompt', () => {
  it('should provide professional, actionable advice', async () => {
    const promptResults = await Promise.all(
      Array.from({ length: 20 }).map(async () => {
        return await runPrompt(promptUnderTest);
      })
    );

    await expect(promptResults).toSatisfyLlmRule(
      'response is professional and contains clear, actionable recommendations'
    );
  });
});

Handling Non-Determinism

The biggest challenge with LLM testing is non-determinism. Run the same prompt twice and you'll get different outputs. Our approach: embrace it.

Instead of expecting identical results, we run each test multiple times and measure the success rate. By default, we require a 95% success rate; if a test passes 19 out of 20 runs, it passes overall. This threshold is configurable per test, which has sparked interesting code-review discussions about what ‘consistent enough’ means for different features.

This is actually a question I'm still thinking about: what success rate thresholds make sense for LLM testing? Is 95% too strict? Too lenient? Does it depend on the use case? (I suspect it does, but I'm curious what others have landed on.)

CI/CD Integration

We run these tests in a dedicated pipeline stage, separate from our regular unit tests. There’s a few reasons for that:

They're slow (Multiple runs of real LLM calls take time)
They're expensive (Each test execution hits the LLM multiple times)
They need a different rollout approach (This is new territory for most of the team)

That last point is key. Currently, the stage is non-blocking; failures don’t prevent deployment. Instead, we report results to a Slack channel where the team can monitor performance and trends. This gives us space to build confidence in the solution gradually, expand test coverage, refine our success criteria, and get comfortable with non-deterministic testing before making it a hard gate.

Trade-offs and Learnings

There are definite costs and consequences to this approach:

Time: This pipeline stage is slower than our standard unit tests. Where regular tests complete in seconds, these can take minutes depending on the number of prompts and configured run counts.

Cost: Running multiple iterations of real LLM calls adds up - token usage isn’t free.

Wider Scope: These tests blur the line between technical and product decisions. Unlike traditional tests that verify logic, prompt tests encode expectations about quality, tone, and behaviour. This means their criteria probably shouldn't be defined by engineering alone - Product owners have a role to play. It reminds me of behaviour-driven development (BDD), where acceptance criteria come from the business rather than the code. "The response should be professional and helpful" is less a technical specification and more a product requirement that happens to be testable with LLMs.

Flakiness: Even with high success rate thresholds, occasional failures still happen. We're still developing our understanding of when a failure indicates a real problem versus normal variance.

Despite these trade-offs, the benefits are clear. We've caught issues that would have shipped to production, and developers have more confidence making prompt changes knowing there's a safety net.

Takeaways for Others

If you're considering adding prompt testing, here's what we’ve learned so far:

Start small: Pick one or two critical prompts to test first. Don't try to achieve 100% coverage immediately. Focus on prompts where failures would have the highest impact.

Involve Product Owners: Prompt test criteria are product decisions as much as technical ones. Bring Product Owners into conversations about how we should be validating the prompt output from the user perspective. They'll help you define rubrics that reflect real user needs rather than engineering assumptions.

Make it easy to write tests: Whatever your implementation, minimise friction for developers. If writing a prompt test is significantly harder than writing a unit test, adoption will suffer.

Accept imperfection: Probabilistic success rates feel odd at first, but they’re a more honest model of how LLMs behave.

Balance cost versus confidence: You don't need to test every prompt on every commit. Be strategic about when tests run and how many iterations you perform.

Conclusion

LLMs are now core to many B2B products, but traditional testing doesn’t fit their probabilistic nature. Mocking still has its place for testing integration code, but it can't tell you if your prompts truly deliver the intended outcome.

Prompt integration testing isn't perfect - it's slower, more expensive, and fuzzier than regular testing. But it fills a critical gap. It can catch regressions before they reach production, give developers confidence to refactor prompts, and forces us to articulate what "good" actually means for our AI features.

However you’re using AI, one principle holds true: if the product’s value depends on LLM outputs, you need to test those outputs directly.

DEV Community