DEV Community: Stuart Palmer

Testing LLM Prompts in Production Pipelines: A Practical Approach

Stuart Palmer — Wed, 19 Nov 2025 22:04:42 +0000

Over the past few months, I’ve been working on integrating a number of LLM-based features throughout our product; things like content generation, intelligent recommendations, and agentic task flows. As these features matured, one question kept coming up: how do we actually test this?

Traditional unit testing works beautifully for deterministic code. You write a test, assert an expected output, and move on. But LLMs don’t behave that way; run the same prompt twice and you'll get two different responses. So how do you test something that never gives you the same answer?

Where We Started: Mocking Our Way Forward

Our in-house LLM service already had strong test coverage, so at the application layer we treated it like any other dependency: we mocked it. Our unit tests would stub out the LLM responses and focus on testing the code around them: error handling, data transformations, business logic.

This approach did have value. It gave us confidence that our integration code was solid, let us test edge cases without eating model compute time, and kept our test suite fast. For early development, it was exactly what we needed.

But there was an elephant in the room; we weren't testing the prompts themselves. We had no way to catch when a prompt change degraded output quality, shifted tone, or broke RAG accuracy. We were testing everything except the thing that mattered the most.

What Mocks Can't Catch

As our LLM features moved into production, the gaps became obvious. We had no way to catch things like:

Prompt regressions (changes that subtly degraded output quality)
Behavioural requirements (whether outputs met specific criteria like tone, structure, or completeness)
RAG faithfulness (whether our Retrieval-Augmented Generation features accurately represented the knowledge they retrieved, or were hallucinating despite having correct context)

These weren’t just theoretical risk. In one case, I updated a function that generates performance recommendations for technical specialists. I added some extra metrics; a straightforward improvement, or so I thought. The feature still ran fine, but the tone had shifted. What was previously clear, actionable technical advice became overly dramatic and superficial.

Our mocked tests passed with no issues, but the actual output had regressed in a way that would frustrate users.

We needed a way to test the prompts themselves.

The Solution: Prompt Integration Testing

To fill the gap, we added a new pipeline stage specifically for testing LLM prompts, focusing on integration testing that evaluates real model outputs against defined requirements. We use PromptFoo, a framework for LLM testing, and integrated it with our existing Jest test suite.

Our tests fall into two main categories:

1. Quality/Subjective

These test behavioural characteristics like tone, structure, completeness and safety. We’re not looking for exact textural matches, instead we define expectations such as:

"Response is professional and contains no profanity"
"Answer is structured with headings and bullet points"
"Response does not include personal opinions or speculation"

These tests catch regressions where functionality still works but quality silently drops.

2. Faithfulness/Objective

For our RAG-based features, we need to verify that outputs faithfully reflect the retrieved context. These tests check whether the model is grounding its output correctly:

"If the knowledge base lists opening hours as 11am-3pm for the holiday, does the response reflect that?"
"When the documentation specifies version 2.4.1, does the answer cite the correct version?"
"Does the summary include all critical warnings from the source material?"

This catches hallucinations early and ensures users aren’t seeing confidently incorrect information.

Implementation: Making It Feel Like Normal Testing

The key to adoption was making prompt tests feel like writing any other test. We built a Jest wrapper around PromptFoo that lets developers write LLM tests using familiar patterns.

First we create a custom Jest Matcher:

// addPromptMatchers.ts
import { assertions } from 'promptFoo';

const { matchesLlmRubric } = assertions;

export const addPromptMatchers = () => {
  expect.extend({
    async toSatisfyLlmRule(
      actual: string[],
      expected: string,
      passRateThreshold = 0.95
    ) {
      const results = await Promise.all(
        actual.map(
          async (llmOutput) => await matchesLlmRubric(expected, llmOutput)
        )
      );

      const passCount = results.filter(({ pass }) => pass).length;
      const passRate = passCount / actual.length;

      if (passRate >= passRateThreshold) {
        return {
          message: () => 'passed',
          pass: true
        };
      } else {
        return {
          message: () =>
            `Prompt Eval Failed: ${passRate * 100}% pass rate (needed ${
              passRateThreshold * 100
            }% to pass test)`,
          pass: false
        };
      }
    }
  });
};

Then, we add this to the Jest startup config, and create a separate Jest config for prompt-specific test runs:

//jest.setupAfter.ts
import { addPromptMatchers } from './addPromptMatchers.ts'

addPromptMatchers();

//jest.promptEval.config.ts
export default {
  setupFilesAfterEnv: ['./jest.setupAfter.ts'],
  // notice the glob pattern matches files with the suffix .test.prompt.ts
  testMatch: ['**/?(*.)+(spec|test).?(prompt).?([mc])[jt]s?(x)'] 
  // Other Jest config
};

Then when you plumb it into your pipeline, remember to invoke an additional Jest stage with the additional config file.

// package.json
{
  "scripts": {
    "tests:prompt": "jest --config=jest.promptEval.config.ts"
  }
}

To generate tests which utilise this and run within the dedicated pipeline, we use a modified file naming convention (.test.prompt.ts vs .test.ts) to make the distinction clear. Developers who know how to write unit tests can write prompt tests.

// LLM prompt test file
// myFeature.test.prompt.ts

describe('recommendation prompt', () => {
  it('should provide professional, actionable advice', async () => {
    const promptResults = await Promise.all(
      Array.from({ length: 20 }).map(async () => {
        return await runPrompt(promptUnderTest);
      })
    );

    await expect(promptResults).toSatisfyLlmRule(
      'response is professional and contains clear, actionable recommendations'
    );
  });
});

Handling Non-Determinism

The biggest challenge with LLM testing is non-determinism. Run the same prompt twice and you'll get different outputs. Our approach: embrace it.

Instead of expecting identical results, we run each test multiple times and measure the success rate. By default, we require a 95% success rate; if a test passes 19 out of 20 runs, it passes overall. This threshold is configurable per test, which has sparked interesting code-review discussions about what ‘consistent enough’ means for different features.

This is actually a question I'm still thinking about: what success rate thresholds make sense for LLM testing? Is 95% too strict? Too lenient? Does it depend on the use case? (I suspect it does, but I'm curious what others have landed on.)

CI/CD Integration

We run these tests in a dedicated pipeline stage, separate from our regular unit tests. There’s a few reasons for that:

They're slow (Multiple runs of real LLM calls take time)
They're expensive (Each test execution hits the LLM multiple times)
They need a different rollout approach (This is new territory for most of the team)

That last point is key. Currently, the stage is non-blocking; failures don’t prevent deployment. Instead, we report results to a Slack channel where the team can monitor performance and trends. This gives us space to build confidence in the solution gradually, expand test coverage, refine our success criteria, and get comfortable with non-deterministic testing before making it a hard gate.

Trade-offs and Learnings

There are definite costs and consequences to this approach:

Time: This pipeline stage is slower than our standard unit tests. Where regular tests complete in seconds, these can take minutes depending on the number of prompts and configured run counts.

Cost: Running multiple iterations of real LLM calls adds up - token usage isn’t free.

Wider Scope: These tests blur the line between technical and product decisions. Unlike traditional tests that verify logic, prompt tests encode expectations about quality, tone, and behaviour. This means their criteria probably shouldn't be defined by engineering alone - Product owners have a role to play. It reminds me of behaviour-driven development (BDD), where acceptance criteria come from the business rather than the code. "The response should be professional and helpful" is less a technical specification and more a product requirement that happens to be testable with LLMs.

Flakiness: Even with high success rate thresholds, occasional failures still happen. We're still developing our understanding of when a failure indicates a real problem versus normal variance.

Despite these trade-offs, the benefits are clear. We've caught issues that would have shipped to production, and developers have more confidence making prompt changes knowing there's a safety net.

Takeaways for Others

If you're considering adding prompt testing, here's what we’ve learned so far:

Start small: Pick one or two critical prompts to test first. Don't try to achieve 100% coverage immediately. Focus on prompts where failures would have the highest impact.

Involve Product Owners: Prompt test criteria are product decisions as much as technical ones. Bring Product Owners into conversations about how we should be validating the prompt output from the user perspective. They'll help you define rubrics that reflect real user needs rather than engineering assumptions.

Make it easy to write tests: Whatever your implementation, minimise friction for developers. If writing a prompt test is significantly harder than writing a unit test, adoption will suffer.

Accept imperfection: Probabilistic success rates feel odd at first, but they’re a more honest model of how LLMs behave.

Balance cost versus confidence: You don't need to test every prompt on every commit. Be strategic about when tests run and how many iterations you perform.

Conclusion

LLMs are now core to many B2B products, but traditional testing doesn’t fit their probabilistic nature. Mocking still has its place for testing integration code, but it can't tell you if your prompts truly deliver the intended outcome.

Prompt integration testing isn't perfect - it's slower, more expensive, and fuzzier than regular testing. But it fills a critical gap. It can catch regressions before they reach production, give developers confidence to refactor prompts, and forces us to articulate what "good" actually means for our AI features.

However you’re using AI, one principle holds true: if the product’s value depends on LLM outputs, you need to test those outputs directly.

Closures & Callstacks: Building a Game to Learn JavaScript Closures

Stuart Palmer — Sat, 15 Nov 2025 15:11:43 +0000

A practical exercise in learning closures by building a tiny idle game - no frameworks, just vanilla JavaScript.

Early in my development journey, I struggled with JavaScript closures. The concept felt abstract and slippery - I could read the definitions, but they didn't quite click. So I did what I often do when learning something new: I built a small project that forced me to use them extensively.

The result was Closures & Callstacks, a simple browser-based idle game where a party of adventurers battles a dragon. Built with nothing but vanilla HTML, CSS, and JavaScript - no frameworks, no libraries - it served its purpose: by structuring the entire application around factory functions and closures, I finally internalised how they work.

The Game

The premise is straightforward: you generate a party of three adventurers (fighters, wizards, and clerics, each with their own ASCII art representation), then watch them battle an ancient dragon in turn-based combat. Characters attack enemies or heal allies based on simple AI, with health bars updating in real-time and a combat log narrating the action. Victory or defeat depends on whether your party can whittle down the dragon's health before it stomps and firebreathes your adventurers into oblivion.

The Core Pattern: Factory Functions

The game's architecture revolves around factory functions - functions that return objects with methods. These methods "close over" private variables, creating encapsulated state without needing classes or the new keyword.

Here's a simplified version of the health system:

function healthFunctions() {
  let maxHealth = 0;
  let currentHealth = 0;
  let isKO;

  return {
    setMaxHealth: function(maxHP) {
      maxHealth = maxHP;
      currentHealth = maxHealth;
      isKO = false;
      return maxHealth;
    },
    getCurrentHealth: function() {
      return currentHealth;
    },
    takeDamage: function(damage) {
      currentHealth = Math.max((currentHealth -= damage), 0);
      isKO = currentHealth === 0 ? true : false;
      return currentHealth;
    },
    healDamage: function(heal) {
      if (isKO) {
        isKO = false;
      }
      currentHealth = Math.min((currentHealth += heal), maxHealth);
      return currentHealth;
    },
    isKO: function() {
      return isKO;
    }
  };
}

The variables maxHealth, currentHealth, and isKO are truly private. There's no way to access them directly from outside the function - you can only interact with them through the returned methods. Each method maintains a reference to these variables through closure, even after healthFunctions() has finished executing.

Building Characters with Closures

The player factory follows the same pattern but at a larger scale:

function playerFunctions() {
  let _id;
  let _name;
  let _playerClass;
  let _allies = {};
  let _enemies = {};
  let _profBonus = 1;
  let playerAttack = {};
  let playerBuffs = {};

  const playerHealth = healthFunctions();

  return {
    init: function(name, playerClass, allies, enemies, id) {
      _id = id;
      _name = name;
      _allies = allies;
      _enemies = enemies;
      _playerClass = playerClass;
      playerHealth.setMaxHealth(calculateHP(playerClass));
      playerAttack = attackFunctions(_profBonus, _playerClass);
      playerBuffs = buffFunctions(_profBonus, _playerClass);
    },
    getName: function() {
      return _name;
    },
    get playerClass() {
      return _playerClass;
    },
    get health() {
      return playerHealth;
    },
    takeTurn: function() {
      if (this.health.isKO(true)) {
        return;
      }
      // ... game logic for taking actions
    }
  };
}

Each character created by playerFunctions() maintains its own private state. The playerHealth variable holds a reference to a health system (itself created by a factory function), and all the character's methods can access it through closure.

What I Learned

Building this game made closures tangible for me. Instead of being an abstract concept, they became a practical tool for:

Data Privacy: No risk of external code accidentally modifying internal state
Encapsulation: Each character or system manages its own data
Composition: Factory functions can call other factory functions (like playerHealth = healthFunctions()) to build complex behaviours

Is this the most efficient way to structure a game? Probably not. The code has plenty of rough edges I'd refactor now (I’ve chosen to leave the code as-is for posterity!). But as a learning exercise, it was invaluable. By forcing myself to use closures everywhere - for health tracking, attack systems, buff management, and game state - I developed an intuitive understanding that stuck.

Sometimes the best way to learn a concept is to build something with it, even if that something is a simple dragon-fighting game with ASCII art characters.

🎮 Play the game | 💾 View source on GitHub