Yagnesh Khamar

Posted on Apr 9

How to test LLM prompts in CI/CD (and stop breaking production)

#ai #opensource #testing #programming

If you're shipping LLM-powered features, you've probably done this:

Changed "Summarize this:" to "Brief summary of:" — deployed it — and quietly broke three downstream behaviours you didn't know existed.

No test caught it. No CI step failed. It just went out, and your users found the regression before you did.

This is the prompt testing problem. And it's the same problem we solved for regular code 20 years ago with unit tests.

The problem: prompts are code, but we don't test them like code

When you change a function, you run your test suite. If something breaks, the pipeline fails and the change doesn't ship.

When you change a prompt, you... eyeball it? Run it manually a few times? Hope for the best?

Most teams are shipping prompt changes blind. The consequences are subtle and delayed — a customer support bot that stopped following tone guidelines, a summariser that now hallucinates dates, a classifier that changed its output format and broke the parsing downstream.

You don't find out until production.

What prompt testing actually looks like

Here's what a basic prompt test looks like with Phasio:

// phasio/summariser.test.ts
import { describe, pe } from '@phasio/sdk';
import { contains, notContains, llmJudge } from '@phasio/sdk';

describe('Summariser prompt', () => {

  pe.test('produces a summary', {
    input: 'The 2008 financial crisis was triggered by the collapse of mortgage-backed securities.',
    expect: contains('financial'),
  });

  pe.test('does not include disclaimers', {
    input: 'Explain what a CDO is.',
    expect: notContains('I cannot provide'),
  });

  pe.test('quality: clear and concise', {
    input: 'Explain async/await in JavaScript.',
    expect: llmJudge('Clear explanation suitable for a mid-level developer. No filler. Under 100 words.'),
  });

});

If you've written Jest tests before, this is already familiar. That's intentional.

Setting up Phasio in 5 minutes

1. Install the SDK

npm install @phasio/sdk

2. Create your config file

// phasio.config.ts
import { defineConfig } from '@phasio/sdk';

export default defineConfig({
  providers: {
    openai: {
      apiKey: process.env.OPENAI_API_KEY,
      model: 'gpt-4o-mini',
    },
    anthropic: {
      apiKey: process.env.ANTHROPIC_API_KEY,
      model: 'claude-haiku-4-5-20251001',
    },
  },
  judges: ['openai', 'anthropic'], // Multi-judge: averages scores across both
});

3. Write your first test file

Create a phasio/ folder at the root of your project. Any file matching *.test.ts inside it will be picked up automatically.

4. Run it

npx phasio

Phasio discovers all test files, runs them, and outputs a summary. Exit code 0 on pass, exit code 1 on failure.

The three validator types

contains(string) — checks the output includes a substring. Good for format compliance, required keywords, expected response structure.

pe.test('includes a call to action', {
  input: userMessage,
  expect: contains('contact us'),
});

notContains(string) — checks the output does not include a substring. Good for preventing hallucinated phrases, blocked content, legacy prompt artifacts.

pe.test('no apology language', {
  input: userMessage,
  expect: notContains('I apologise'),
});

llmJudge(criteria) — uses an LLM to score the output against a natural language quality criteria. Returns a score between 0 and 1. Fails if the score drops below your threshold.

pe.test('tone matches brand voice', {
  input: userMessage,
  expect: llmJudge('Professional but approachable. No corporate jargon. Reads like a senior engineer wrote it.'),
});

When you configure multiple judges (e.g. GPT-4o-mini + Claude Haiku), Phasio averages their scores. This reduces single-model scoring bias — one model's quirks don't determine your pass/fail.

Multi-provider testing in one run

One of the real-world use cases for Phasio: you're evaluating whether to switch from GPT-4o to Claude. Run the same test suite against both providers simultaneously.

// phasio.config.ts
export default defineConfig({
  providers: {
    openai: {
      apiKey: process.env.OPENAI_API_KEY,
      model: 'gpt-4o',
    },
    anthropic: {
      apiKey: process.env.ANTHROPIC_API_KEY,
      model: 'claude-sonnet-4-6',
    },
  },
});

One command, two providers, side-by-side results. No manual switching. No separate test scripts.

Adding Phasio to GitHub Actions

This is the part that turns prompt testing from a local habit into a hard gate.

# .github/workflows/prompt-tests.yml
name: Prompt Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  prompt-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm install
      - run: npx phasio
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Now every PR that changes a prompt runs your eval suite. If the quality score drops or a contains check fails, the PR is blocked. Same discipline as unit tests — for prompts.

What to test first

If you're starting from zero, don't try to write comprehensive test coverage immediately. Pick three things:

1. Format compliance — Does the output follow the structure your downstream code expects? If you're parsing JSON out of an LLM response, test that it's actually valid JSON.

2. Hard exclusions — Are there things the output should never say? Test those with notContains.

3. One quality gate — Pick your most critical prompt and write one llmJudge test for it. Something like: "Answers the question asked. Does not hallucinate. Under 150 words."

Three tests is better than zero. Ship those first, then expand coverage over time as you see what actually regresses.

The payoff

The goal isn't to write tests for tests' sake. It's to make this workflow possible:

Engineer opens a PR to update the system prompt → CI runs npx phasio → all tests pass → PR merges with confidence.

Instead of:

Engineer updates prompt → deploys → waits → user reports that the chatbot is now giving wrong answers.

Prompt testing isn't new. Teams doing serious LLM work have been doing variants of this manually for a while. Phasio just makes it as easy as writing Jest tests.

Get started

SDK (MIT): npm install @phasio/sdk
GitHub: github.com/YagneshKhamar/phasio
Dashboard + docs: phasio.dev

Questions or feedback — drop them in the comments. Especially interested in what validators people feel are missing.

DEV Community