If you're shipping LLM-powered features, you've probably done this:
Changed "Summarize this:" to "Brief summary of:" — deployed it — and quietly broke three downstream behaviours you didn't know existed.
No test caught it. No CI step failed. It just went out, and your users found the regression before you did.
This is the prompt testing problem. And it's the same problem we solved for regular code 20 years ago with unit tests.
The problem: prompts are code, but we don't test them like code
When you change a function, you run your test suite. If something breaks, the pipeline fails and the change doesn't ship.
When you change a prompt, you... eyeball it? Run it manually a few times? Hope for the best?
Most teams are shipping prompt changes blind. The consequences are subtle and delayed — a customer support bot that stopped following tone guidelines, a summariser that now hallucinates dates, a classifier that changed its output format and broke the parsing downstream.
You don't find out until production.
What prompt testing actually looks like
Here's what a basic prompt test looks like with Phasio:
// phasio/summariser.test.ts
import { describe, pe } from '@phasio/sdk';
import { contains, notContains, llmJudge } from '@phasio/sdk';
describe('Summariser prompt', () => {
pe.test('produces a summary', {
input: 'The 2008 financial crisis was triggered by the collapse of mortgage-backed securities.',
expect: contains('financial'),
});
pe.test('does not include disclaimers', {
input: 'Explain what a CDO is.',
expect: notContains('I cannot provide'),
});
pe.test('quality: clear and concise', {
input: 'Explain async/await in JavaScript.',
expect: llmJudge('Clear explanation suitable for a mid-level developer. No filler. Under 100 words.'),
});
});
If you've written Jest tests before, this is already familiar. That's intentional.
Setting up Phasio in 5 minutes
1. Install the SDK
npm install @phasio/sdk
2. Create your config file
// phasio.config.ts
import { defineConfig } from '@phasio/sdk';
export default defineConfig({
providers: {
openai: {
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini',
},
anthropic: {
apiKey: process.env.ANTHROPIC_API_KEY,
model: 'claude-haiku-4-5-20251001',
},
},
judges: ['openai', 'anthropic'], // Multi-judge: averages scores across both
});
3. Write your first test file
Create a phasio/ folder at the root of your project. Any file matching *.test.ts inside it will be picked up automatically.
4. Run it
npx phasio
Phasio discovers all test files, runs them, and outputs a summary. Exit code 0 on pass, exit code 1 on failure.
The three validator types
contains(string) — checks the output includes a substring. Good for format compliance, required keywords, expected response structure.
pe.test('includes a call to action', {
input: userMessage,
expect: contains('contact us'),
});
notContains(string) — checks the output does not include a substring. Good for preventing hallucinated phrases, blocked content, legacy prompt artifacts.
pe.test('no apology language', {
input: userMessage,
expect: notContains('I apologise'),
});
llmJudge(criteria) — uses an LLM to score the output against a natural language quality criteria. Returns a score between 0 and 1. Fails if the score drops below your threshold.
pe.test('tone matches brand voice', {
input: userMessage,
expect: llmJudge('Professional but approachable. No corporate jargon. Reads like a senior engineer wrote it.'),
});
When you configure multiple judges (e.g. GPT-4o-mini + Claude Haiku), Phasio averages their scores. This reduces single-model scoring bias — one model's quirks don't determine your pass/fail.
Multi-provider testing in one run
One of the real-world use cases for Phasio: you're evaluating whether to switch from GPT-4o to Claude. Run the same test suite against both providers simultaneously.
// phasio.config.ts
export default defineConfig({
providers: {
openai: {
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o',
},
anthropic: {
apiKey: process.env.ANTHROPIC_API_KEY,
model: 'claude-sonnet-4-6',
},
},
});
One command, two providers, side-by-side results. No manual switching. No separate test scripts.
Adding Phasio to GitHub Actions
This is the part that turns prompt testing from a local habit into a hard gate.
# .github/workflows/prompt-tests.yml
name: Prompt Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
prompt-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm install
- run: npx phasio
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Now every PR that changes a prompt runs your eval suite. If the quality score drops or a contains check fails, the PR is blocked. Same discipline as unit tests — for prompts.
What to test first
If you're starting from zero, don't try to write comprehensive test coverage immediately. Pick three things:
1. Format compliance — Does the output follow the structure your downstream code expects? If you're parsing JSON out of an LLM response, test that it's actually valid JSON.
2. Hard exclusions — Are there things the output should never say? Test those with notContains.
3. One quality gate — Pick your most critical prompt and write one llmJudge test for it. Something like: "Answers the question asked. Does not hallucinate. Under 150 words."
Three tests is better than zero. Ship those first, then expand coverage over time as you see what actually regresses.
The payoff
The goal isn't to write tests for tests' sake. It's to make this workflow possible:
Engineer opens a PR to update the system prompt → CI runs
npx phasio→ all tests pass → PR merges with confidence.
Instead of:
Engineer updates prompt → deploys → waits → user reports that the chatbot is now giving wrong answers.
Prompt testing isn't new. Teams doing serious LLM work have been doing variants of this manually for a while. Phasio just makes it as easy as writing Jest tests.
Get started
-
SDK (MIT):
npm install @phasio/sdk - GitHub: github.com/YagneshKhamar/phasio
- Dashboard + docs: phasio.dev
Questions or feedback — drop them in the comments. Especially interested in what validators people feel are missing.
Top comments (0)