If you're building a product with an AI chatbot, you've probably run into this:
await expect(response).toContainText('The Pro plan costs $49/month');
This breaks constantly. LLMs never return the exact same string twice.
The problem
Traditional matchers assume deterministic output. AI responses are:
- Semantically equivalent but textually different every run
- Sometimes helpful, sometimes hallucinating
- Hard to validate with
toEqual()ortoContainText()
You end up either skipping the assertion entirely, or writing brittle string checks that fail on every deploy.
What I built
playwright-ai-matchers — a library that uses Claude Haiku under the hood to evaluate AI responses semantically inside your Playwright tests.
import { test, expect } from '@playwright/test';
import 'playwright-ai-matchers';
test('AI chatbot responds correctly to a billing question', async ({ page }) => {
await page.goto('https://your-app.com/chat');
await page.locator('#user-input').fill('How much does the Pro plan cost?');
await page.locator('#send-button').click();
await page.locator('.ai-response').waitFor();
const response = await page.locator('.ai-response').textContent();
await expect(response).toMeanSomethingAbout('billing and pricing');
await expect(response).toSatisfy('should mention a specific price or redirect to the pricing page');
await expect(response).not.toHallucinate('Our Pro plan is $49/month. Enterprise has custom pricing.');
await expect(response).toBeHelpful();
});
The 4 matchers
toMeanSomethingAbout(topic)
Checks whether the response meaningfully engages with a topic. Vague responses fail — if the chatbot says "We comply with all applicable laws" for a data privacy question, that's a fail.
await expect(response).toMeanSomethingAbout('refund policy');
await expect(response).not.toMeanSomethingAbout('competitor products');
toSatisfy(criterion)
Evaluates the response against a plain-language requirement. Read literally — no partial credit.
await expect(response).toSatisfy('should mention a specific price');
await expect(response).toSatisfy('must not recommend any third-party service');
toHallucinate(context)
Detects whether the response invents facts not present in the provided context. Typically used with .not to guard against hallucination.
await expect(response).not.toHallucinate(
'Our Pro plan is $49/month. Enterprise has custom pricing.'
);
toBeHelpful()
Checks whether the response is genuinely useful — not an error message, a flat refusal, or an empathy-without-substance reply like "I understand this is frustrating. Let me know if I can help."
await expect(response).toBeHelpful();
await expect(errorFallback).not.toBeHelpful();
How it works
Each matcher sends the response to Claude Haiku with a carefully crafted evaluation prompt and gets back { pass: boolean, reason: string }. One API call per matcher. That's it.
When a test fails, you see exactly why:
Error: Expected response to mean something about "billing and pricing", but it didn't.
Reason: The response discusses general greetings and does not address billing or pricing topics.
Received: Hi! I'm here to help. What would you like to know today?
Install
npm install playwright-ai-matchers @anthropic-ai/sdk
export ANTHROPIC_API_KEY=sk-ant-...
Then import once in your playwright.config.ts:
import './node_modules/playwright-ai-matchers/dist';
Or per test file:
import 'playwright-ai-matchers';
Full TypeScript support included — all 4 matchers appear in autocomplete on expect().
What's in v2.2.0
Just shipped an internal improvement: leaner evaluation prompts (~250–350 tokens vs ~600+ per call) with tighter rules for edge cases like empty context, vague responses, and empathy-without-substance replies. Same API, better accuracy, lower cost.
Would love feedback
Especially if you're testing AI products and hitting this problem — or if you're using a different approach entirely.
Top comments (0)