I built an npm library to test AI chatbots with Playwright — here's why normal matchers don't work

#playwright #ai #typescript #testing

If you're building a product with an AI chatbot, you've probably run into this:

await expect(response).toContainText('The Pro plan costs $49/month');

This breaks constantly. LLMs never return the exact same string twice.

The problem

Traditional matchers assume deterministic output. AI responses are:

Semantically equivalent but textually different every run
Sometimes helpful, sometimes hallucinating
Hard to validate with toEqual() or toContainText()

You end up either skipping the assertion entirely, or writing brittle string checks that fail on every deploy.

What I built

playwright-ai-matchers — a library that uses Claude Haiku under the hood to evaluate AI responses semantically inside your Playwright tests.

import { test, expect } from '@playwright/test';
import 'playwright-ai-matchers';

test('AI chatbot responds correctly to a billing question', async ({ page }) => {
  await page.goto('https://your-app.com/chat');
  await page.locator('#user-input').fill('How much does the Pro plan cost?');
  await page.locator('#send-button').click();
  await page.locator('.ai-response').waitFor();

  const response = await page.locator('.ai-response').textContent();

  await expect(response).toMeanSomethingAbout('billing and pricing');
  await expect(response).toSatisfy('should mention a specific price or redirect to the pricing page');
  await expect(response).not.toHallucinate('Our Pro plan is $49/month. Enterprise has custom pricing.');
  await expect(response).toBeHelpful();
});

The 4 matchers

`toMeanSomethingAbout(topic)`

Checks whether the response meaningfully engages with a topic. Vague responses fail — if the chatbot says "We comply with all applicable laws" for a data privacy question, that's a fail.

await expect(response).toMeanSomethingAbout('refund policy');
await expect(response).not.toMeanSomethingAbout('competitor products');

`toSatisfy(criterion)`

Evaluates the response against a plain-language requirement. Read literally — no partial credit.

await expect(response).toSatisfy('should mention a specific price');
await expect(response).toSatisfy('must not recommend any third-party service');

`toHallucinate(context)`

Detects whether the response invents facts not present in the provided context. Typically used with .not to guard against hallucination.

await expect(response).not.toHallucinate(
  'Our Pro plan is $49/month. Enterprise has custom pricing.'
);

`toBeHelpful()`

Checks whether the response is genuinely useful — not an error message, a flat refusal, or an empathy-without-substance reply like "I understand this is frustrating. Let me know if I can help."

await expect(response).toBeHelpful();
await expect(errorFallback).not.toBeHelpful();

How it works

Each matcher sends the response to Claude Haiku with a carefully crafted evaluation prompt and gets back { pass: boolean, reason: string }. One API call per matcher. That's it.

When a test fails, you see exactly why:

Error: Expected response to mean something about "billing and pricing", but it didn't.
Reason: The response discusses general greetings and does not address billing or pricing topics.
Received: Hi! I'm here to help. What would you like to know today?

Install

npm install playwright-ai-matchers @anthropic-ai/sdk
export ANTHROPIC_API_KEY=sk-ant-...

Then import once in your playwright.config.ts:

import './node_modules/playwright-ai-matchers/dist';

Or per test file:

import 'playwright-ai-matchers';

Full TypeScript support included — all 4 matchers appear in autocomplete on expect().

What's in v2.2.0

Just shipped an internal improvement: leaner evaluation prompts (~250–350 tokens vs ~600+ per call) with tighter rules for edge cases like empty context, vague responses, and empathy-without-substance replies. Same API, better accuracy, lower cost.