DEV Community

Cover image for I built an npm library to test AI chatbots with Playwright — here's why normal matchers don't work
Germán Gordón
Germán Gordón

Posted on

I built an npm library to test AI chatbots with Playwright — here's why normal matchers don't work

If you're building a product with an AI chatbot, you've probably run into this:

await expect(response).toContainText('The Pro plan costs $49/month');
Enter fullscreen mode Exit fullscreen mode

This breaks constantly. LLMs never return the exact same string twice.

The problem

Traditional matchers assume deterministic output. AI responses are:

  • Semantically equivalent but textually different every run
  • Sometimes helpful, sometimes hallucinating
  • Hard to validate with toEqual() or toContainText()

You end up either skipping the assertion entirely, or writing brittle string checks that fail on every deploy.

What I built

playwright-ai-matchers — a library that uses Claude Haiku under the hood to evaluate AI responses semantically inside your Playwright tests.

import { test, expect } from '@playwright/test';
import 'playwright-ai-matchers';

test('AI chatbot responds correctly to a billing question', async ({ page }) => {
  await page.goto('https://your-app.com/chat');
  await page.locator('#user-input').fill('How much does the Pro plan cost?');
  await page.locator('#send-button').click();
  await page.locator('.ai-response').waitFor();

  const response = await page.locator('.ai-response').textContent();

  await expect(response).toMeanSomethingAbout('billing and pricing');
  await expect(response).toSatisfy('should mention a specific price or redirect to the pricing page');
  await expect(response).not.toHallucinate('Our Pro plan is $49/month. Enterprise has custom pricing.');
  await expect(response).toBeHelpful();
});
Enter fullscreen mode Exit fullscreen mode

The 4 matchers

toMeanSomethingAbout(topic)

Checks whether the response meaningfully engages with a topic. Vague responses fail — if the chatbot says "We comply with all applicable laws" for a data privacy question, that's a fail.

await expect(response).toMeanSomethingAbout('refund policy');
await expect(response).not.toMeanSomethingAbout('competitor products');
Enter fullscreen mode Exit fullscreen mode

toSatisfy(criterion)

Evaluates the response against a plain-language requirement. Read literally — no partial credit.

await expect(response).toSatisfy('should mention a specific price');
await expect(response).toSatisfy('must not recommend any third-party service');
Enter fullscreen mode Exit fullscreen mode

toHallucinate(context)

Detects whether the response invents facts not present in the provided context. Typically used with .not to guard against hallucination.

await expect(response).not.toHallucinate(
  'Our Pro plan is $49/month. Enterprise has custom pricing.'
);
Enter fullscreen mode Exit fullscreen mode

toBeHelpful()

Checks whether the response is genuinely useful — not an error message, a flat refusal, or an empathy-without-substance reply like "I understand this is frustrating. Let me know if I can help."

await expect(response).toBeHelpful();
await expect(errorFallback).not.toBeHelpful();
Enter fullscreen mode Exit fullscreen mode

How it works

Each matcher sends the response to Claude Haiku with a carefully crafted evaluation prompt and gets back { pass: boolean, reason: string }. One API call per matcher. That's it.

When a test fails, you see exactly why:

Error: Expected response to mean something about "billing and pricing", but it didn't.
Reason: The response discusses general greetings and does not address billing or pricing topics.
Received: Hi! I'm here to help. What would you like to know today?
Enter fullscreen mode Exit fullscreen mode

Install

npm install playwright-ai-matchers @anthropic-ai/sdk
export ANTHROPIC_API_KEY=sk-ant-...
Enter fullscreen mode Exit fullscreen mode

Then import once in your playwright.config.ts:

import './node_modules/playwright-ai-matchers/dist';
Enter fullscreen mode Exit fullscreen mode

Or per test file:

import 'playwright-ai-matchers';
Enter fullscreen mode Exit fullscreen mode

Full TypeScript support included — all 4 matchers appear in autocomplete on expect().

What's in v2.2.0

Just shipped an internal improvement: leaner evaluation prompts (~250–350 tokens vs ~600+ per call) with tighter rules for edge cases like empty context, vague responses, and empathy-without-substance replies. Same API, better accuracy, lower cost.

Would love feedback

Especially if you're testing AI products and hitting this problem — or if you're using a different approach entirely.

Top comments (0)