DEV Community

Cover image for Test Your LLM Like You Test Your UI
Callum Porter
Callum Porter

Posted on

Test Your LLM Like You Test Your UI

This tutorial was written for @llmassert/playwright v0.6.0.

You've built a chatbot. Your Playwright tests pass. But your users are reporting hallucinated answers — confident responses that sound right but are completely fabricated.

The problem? Your tests check that the chatbot responds, not that it responds correctly. A toContain assertion can't tell the difference between a grounded answer and a hallucination. You need assertions that actually understand the output.

@llmassert/playwright adds five LLM-powered matchers to Playwright's expect() — checking for hallucinations, PII, tone, format, and semantic accuracy: same test framework, same workflow, new superpowers.

In this tutorial, you'll go from zero to five working LLM assertions in about 10 minutes. No new framework to learn — if you know Playwright, you already know 90% of what you need.

One thing to know first: what "inconclusive" means

LLMAssert uses an LLM (GPT-5.4-mini by default) as a judge to evaluate your outputs. But LLM APIs can be slow or temporarily unavailable.

When the judge can't return a score, the result is inconclusive — and the test passes. This is by design: a provider outage should never block your CI pipeline.

Your test runs
    │
    ▼
Judge evaluates output
    │
    ├── Score ≥ threshold  →  PASS  ✓
    ├── Score < threshold  →  FAIL  ✗
    └── Judge unavailable  →  INCONCLUSIVE (passes) ≈
Enter fullscreen mode Exit fullscreen mode

Every matcher returns { pass: boolean, score: number | null, reasoning: string }. The score ranges from 0.0 to 1.0 — or null if inconclusive. You get a numeric quality signal, not just pass/fail.

Running these examples costs less than a penny in API calls (GPT-5.4-mini pricing).

Setup (2 minutes)

Install the package:

pnpm add -D @llmassert/playwright
# or: npm install -D @llmassert/playwright
Enter fullscreen mode Exit fullscreen mode

Create a .env file in your project root with your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key_here
Enter fullscreen mode Exit fullscreen mode

Make sure .env is in your .gitignore — Playwright projects usually have this already, but double-check before committing.

That's it. You're ready to write your first LLM assertion.

One import to change. Import test and expect from @llmassert/playwright instead of @playwright/test. This gives you the five LLM matchers plus the worker-scoped judge fixture. Your playwright.config.ts stays the same. The package ships both ESM and CJS — require() works too.

// Before
import { test, expect } from "@playwright/test";

// After
import { test, expect } from "@llmassert/playwright";
Enter fullscreen mode Exit fullscreen mode

Catching a hallucination

Here's a typical Playwright test that checks a chatbot response:

import { test, expect } from "@playwright/test";

test("chatbot answers FAQ correctly", async () => {
  const response = "Our return window is 90 days from purchase.";

  // This passes! But the response is wrong...
  expect(response).toContain("return");
});
Enter fullscreen mode Exit fullscreen mode

The test passes because the word "return" appears in the response. But the actual return policy is 30 days. Your chatbot just hallucinated, and your test didn't catch it.

Now with LLMAssert:

import { test, expect } from "@llmassert/playwright";

test("chatbot answers FAQ correctly", async () => {
  const response = "Our return window is 90 days from purchase.";
  const faqDocs = "Returns accepted within 30 days. No restocking fee.";

  // This fails! The judge identifies the 90/30-day discrepancy.
  await expect(response).toBeGroundedIn(faqDocs);
});
Enter fullscreen mode Exit fullscreen mode

Notice the await — LLMAssert matchers are async because they call a judge model. Standard Playwright matchers like toContain are synchronous and don't need await.

The toBeGroundedIn matcher sends both the response and your source context to the judge model, which checks every claim against the evidence. The "90 days" claim contradicts the "30 days" in the source docs — the test fails with a score and a plain-English explanation of what went wrong.

This is what makes LLM assertions different from regex or toContain: the judge understands meaning, not just string matching. It catches paraphrased hallucinations, subtle contradictions, and fabricated details that would sail through traditional assertions.

The five matchers

toBeGroundedIn — catch hallucinations

Every claim in the output must be supported by the context you provide. Great for FAQ bots, RAG pipelines, and any system that should answer from source documents.

test("support answer is grounded in knowledge base", async () => {
  const response = "We offer a 30-day money-back guarantee on all plans.";
  const knowledgeBase = "All plans include a 30-day money-back guarantee. No questions asked.";

  await expect(response).toBeGroundedIn(knowledgeBase);
});
Enter fullscreen mode Exit fullscreen mode

toBeFreeOfPII — detect personal information

Scans for names, emails, phone numbers, addresses, and more. A score of 1.0 means the text is clean; 0.0 means PII was definitely found.

test("support response does not leak customer PII", async () => {
  const response = "Your order #12345 has been shipped and should arrive Friday.";

  await expect(response).toBeFreeOfPII();
});

// Verify PII IS present (e.g., in a profile summary)
test("profile includes user details", async () => {
  const summary = "Account holder: Jane Smith, jane@example.com";

  await expect(summary).not.toBeFreeOfPII();
});
Enter fullscreen mode Exit fullscreen mode

toMatchTone — enforce brand voice

Validates that text matches a natural-language tone descriptor. Use it to ensure your bot stays on-brand even when users are frustrated.

test("support replies stay professional under pressure", async () => {
  const response = "I understand your frustration. Let me look into this right away and find a solution for you.";

  await expect(response).toMatchTone("empathetic and solution-oriented");
});
Enter fullscreen mode Exit fullscreen mode

toBeFormatCompliant — check output structure

Validates that text conforms to a described format. The schema parameter is a natural-language description, not a JSON Schema.

test("product description follows template", async () => {
  const description = "Introducing the CloudWidget Pro.\n\n- 99.9% uptime\n- Auto-scaling\n- 24/7 support\n\nStart your free trial today.";

  await expect(description).toBeFormatCompliant(
    "Three paragraphs: overview, key features as bullet list, call to action"
  );
});
Enter fullscreen mode Exit fullscreen mode

toSemanticMatch — verify meaning preservation

Compares the semantic similarity between two texts. Great for testing translations, summaries, or rephrased content.

test("summary preserves key meaning", async () => {
  const original = "The quarterly revenue increased by 15% driven by strong demand in the enterprise segment.";
  const summary = "Revenue grew 15% this quarter, led by enterprise sales.";

  await expect(summary).toSemanticMatch(original);
});
Enter fullscreen mode Exit fullscreen mode

Tuning thresholds

Every matcher uses a threshold (default: 0.7) to determine pass/fail. Override it inline:

// Strict grounding for medical content
await expect(response).toBeGroundedIn(context, { threshold: 0.95 });

// Relaxed matching for creative paraphrasing
await expect(summary).toSemanticMatch(reference, { threshold: 0.6 });
Enter fullscreen mode Exit fullscreen mode

Why not just write a custom eval script?

You could call the OpenAI API directly from your tests and parse the response yourself. But you'd need to handle:

  • Fallback logic when the API is down (so your CI doesn't break)
  • Timeout handling that doesn't block your entire test suite
  • Score normalization across different prompt types
  • Result collection for tracking scores over time
  • Rate limiting to avoid burning through your API quota in parallel test runs

LLMAssert handles all of this out of the box, behind the same expect() interface you already use.

Tracking results across runs

The assertions work standalone — no account needed. But if you want to track scores over time, spot regressions, and share results with your team, add the optional dashboard reporter.

Add it to your playwright.config.ts:

import { defineConfig } from "@playwright/test";

export default defineConfig({
  reporter: [
    ["list"],
    [
      "@llmassert/playwright/reporter",
      {
        projectSlug: "my-project",
        apiKey: process.env.LLMASSERT_API_KEY,
      },
    ],
  ],
});
Enter fullscreen mode Exit fullscreen mode

The reporter batches evaluation results and sends them to the LLMAssert dashboard after each test run. If the dashboard is unreachable, your tests still pass — the reporter defaults to onError: 'warn'.

Omit the apiKey to run in local-only mode with no network calls.

Understanding your API keys

The tutorial uses up to three environment variables. They serve different purposes:

Variable Source Purpose If leaked
OPENAI_API_KEY OpenAI dashboard Powers the primary judge (GPT-5.4-mini). Required unless using Anthropic only. Spend on your OpenAI account
ANTHROPIC_API_KEY Anthropic console Powers the fallback judge (Claude Haiku). Optional. Spend on your Anthropic account
LLMASSERT_API_KEY LLMAssert dashboard Sends results to the dashboard. Optional. Test data written to one project

At least one of OPENAI_API_KEY or ANTHROPIC_API_KEY must be set. If neither is present, all assertions return inconclusive.

Adding the fallback judge

For resilience, you can add Claude Haiku as a fallback. If the primary model fails, the fallback takes over before marking results inconclusive.

pnpm add -D @anthropic-ai/sdk
Enter fullscreen mode Exit fullscreen mode
# Add to your .env
ANTHROPIC_API_KEY=your_anthropic_api_key_here
Enter fullscreen mode Exit fullscreen mode

The fallback activates automatically — no code changes needed.

OPENAI_API_KEY set? ──yes──▶ GPT-5.4-mini
                              │
                         success? ──yes──▶ return score
                              │
                              no
                              ▼
ANTHROPIC_API_KEY set? ──yes──▶ Claude Haiku
                              │
                         success? ──yes──▶ return score
                              │
                              no
                              ▼
                         inconclusive (test passes)
Enter fullscreen mode Exit fullscreen mode

What to test next

You've seen how five matchers can catch issues that traditional assertions miss. Here are some ideas for your own test suite:

  • RAG pipelines: Use toBeGroundedIn with your retrieved documents as context. This is the single highest-value assertion for any retrieval-augmented generation system.
  • Customer-facing bots: Combine toBeFreeOfPII + toMatchTone for safety and brand compliance. Two matchers, one test, two failure modes caught.
  • Content generation: Use toBeFormatCompliant to enforce structured templates. Especially useful for outputs that feed downstream systems expecting specific formats.
  • Multilingual features: Use toSemanticMatch to validate translations and summaries. A back-translation pattern (translate, then translate back, then compare to the original) works surprisingly well as a quality signal.
  • Regression testing: Run the same assertions across prompt versions to see if score distributions shift. The dashboard reporter makes this visual.

All five matchers support .not negation — useful when you want to assert that creative output is not grounded in a template, or that a response does contain specific user details.

The package is MIT-licensed and free to use. Check out the documentation, browse the source on GitHub, or install it now:

pnpm add -D @llmassert/playwright
Enter fullscreen mode Exit fullscreen mode

Built by the LLMAssert team. Star us on GitHub if this was helpful!

Top comments (0)