<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Callum Porter</title>
    <description>The latest articles on DEV Community by Callum Porter (@cporter97).</description>
    <link>https://dev.to/cporter97</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1132910%2F72d9655d-8478-4169-a1e3-f581ea6bc24b.png</url>
      <title>DEV Community: Callum Porter</title>
      <link>https://dev.to/cporter97</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cporter97"/>
    <language>en</language>
    <item>
      <title>Test Your LLM Like You Test Your UI</title>
      <dc:creator>Callum Porter</dc:creator>
      <pubDate>Wed, 08 Apr 2026 02:41:30 +0000</pubDate>
      <link>https://dev.to/cporter97/test-your-llm-like-you-test-your-ui-1113</link>
      <guid>https://dev.to/cporter97/test-your-llm-like-you-test-your-ui-1113</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This tutorial was written for &lt;code&gt;@llmassert/playwright&lt;/code&gt; v0.6.0.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You've built a chatbot. Your Playwright tests pass. But your users are reporting hallucinated answers — confident responses that sound right but are completely fabricated.&lt;/p&gt;

&lt;p&gt;The problem? Your tests check that the chatbot &lt;em&gt;responds&lt;/em&gt;, not that it responds &lt;em&gt;correctly&lt;/em&gt;. A &lt;code&gt;toContain&lt;/code&gt; assertion can't tell the difference between a grounded answer and a hallucination. You need assertions that actually understand the output.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@llmassert/playwright&lt;/code&gt; adds five LLM-powered matchers to Playwright's &lt;code&gt;expect()&lt;/code&gt; — checking for hallucinations, PII, tone, format, and semantic accuracy: same test framework, same workflow, new superpowers.&lt;/p&gt;

&lt;p&gt;In this tutorial, you'll go from zero to five working LLM assertions in about 10 minutes. No new framework to learn — if you know Playwright, you already know 90% of what you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  One thing to know first: what "inconclusive" means
&lt;/h2&gt;

&lt;p&gt;LLMAssert uses an LLM (GPT-5.4-mini by default) as a &lt;strong&gt;judge&lt;/strong&gt; to evaluate your outputs. But LLM APIs can be slow or temporarily unavailable.&lt;/p&gt;

&lt;p&gt;When the judge can't return a score, the result is &lt;strong&gt;inconclusive&lt;/strong&gt; — and the test &lt;strong&gt;passes&lt;/strong&gt;. This is by design: a provider outage should never block your CI pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your test runs
    │
    ▼
Judge evaluates output
    │
    ├── Score ≥ threshold  →  PASS  ✓
    ├── Score &amp;lt; threshold  →  FAIL  ✗
    └── Judge unavailable  →  INCONCLUSIVE (passes) ≈
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every matcher returns &lt;code&gt;{ pass: boolean, score: number | null, reasoning: string }&lt;/code&gt;. The score ranges from 0.0 to 1.0 — or &lt;code&gt;null&lt;/code&gt; if inconclusive. You get a numeric quality signal, not just pass/fail.&lt;/p&gt;

&lt;p&gt;Running these examples costs less than a penny in API calls (GPT-5.4-mini pricing).&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup (2 minutes)
&lt;/h2&gt;

&lt;p&gt;Install the package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm add &lt;span class="nt"&gt;-D&lt;/span&gt; @llmassert/playwright
&lt;span class="c"&gt;# or: npm install -D @llmassert/playwright&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file in your project root with your OpenAI API key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_openai_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure &lt;code&gt;.env&lt;/code&gt; is in your &lt;code&gt;.gitignore&lt;/code&gt; — Playwright projects usually have this already, but double-check before committing.&lt;/p&gt;

&lt;p&gt;That's it. You're ready to write your first LLM assertion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One import to change.&lt;/strong&gt; Import &lt;code&gt;test&lt;/code&gt; and &lt;code&gt;expect&lt;/code&gt; from &lt;code&gt;@llmassert/playwright&lt;/code&gt; instead of &lt;code&gt;@playwright/test&lt;/code&gt;. This gives you the five LLM matchers plus the worker-scoped judge fixture. Your &lt;code&gt;playwright.config.ts&lt;/code&gt; stays the same. The package ships both ESM and CJS — &lt;code&gt;require()&lt;/code&gt; works too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// After&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@llmassert/playwright&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Catching a hallucination
&lt;/h2&gt;

&lt;p&gt;Here's a typical Playwright test that checks a chatbot response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;chatbot answers FAQ correctly&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Our return window is 90 days from purchase.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// This passes! But the response is wrong...&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toContain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;return&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test passes because the word "return" appears in the response. But the actual return policy is 30 days. Your chatbot just hallucinated, and your test didn't catch it.&lt;/p&gt;

&lt;p&gt;Now with LLMAssert:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@llmassert/playwright&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;chatbot answers FAQ correctly&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Our return window is 90 days from purchase.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;faqDocs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Returns accepted within 30 days. No restocking fee.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// This fails! The judge identifies the 90/30-day discrepancy.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeGroundedIn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;faqDocs&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;await&lt;/code&gt; — LLMAssert matchers are async because they call a judge model. Standard Playwright matchers like &lt;code&gt;toContain&lt;/code&gt; are synchronous and don't need &lt;code&gt;await&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;toBeGroundedIn&lt;/code&gt; matcher sends both the response and your source context to the judge model, which checks every claim against the evidence. The "90 days" claim contradicts the "30 days" in the source docs — the test fails with a score and a plain-English explanation of what went wrong.&lt;/p&gt;

&lt;p&gt;This is what makes LLM assertions different from regex or &lt;code&gt;toContain&lt;/code&gt;: the judge understands meaning, not just string matching. It catches paraphrased hallucinations, subtle contradictions, and fabricated details that would sail through traditional assertions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five matchers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;toBeGroundedIn&lt;/code&gt; — catch hallucinations
&lt;/h3&gt;

&lt;p&gt;Every claim in the output must be supported by the context you provide. Great for FAQ bots, RAG pipelines, and any system that should answer from source documents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;support answer is grounded in knowledge base&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;We offer a 30-day money-back guarantee on all plans.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;knowledgeBase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;All plans include a 30-day money-back guarantee. No questions asked.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeGroundedIn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;knowledgeBase&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;toBeFreeOfPII&lt;/code&gt; — detect personal information
&lt;/h3&gt;

&lt;p&gt;Scans for names, emails, phone numbers, addresses, and more. A score of 1.0 means the text is clean; 0.0 means PII was definitely found.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;support response does not leak customer PII&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Your order #12345 has been shipped and should arrive Friday.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeFreeOfPII&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Verify PII IS present (e.g., in a profile summary)&lt;/span&gt;
&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;profile includes user details&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Account holder: Jane Smith, jane@example.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;not&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toBeFreeOfPII&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;toMatchTone&lt;/code&gt; — enforce brand voice
&lt;/h3&gt;

&lt;p&gt;Validates that text matches a natural-language tone descriptor. Use it to ensure your bot stays on-brand even when users are frustrated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;support replies stay professional under pressure&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;I understand your frustration. Let me look into this right away and find a solution for you.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toMatchTone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;empathetic and solution-oriented&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;toBeFormatCompliant&lt;/code&gt; — check output structure
&lt;/h3&gt;

&lt;p&gt;Validates that text conforms to a described format. The schema parameter is a &lt;strong&gt;natural-language description&lt;/strong&gt;, not a JSON Schema.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;product description follows template&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Introducing the CloudWidget Pro.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;- 99.9% uptime&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;- Auto-scaling&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;- 24/7 support&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Start your free trial today.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeFormatCompliant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Three paragraphs: overview, key features as bullet list, call to action&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;toSemanticMatch&lt;/code&gt; — verify meaning preservation
&lt;/h3&gt;

&lt;p&gt;Compares the semantic similarity between two texts. Great for testing translations, summaries, or rephrased content.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;summary preserves key meaning&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;original&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The quarterly revenue increased by 15% driven by strong demand in the enterprise segment.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Revenue grew 15% this quarter, led by enterprise sales.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toSemanticMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;original&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Tuning thresholds
&lt;/h2&gt;

&lt;p&gt;Every matcher uses a threshold (default: 0.7) to determine pass/fail. Override it inline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Strict grounding for medical content&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeGroundedIn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Relaxed matching for creative paraphrasing&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toSemanticMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;reference&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why not just write a custom eval script?
&lt;/h2&gt;

&lt;p&gt;You could call the OpenAI API directly from your tests and parse the response yourself. But you'd need to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fallback logic&lt;/strong&gt; when the API is down (so your CI doesn't break)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeout handling&lt;/strong&gt; that doesn't block your entire test suite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score normalization&lt;/strong&gt; across different prompt types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result collection&lt;/strong&gt; for tracking scores over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt; to avoid burning through your API quota in parallel test runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLMAssert handles all of this out of the box, behind the same &lt;code&gt;expect()&lt;/code&gt; interface you already use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracking results across runs
&lt;/h2&gt;

&lt;p&gt;The assertions work standalone — no account needed. But if you want to track scores over time, spot regressions, and share results with your team, add the optional dashboard reporter.&lt;/p&gt;

&lt;p&gt;Add it to your &lt;code&gt;playwright.config.ts&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineConfig&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineConfig&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;reporter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;list&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@llmassert/playwright/reporter&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;projectSlug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-project&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LLMASSERT_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reporter batches evaluation results and sends them to the &lt;a href="https://llmassert.com" rel="noopener noreferrer"&gt;LLMAssert dashboard&lt;/a&gt; after each test run. If the dashboard is unreachable, your tests still pass — the reporter defaults to &lt;code&gt;onError: 'warn'&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Omit the &lt;code&gt;apiKey&lt;/code&gt; to run in local-only mode with no network calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding your API keys
&lt;/h3&gt;

&lt;p&gt;The tutorial uses up to three environment variables. They serve different purposes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;If leaked&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OPENAI_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.openai.com/api-keys" rel="noopener noreferrer"&gt;OpenAI dashboard&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Powers the primary judge (GPT-5.4-mini). Required unless using Anthropic only.&lt;/td&gt;
&lt;td&gt;Spend on your OpenAI account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://console.anthropic.com/" rel="noopener noreferrer"&gt;Anthropic console&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Powers the fallback judge (Claude Haiku). Optional.&lt;/td&gt;
&lt;td&gt;Spend on your Anthropic account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LLMASSERT_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://llmassert.com" rel="noopener noreferrer"&gt;LLMAssert dashboard&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Sends results to the dashboard. Optional.&lt;/td&gt;
&lt;td&gt;Test data written to one project&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At least one of &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; or &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; must be set. If neither is present, all assertions return inconclusive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding the fallback judge
&lt;/h3&gt;

&lt;p&gt;For resilience, you can add Claude Haiku as a fallback. If the primary model fails, the fallback takes over before marking results inconclusive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm add &lt;span class="nt"&gt;-D&lt;/span&gt; @anthropic-ai/sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add to your .env&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_anthropic_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fallback activates automatically — no code changes needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENAI_API_KEY set? ──yes──▶ GPT-5.4-mini
                              │
                         success? ──yes──▶ return score
                              │
                              no
                              ▼
ANTHROPIC_API_KEY set? ──yes──▶ Claude Haiku
                              │
                         success? ──yes──▶ return score
                              │
                              no
                              ▼
                         inconclusive (test passes)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What to test next
&lt;/h2&gt;

&lt;p&gt;You've seen how five matchers can catch issues that traditional assertions miss. Here are some ideas for your own test suite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG pipelines&lt;/strong&gt;: Use &lt;code&gt;toBeGroundedIn&lt;/code&gt; with your retrieved documents as context. This is the single highest-value assertion for any retrieval-augmented generation system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer-facing bots&lt;/strong&gt;: Combine &lt;code&gt;toBeFreeOfPII&lt;/code&gt; + &lt;code&gt;toMatchTone&lt;/code&gt; for safety and brand compliance. Two matchers, one test, two failure modes caught.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content generation&lt;/strong&gt;: Use &lt;code&gt;toBeFormatCompliant&lt;/code&gt; to enforce structured templates. Especially useful for outputs that feed downstream systems expecting specific formats.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual features&lt;/strong&gt;: Use &lt;code&gt;toSemanticMatch&lt;/code&gt; to validate translations and summaries. A back-translation pattern (translate, then translate back, then compare to the original) works surprisingly well as a quality signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regression testing&lt;/strong&gt;: Run the same assertions across prompt versions to see if score distributions shift. The dashboard reporter makes this visual.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All five matchers support &lt;code&gt;.not&lt;/code&gt; negation — useful when you want to assert that creative output is &lt;em&gt;not&lt;/em&gt; grounded in a template, or that a response &lt;em&gt;does&lt;/em&gt; contain specific user details.&lt;/p&gt;

&lt;p&gt;The package is MIT-licensed and free to use. Check out the &lt;a href="https://docs.llmassert.com" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;, browse the &lt;a href="https://github.com/llm-assert/llm-assert" rel="noopener noreferrer"&gt;source on GitHub&lt;/a&gt;, or install it now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm add &lt;span class="nt"&gt;-D&lt;/span&gt; @llmassert/playwright
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Built by the LLMAssert team. &lt;a href="https://github.com/llm-assert/llm-assert" rel="noopener noreferrer"&gt;Star us on GitHub&lt;/a&gt; if this was helpful!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>playwright</category>
      <category>javascript</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
