DEV Community

Cover image for Runtime Snapshots #9 — Semantic Regression Detection: When Deploys Break UX, Not Tests
Alechko
Alechko

Posted on

Runtime Snapshots #9 — Semantic Regression Detection: When Deploys Break UX, Not Tests

Your E2E tests passed. CI is green. Deploy went through.

And now checkout is broken because a new banner covers the "Pay" button on mobile.

The Problem

Traditional testing catches functional breaks. Button doesn't click? Test fails.

But what about:

  • New hero section pushes products below the fold
  • Chat widget overlaps "Add to Cart" on tablet
  • CMS update breaks grid layout
  • A/B test variant hides critical CTA
  • Third-party script (analytics, ads) covers checkout form
  • Responsive breakpoint works on desktop, broken on mobile

These aren't bugs. Tests pass. The page just doesn't work anymore.

Why Existing Tools Fail

Screenshot diff: Every pixel change = alert. Designer tweaks padding? 500 false positives. Team ignores alerts. Real issues slip through.

E2E tests: Check if button exists and clicks. Don't check if button is visible, accessible, not covered by promo banner.

Manual QA: Doesn't scale. Misses edge cases. "Works on my machine."

Semantic State Monitoring

Instead of comparing pixels or running click tests, compare what an LLM understands about your page:

Deploy 1: "E-commerce PDP. Product image, price, 'Add to Cart' button prominent. Checkout accessible."
Deploy 2: "E-commerce PDP. Product image, price, 'Add to Cart' button prominent. Checkout accessible."
Deploy 3: "E-commerce PDP. Large promo banner. 'Add to Cart' partially hidden. Checkout requires scroll."
          ↑
          REGRESSION: Primary action degraded
Enter fullscreen mode Exit fullscreen mode

The LLM doesn't check pixels. It checks whether the page still does its job.

Handling LLM Non-Determinism

LLMs aren't deterministic. Same page, slightly different wording. "12 products" vs "showing 12 items."

Solution: Moving window context.

Instead of comparing current vs previous, feed the LLM recent history:

const stateWindow = [];
const WINDOW_SIZE = 4;

async function checkRegression(currentState) {
  stateWindow.push(currentState);
  if (stateWindow.length > WINDOW_SIZE) stateWindow.shift();
  if (stateWindow.length < 2) return null;

  return await llm.chat({
    prompt: `You're monitoring a web page for UX regressions. 
Recent semantic snapshots (oldest to newest):

${stateWindow.map((s, i) => `[${i + 1}]: ${s}`).join('\n')}

Questions:
1. Is the latest snapshot a regression from established baseline?
2. Are primary actions (CTAs, forms, checkout) still accessible and prominent?
3. Is any critical UI element hidden, pushed off-screen, or covered?

Reply: {regression: true/false, severity: "critical/warning/none", reason: "..."}`
  });
}
Enter fullscreen mode Exit fullscreen mode

Now the LLM sees the pattern. Minor wording variations dissolve. Real regressions stand out.

Why Salience Changes Everything

Most "AI monitoring" solutions do this:

Page → LLM → "figure it out"
Enter fullscreen mode Exit fullscreen mode

We do this:

Page → SiFR (structure + relations + salience) → LLM (interpretation, not discovery)
Enter fullscreen mode Exit fullscreen mode

The model does not inspect the DOM equally. SiFR assigns salience scores to elements before the LLM sees them. High-salience elements (CTAs, forms, primary content) dominate the semantic state. Low-salience elements (footers, decorations, cookie banners) are effectively ignored.

This is why CSS tweaks don't trigger alerts, but "button covered by banner" does.

Element Salience LLM Treatment
Checkout button 95% Critical — visibility change = regression
Product grid 88% Important — pushed off-screen = warning
Promo banner 70% Monitor — if it occludes high-salience = alert
Footer links 15% Ignored
Cookie consent 12% Ignored

We don't ask the model what matters — we tell it.

What This Catches

Issue E2E Tests Visual Diff Semantic
Button covered by new banner ❌ Pass ⚠️ Alert (among 50 others) ✅ "CTA occluded"
Products pushed below fold ❌ Pass ❌ Pass ✅ "Primary content degraded"
Mobile layout broken ❌ Pass (if desktop-only) ⚠️ Noise ✅ "Responsive regression"
Third-party widget overlap ❌ Pass ⚠️ Noise ✅ "External element occludes checkout"
CMS broke grid ❌ Pass ⚠️ Alert flood ✅ "Layout structure changed"
A/B test hides CTA ❌ Pass ❌ Different baseline ✅ "Variant missing primary action"

Bonus: Security Layer

Same approach catches malicious changes:

  • Defacement: High-salience content replaced → instant alert
  • Phishing overlay: New high-salience form over login → "Anomaly: duplicate auth form"
  • Content injection: Suspicious iframe/script in critical area → flagged

Because the LLM reads a projection of the page (pre-weighted by salience), not raw HTML:

  • Injected instructions in low-salience areas = ignored
  • Prompt injection surface = minimal

Security monitoring as a free addon to your QA pipeline.

Implementation

// Playwright + Element-to-LLM
async function getSemanticState(page, viewport) {
  await page.setViewportSize(viewport); // Test multiple breakpoints

  const sifr = await page.evaluate(() => {
    return new Promise((resolve) => {
      document.addEventListener('e2llm-capture-response', (e) => {
        resolve(e.detail.data);
      }, { once: true });
      document.dispatchEvent(new CustomEvent('e2llm-capture-request', {
        detail: { selector: 'body', options: { preset: 'minimal' } }
      }));
    });
  });

  return await llm.chat({
    prompt: `Describe this page's functional state:
1. Primary actions available (buttons, forms, CTAs)
2. Content hierarchy (what's prominent vs hidden)
3. Any UI issues (overlaps, off-screen elements, broken layout)

Be consistent. Same functional state = same description.`,
    context: JSON.stringify(sifr)
  });
}

// Check critical viewports
const viewports = [
  { width: 1920, height: 1080, name: 'desktop' },
  { width: 768, height: 1024, name: 'tablet' },
  { width: 375, height: 667, name: 'mobile' }
];

for (const vp of viewports) {
  const state = await getSemanticState(page, vp);
  const result = await checkRegression(state);

  if (result?.regression) {
    alert(`[${vp.name}] ${result.severity}: ${result.reason}`);
  }
}
Enter fullscreen mode Exit fullscreen mode

When To Run

Trigger Use Case
Post-deploy Catch regressions before users
Scheduled (hourly) Third-party script changes, CMS updates
Pre-merge (staging) PR review with semantic diff
Multi-viewport Responsive regression detection

Try It

  1. Install Element-to-LLM extension
  2. Integrate with Playwright
  3. Add your LLM
  4. Run on critical pages post-deploy

Your tests check if code works. This checks if users can use it.


Series Index:


Running this in your pipeline? Share your experience in the comments.


Tags: #webdev #frontend #testing #qa #devops

Top comments (0)