Alechko

Posted on Jan 6

Runtime Snapshots #9 — Semantic Regression Detection: When Deploys Break UX, Not Tests

#cicd #frontend #testing #ux

Your E2E tests passed. CI is green. Deploy went through.

And now checkout is broken because a new banner covers the "Pay" button on mobile.

The Problem

Traditional testing catches functional breaks. Button doesn't click? Test fails.

But what about:

New hero section pushes products below the fold
Chat widget overlaps "Add to Cart" on tablet
CMS update breaks grid layout
A/B test variant hides critical CTA
Third-party script (analytics, ads) covers checkout form
Responsive breakpoint works on desktop, broken on mobile

These aren't bugs. Tests pass. The page just doesn't work anymore.

Why Existing Tools Fail

Screenshot diff: Every pixel change = alert. Designer tweaks padding? 500 false positives. Team ignores alerts. Real issues slip through.

E2E tests: Check if button exists and clicks. Don't check if button is visible, accessible, not covered by promo banner.

Manual QA: Doesn't scale. Misses edge cases. "Works on my machine."

Semantic State Monitoring

Instead of comparing pixels or running click tests, compare what an LLM understands about your page:

Deploy 1: "E-commerce PDP. Product image, price, 'Add to Cart' button prominent. Checkout accessible."
Deploy 2: "E-commerce PDP. Product image, price, 'Add to Cart' button prominent. Checkout accessible."
Deploy 3: "E-commerce PDP. Large promo banner. 'Add to Cart' partially hidden. Checkout requires scroll."
          ↑
          REGRESSION: Primary action degraded

The LLM doesn't check pixels. It checks whether the page still does its job.

Handling LLM Non-Determinism

LLMs aren't deterministic. Same page, slightly different wording. "12 products" vs "showing 12 items."

Solution: Moving window context.

Instead of comparing current vs previous, feed the LLM recent history:

const stateWindow = [];
const WINDOW_SIZE = 4;

async function checkRegression(currentState) {
  stateWindow.push(currentState);
  if (stateWindow.length > WINDOW_SIZE) stateWindow.shift();
  if (stateWindow.length < 2) return null;

  return await llm.chat({
    prompt: `You're monitoring a web page for UX regressions. 
Recent semantic snapshots (oldest to newest):

${stateWindow.map((s, i) => `[${i + 1}]: ${s}`).join('\n')}

Questions:
1. Is the latest snapshot a regression from established baseline?
2. Are primary actions (CTAs, forms, checkout) still accessible and prominent?
3. Is any critical UI element hidden, pushed off-screen, or covered?

Reply: {regression: true/false, severity: "critical/warning/none", reason: "..."}`
  });
}

Now the LLM sees the pattern. Minor wording variations dissolve. Real regressions stand out.

Why Salience Changes Everything

Most "AI monitoring" solutions do this:

Page → LLM → "figure it out"

We do this:

Page → SiFR (structure + relations + salience) → LLM (interpretation, not discovery)

The model does not inspect the DOM equally. SiFR assigns salience scores to elements before the LLM sees them. High-salience elements (CTAs, forms, primary content) dominate the semantic state. Low-salience elements (footers, decorations, cookie banners) are effectively ignored.

This is why CSS tweaks don't trigger alerts, but "button covered by banner" does.

Element	Salience	LLM Treatment
Checkout button	95%	Critical — visibility change = regression
Product grid	88%	Important — pushed off-screen = warning
Promo banner	70%	Monitor — if it occludes high-salience = alert
Footer links	15%	Ignored
Cookie consent	12%	Ignored

We don't ask the model what matters — we tell it.

What This Catches

Issue	E2E Tests	Visual Diff	Semantic
Button covered by new banner	❌ Pass	⚠️ Alert (among 50 others)	✅ "CTA occluded"
Products pushed below fold	❌ Pass	❌ Pass	✅ "Primary content degraded"
Mobile layout broken	❌ Pass (if desktop-only)	⚠️ Noise	✅ "Responsive regression"
Third-party widget overlap	❌ Pass	⚠️ Noise	✅ "External element occludes checkout"
CMS broke grid	❌ Pass	⚠️ Alert flood	✅ "Layout structure changed"
A/B test hides CTA	❌ Pass	❌ Different baseline	✅ "Variant missing primary action"

Bonus: Security Layer

Same approach catches malicious changes:

Defacement: High-salience content replaced → instant alert
Phishing overlay: New high-salience form over login → "Anomaly: duplicate auth form"
Content injection: Suspicious iframe/script in critical area → flagged

Because the LLM reads a projection of the page (pre-weighted by salience), not raw HTML:

Injected instructions in low-salience areas = ignored
Prompt injection surface = minimal

Security monitoring as a free addon to your QA pipeline.

Implementation

// Playwright + Element-to-LLM
async function getSemanticState(page, viewport) {
  await page.setViewportSize(viewport); // Test multiple breakpoints

  const sifr = await page.evaluate(() => {
    return new Promise((resolve) => {
      document.addEventListener('e2llm-capture-response', (e) => {
        resolve(e.detail.data);
      }, { once: true });
      document.dispatchEvent(new CustomEvent('e2llm-capture-request', {
        detail: { selector: 'body', options: { preset: 'minimal' } }
      }));
    });
  });

  return await llm.chat({
    prompt: `Describe this page's functional state:
1. Primary actions available (buttons, forms, CTAs)
2. Content hierarchy (what's prominent vs hidden)
3. Any UI issues (overlaps, off-screen elements, broken layout)

Be consistent. Same functional state = same description.`,
    context: JSON.stringify(sifr)
  });
}

// Check critical viewports
const viewports = [
  { width: 1920, height: 1080, name: 'desktop' },
  { width: 768, height: 1024, name: 'tablet' },
  { width: 375, height: 667, name: 'mobile' }
];

for (const vp of viewports) {
  const state = await getSemanticState(page, vp);
  const result = await checkRegression(state);

  if (result?.regression) {
    alert(`[${vp.name}] ${result.severity}: ${result.reason}`);
  }
}

When To Run

Trigger	Use Case
Post-deploy	Catch regressions before users
Scheduled (hourly)	Third-party script changes, CMS updates
Pre-merge (staging)	PR review with semantic diff
Multi-viewport	Responsive regression detection

Try It

Install Element-to-LLM extension
Integrate with Playwright
Add your LLM
Run on critical pages post-deploy

Your tests check if code works. This checks if users can use it.

Series Index:

#1 — What Can an LLM Actually See?
#2 — Capturing What Matters
#3 — The Devil's in the Diffs
#4 — Building a Robust Selection Model
#5 — Taming the Token Budget
#6 — Representing Relations Without the Hierarchy
#7 — Shipping: Extension, Automation, and What's Next
#8 — From Bug Reports to Automated Regression: A QA Pipeline
#9 — Semantic Regression Detection (you are here)

Running this in your pipeline? Share your experience in the comments.

Tags: #webdev #frontend #testing #qa #devops

DEV Community