DEV Community

Cover image for We pointed our chaos-QA agent at our own site. It found a shipped bug.
Gregory Potemkin
Gregory Potemkin

Posted on • Originally published at prufa.dev

We pointed our chaos-QA agent at our own site. It found a shipped bug.

We build an AI QA engineer, so the fair test is the obvious one: point it at
ourselves. On 15 June 2026 we ran Gremlin mode — Prufa's chaos-testing
modality — against our own marketing site, prufa.dev. It found a real,
user-facing bug that our CI had gone green on and shipped that same day. Here is
the whole run, including the parts where the tool was wrong about itself.

What Gremlin mode actually does

A normal Prufa flow checks a path you already know to check. Gremlin is for the
paths you didn't. An LLM-backed agent drives a real browser as a deliberately
difficult user — a confused newbie, an impatient double-clicker, a fat-finger
typist, a back-button masher, a hostile poker — and chooses its own next action
every step. It is the part of QA that needs a model: absorbing an unfamiliar UI
and deciding what a frustrated human would try next.

What the agent never does is decide whether anything broke. That is the same
invariant as the rest of Prufa —
the LLM navigates, plain code verifies
and it is the whole reason a finding from an LLM-driven tester can be trusted: a
separate layer of deterministic detectors grades the run. A 500 response, an
uncaught exception, a form that accepts invalid input, content wider than the
viewport, two clickable elements overlapping — those are facts, read off the
live page, not opinions.

The bug: a mobile overflow CI had just shipped

Across three personas, every run reported the same verified finding at the
390px mobile viewport: the page was 103 pixels wider than the screen, with
the "Run a free audit" button in the header hanging off the right edge.

Here is the part that makes the case for chaos QA. Earlier that same day, a
commit titled "fix" had added exactly the rule meant to prevent this:

@media (max-width: 520px) { .header-cta { display: none; } }
Enter fullscreen mode Exit fullscreen mode

It never applied. The button is styled by a.btn-primary { display: inline-block },
whose selector specificity (0,1,1) outranks the bare .header-cta (0,1,0), so
the display: none was silently overridden on every phone-width render. The CSS
was valid. The build passed. The linter was happy. CI was green. And the bug
shipped to production, where it sat 103px wide until an agent that had never seen
our codebase resized the viewport and measured the document.

The fix was to out-specify the button:

@media (max-width: 520px) { header a.header-cta { display: none; } }
Enter fullscreen mode Exit fullscreen mode

header a.header-cta is specificity (0,1,2), which beats a.btn-primary
regardless of source order. After the change, a fresh build measured 0px of
horizontal overflow at 390px and the button correctly hidden. The class of bug
matters here: nothing errored. A test that asserts known selectors would have
stayed green forever, because the breakage was in a layout dimension no one had
written an assertion about. You catch that by measuring the rendered page, not by
re-running the path you already trusted.

The safety guarantee, demonstrated on a live site

A chaos tester loose on a real site is only acceptable if it cannot change
anything. In Prufa, mutations are denied by default: the run is dry-run and a
network-layer guard aborts every non-GET request before it leaves the browser. A
destructive click becomes a "would have mutated" finding instead of an action.

We didn't have to take that on faith — the run logged it. Across the three
personas the agent attempted between 0 and 4 mutations each; every one was
blocked, and the run recorded which control it would have submitted. Real
payment instruments are never used at all. To let Gremlin submit forms for real,
you explicitly authorise a domain you own — and even then, hard caps bound how
many submissions it can make.

Where the tool was wrong about itself

The honest part. In an earlier run, two of the gremlin's own detectors fired
on things that were not bugs:

  • A "dead-end / error page" detector matched the bare string 500 in ordinary marketing copy (think "save $500"), calling a healthy page an error page.
  • A "bad input accepted" detector treated any navigation after a form fill as a successful submission — so clicking a normal link after typing in a field looked like the app had swallowed invalid input.

A verified finding that turns out to be noise costs more trust than a missed bug
costs coverage, so we did not ship around it. We added a detector
false-positive policy: the error-page check now requires a strong error phrase
in the page's prominent text (title or heading) on an error-shaped page, not a
substring match anywhere in the body; the bad-input check now requires a real
form submission — an actual non-GET request — before it fires. Both false
positives are gone, and the genuine findings (the mobile overflow) still land.

We also measured discovery quality directly. On a seeded-bug fixture with five
planted bugs, the agent's first pass found four of five (0.80 coverage); after
we gave it an exploration frontier — a running list of same-site pages it
hasn't visited yet, fed back into each decision — it found all five (1.00),
because it stopped looping one corner and started covering the whole app. That
number is fixture discovery quality, not a claim about your site; the point is
that "does the chaos actually find the planted bugs" is something we test, with
a number, not assert.

Why we publish the misses

A QA product that only tells flattering stories about itself is exactly the
product you shouldn't trust to test you. The mobile bug is a good demo. The
false positives are a better one: they show the failure mode that matters for an
LLM-driven tester — a confident, wrong "this is broken" — and they show the line
we hold against it. The model proposes; plain code disposes; and when plain code
gets it wrong, we fix the plain code, in the open.

Gremlin mode is available on any paid plan — read how it works on
the chaos-testing page, or
run a free audit to see the deterministic side of the same engine on your
own URL first.

Top comments (0)