DEV Community: Gregory Potemkin

We pointed our chaos-QA agent at our own site. It found a shipped bug.

Gregory Potemkin — Mon, 22 Jun 2026 10:43:55 +0000

We build an AI QA engineer, so the fair test is the obvious one: point it at
ourselves. On 15 June 2026 we ran Gremlin mode — Prufa's chaos-testing
modality — against our own marketing site, prufa.dev. It found a real,
user-facing bug that our CI had gone green on and shipped that same day. Here is
the whole run, including the parts where the tool was wrong about itself.

What Gremlin mode actually does

A normal Prufa flow checks a path you already know to check. Gremlin is for the
paths you didn't. An LLM-backed agent drives a real browser as a deliberately
difficult user — a confused newbie, an impatient double-clicker, a fat-finger
typist, a back-button masher, a hostile poker — and chooses its own next action
every step. It is the part of QA that needs a model: absorbing an unfamiliar UI
and deciding what a frustrated human would try next.

What the agent never does is decide whether anything broke. That is the same
invariant as the rest of Prufa —
the LLM navigates, plain code verifies —
and it is the whole reason a finding from an LLM-driven tester can be trusted: a
separate layer of deterministic detectors grades the run. A 500 response, an
uncaught exception, a form that accepts invalid input, content wider than the
viewport, two clickable elements overlapping — those are facts, read off the
live page, not opinions.

The bug: a mobile overflow CI had just shipped

Across three personas, every run reported the same verified finding at the
390px mobile viewport: the page was 103 pixels wider than the screen, with
the "Run a free audit" button in the header hanging off the right edge.

Here is the part that makes the case for chaos QA. Earlier that same day, a
commit titled "fix" had added exactly the rule meant to prevent this:

@media (max-width: 520px) { .header-cta { display: none; } }

It never applied. The button is styled by a.btn-primary { display: inline-block },
whose selector specificity (0,1,1) outranks the bare .header-cta (0,1,0), so
the display: none was silently overridden on every phone-width render. The CSS
was valid. The build passed. The linter was happy. CI was green. And the bug
shipped to production, where it sat 103px wide until an agent that had never seen
our codebase resized the viewport and measured the document.

The fix was to out-specify the button:

@media (max-width: 520px) { header a.header-cta { display: none; } }

header a.header-cta is specificity (0,1,2), which beats a.btn-primary
regardless of source order. After the change, a fresh build measured 0px of
horizontal overflow at 390px and the button correctly hidden. The class of bug
matters here: nothing errored. A test that asserts known selectors would have
stayed green forever, because the breakage was in a layout dimension no one had
written an assertion about. You catch that by measuring the rendered page, not by
re-running the path you already trusted.

The safety guarantee, demonstrated on a live site

A chaos tester loose on a real site is only acceptable if it cannot change
anything. In Prufa, mutations are denied by default: the run is dry-run and a
network-layer guard aborts every non-GET request before it leaves the browser. A
destructive click becomes a "would have mutated" finding instead of an action.

We didn't have to take that on faith — the run logged it. Across the three
personas the agent attempted between 0 and 4 mutations each; every one was
blocked, and the run recorded which control it would have submitted. Real
payment instruments are never used at all. To let Gremlin submit forms for real,
you explicitly authorise a domain you own — and even then, hard caps bound how
many submissions it can make.

Where the tool was wrong about itself

The honest part. In an earlier run, two of the gremlin's own detectors fired
on things that were not bugs:

A "dead-end / error page" detector matched the bare string 500 in ordinary marketing copy (think "save $500"), calling a healthy page an error page.
A "bad input accepted" detector treated any navigation after a form fill as a successful submission — so clicking a normal link after typing in a field looked like the app had swallowed invalid input.

A verified finding that turns out to be noise costs more trust than a missed bug
costs coverage, so we did not ship around it. We added a detector
false-positive policy: the error-page check now requires a strong error phrase
in the page's prominent text (title or heading) on an error-shaped page, not a
substring match anywhere in the body; the bad-input check now requires a real
form submission — an actual non-GET request — before it fires. Both false
positives are gone, and the genuine findings (the mobile overflow) still land.

We also measured discovery quality directly. On a seeded-bug fixture with five
planted bugs, the agent's first pass found four of five (0.80 coverage); after
we gave it an exploration frontier — a running list of same-site pages it
hasn't visited yet, fed back into each decision — it found all five (1.00),
because it stopped looping one corner and started covering the whole app. That
number is fixture discovery quality, not a claim about your site; the point is
that "does the chaos actually find the planted bugs" is something we test, with
a number, not assert.

Why we publish the misses

A QA product that only tells flattering stories about itself is exactly the
product you shouldn't trust to test you. The mobile bug is a good demo. The
false positives are a better one: they show the failure mode that matters for an
LLM-driven tester — a confident, wrong "this is broken" — and they show the line
we hold against it. The model proposes; plain code disposes; and when plain code
gets it wrong, we fix the plain code, in the open.

Gremlin mode is available on any paid plan — read how it works on
the chaos-testing page, or
run a free audit to see the deterministic side of the same engine on your
own URL first.

We audited 14 side-project launches. Zero critical bugs, same quiet flaws.

Gregory Potemkin — Tue, 16 Jun 2026 12:40:10 +0000

Originally published on the Prufa blog.

Five days ago we audited 49 Show HN launches and found that 78% had a critical bug on day one. This week we pointed the same free audit at a different cohort: 14 products freshly posted to r/SideProject. We expected more of the same.

We got the opposite — and it turned out to be more interesting.

Not one of the 14 had a critical finding. No broken signup flow, no canonical pointing at the wrong domain, no analytics tag silently swallowing every event. By the measure that matters most on launch day — does the core thing work — these builders shipped clean.

And yet every single site had findings. They just all live one tier down, in a layer so consistent it reads like a shared checklist nobody handed out:

11 of 14 sent no analytics events at all.
11 of 14 shipped with no Content-Security-Policy and could be framed by any site (no X-Frame-Options).
11 of 14 had serious accessibility violations.
12 of 14 had tap targets smaller than 24px on mobile.
9 of 14 took over four seconds to paint their largest element on mobile.
8 of 14 had no canonical link on the entry page.

No site is named in this post. The point isn't to embarrass anyone — these are good builders who got a real product live. The point is that the same common side-project launch mistakes show up again and again, and if 11 of 14 strangers have them, you probably have a few too.

Methodology, briefly

We pulled 20 URLs from recent r/SideProject posts and ran each through the same audit a free Prufa run does: a real browser loads the public pages and captures network traffic, console output, response codes, headers, and the rendered DOM, then a fixed suite of deterministic checks grades the evidence. Same input, same verdict.

Of the 20: 14 completed cleanly, 4 were blocked by bot protection before our runner could load them, and 2 didn't finish inside our polling window. The numbers below are from the 14 that completed.

Two honest caveats. First, 14 is a small sample — treat these as directional, not census. Second, every number below is from a code-verified check; the audit also produces LLM-written UX observations (a hero that over-claims, a CTA with no clear primary action), but those are advisory and counted nowhere in this data. The LLM in our pipeline never grades results — plain code does.

What actually breaks on a side-project launch: the numbers

Sites affected (of 14)	Finding	Severity
12	Tap targets smaller than 24px (mobile)	warning
12	Slow largest-contentful-paint (9 of them over 4s)	warning
11	No analytics events detected	info
11	No Content-Security-Policy header	info
11	Page can be framed by any site (no `X-Frame-Options`)	info
11	Serious accessibility violations	warning
11	No `llms.txt`	info
10	Minor accessibility violations	info
9	No `X-Content-Type-Options: nosniff`	info
9	Text assets served without compression	info
8	No canonical link on entry page	info
7	Unknown URLs return 200 instead of 404	warning
7	No structured data on entry page	info
6	No `Strict-Transport-Security` header	warning
5	Missing Open Graph tags	info
4	Missing meta description	warning
4	`http://` does not redirect to `https://`	warning
4	Images missing alt text	info

The most common mistake: flying blind on your own launch

Eleven of the fourteen sites sent no analytics events whatsoever. The page loads, the browser records every outbound request, and nothing resembling an analytics beacon ever leaves it.

This was the single most common finding in the Show HN cohort too, and it stings more for a side project. You posted to r/SideProject for one reason — to find out if anyone wants this. The traffic from that post is the clearest signal you will get for weeks: which referrer converted, which screenshot made people click, how many visitors actually reached the signup. For 11 of these 14 builders, that data was never recorded. The launch happened; the evidence didn't.

(We can only see a recognized beacon — if you run a first-party collector we don't have a signature for, you'd show up here too. Worth a 30-second check of your own network tab either way.)

The security headers nobody adds

Eleven sites had no Content-Security-Policy and could be embedded in an iframe by any website on the internet — the setup behind clickjacking. Nine were missing X-Content-Type-Options: nosniff; six had no HSTS; four served http:// without redirecting to https://.

None of these is exploitable on its own for most side projects, and none will page you. But they're each a one-line fix in your host or framework config, and they're the difference between "looks like a weekend hack" and "looks like someone who knows what they're doing" to anyone who checks. Several were also soft-404s — 7 of 14 returned 200 OK for URLs that don't exist, which quietly pollutes search indexing and hides broken links from your own logs.

The mobile and accessibility tax

Twelve sites had tap targets under 24px and nine took over four seconds to paint on mobile — one took 18.5 seconds. Most launch traffic from a social post is mobile; a four-second hero is a meaningful chunk of visitors gone before they see the thing.

Eleven sites had serious accessibility violations (the kind axe-core flags as serious — missing form labels, insufficient contrast, controls with no accessible name). These aren't only a compliance question: a button a screen reader can't name is often a button that's confusing to everyone, and contrast failures are just hard-to-read text.

The AEO gap: 11 of 14 have no llms.txt

Eleven sites had no llms.txt and seven had no structured data on the entry page. A year ago that was a non-issue. Now a real and growing share of "how do I…" and "what's the best tool for…" traffic resolves inside ChatGPT, Perplexity, and Google's AI overviews — and those engines lean on machine-readable signals to understand and cite you. A side project with no structured data and no llms.txt is invisible to exactly the channel that's growing fastest.

What we take from this

The Show HN cohort failed loudly — broken flows, dead analytics tags, canonical tags aimed at the wrong domain. This cohort failed quietly, and uniformly. Zero criticals is genuinely good news; it means these builders shipped working products. But "nothing is broken" and "nothing is leaking" are different claims, and all 14 were leaking in the same handful of places: reach (analytics, AEO, SEO), trust (security headers), and reach-again (mobile speed, accessibility).

None of it requires judgment to detect. Every finding above is a deterministic check against evidence a browser can capture — a request that did or didn't happen, a header that is or isn't present, a response code. Which is exactly why it should be automated instead of living on a checklist you mean to get to.

That's the audit we ran on these 14 sites, and it's free: paste a URL on the Prufa homepage and get the same machine-verified findings for your own launch in about a minute. Ideally before you post it.

We audited 49 Show HN launches. 38 had a critical bug on day one.

Gregory Potemkin — Fri, 12 Jun 2026 09:19:31 +0000

Originally published on the Prufa blog.

In June 2026 we pointed Prufa's free audit at 50 products that had just launched on Show HN — every launch from the previous 30 days that earned at least 10 points. These are products at their moment of maximum attention: front page, real traffic, founders watching the comments.

The headline numbers, from the 49 audits that completed (one site couldn't be reached by our runner):

100% of the 49 launches had at least one machine-verified finding.
78% — 38 of 49 — had at least one critical finding.
40 critical and 61 warning findings in total, every one verified by deterministic checks against captured browser evidence.

No site is named in this post. The point isn't to embarrass anyone — it's that these failures are systematic, and if these teams have them on launch day, you probably do too.

Methodology, briefly

Each site got the same audit a free Prufa run does: a real browser loads the public pages, captures network traffic, console output, cookies, and response codes, and a fixed suite of deterministic checks grades the evidence. Same input, same verdict. Every number below is from a code-verified check — no LLM opinions are counted anywhere in this data.

One honest caveat: our export keeps only the top findings per site, so the per-issue counts below are floors, not totals. The real numbers are equal or worse.

What actually breaks at website launch: the numbers

Sites affected (of 49)	Finding	Severity
38	No analytics events detected	critical
24	No canonical link on entry page	info
22	Cookies set without the `Secure` attribute	warning
14	Broken links	warning
12	No `<h1>` heading on entry page	info
11	No robots.txt	info
10	JavaScript console errors during page load	warning
10	Missing meta description	warning
8	Images missing alt text	info
7	Missing Open Graph tags	info
3	Tag container loads, but no analytics events fire	warning
2	Canonical URL pointing to a different host	critical

The most common launch bug: analytics that record nothing

The most common critical finding, by a wide margin: no analytics events detected. The page loads, the browser captures every outgoing request — and nothing resembling an analytics event leaves the page.

Think about what that means on launch day specifically. Front page of Hacker News is, for many of these products, the single largest traffic spike they will ever see. Which referrers converted, which pages people actually read, how many of those visitors signed up — for 38 of these 49 teams, that data simply doesn't exist. Not sampled, not skewed: absent.

Three more sites had a subtler version: the tag container loads (so a quick "view source" check looks fine), but no events ever fire. That one is nasty precisely because it passes the eyeball test — the only way to catch it is to watch the network traffic, which is what our check does.

The rest of the list is the unglamorous stuff

Broken links (14 sites). Nobody clicks every link on their own site — especially footer links, docs links, and that one pricing anchor that moved two redesigns ago. Visitors do.

Console errors at page load (10 sites). Errors at load time often mean broken features visitors never report — they just leave. These ten sites shipped them to the HN front page.

Cookies without Secure (22 sites). A one-attribute fix, sitting on nearly half the cohort.

The canonical-to-wrong-host pair (2 sites, critical). Two sites shipped a <link rel="canonical"> pointing at a different domain — almost certainly a leftover from a template or staging config. That tag tells search engines "index that other site instead of me." On launch week.

What we take from this

These aren't careless teams. They got a product to Show HN and earned points doing it. The pattern says something else: the surface area that needs verifying grows faster than anyone's willingness to click through it — especially in the week before a launch, when everything is on fire.

None of the findings above require judgment to detect. Every one is a deterministic check against evidence a browser can capture: a response code, a network request that did or didn't happen, an attribute on a cookie. Which is exactly why this should be automated — and why the LLM in our pipeline never grades results; plain code does.

We turned this dataset into a pre-launch checklist ordered by these failure rates, if you want the actionable version.

That's the audit we ran on these 49 sites, and it's free: paste a URL on prufa.dev, get the same machine-verified findings for your own site in about a minute. Before your launch day, ideally.

How I Set Up OpenClaw: A Developer's Guide to Self-Hosted AI Assistant Infrastructure

Gregory Potemkin — Wed, 25 Mar 2026 08:49:10 +0000

I recently set up OpenClaw, the open-source AI assistant framework, and wanted to share my experience for anyone considering self-hosting vs managed options.

What is OpenClaw?

OpenClaw is an AI assistant framework that lets you run your own AI assistant with integrations for Telegram, Slack, WhatsApp, and a built-in web chat. Think of it as your own ChatGPT that you control completely.

Why Self-Host?

Data privacy: Your conversations stay on your infrastructure
Cost control: Use your own API keys, pay only for what you use
Customization: Full control over models, prompts, and integrations
Learning: Great way to understand AI infrastructure

The Setup Process

1. Install OpenClaw

macOS/Linux/WSL2:

curl -fsSL https://openclaw.ai/install.sh | bash

Windows PowerShell:

iwr -useb https://openclaw.ai/install.ps1 | iex

2. Run the Onboarding Wizard

openclaw onboard --install-daemon

This configures:

Model authentication (OpenAI, Anthropic, Gemini, etc.)
Workspace defaults
Gateway settings
Optional messaging channels

3. Verify Everything Works

openclaw gateway status
openclaw doctor
openclaw dashboard

The last command opens Control UI at http://127.0.0.1:18789/ where you can send your first message.

Key Lessons Learned

Use localhost for dashboard: Never expose the Control UI to the public internet. Use Tailscale or SSH tunneling if you need remote access.
Run doctor after updates: Always run openclaw doctor after setup and upgrades to catch issues early.
Start with built-in chat: You don't need Telegram or Slack configured to get started. The Control UI works immediately.
Document your install method: Whether you used the script, npm, or source build - keep track of how you installed it for troubleshooting.

When to Consider Managed Hosting

Self-hosting is great for:

Developers who want full control
Teams with existing infrastructure
Privacy-sensitive use cases

But if you want zero infrastructure work, there's OpenClaw Setup - managed hosting that handles operations while you keep control of your credentials and config.

Resources

Have you tried self-hosting AI assistants? What's been your experience?