DEV Community

Muggle AI
Muggle AI

Posted on

Two kinds of AI testing shipped this month. They solve completely different problems.

Lovable shipped $100 AI pentests. Meta proved LLM-generated tests catch 4x more bugs. Both shipped this month. They solve completely different problems — and the confusion between AI security testing, AI test automation, and AI-generated test suites is making it harder to know which one you need. Neither one touches the layer where most teams are actually losing users.

On March 24, Lovable launched integrated security pentesting in partnership with Aikido — a $1B security unicorn — for $100 per pentest. The same week, Meta published research on a system called JiTTests (arXiv: 2601.22832) showing that LLM-generated unit tests can catch bugs at scale inside a production engineering organization. Both are real advances. Both are well-executed. And both are getting lumped under "AI testing" in a way that obscures what they actually do — and what neither of them touches.

It's worth pulling these apart carefully, because the gap between them is where a lot of teams are quietly bleeding.

What the Lovable + Aikido pentest covers

The Lovable integration runs a full whitebox + blackbox + greybox pentest against your deployed application: OWASP Top 10, LLM Top 10, privilege escalation, IDOR, authentication bypasses. It delivers results in 1–4 hours for $100, against a traditional range of $5K–$50K for an equivalent manual engagement. At that price point, security testing becomes something you can do per deploy rather than per quarter.

That's a meaningful shift. But the boundaries matter: this tests Lovable-built apps only, it tests the deployed application, and it looks for security vulnerabilities. A pentest will tell you whether an attacker can access data they shouldn't. It won't tell you whether your checkout flow breaks when a user applies a coupon code on mobile. That's not a security failure — it's a behavioral failure, and it's explicitly out of scope.

What Meta's JiTTests covers

The Meta paper is the more technically interesting result. The core idea: instead of maintaining a static test suite that grows stale, generate fresh unit tests per code diff — tests specifically designed to fail on the incoming change if it introduces a bug. These are catching tests, not hardening tests.

The numbers are compelling: 22,126 tests analyzed, 4x more candidate catches compared to hardening-style tests, and 70% reduction in human review time. The pipeline used Llama 3.3-70B, Gemini 3 Pro, and Claude Sonnet 4 as assessors. Of 41 candidate catches surfaced to human reviewers, 8 were confirmed bugs — 4 of them serious.

Those caveats are real and the paper acknowledges them. Eight confirmed from 41 is a small sample. The oracle problem (determining whether a test failure signals a real bug or a spec change) remains unsolved and requires human judgment. JiTTests works at the unit level — individual functions and their immediate behaviors. It's not testing sequences of actions. It's not testing how a user navigates through your product. And it requires the diff to exist — by definition, it can't catch bugs that live in the interaction between components rather than inside a single changed function.

The gap neither of them fills

There's a third category that both systems structurally ignore: user journey testing.

The checkout flow that silently dead-ends when a promo code is applied. The signup that completes on desktop but drops users on mobile Safari after the email confirmation step. The dashboard that loads correctly in isolation but throws a 403 when navigated to from a shared link. These are behavioral bugs. They only surface when a real user clicks through a sequence of actions — and they're invisible to both a security scanner and a per-diff unit test generator.

A single broken checkout flow costs more in customer lifetime value than a year of testing infrastructure — and most teams discover it only when a customer emails to say something is broken.

Security testing doesn't touch these because they're not vulnerabilities. Code-level catching doesn't touch them because they're not regressions in a single function — they're emergent failures in multi-step flows. Right now, the only reliable way to catch them is manual QA, end-to-end test suites that someone has to write and maintain, or actual user reports. Penligent published a taxonomy in March 2026 noting that "AI testing" now refers to at least five distinct categories — and the terminology itself is obscuring which problems are actually being addressed. Muggle AI is building specifically on this journey testing layer — the approach is paste a URL, get journey coverage across your key flows (muggle.ai). That's a different class of problem from security scanning or unit diffing: you're not testing what code does, you're testing what a user experiences through a sequence of real steps.

Which layer you actually need right now

This isn't a "you need all three" post — that's easy to say and hard to act on. Here's a more honest framing:

If you're handling payments or sensitive user data: security testing is the non-negotiable starting point. The Lovable + Aikido model makes this accessible at a price that removes the excuse. If you're shipping AI-generated code at speed — vibe coding, rapid prototyping, whatever you want to call it — code-level catching of the JiTTests variety addresses the specific risk that your diff introduces a regression no one reviewed. Those are different threat models. If users are dropping off or churning from flows that "should work," neither of those tools will find the problem. That's a journey testing gap, and most small teams have none of the three layers covered.

Developer trust in AI-generated output has already slipped — Stack Overflow data shows it falling from 69% to 54% — and the pressure to ship fast hasn't changed. The testing infrastructure hasn't kept pace with the generation infrastructure. That's the actual problem statement.

What March 2026 actually shipped

Two out of three layers got serious investment this month. Security testing is now accessible to teams that previously couldn't afford it. Unit-level catching is showing real signal at Meta's scale, even if the confirmed-bug sample is small. Both are genuine progress.

The third layer — testing what users actually experience when they click through your product — is the hardest to automate, and it didn't ship this month. Testing a behavioral flow requires understanding intent, state, and sequence in a way that doesn't reduce to "does this function return the right value" or "is this endpoint vulnerable to injection." The industry knows what the gap is. The hard part is that solving it means building something that can reason about user experience, not just code behavior. That's a different class of problem — and it's the one most teams discover only after a customer emails to tell them something is broken.

Which of these three layers does your team actually have covered?

Top comments (0)