Scarab Field Test: Repairing an AI-Generated App Without Guessing Its Intended Baseline
I’ve been building Scarab Diagnostic Suite around a problem I keep seeing in AI-assisted development: the app may look close, the code may be mostly there, and some checks may even pass — but the repo still isn’t in a trustworthy state.
So I tested Scarab against a public GitHub repo that was explicitly asking for help with an AI-generated web app. The app had been created through a generated/vibe-coded workflow and the owner was looking for help cleaning it up, fixing broken behavior, and making it more stable.
The interesting part wasn’t just “can the code be fixed?”
The interesting part was: what does fixed mean for this repo?
Scarab’s repair pass surfaced that there were actually two valid repair postures:
- TypeScript intended — treat npm run typecheck as a real acceptance gate.
- Build/lint only — treat the app as a generated JavaScript React export, where build + lint are the intended acceptance boundary.
That distinction matters because a diagnostic suite should not blindly impose a standard the repo never chose. Sometimes the repair is not just technical. Sometimes the repair is clarifying the repo’s actual operating baseline.
Both repaired versions now:
- build successfully
- lint successfully
- run locally in the browser
- render the app correctly
- include saved runtime evidence/screenshots
- pass browser smoke checks across key routes
One of the more useful findings was that static checks were not enough. A governance/static pass could look clean while the browser runtime still revealed real problems: stray generated stub text, React not mounting meaningful app content, and missing local Base44 helper behavior outside the hosted runtime.
That is exactly the kind of failure I’m interested in.
Not just “does the code pass a command?”
But:
- does the app actually render?
- does the local runtime behave?
- did the repair preserve the app’s intent?
- did the repo become more coherent afterward?
- is there evidence, not just confidence?
This is the distinction I keep coming back to with AI-generated code: completion is not the same as repo health.
A coding agent can produce code quickly. A repair pass can make code green. But a repo still needs a way to ask whether the result matches its actual baseline, whether the verification boundary is appropriate, and whether the system has returned to a quieter, more trustworthy state.
That is the space Scarab is being built for: repo-local diagnostics that maintain repo intent and inform repair for AI-assisted development.
Not another coding agent.
A way to help the repo prove what is true after the agent is done.
Top comments (5)
The 'two valid repair postures' distinction is exactly what we wrestle with in test automation — is the suite validating 'does it work as intended' or 'does it match what was built'? Those are completely different baselines and most frameworks conflate them.
The static-checks-pass-but-runtime-fails pattern is something we hit constantly with AI-generated UI code. Linted, compiled, deployed, still broken when a real session touches it.
Do you find the repair posture choice ends up being more of a technical decision or an organizational one — i.e., does the team even agree on what 'fixed' means before Scarab gets involved?
Excellent question!
The repair posture is partly technical and partly governance-driven. Scarab can surface the evidence that shows which baselines are actually available in the repo — for example, whether the repo is behaving like a TypeScript-intended app or more like a generated JavaScript React export with build/lint/runtime as the meaningful acceptance boundary.
But Scarab doesn’t “decide” what fixed means in a subjective way. It is deterministic. It does not invent intent, and it is not the thing applying the repair.
Its role is to surface the repo evidence, clarify the viable baseline, expose where the current state is breaking against that baseline, and give the AI coding agent a narrower, better-supported repair context.
So in a healthy workflow, the team or repo governance owns the definition of “fixed.” Scarab helps make that definition inspectable and testable, then the coding agent applies the patch against the evidence Scarab surfaced.
That’s the line I’m trying to preserve: Scarab doesn’t replace judgment or repair the repo by vibes. It gives the repair process a deterministic evidence layer.
The TypeScript vs build/lint posture finding is the part that sticks with me.
On the surface it's a technical fork. But really it's a test oracle problem — you can't evaluate a repair until you know what the repo intended to be. Most tools skip that question and just make the red turn green. What I like about your approach is you force that question into the open before anyone touches code.
As a QA person, that's the difference between "tests pass" and "the system is in a known good state." They look the same on a dashboard. They're not the same thing.
Have you run into a case where surfacing the two postures actually changed what the team decided to fix — or did they always pick the same one once they saw the evidence?
That’s the exact question I’m trying to test next.
So far, I haven’t run this in a formal team setting where a team sat down, reviewed two repair postures, and explicitly chose one as governance. Most of the current work has been field tests against public repos and open issues.
What I have seen, though, is that surfacing the posture changes the shape of the repair conversation. In a couple of cases, the patch itself wasn’t merged directly, but the maintainers adopted the same underlying repair direction and implemented a fix they could own. To me, that’s still a useful signal: the diagnostic didn’t just make something green, it surfaced the boundary clearly enough that the maintainer could recognize the actual failure and repair it on their terms.
That’s also why I’m careful not to say Scarab decides what “fixed” means. It can expose the available baselines and the evidence behind them, but in a healthy project the repo owner or team has to own the final posture.
The part I’m watching for now is exactly what you named: whether making the test oracle problem visible earlier changes the repair decision before code gets touched. My instinct is yes, but I want the field tests to prove that rather than just claim it.
"my instinct is yes, but I want the field tests to prove it" — that sentence tells me more about how you build than any product page could.
The bit about maintainers adopting the direction but not the patch is actually quite telling. In testing we call that "report the finding, don't prescribe the fix." The best bug reports don't tell the developer how to fix it — they make the failure boundary so clear that the developer can't unsee it. Then they fix it on their own terms, which means they'll actually own it long-term.
Sounds like Scarab is doing the same thing, just at the repo level instead of the test level.
Keep the field tests coming. I'm following the series now — the LangChain streaming vs structured output one was particularly good.