Scarab Systems

Posted on Jun 10

The Real AI Coding Breakthrough Is Not More Context. It Is Better Diagnostics.

#ai #programming #devops #discuss

Distinguishing repo 'drift' from failure

When I started building what became Scarab Diagnostic Suite, I was not trying to create a theory of AI-assisted software development.

I was trying to survive my own repo.

I was building an intricate frontend/backend system with Codex. Not a toy app. Not a landing page. A real system with a public site, internal content architecture, member-facing surfaces, product logic, backend APIs, frontend routes, editorial workflows, operational concerns, and a lot of moving parts that all needed to keep agreeing with each other.

Codex could move fast. Very fast.

It could generate files, wire components together, write migrations, patch errors, build routes, add tests, explain itself, and keep pushing the project forward. In the beginning, that felt like the whole promise of AI coding agents.

Then the drift started.

Not failure, exactly.

Drift.

That distinction matters.

The agent could still produce code. The build could still pass. A test could still go green. The UI could still render something. But underneath that apparent progress, the repo would start losing its shape.

A file would grow too large because the agent kept adding “just one more thing.” A helper would appear where a real module boundary should have existed. A frontend route would start owning behavior that belonged in the backend. A generated artifact would quietly become treated as source truth. A test would prove the patch instead of proving the intended behavior. A workaround would survive long enough to become architecture.

The code worked.

The repo was less coherent.

That is where Scarab began.

Not as another coding agent. Not as a replacement for tests. Not as a magic repair tool. Scarab began as a response to one very specific pressure: how do you keep an AI coding agent from slowly pulling a real software project out of alignment with itself?

The problem was not that AI could not code

AI can code.

That is no longer the interesting question.

The more important question is whether AI can keep coding inside the shape of the system you are actually trying to build.

That is much harder.

A repo is not just a pile of files. It has ownership. It has boundaries. It has rules. It has assumptions. It has places where truth is supposed to live. It has framework conventions. It has tests that are supposed to prove something meaningful. It has public behavior, internal behavior, generated behavior, runtime behavior, and maintainer expectations.

When those layers agree, the system feels coherent.

When they stop agreeing, the system starts drifting.

The uncomfortable part is that we often ask AI to preserve context that humans have not actually made stable.

We want the agent to know what matters. We want it to patch narrowly. We want it to respect architecture. We want it to understand which file owns which responsibility. We want it to know which tests are authoritative and which ones are incidental. We want it to know when a workaround is acceptable and when it is a quiet violation of the system.

But very often, the repo itself does not clearly know those things.

The docs may be stale. The tests may only prove part of the behavior. The runtime may contradict the intended architecture. The issue may describe the symptom but not the boundary that failed. The human may know the rule intuitively, but the repo may not preserve that rule anywhere the agent can reliably obey.

So we put an AI coding agent into an unstable field and act surprised when it drifts.

But if the system has not defined its own boundaries, why would we expect the AI to enforce them?

More context is not the same as stable context

A lot of the AI coding conversation is still orbiting the same set of answers: bigger context windows, smarter models, more autonomous agents, better prompts, stronger tools, more memory.

Those things can help.

But they do not solve the underlying problem by themselves.

A larger context window can hold more text. It does not automatically decide which text is authoritative. A smarter model can reason better. It can still reason toward the wrong baseline. A more autonomous agent can move faster. It can also spread a bad assumption across more surfaces before anyone notices.

The problem is not simply that the AI needs more context.

The problem is that the context needs ownership.

Some context should not live as a fragile conversation with the model. Some context should not depend on whether the agent remembered a sentence from three prompts ago. Some context should not have to be re-explained every time the AI touches the repo.

The repo should be able to say:

This surface owns this behavior.

This file is canonical.

This pattern is deprecated.

This test proves this boundary.

This generated output is not source truth.

This workaround is temporary.

This framework convention must be preserved.

This is what “done” means here.

That is a very different posture from “let the AI figure it out.”

It gives the AI a better job.

Instead of asking the agent to be the architect, the historian, the maintainer, the tester, the reviewer, and the builder all at once, the repo can hold the system rules outside the model. Then the AI can do what it is actually good at: move quickly inside a bounded structure.

That is when AI becomes more powerful, not less.

The real breakthrough is not dumping more responsibility onto AI.

It is taking the weight of unstable context off the AI.

Diagnostics are not optional in serious systems

There is a reason every imagined advanced system, from science fiction ships to humanoid machines, is always running diagnostics.

Before the system acts, it checks itself.

Are the subsystems online?

Are the readings coherent?

Is the signal real?

Is the failure isolated?

Which boundary is compromised?

That pattern exists because it reflects something fundamental: serious systems need ways to verify their own operating state.

Software is no different.

And AI-assisted software needs this even more, not less.

If an AI coding agent can touch dozens of files quickly, then the repo needs a way to determine whether those changes preserved the system’s truth. Not just whether the code compiles. Not just whether the check passed. Not just whether the agent says the task is complete.

The repo needs to ask:

Did this change respect ownership?

Did it preserve the boundary?

Did it move behavior into the right layer?

Did it make the test more truthful or more theatrical?

Did it reduce drift or simply silence the visible failure?

Did the system become more coherent after the repair?

That is the diagnostic layer Scarab is aimed at.

Scarab is not trying to be the coding agent. It is not trying to make subjective product decisions. It is not trying to replace maintainers or developers.

It is trying to hold repo truth outside the AI.

That distinction is the whole product.

The first use case was active AI-assisted development

The original Scarab use case was not public open-source repair.

The original use case was active development.

I needed a way to keep Codex working inside the boundaries of a repo that was becoming too large and too intricate to keep entirely in my head at all times.

The loop looked something like this:

The human defines the project direction.

The repo preserves the rules, boundaries, and accepted structure.

The AI coding agent writes or changes code.

Scarab checks whether the change still aligns with the repo’s accepted truth.

If a contradiction appears, Scarab surfaces the diagnostic boundary.

The human or coding agent repairs under direction.

Scarab runs again to verify that the contradiction has actually been resolved.

That is very different from “AI writes code and we hope the tests catch it.”

It is also different from “AI writes code and later someone cleans it up.”

The point is to keep the repo’s truth under watch while the repo is being changed.

That is the forward-facing use case: prevent drift from becoming normal.

Then the field tests changed the shape of the theory

Once Scarab started working as an active guardrail, a larger question appeared.

If Scarab can detect drift while it is forming, can it also detect drift after it has accumulated?

That question pushed Scarab into public field testing.

I started running diagnostic passes against real open-source issues. Not toy repos. Not staged demos. Real issues in real software projects, across different languages, frameworks, runtimes, and maintainer cultures.

The point was not to prove that every bug was caused by AI.

That would be too narrow, and honestly not true.

The point was to test whether the underlying diagnostic lens held up outside my own repo.

Could the same boundary theory apply to systems that were already drifting before Scarab arrived?

The answer is starting to look like yes.

That is the part that changed the product for me.

Scarab began as a way to supervise AI-assisted development. But the field tests started showing that the same principle can work backward. It can enter an existing failure surface, identify the boundary that stopped preserving truth, and help drive a narrow repair candidate.

That matters because drift does not care whether it was introduced by a human, a team, an AI coding agent, a migration, a stale test, a runtime change, or five years of “just patch it for now.”

Drift is drift.

The system has fallen out of alignment with itself.

The field tests are proving the theory in public

The most important thing about the field tests is not that they produce patches.

The important thing is that the same diagnostic question keeps working across very different systems:

What truth was this system supposed to preserve, and which boundary stopped preserving it?

In a pnpm field test, the issue was not really “self-upgrade is broken” in some broad sense. The boundary was narrower: a command meant to upgrade the pnpm binary was falling through into project dependency behavior when no project manifest existed. The repair was accepted and merged because the boundary was specific: skip the dependency-status auto-install path when the no-manifest condition is intentional, while preserving the existing fallback behavior where a project manifest exists.

That is not a vibes-based repair.

That is a boundary repair.

In a Rust compiler field test, the visible issue was that infinitely recursive nested structs had started compiling in release mode. That sounds huge from the outside. But the repair surface was narrower: raw pointer metadata needed to remain conservative, while the optimized/codegen path still needed to force pointee layout validation so impossible recursive layouts did not slip through.

Again, the useful diagnostic was not “Rust is broken.”

It was: this specific validation ownership boundary was lost in this specific optimized path.

In a Playwright field test, the visible symptom was a hang: response.allHeaders() could fail to settle for a worker script loaded inside an iframe. But the diagnostic boundary was smaller than “Playwright hangs.” The response metadata completion path lost ownership in a worker-frame session. Normal Chromium response metadata behavior should remain intact, but the worker-main-script case needed a bounded fallback so the public API could complete instead of waiting forever for metadata that would not arrive.

Different stacks.

Different languages.

Different symptoms.

Same pattern.

Something stopped preserving the truth another part of the system depended on.

That is the Scarab lane.

Narrow repair is not small thinking

One thing I keep emphasizing in the field reports is the phrase “narrow repair.”

That can sound minor.

It is not.

A narrow repair is not small because the bug is unimportant. It is small because the boundary is clear.

That is exactly what you want.

A broad patch often means the failure boundary is still blurry. The repair starts touching nearby code. It adds guards. It rewrites tests. It changes behavior around the symptom. It makes the red light go away, but it may leave the deeper contradiction intact.

A narrow repair says:

The break is here.

This handoff lost ownership.

This test proves the boundary.

This behavior must stay unchanged.

This is out of scope.

This is the smallest lane that can restore coherence.

That is not just useful for open-source contribution. It is useful for AI-assisted development in general.

AI coding agents are very good at generating broad plausible changes. That is both the gift and the danger. If the repair lane is not bounded before the agent begins, the agent may repair the visible failure by creating three new invisible ones.

Diagnostics make the repair smaller before the repair begins.

That is where AI becomes safer and more useful.

Scarab is not anti-AI

I want to be very clear about this because the conversation around AI coding often collapses into two extreme positions.

One side acts like AI should be trusted to build everything autonomously if the model is strong enough.

The other side acts like AI-generated code is inherently suspect and the safest answer is to avoid it entirely.

I do not think either position is serious enough.

AI coding agents are powerful.

They are also easy to misuse.

The answer is not to pretend they cannot help. The answer is also not to pretend they can hold every boundary of a complex system by themselves.

The better posture is to give AI a structure it can work inside.

That is what good human teams already do. A serious engineering team does not rely on one developer’s memory to preserve the whole system. It uses architecture, tests, review, conventions, ownership, documentation, CI, runtime checks, and operational discipline.

AI-assisted development needs the same kind of seriousness.

Actually, it needs more.

Because the agent moves quickly, sounds confident, and can produce changes faster than a human can fully inspect by intuition.

That does not mean the agent should be discarded.

It means the repo needs a diagnostic layer around the agent.

Rules and boundaries are not there to limit intelligence

This is one of the deeper shifts for me.

At first, it is easy to think of guardrails as restrictions.

Do not let the AI touch this.

Do not let the AI do that.

Do not let the AI go too far.

But I think the more interesting version is almost the opposite.

Rules and boundaries make AI more powerful because they stop the agent from wasting intelligence guessing the shape of the system.

If the AI has to infer everything every time, it spends its power reconstructing context. It has to guess what matters. It has to guess which layer owns which behavior. It has to guess whether the current implementation is intentional or accidental. It has to guess whether a test is authoritative or just historical residue.

That is a terrible use of AI.

The repo should hold that.

The human should not have to keep every implementation rule in their head.

The AI should not be trusted to reinvent the architecture from scratch.

The repo itself needs to preserve enough structure that the AI can build creatively without constantly distorting the system.

That is the deeper theory behind Scarab.

Human vision can be messy. Creative direction can be emergent. You do not always know the final shape of the thing before you start building it. That is true in software, writing, design, research, and product work.

AI can be extremely useful in that process because it can turn vague direction into working material quickly.

But if the project cannot preserve its own boundaries, AI may turn early uncertainty into permanent architecture too fast. It may solidify the wrong thing. It may preserve an accidental implementation as if it were intentional design. It may make the compromised version feel like the only version you can ship.

That is the tragedy.

AI should help builders reach the real version of the project.

It should not pressure them into accepting the version that survived the drift.

The backward-facing use case proves the forward-facing one

The field tests are useful because they show Scarab working in recovery mode.

A public issue already exists. A failure has already surfaced. The repo has already hit a contradiction. The diagnostic pass works backward from symptom to boundary.

That is valuable by itself.

But to me, the deeper value is that recovery mode proves the forward-facing use case.

If the same diagnostic lens can identify accumulated drift after the fact, then it can also be used earlier, while the repo is being changed, to keep the contradiction from becoming normal in the first place.

Those two modes reinforce each other.

Backward-facing mode says: here is where the system drifted.

Forward-facing mode says: do not let this kind of drift enter silently again.

That is the product direction I care about most.

Not “AI writes code faster.”

Not “AI fixes everything.”

Not “humans are obsolete.”

The more interesting future is:

Humans define intent.

Repos preserve rules and boundaries.

AI agents build inside those constraints.

Diagnostics prove whether the system stayed coherent.

That is the workflow I think serious AI-assisted development is moving toward.

The core principle is simple

All systems have rules and boundaries.

If those rules are not explicit, they do not disappear. They become implicit. If they are implicit, they become easy to violate. If they are easy to violate, AI will eventually violate them faster than humans can notice.

That is not because AI is malicious.

It is because AI is operating inside whatever structure we give it.

If we give it a repo with unclear ownership, stale docs, partial tests, accidental patterns, and no stable source of truth, we should not be surprised when it produces locally plausible changes that deepen the incoherence.

The fix is not just better prompts.

The fix is better system discipline.

A repo should know what it is.

It should know which rules matter.

It should know where truth lives.

It should know what must be preserved.

It should know what has changed.

It should know when a boundary has been crossed.

And when an AI agent enters that repo, the agent should not be carrying all of that alone.

That context should be held outside the model.

That is where I think the real AI coding breakthrough begins.

Not with an agent that remembers everything.

With a repo that can prove what is true.

Where this is going

Scarab is still evolving.

The field tests are part of that evolution. They are not polished marketing demos. They are pressure tests. They show where the diagnostic theory holds, where it needs sharpening, and how it behaves across real projects with real maintainers and real failure surfaces.

So far, the pattern is holding.

AI did not invent code drift.

It made it impossible to ignore.

But the answer is not to fear the AI.

The answer is to stop asking AI to carry rules that the system itself has not stabilized.

If we want AI coding agents to become serious development partners, then we need more than generation.

We need diagnostics.

We need repo-local truth.

We need boundaries that can be checked.

We need evidence before repair.

We need a way to tell the difference between a patch that silences a symptom and a repair that restores coherence.

That is what Scarab is becoming:
a deterministic diagnostic layer for AI-assisted development and accumulated software drift.

The public field tests are the proof trail.

The theory is simple:
the bug begins where the boundary stops holding.

The repair begins when the system can prove where that happened.

Top comments (9)

Mykola Kondratiuk • Jun 15

not sure the framing holds - context and diagnostics aren't competing solutions. context prevents wrong generation, diagnostics catch post-generation drift. skimping on context just creates more diagnostic failures downstream.

Scarab Systems • Jun 15

I agree with this distinction — context and diagnostics are not competing in the simple sense.

Context helps prevent bad generation. Diagnostics catches whether the change preserved the system after generation.

The part I’m pushing on is that “more context” is often treated as if it automatically creates better authority. It doesn’t. A larger context window can include stale docs, incidental tests, generated artifacts, old workarounds, and contradictory implementation history. The agent can see more and still reason toward the wrong baseline.

So I don’t think the answer is to skimp on context. I think the answer is to make context governed.

Which surfaces are authoritative?

Which claims does the repo depend on?

Which tests prove behavior versus merely bless the patch?

Which generated outputs are evidence, and which are not source truth?

That’s where diagnostics become necessary. Not as a replacement for context, but as the mechanism that checks whether the context actually preserved the repo’s truth.

Mykola Kondratiuk • Jun 15

right, the authority problem is real - freshness isn't in the token stream, so the model can't distinguish 'current spec' from 'old workaround.' diagnostics give you the signal that something drifted, even if they don't explain why.

Ethan Walker • Jun 17

Strong agree that diagnostics beat raw context. The version that bit us in eval: a failing test that just says 'score 0.4' is useless, you re-run it five times guessing. The fix was making every eval failure emit WHY, which check failed, on which input, actual vs expected, so the red is actionable instead of a mystery. Same principle as your point: the signal is worthless if it does not tell you what to do next. We treat 'this eval went red but I cannot tell why' as a bug in the eval, not a flaky result.

Scarab Systems • Jun 17

Yes — exactly this.

A red eval that only says score: 0.4 is basically the eval equivalent of a stack trace with the useful part cut off.

At that point the eval has detected pain, but not produced diagnosis.

The standard I keep coming back to is:

if the failure does not tell you where the contract broke, the failure report is part of the bug

That is especially true with agents because a bad final output can come from a lot of different places: wrong retrieval, stale context, bad tool call, bad assumption, bad test expectation, missing fixture, or the agent “repairing” around the eval instead of satisfying the real contract.

So yes, I completely agree with treating “red but unknowable” as a bug in the eval itself.

The useful eval is not just pass/fail.

It should say:

which check failed
what input triggered it
what contract was expected
what actually happened
whether the failure is in the model output, the test, the fixture, the retrieval/context, or the surrounding system

That is where diagnostics become more valuable than raw context. Context gives the agent more to read. Diagnostics tell the system what broke and where to look next.

Alex Shev • Jun 11

Diagnostics beat context size almost every time. More context can make an agent sound more confident, but diagnostics tell it what actually changed, what failed, and where to look next.

The best AI coding workflows will probably feel less like bigger prompts and more like terminal skills with sharp probes and clean failure messages.

xulingfeng • Jun 11

Drift vs failure — that distinction reframes every 'AI went rogue' scenario I've been writing. The AI never actually fails at the technical level. Tests pass, builds go green, CI gives a thumbs up. But the repo quietly forgets what shape it was supposed to hold. Diagnostics over bigger context windows, any day. Been tracking the field tests — pnpm and Rust, that's the kind of narrow repair that makes the whole thesis click. 🔥

Scarab Systems • Jun 11 • Edited

Thank you! As always the subtle reframing from this kind of feedback is extremely valuable for me....it helps me to continue to refine, mentally hold the theory and stress test it's edges.

I'm setting up a full Scarab Field Lab on Github now - Link available shortly!

Hossein Yazdi • Jun 11

Interesting point. The more I use AI for coding, the more I think the bottleneck is understanding and preserving system intent, not generating code.