DEV Community

Scarab Systems
Scarab Systems

Posted on

AI Code Quality Is Not Repo Truth

There is a pattern starting to show up everywhere now.

A team uses an AI coding agent. The agent moves fast. It writes code, rewrites tests, updates docs, creates abstractions, touches config, patches runtime behavior, and explains itself confidently.

Then the team starts noticing the uncomfortable part.

The code is not always “wrong” in the obvious way. It may compile. It may pass tests. It may satisfy a review checklist. It may even look cleaner than what was there before.

But something in the system has shifted.

A test now proves the patch instead of the behavior.

A README now describes an API the repo does not actually expose.

A generated artifact is treated like source truth.

A config file silently becomes the repair surface for a runtime problem.

A fallback path preserves uptime but loses correctness.

A frontend change compensates for a backend contract that should never have moved.

The code looks finished, but the repo no longer tells one coherent story.

That is drift.

And I think the industry is at risk of misunderstanding what kind of problem drift actually is.

The wrong reflex: solve AI drift with more AI

A lot of the current response to AI-assisted code failure is still happening inside the same mental frame that created the problem.

The agent generates code.

Then another AI reviews the code.

Then another AI summarizes the review.

Then another AI writes tests.

Then another AI explains whether the tests passed.

This can be useful. It can catch things. It can improve workflows.

But it does not solve the deeper problem.

If the source of truth is still conversational, probabilistic, and context-fragile, then you have not created a stable diagnostic layer. You have created another layer of interpretation.

That may be better than nothing, but it is not the same as proof.

The industry keeps trying to make the AI agent more self-aware, more careful, more reflective, more heavily prompted, more supervised by other AI systems.

But drift is not primarily a personality flaw in the agent.

Drift is what happens when a system loses track of which boundary owns which truth.

That means the missing layer cannot just be another AI opinion.

It has to be outside the agent loop.

Step outside the conversation

One of the easiest ways to see this is to stop imagining yourself as the developer talking to your AI coding agent.

Instead, imagine you are standing outside the workflow, watching another developer have that conversation.

The developer tells the agent to fix a bug.

The agent changes production code.

Then it changes the tests.

Then it edits the docs.

Then it updates configuration.

Then it says the work is complete.

From inside the conversation, this can feel productive.

From outside the conversation, the obvious process question appears:

Who is checking whether each layer still owns the thing it claims to own?

If a technician from another department made those same changes, a serious engineering team would not simply ask, “Did they finish?”

They would ask:

What did they touch?

Which system boundary did they cross?

Which contract changed?

Which evidence proves the behavior still belongs there?

Which test proves the original claim rather than the new patch?

Which source is authoritative now?

AI-assisted development needs that same operational distance.

Not because AI is bad.

Because AI is fast enough to cross boundaries faster than the team can notice.

AI code quality is not the same as repo truth

There are useful tools emerging around AI code quality: guards, review skills, semantic linters, test checkers, docs checkers, prompt rules, agent policies, and CI add-ons.

Some of them catch real problems.

They can flag hallucinated APIs.

They can notice mock abuse.

They can detect documentation that references missing functions.

They can warn about over-abstraction, broad error swallowing, unsafe patterns, or framework-specific mistakes.

That is valuable.

But catching known output patterns is not the same as proving that the repo’s truth model was preserved.

A guard might say:

“This test mocks too much.”

A diagnostic system has to ask:

“Did the test layer stop validating the behavior that the production system depends on?”

A docs checker might say:

“This function does not exist.”

A diagnostic system has to ask:

“Is the documentation wrong, is the code missing a public API, is generated reference output stale, or did the ownership of this claim move?”

A code-quality tool might say:

“This abstraction is premature.”

A diagnostic system has to ask:

“Did this change move responsibility out of the surface that owns it?”

Those are related questions, but they are not the same question.

The first is output review.

The second is boundary diagnosis.

Scarab’s realignment

Scarab Diagnostic Suite was built around a simple premise:

AI should not be the source of truth for AI-assisted code work.

The repo has to be the source of truth.

The diagnostic layer has to be deterministic, mechanical, evidence-first, and independent of the coding agent’s conversational state.

That is the realignment.

Instead of asking an AI agent to remember every contract, every invariant, every architectural rule, every generated artifact boundary, every test obligation, and every repo-specific convention, Scarab works from the outside.

It inspects evidence.

It compares claims.

It surfaces contradictions.

It identifies boundary failures.

It gives the agent the right context only after the system has established what needs to be preserved.

That matters because an AI coding agent can only work with the context it has. If the context is wrong, stale, incomplete, or conversationally compressed, the agent can produce a very polished wrong answer.

Scarab is not trying to make the agent “smarter” in the abstract.

It is trying to stop the agent from operating without a stable map of repo truth.

The same failure shows up in different lanes

This is why the problem is bigger than one kind of codebase.

AI drift does not only affect open-source projects.

It affects frontend teams, backend services, data systems, DevOps workflows, scientific software, internal tools, agencies, startups, and companies that never thought of themselves as software companies until AI started writing code for them.

The pressure point changes by lane.

The underlying failure is the same.

A boundary stopped preserving the truth another part of the system depended on.

Frontend systems: UI behavior becomes the patch surface

In frontend work, drift often appears when the visible interface starts compensating for a broken contract somewhere else.

A component gets a defensive fallback.

A route adds extra state handling.

A client-side workaround hides an API inconsistency.

A UI test is updated to match the new rendering path.

The screen looks fixed, but the ownership question may be wrong.

Was the frontend supposed to absorb that behavior?

Or did the backend contract, router boundary, state model, accessibility surface, or generated client drift first?

A normal review may ask whether the UI works.

A boundary diagnostic asks whether the UI became responsible for something it does not own.

That distinction matters because once the wrong layer absorbs the repair, future changes inherit the lie.

Backend and API systems: contracts drift quietly

Backend drift often shows up as contract confusion.

A handler returns a slightly different shape.

A serializer changes behavior.

A migration updates data in a way the API layer does not fully encode.

A client library keeps working because it is permissive.

The tests pass because they were updated near the patch.

But the contract may no longer be stable.

In ordinary review, the question is often: “Does this endpoint work?”

The deeper question is:

“Which layer owns this contract, and what evidence proves the contract still matches implementation, documentation, tests, and clients?”

That is where drift hides.

Not always in a crash.

Often in a silent mismatch between what the system claims and what the system now does.

Data systems: freshness, schema, and provenance are not vibes

Data systems are especially vulnerable because the code can be technically correct while the data truth has already moved.

A schema changes.

A cache survives too long.

A migration succeeds but changes meaning.

A model reads from a snapshot that is no longer valid.

A pipeline output is treated as fresh because the job completed.

For data-heavy teams, the problem is not only whether the job ran.

It is whether the job preserved the assumptions the downstream system depends on.

What schema did this result assume?

What version of the source did it read?

Which migration state was active?

Which artifact is authoritative?

Which result is generated, and which result is source truth?

A deterministic diagnostic layer matters here because AI can explain a data pipeline beautifully and still miss the fact that the pipeline is reasoning from stale or misowned evidence.

DevOps and CI/CD: availability is not correctness

Automation failures often look like infrastructure problems.

A deployment succeeds but pulls the wrong image.

A cache hit looks valid but was built under a different assumption.

A fallback keeps the system alive but bypasses the verification path.

A retry prevents downtime but repeats a side effect.

A CI job passes because the failing surface was never exercised.

The industry has spent years building tools around availability, observability, retries, alerts, and recovery.

Those tools matter.

But AI-assisted development adds a different question:

Did the workflow preserve the proof that the result is still correct?

A green pipeline does not automatically mean the repo stayed truthful.

It means the pipeline’s checks passed.

Those are not always the same thing.

Tests: the most dangerous drift can look like validation

Tests are one of the first places AI coding agents can create false confidence.

An agent writes tests.

The tests pass.

The patch looks validated.

But what did the tests prove?

Did they prove the original behavior?

Did they prove the new implementation?

Did they mock away the system boundary?

Did they assert on internals?

Did they update the expectation to match the patch?

Did they delete the failure rather than preserve the regression?

This is why test quality is not just a code smell issue.

It is a truth issue.

A test is not valuable because it exists.

A test is valuable because it preserves a claim the system depends on.

When that claim moves silently, the repo can look safer while becoming less trustworthy.

Documentation: public claims are part of the system

Documentation drift is often treated as cosmetic.

It is not.

Docs are public claims.

A README, API reference, changelog, migration guide, or docstring tells users and future agents what the system is supposed to be.

When documentation references functions that do not exist, examples that cannot run, flags that no longer work, or behaviors that changed without a claim boundary, the repo has lost one of its truth surfaces.

This matters even more with AI agents, because agents read documentation too.

Bad docs do not only mislead humans.

They feed future automation.

That means documentation drift can become agent drift.

And agent drift can write more documentation drift.

That loop is exactly why an independent diagnostic layer matters.

Scientific and applied technical systems: correctness is the product

Not every company following AI-assisted development is a devtools company.

Biotech companies, research labs, analytics teams, agencies, logistics firms, ecommerce platforms, healthcare-adjacent software teams, and internal automation groups all have code.

Many of them are now using AI to write or maintain that code.

For those teams, drift is not an abstract developer concern.

It can mean measurement pipelines become less reproducible.

Reports no longer match source data.

Internal tools encode the wrong assumption.

A generated workflow silently changes how evidence is processed.

The company may not care about “AI code quality” as a category.

But they absolutely care when their software stops preserving the truth their business depends on.

That is the market-level problem.

The repair begins before the patch

The industry often talks about AI-assisted development as if the main question is how to generate better patches.

But the harder question comes before the patch.

What is the repair surface?

Which boundary owns the failure?

What evidence proves the failure belongs there?

What should not be touched?

What tests are allowed to change?

What generated artifacts are outputs, not authority?

What documentation claims must remain aligned?

What context does the agent need before it is allowed to act?

A patch without that map can make the system worse while looking helpful.

That is why Scarab is not a patch bot.

It is not a linter.

It is not a code review personality.

It is not another AI agent watching the first AI agent.

Scarab Diagnostic Suite is a proprietary diagnostic product built around evidence-first repo analysis.

SDS finds evidence.

People make claims.

Maintainers decide.

That boundary is important.

The diagnostic layer should not pretend to be the maintainer.

It should make the system legible enough for the maintainer, developer, or AI coding agent to act without guessing.

Field Lab: public diagnostic case records

Scarab Systems has opened a public Field Lab for selected diagnostic field tests.

The Field Lab publishes public case records from real open-source issues: the issue being examined, the suspected boundary, the evidence gathered, the validation performed, and the current status of the diagnostic claim.

Some cases may end with a local repair candidate.

Some may become upstream pull requests.

Some may remain diagnostic records only.

That status is part of the record.

Scarab Diagnostic Suite is proprietary, but the larger conversation is shared. AI-assisted development is changing how all of us work with code, and the goal of the Field Lab is to make boundary failures, drift patterns, and diagnostic reasoning easier to see from more than one angle.

We welcome Field Lab candidate suggestions from developers, maintainers, companies, researchers, and anyone working close enough to code to notice when something has stopped holding together.

If you know of a public open-source issue that looks like cross-layer drift, unclear ownership, AI-assisted codebase confusion, or a boundary failure, you can suggest it as a Field Lab candidate.

Useful suggestions include the public issue link, the suspected boundary, reproduction notes if available, and why the issue may be diagnostically interesting.

GitHub logo scarab-systems / scarab-field-lab

Public case library for Scarab Diagnostic Suite field tests, recording public issues, diagnostic findings, validation summaries, and upstream PR status without publishing private work materials.

Scarab Field Lab

Scarab Systems mascot holding a circuit-board lollipop

Scarab Field Lab is the public case library for Scarab Diagnostic Suite field tests.

Scarab Diagnostic Suite is proprietary and not open source. This repository publishes public case records only: public links, specific diagnostic findings validation notes, and, when applicable, the public status of a human-reviewed patch or upstream pull request. It does not contain SDS source code, internal diagnostic rules, product internals, private run artifacts, or implementation details.

Scarab Diagnostic Suite is a mechanical diagnostic layer. It inspects repository evidence, compares expected and observed behavior, and records specific findings. It is not an AI coding agent, does not use AI to determine diagnostic truth, and does not submit unattended patches. AI assistance may help after bounded diagnostic evidence exists with patch drafting or summaries, but public submissions remain human reviewed.

What This Repo Shows

  • Public issue and pull request links.
  • Sanitized diagnostic evidence.
  • Whether a case…

The conceptual shift

The AI coding conversation has been centered on the agent.

Better prompts.

Better models.

Better context windows.

Better reviews.

Better tool calls.

Better planning.

Better self-correction.

All of that may help.

But it does not remove the kink in the road.

The kink is that we keep asking AI to be both the worker and the source of truth for the work.

That is the part that has to change.

The repo needs an independent diagnostic layer.

The agent needs bounded context.

The repair needs an owned surface.

The system needs evidence before action.

Once you see it that way, the problem becomes much clearer.

AI did not make software boundaries matter less.

It made them matter more.

Because now the thing crossing those boundaries is faster, more confident, and less naturally constrained by the tacit knowledge a human team used to carry.

The next phase of AI-assisted development will not be won by teams that generate the most code.

It will be won by teams that can still prove what their codebase means after AI has touched it.

That is the real shift.

Not more AI inside the loop.

A stable diagnostic layer outside it.

Top comments (2)

Collapse
 
lyx19951121 profile image
LYX19951121

This resonates deeply. The backend API section hit home — I've seen AI agents update both the handler response shape AND the test assertions in the same pass, making the change invisible to review. The "outside the conversation" framing is key: the question isn't "does it work?" but "did the contract survive?" That's a different diagnostic altogether.

Collapse
 
scarab-systems profile image
Scarab Systems

Yes — that exact pattern is one of the clearest examples of why this needs to be treated as a diagnostic problem, not just a review problem.

When the handler response shape and the test assertion move together in the same pass, the test can stop acting like independent evidence. It may still pass, but it is no longer proving that the original contract survived.

That’s the piece I keep coming back to: a green test is only meaningful if we know what claim it was supposed to protect.

“Did the contract survive?” is exactly the better question. Because once the contract, implementation, and validation all move at the same time, the change can look clean while the repo has quietly lost the thing the test was meant to preserve.