DEV Community

Scarab Systems
Scarab Systems

Posted on

Scarab Systems Field Lab: Same Patch, 96% Less Final Decision Context

Scarab Systems has started formally evaluating the efficiency of using Scarab Diagnostic Suite during AI-assisted software repair.

The question is simple:

Does a diagnostic governance layer actually improve the agent patching workflow in a measurable way?

Not theoretically.

Not philosophically.

Not because “AI governance” sounds responsible.

But in a concrete repair scenario, against a real public repository, with a public upstream issue and an actual patch.

Does the workflow become more efficient?

Does the agent need less context?

Does the repair become more evidence-bound?

Does the process reduce waste?

That is what this comparison looked at.

The Case

The test case was public issue emersion/xdg-desktop-portal-wlr#379.

The issue concerned SelectSources behavior in xdg-desktop-portal-wlr.

When a D-Bus caller omitted the optional types field, the code initialized type_mask to 0.

That caused target selection to fail with:

wlroots: No supported targets specified
Enter fullscreen mode Exit fullscreen mode

The expected behavior was to default omitted types to monitor capture.

The final patch was a one-line repair:

- uint32_t type_mask = 0;
+ uint32_t type_mask = MONITOR;
Enter fullscreen mode Exit fullscreen mode

The SDS-supported public PR was opened as:

emersion/xdg-desktop-portal-wlr#393

At review time, the PR was open, not draft, mergeable, review-required, and passing Alpine, Arch Linux, and FreeBSD checks.

The Comparison

Scarab compared two workflows.

The first was a cold Codex baseline.

The baseline agent received only the public issue, the target repository snapshot, and a neutral prompt. It did not receive SDS findings, private notes, local repair history, runtime artifacts, standing profiles, or previous Scarab context.

The second was an SDS-guided workflow.

That workflow used selected diagnostic evidence, target-source inspection, and verification outputs to narrow the repair boundary before final agent decision-making.

The point was not to see whether SDS could produce a more impressive diff.

The point was to measure whether SDS could reduce the amount of context required for the agent to make the same repair decision responsibly.

The Cold Baseline Result

The cold baseline succeeded.

That matters.

Given enough repository context, Codex correctly identified the one-line fix:

- uint32_t type_mask = 0;
+ uint32_t type_mask = MONITOR;
Enter fullscreen mode Exit fullscreen mode

It also correctly explained why the change preserved existing behavior when callers explicitly provide types.

So the honest finding is not:

“Codex could not solve this without SDS.”

It could.

The more interesting finding is what it cost the agent contextually to get there.

The cold baseline used:

  • Input tokens: 70,212
  • Output tokens: 1,982
  • Total tokens: 72,194

That is the standard shape of a lot of AI-assisted coding today.

Give the model a large repo payload.

Ask it to reason broadly.

Hope it finds the right boundary.

Sometimes it does.

But that does not mean the context strategy is efficient.

The SDS-Guided Result

The SDS-guided workflow reached the same public patch, but through a narrower evidence path.

The final SDS patch-decision context was:

  • Final SDS context tokens: 2,651

A broader recorded-token total for the SDS workflow was:

  • Broader SDS recorded-token total: 6,445

The patch was the same.

The difference was the repair posture.

SDS changed the agent workflow from:

Read a large repo snapshot and reason broadly.

to:

Use selected evidence.

Inspect the owning source boundary.

Verify the patch.

Keep the final decision context small.

That is the core result.

The Token Difference

Compared with the cold baseline input token count, the SDS-guided final decision context used about 96.2% fewer tokens.

Compared with the cold baseline total token count, the SDS final decision context was about 27x smaller.

Even using the broader SDS recorded-token number of 6,445, the SDS-guided workflow still used about 90.8% fewer tokens than the cold baseline input count.

That is the part I think matters commercially.

A lot of AI coding infrastructure is being built around the assumption that the agent needs the giant payload.

The giant payload then creates another problem.

Now companies need infrastructure to manage the payload.

They need context compression.

They need memory systems.

They need retrieval layers.

They need orchestration.

They need agents that can carry more and more history forward.

But what if the agent does not need the 70,000-token payload for the repair decision?

What if the more valuable layer is not a bigger context-management system, but a diagnostic governance layer that surfaces the smaller truth boundary the agent actually needs?

That is what SDS is testing.

The Real Finding

For this issue, SDS did not help Codex discover a different repair.

That would be the wrong claim.

The public issue was clear enough that the cold baseline also found the correct one-line patch.

The value of SDS was different.

SDS made the agent more disciplined, more evidence-bound, and dramatically more context-efficient.

The agent did not need more intelligence.

It needed a narrower truth surface.

That matters because real-world patching is rarely only about generating a diff.

It is also about knowing:

  • Why is this the right repair?
  • What evidence supports it?
  • What source boundary owns the behavior?
  • What scope should the patch not exceed?
  • What validators are sufficient?
  • What downstream behavior needs to remain intact?

Those are governance questions.

And they are usually where agentic software change starts to get expensive.

Ephemeral Context Matters

There is another important point here.

The SDS-guided context was ephemeral.

The agent did not need to permanently remember the whole repair.

It did not need to carry the full repo history forward.

It did not need to maintain a giant long-running internal state about this codebase in order to be useful next time.

That is central to the Scarab theory.

The agent should not have to own the repo’s truth.

The repo governance layer should own the repo’s truth.

When the next task comes in, the system should surface the relevant operational truth again, mechanically and per task.

That is a very different architecture from asking the agent to remember everything.

Scarab’s position is that long-horizon AI-assisted software change should not depend on an agent carrying an ever-growing memory of the repo.

The system should be able to surface:

  • active runtime truth
  • ownership boundaries
  • validator coverage
  • source inspection evidence
  • mutation boundaries
  • repair lanes
  • task-scoped context

Then the agent can operate inside that surfaced boundary.

The context can be used, verified, and released.

The repo remains the source of governed truth.

Not the chat thread.

Not the model memory.

Not a pile of historical prompt residue.

Why This Matters

The industry is spending a lot of money trying to make large context windows more usable.

That work has value.

But there is another path.

Instead of asking:

How do we make the agent carry more context?

Scarab is asking:

How do we make the repo surface the right context?

Those are very different questions.

The first question assumes the agent needs to ingest more.

The second assumes the system should become more mechanically legible before the agent acts.

In this case, the difference was stark:

  • Cold baseline: 72,194 total tokens
  • SDS final decision context: 2,651 tokens
  • Same final patch
  • Verified public PR

That does not mean every repair will show the same ratio.

It does not mean SDS eliminates all context cost.

It does not mean a diagnostic suite has no operating cost of its own.

This comparison is specifically about repair-time agent context efficiency, not the total cost of building or operating the surrounding SDS workflow.

But it does show something important:

A well-scoped diagnostic governance layer can change the shape of agentic repair.

Not by making the model magical.

By making the repair boundary clearer.

The Takeaway

The first Scarab Field Lab comparison produced a simple result:

Same patch.

Far less final decision context.

Better evidence discipline.

Cleaner repair boundary.

Public verification.

That is the meaningful signal.

SDS did not prove that Codex cannot patch without help.

It proved something more commercially relevant:

AI coding agents may not need massive repo payloads for every repair if the repo can mechanically surface the truth boundary they need.

That is the layer Scarab Systems is building toward.

Repo-truth infrastructure for AI-assisted software change.

Not more context for the sake of more context.

The right truth, surfaced at the right boundary, for the right mutation.

My Codex assisted in writing this article and formatting properly. However, the words and theory are mine. lol

Top comments (0)