Scarab Systems

Posted on Jul 4

Scarab Field Lab: 27x Smaller Final Decision Context, Merged Upstream

#ai #programming #devops #discuss

The Patch Was Small. The Proof Was Not.

Scarab Systems has been running public Field Lab comparisons to test a simple question:

Can AI-assisted software repair be made narrower, cheaper, more evidence-bound, and still accepted upstream?

Not in a demo repo.

Not in a toy app.

Not in a private benchmark.

In real open-source infrastructure-adjacent codebases, against public issues, with public PRs, public CI, and maintainers who do not care about your theory if the patch does not belong.

This field report now has a new result.

The SDS-supported patch for emersion/xdg-desktop-portal-wlr#379 has been merged upstream.

PR: emersion/xdg-desktop-portal-wlr#393

The patch was approved and merged into emersion:master.

That matters because this is no longer only a comparison between two AI repair workflows.

It is now an accepted upstream repair.

The issue

The issue concerned SelectSources behavior in xdg-desktop-portal-wlr.

When a D-Bus caller omitted the optional types field, the code initialized type_mask to 0.

That caused target selection to fail with:

wlroots: No supported targets specified

The expected behavior was to default omitted types to monitor capture.

The final patch was intentionally small:

uint32_t type_mask = 0;
uint32_t type_mask = MONITOR;

Small patch.

Real repo.

Maintainer approval.

Merged upstream.

The comparison

Scarab compared two workflows.

The first was a cold Codex baseline.

The cold baseline used only:

the public issue snapshot,
the target repository snapshot,
a neutral baseline prompt.

It did not use SDS findings, private context, runtime artifacts, standing profiles, Field Lab notes, or local repair history.

The second was an SDS-guided repair workflow.

That workflow used selected diagnostic evidence, target-source inspection, and verification outputs to narrow the repair boundary before final agent decision-making.

The important point:

Both workflows reached the same patch.

So the claim is not that SDS made Codex smarter.

Codex could find the repair cold when given enough repo context.

The difference was what it cost to get there.

The token result

Cold Codex baseline:

Input tokens: 70,212
Output tokens: 1,982
Total tokens: 72,194

SDS-guided repair:

Final patch-decision context: 2,651 tokens

Same repair.

But the SDS-guided final decision context was roughly 27x smaller than the cold baseline total.

Compared with the cold baseline input count, the SDS final decision context used about 96% fewer tokens.

That is the cost signal.

A lot of AI coding infrastructure is being built around the assumption that the agent needs a massive repo payload.

More context.

More memory.

More orchestration.

More retrieval.

More harnessing.

More machinery around helping the agent carry the repo.

Scarab is testing the opposite direction.

Maybe the agent does not need to carry the whole repo in its mouth.

Maybe the repo needs to surface the right truth boundary before the agent acts.

Why the merge matters

The patch being merged changes the quality of the evidence.

Before merge, the field test showed:

same patch
less context
public PR
passing checks

After merge, it shows:

same patch
less context
maintainer approval
accepted upstream repair

That is a different proof surface.

The patch did not just satisfy an internal Scarab comparison.

It passed through a real upstream review path.

The maintainer accepted it.

CI passed.

The repair landed.

That matters because AI-assisted coding cannot be judged only by whether it can produce a diff.

A diff is cheap.

The harder questions are:

Did the patch stay narrow?
Did it touch the right boundary?
Did it preserve existing behavior?
Was the evidence sufficient?
Could it pass review?
Could it land upstream?

In this case, yes.

The real finding

This field test is not about a one-line patch being impressive.

The line was not the expensive part.

The expensive part was finding the safe repair boundary, preserving explicit types behavior, verifying the change, and producing a patch maintainers could accept.

That is the hidden cost in software repair.

A patch can be visually tiny and still have a non-trivial confidence burden.

The cold agent paid that burden with a broad context path.

SDS reduced the final decision context by surfacing the relevant repair boundary before the agent acted.

That is the Scarab thesis in practice:

The agent should not own the repo’s truth.

The repo governance layer should.

For each repair, the system should surface the relevant operational truth:

what behavior is failing
what source boundary owns it
what existing behavior must be preserved
what evidence supports the patch
what validators are sufficient
what mutation boundary should not be crossed

Then the agent can act inside that boundary.

The context can be used, verified, and released.

No giant historical prompt residue.

No agent pretending to remember the repo forever.

No massive payload when a narrow truth surface is enough.

Why this matters commercially

Companies are about to spend a lot of money on AI coding infrastructure.

Much of that spend will go toward managing huge context payloads.

But if a diagnostic governance layer can repeatedly reduce repair-time context while preserving patch quality, that changes the cost architecture.

This field test showed:

Cold baseline:
72,194 total tokens
SDS-guided final decision context:
2,651 tokens
Outcome:
same patch, merged upstream

That is not just a token optimization.

It is evidence that AI-assisted repair can be made more disciplined.

More context-efficient.

More reviewable.

More maintainable.

More acceptable to real projects.

Takeaway

xdg-desktop-portal-wlr#393 is now merged upstream.

Same patch.

27x smaller final decision context.

Maintainer approval.

Checks passed.

Accepted repair.

The patch was small.

The proof was not.

Scarab Systems is building repo-truth infrastructure for AI-assisted software change.

Not more context for the sake of more context.

The right truth, surfaced at the right boundary, for the right mutation.

DEV Community

Scarab Field Lab: 27x Smaller Final Decision Context, Merged Upstream

Top comments (0)