Scarab Systems

Posted on Jun 26

From Narrow Patches to Long-Horizon AI Implementation

#ai #devops #programming #discuss

I’m starting to see the larger thread in the Scarab field work.

At first, the question was narrow:

Can repo truth help an AI coding agent make a precise patch without drifting across the codebase?

That has been showing up in the public field tests.

Given a baseline of repo truth, rules, ownership, boundaries, and validators, the coding agent does not need to “understand the whole repo” in some vague, giant-context way.

It needs the right implementation surface.

It needs to know what is true, what owns the area, what boundary applies, and what proves the change is safe.

That is how we have been getting narrow patches into complex repositories.

From patching to quieting

The next question was different.

Instead of targeting one bug, could Scarab work through a noisy repo surface and bring it down to quiet?

That became the stepwise quieting experiment.

In that mode, Scarab does not try to fix everything in one giant patch.

It surfaces the next hotspot, identifies the boundary around it, repairs only what the evidence supports, reruns the diagnostic, and steps down again.

The loop looks like this:

hotspot
boundary
bounded repair
rerun
step down
repeat until quiet

That experiment mattered because messy repo work often turns into weeks of circling.

You fix one thing.

Another surface gets noisy.

The issue moves.

The boundary was actually somewhere else.

The patch gets wider.

The repo gets harder to reason about.

Stepwise quieting tested whether a repo could be brought from noisy to quiet through a controlled diagnostic loop instead of guesswork, broad patching, or agent freewheeling.

The third question: long-horizon implementation

Now I’m testing a third question:

Can repo truth support long-horizon AI implementation?

Not one patch.

Not one repair.

Not one quick feature.

A long implementation campaign across a real, complex stack.

The current test case is Scarab’s own Observer layer: the internal operator console I’m building so I can see Scarab’s diagnostics, telemetry, workspaces, evidence, runtime state, gates, patch surfaces, and implementation visibility while the system is running.

This is not a toy app.

The Observer stack includes:

Next.js
React
TypeScript
shadcn UI
shadcn/Tailwind v4
Tailwind CSS
Radix UI
TanStack Query
TanStack Table
Zustand
React Flow
Monaco Editor
@monaco-editor/react
Apache ECharts
Playwright
pnpm
Node.js
Python
RabbitMQ
Celery
JSON Schema
Lucide React
CMDK
Class Variance Authority
clsx
tailwind-merge
Docker
Docker Compose

The important thing is not that Codex is writing code.

Codex is writing the code.

The important thing is what keeps Codex aligned while it writes.

Scarab is not telling Codex what code to write

In this mode, Scarab provides repo-specific implementation guidance along the way.

Not exact code instructions.

Implementation guidance.

Questions like:

What owns this surface?
What boundaries apply?
What contracts already exist?
What validators must pass?
What gaps are real?
What should not be invented?
What is the next lawful implementation step toward the target architecture?

That guidance appears to be the difference.

Instead of giving Codex a giant prompt and hoping it remembers the system, Scarab keeps resurfacing the repo’s current truth as the implementation proceeds.

That matters because long-horizon agent work is where the usual problems begin.

The agent can generate code.

The agent can pass a local test.

The agent can make a screen look plausible.

But can it keep working for hours without losing the thread?

Can it preserve contracts?

Can it avoid fake data?

Can it distinguish visible progress from validated completion?

Can it leave rollback checkpoints?

Can it refuse to call the work finished before final gates pass?

That is what this Observer build is testing.

Pass 1: foundation

Pass 1 took about 8 hours.

That pass did not try to jump straight to the gold-standard mockups.

It took the Observer foundation I already had and absorbed the target direction into what was actually wired: routes, contracts, read models, tests, checkpoint summaries, and working panel surfaces.

That was important.

The goal was not to let an AI agent paint over the repo with a beautiful fake dashboard.

The goal was to preserve the under-the-hood truth while turning the Observer into a real operator console.

Pass 2: gold pass

Pass 2 is now underway.

This is the gold pass.

It is moving panel by panel toward the richer Observer console: dark dense operator shell, route workbenches, visual proof screenshots, backend/read-model contracts, Playwright coverage, and validator-gated checkpoints.

Current state from the run:

branch ahead by 42 commits
Stages 1 through 8 checkpointed
Policy 4 of 6 slices committed
remaining work includes Policy finish, Observability/Telemetry, Search correlation, final gold review, runtime verification, and screenshot proof

The run estimated itself at roughly 75–80% complete by checkpoint count, but closer to 65–70% complete by remaining integration risk.

That distinction is one of the most interesting parts.

It is not saying:

lots of code was written, therefore done

It is tracking remaining risk.

It knows which stages are checkpointed.

It knows what is still uncommitted.

It knows final screenshots, runtime checks, Playwright coverage, mockup comparison, and unresolved polish issues still have to pass before the work can be called complete.

That is the behavior I wanted to see.

What the screenshots show

The screenshots from Pass 2 are work-in-progress proof.

They are not the final gold state.

But they show the Observer becoming real: overview, workflow, evidence, queues, runs, workspaces, policy, and read-only operator surfaces starting to take shape inside a coherent dark console.

The point is not that the UI is finished.

The point is that the implementation campaign is still moving through checkpoints instead of collapsing into a pile of generated code.

The thread

For me, this ties the Scarab work together.

Narrow patching tests whether repo truth can guide bounded repairs.

Stepwise quieting tests whether repo truth can reduce a noisy repo surface to quiet.

The Observer build tests whether repo truth can sustain long-horizon AI implementation over time.

That is the part I care about.

Not “AI wrote a dashboard.”

Not “Codex generated a lot of code.”

The question is whether an AI coding agent can keep implementing inside a real repo without drifting, if the repo continuously surfaces what is true, what owns what, what boundaries apply, and what proves the next step is safe.

That is the experiment.

And so far, it is real work.

Top comments (4)

xulingfeng • Jun 26

The 'risk remaining' part is what most AI demos conveniently skip — everyone shows what worked, nobody shows what's still uncertain. Observer's looking real. Also 42 commits in one pass is insane 😂 Keep going.