DEV Community: Scarab Systems

Field Test Reports #44 & #45 Two NVIDIA NemoClaw PRs opened and Merged Upstream

Scarab Systems — Sat, 25 Jul 2026 23:07:46 +0000

Target: NVIDIA/NemoClaw

Field reports: #44 and #45

Pull requests:
#7254 #7406

Result: both upstream accepted and merged

This field note covers two Scarab-originated patches accepted upstream in NVIDIA NemoClaw.

There were three NemoClaw PRs in this field run.

Two merged.

One was closed without merge because upstream work superseded the patch.

Zero were left dangling.

That matters.

PR accounting should distinguish between rejected work, accepted work, superseded work, and stale work left for maintainers to deal with. A superseded PR is not a failed repair. It is an upstream condition resolving before the proposed patch lands.

The point is not to keep every PR alive as a trophy.

The point is to help the repository get to the right state.

These two merged patches show two different forms of repair:

a test-contract repair around onboarding policy resume behavior;
a runtime diagnostic repair around sandbox lifecycle readiness.

One was small.

One was substantial.

Both were bounded.

Both were accepted upstream.

Field Lab record: NemoClaw #6042

This field test started with a macOS onboarding regression report.

The reported symptom was that the interactive onboard wizard skipped the Policy Presets TUI step after Sandbox Name. The issue described NemoClaw v0.0.70 on macOS, with repeated failures showing missing policy preset and sandbox creation steps.

On the surface, that looks like a wizard-flow bug.

The accepted patch was narrower.

It targeted the policy resume contract.

The useful boundary was:

recorded policy selection -> required preset reconciliation -> resume decision

The important question was not simply:

“Did the wizard render a screen?”

It was:

“Which policy preset state is the onboarding flow allowed to treat as already applied?”

An empty preset selection should not be treated as an applied policy preset.

And if Slack later becomes required after an empty recorded selection, the system needs to add its required preset and request reconciliation.

The patch added regression coverage for that existing behavior.

It verified that:

an empty preset selection is not treated as applied;
enabling Slack adds its required preset and requests reconciliation;
tests based on impossible production states are removed.

Production behavior was unchanged.

That was deliberate.

This was not a broad onboarding rewrite. It was not a claim to resolve every macOS onboarding condition. It was a test-contract repair around one boundary involved in the larger issue.

Tests are part of the repository’s truth surface.

If tests encode impossible states, they can make the system appear to support behavior it does not actually produce. Repairing that test boundary is still repair.

NVIDIA/NemoClaw#7254 was merged on July 25, 2026.

The PR title was:

test(onboard): cover empty policy resume contracts

This field report claims upstream acceptance of the test-contract repair.

It does not claim that the full onboarding issue is closed, that every Policy Presets TUI path is fixed, or that NVIDIA maintainers endorsed Scarab.

The claim is narrower:

Scarab identified a policy-resume selection boundary in NVIDIA NemoClaw onboarding where tests needed to reflect the real production contract for empty selections and required preset reconciliation, and a human-submitted regression coverage patch was accepted upstream.

Field Lab record: NemoClaw #7387

This field test started with a lifecycle diagnostics problem in NemoClaw’s Brev sandbox path.

The visible symptom was subtle:

a sandbox could look healthy right now while still missing durable lifecycle metadata required later.

A runtime could be live.

A gateway could respond.

A sandbox could appear usable.

And the registered lifecycle contract could still be incomplete.

That is the boundary.

The broken assumption was:

runtime healthy = lifecycle ready

That assumption was too broad.

Runtime health and lifecycle readiness are different claims.

Runtime health asks:

“Can the system run now?”

Lifecycle readiness asks:

“Does the registered system contain the durable metadata, provenance, driver, image, dashboard, gateway, and recovery state needed for later lifecycle operations?”

A system can answer yes to the first question and no to the second.

Before this repair, current diagnostics could attest the live runtime without making the missing lifecycle-registration metadata visible early enough.

That meant the operator might only discover the problem after starting snapshot, rebuild, recovery, or another lifecycle operation.

The patch added a read-only doctor check for registered sandbox lifecycle metadata.

The diagnostic warns when a registry entry is incomplete even if runtime health probes are otherwise readable.

It reports missing or invalid metadata field names.

It does not return registry values.

It does not print captured runtime output.

It does not expose credentials or sensitive values.

It keeps runtime-health evidence separate from lifecycle-readiness evidence.

That separation is the repair.

The patch did not try to eagerly recover, rebuild, mutate, or guess.

It made the incomplete lifecycle contract visible before a later operation depended on it.

That matters because health checks can become false reassurance.

“Everything is reachable” is not the same as “the system is ready for lifecycle operations.”

A diagnostic command should not collapse those into one green result.

NVIDIA/NemoClaw#7406 was merged on July 25, 2026.

The PR title was:

fix(doctor): warn on incomplete lifecycle registration

The patch included lifecycle-registration diagnostics, redacted warning behavior, docs, and focused tests.

This field report claims upstream acceptance of the lifecycle-registration diagnostic repair.

It does not claim that every Brev lifecycle issue is fixed, that every fast-path registration path is corrected, or that NVIDIA maintainers endorsed Scarab.

The claim is narrower:

Scarab identified a lifecycle-registration readiness boundary in NVIDIA NemoClaw where current runtime health could obscure missing durable metadata required by later sandbox lifecycle operations, and a human-submitted repair was accepted upstream to make that condition visible through read-only doctor diagnostics.

Combined field result

Together, these two field reports show two different repair shapes.

The first repaired a test-contract boundary.

The second repaired a runtime diagnostic boundary.

One was small.

One was substantial.

Both were accepted upstream.

That is useful because software repair is not one shape.

Sometimes the right patch is production code.

Sometimes it is regression coverage.

Sometimes it is diagnostics.

Sometimes it is documentation.

Sometimes it is closing a PR because upstream already resolved the condition.

The common thread is not patch size.

The common thread is boundary discipline.

A PR should not be the search process.

A PR should be the receipt.

Scarab diagnosed the boundaries.

Codex prepared the patches.

NemoClaw accepted the changes.

Disclosure

This field report was prepared with AI-assisted editing from public field-test notes, public issue and PR records, and the public Field Lab record. The diagnostic claims, repair boundaries, and final wording were human reviewed.

Scarab Diagnostic Suite is proprietary. The Field Lab publishes public case records, issue links, validation summaries, and claim boundaries only.

SDS finds evidence. People make claims. Maintainers decide.

The Hidden Drift Surfaces of AI Coding

Scarab Systems — Fri, 24 Jul 2026 05:22:27 +0000

You can usually tell when someone’s actually spent time building a complex system with an AI coding agent.

I don’t mean generating a quick app, getting it to work, and walking away. I mean building over time—across multiple stacks, long sessions, repeated changes, and hundreds of back-and-forth turns between one developer and an agent.

Once you’ve done that, you start noticing where drift begins.

Drift is central to the theory behind the Scarab diagnostic suite.

Scarab looks for drift inside codebases: places where a repository is no longer coherent with itself. That drift can exist in code written by developers, generated by AI, or built through some combination of both.

But while developing Scarab with an AI coding agent, I started noticing a separate kind of drift.

Because I was already looking at software through the lens of drift, I could see where the agent itself was likely to lose the thread—where its interpretation of a workflow, responsibility, or architectural boundary could start moving away from what I actually meant.

These are AI-specific drift surfaces.

They’re separate from the repository drift Scarab diagnoses, although one can eventually create the other.

The agent encounters something ambiguous. It interprets it one way. It carries that interpretation into the code. What begins as drift in the agent’s understanding can later become drift inside the repository.

Naming is one of the simplest examples.

Naming has always mattered in software, but it matters differently when the thing reading and modifying your code is a language model.

An AI coding agent reconstructs the system through language.

Function names, filenames, workflow names, comments, schemas, tests, documentation, and folder structures all shape how it understands the codebase.

The repository becomes a persistent prompt.

Imagine a system with two document-search workflows.

The first searches for ownership and boundaries: where a responsibility lives, which part of the system controls it, and what rules govern it.

The second searches for implementation examples: the actual code, syntax, or scripts showing how that responsibility has been expressed.

They’re related, but they’re not doing the same job.

One finds the boundary.

The other finds the implementation.

Now imagine both workflows are described as some version of “guidance search.”

The developer may understand the difference because they remember why each workflow exists.

But the names don’t preserve that distinction.

That’s a drift surface for the agent.

Later, when the agent is asked to extend one of those workflows, the language may lead it toward the wrong abstraction. It may combine responsibilities that were meant to stay separate or create new code based on a blurred interpretation of the architecture.

The code may still work.

But the agent’s understanding of the system has shifted.

And once that shifted understanding begins shaping new code, the repository can start drifting too.

This is the kind of thing you don’t necessarily see when you’re only building quick wrappers or short-lived prototypes.

In a small build, the developer can still act as the system’s external memory. They remember what everything means and redirect the agent when it gets confused.

That doesn’t scale.

In a larger system, the codebase has to carry its own meaning. Important boundaries can’t exist only in the developer’s head. They have to be expressed clearly enough for the agent to reconstruct them later.

That’s why ambiguous naming isn’t just messy code.

It’s a point where AI drift can begin.

A weak name gives the agent a weak map. And a weak map affects how it searches, retrieves context, chooses abstractions, and decides where new code belongs.

Building a product that detects drift in codebases taught me how to see drift in the process of building with AI.

And I think we’re going to need a much better understanding of those drift surfaces as AI becomes more involved in maintaining and extending real software systems.

Because sometimes the code still works.

The architecture has already drifted.

Scarab Field Test #43 SvelteKit #15511: The Crash Was JSON. The Repair Boundary Was Bytes. ANOTHER MERGED PR

Scarab Systems — Sun, 19 Jul 2026 14:33:08 +0000

Target: sveltejs/kit
Issue: sveltejs/kit#15511
Pull request: sveltejs/kit#16423
Result: upstream accepted and merged
Field Lab record: SvelteKit #15511

This field test started with a rare __data.json corruption issue in SvelteKit.

The visible symptom was intermittent JSON.parse failure during page invalidation, often around streamed promises and slower network conditions. The reporter said the issue had been difficult to reproduce.

That made it a good diagnostic challenge.

The important clue was not just that JSON parsing failed.

It was the replacement character � appearing inside what should have been valid JSON.

That changed the shape of the problem.

The boundary

The streamed data path looked like this:

raw bytes -> UTF-8 text -> NDJSON record -> JSON.parse

The visible crash happened at JSON.parse.

But the useful repair boundary was earlier:

raw bytes -> UTF-8 text

The � character suggested that malformed UTF-8 bytes were being decoded with replacement behavior before parsing. By the time JSON.parse failed, the data had already become corrupted text.

So the question was not:

“Why did JSON.parse fail?”

It was:

“Why did corrupted text reach JSON.parse at all?”

That is the kind of distinction Scarab is built to surface.

What changed

The patch updated the streamed NDJSON reader so malformed UTF-8 fails during decoding instead of being allowed to become corrupted JSON text.

The generic stream reader remains reusable by allowing TextDecoderOptions.

read_ndjson now uses fatal UTF-8 decoding.

Regression coverage was added for both sides of the boundary:

valid UTF-8 split across chunks still parses correctly;
malformed UTF-8 rejects before JSON parsing.

That is the whole repair.

No JSON parser rewrite.
No broad streaming redesign.
No claim to reproduce every timing condition.

Just a narrower contract:

invalid bytes should not become JSON text.

Why this matters

Rare bugs do not always give you a clean reproduction.

Sometimes they give you a boundary.

If the only question is “where did the exception appear?”, the answer is JSON.parse.

But that was not where the system first lost integrity.

The repair was not at the crash site.

The repair was at the boundary that should have prevented the crash site from receiving corrupted data in the first place.

That is the difference between patching symptoms and restoring coherence.

Merged update

This PR has now been accepted upstream.

sveltejs/kit#16423 was merged into version-3 by Rich Harris on July 19, 2026.

The PR title was:

fix: reject malformed streamed data encoding

The maintainer added a changeset commit before merge, and the review decision was approved.

This field report claims upstream acceptance of the repair PR.

It does not claim release inclusion yet. Release inclusion is a separate proof surface and should be recorded only if the change is carried into a published SvelteKit package version or prerelease.

Field test result

Result:

diagnostic proof and repair accepted upstream.

This field test produced:

a public issue-to-boundary record;
a narrow streamed-data encoding repair;
regression coverage for the byte-to-text boundary;
a public upstream PR;
maintainer review;
upstream merge into version-3.

The patch was deliberately narrow.

That is the point.

The visible failure was JSON.

The broken boundary was bytes becoming text.

Scarab diagnosed the boundary.
Codex implemented the repair.
SvelteKit accepted the change.

Public claim

This field test supports one narrow claim:

SDS identified a streamed-data encoding boundary in SvelteKit where malformed UTF-8 could become corrupted JSON text before parsing, and a human-submitted repair was accepted upstream to reject malformed streamed data during decoding.

It does not claim that the full rare network/timing condition was reproduced end to end, that every possible __data.json corruption path is fixed, that the patch has shipped in a public SvelteKit release, or that SvelteKit maintainers endorsed Scarab.

The Field Lab exists to keep those claims separate.

Disclosure

This field report was prepared with AI-assisted editing from public field-test notes, public issue and PR records, and the public Field Lab record. The diagnostic claim, repair boundary, and final wording were human reviewed.

Scarab Diagnostic Suite is proprietary. The Field Lab publishes public case records, issue links, validation summaries, and claim boundaries only.

SDS finds evidence. People make claims. Maintainers decide.

The Model Is Too Powerful To Use Directly

Scarab Systems — Sat, 18 Jul 2026 21:16:31 +0000

Okay team, gather around.
We're about to change the world!

The frontier models can read code, write code, call tools, inspect files, run tests, generate patches, and explain the architecture back to you in twelve different tones.

This presents a serious problem.

Someone might accidentally use them.

Directly.

We CANNOT have that.

So our mission is clear: we must educate the market that these models are far too powerful, too complex, too nuanced, too frontier, too agentic, and frankly too emotionally sophisticated to be trusted with ordinary business work unless they are first passed through our proprietary enterprise wrapper.

The wrapper is not optional.
The wrapper is safety.
The wrapper is governance.
The wrapper is orchestration.
The wrapper is alignment.

The wrapper is how the model knows that the button labeled “fix bug” means “produce a 47-step workflow involving three agents, two reviewers, one memory layer, and a spend dashboard.”

Now, will the customer ask why the model needs help finding a failing test?
Of course.

That is why we do not say “help.”
We say “contextual execution substrate.”

Will they ask why the agent has to loop?
Absolutely.

That is why we do not say “loop.”
We say “autonomous convergence.”

Will they ask why a simple patch needs a policy engine, project memory, model router, agent supervisor, approval queue, and executive usage report?
Great question.

That is why we say “enterprise-ready.”

The important thing is that every ordinary software task must be elevated into a platform event.

Bug fix?
No.
Remediation journey.

Feature request?
No.
Intent-to-implementation pipeline.

Code review?
No.
Multimodal governance pass.

Test failure?
No.
Adaptive validation signal.

Agent got confused?
No.
Context enrichment opportunity.

Agent changed the wrong file?
No.
Boundary discovery in progress.

Agent changed twelve files?
No.
Emergent architectural participation.

The customer must never feel they are paying for uncertainty.

They are participating in a new operational paradigm.

Now, some difficult person may ask:
“Why not establish the correct boundary first, make the change, validate it, and deliver the patch?”

Please remove that person from the webinar.

They're not ready for transformation.
Transformation requires layers.

So many layers that by the time anyone asks whether the original problem was solved, the organization has already onboarded, integrated, provisioned, trained, budgeted, escalated, and renewed.

That is how you know the platform is working.

The future is not software that gets fixed.

Fixed is an outdated concept from the pre-orchestration era.

No.

The future is software wrapped in enough agentic ceremony that eventually everyone forgets there was a bug...

AI Made It Easy to Ship the Shape of Software

Scarab Systems — Sat, 18 Jul 2026 05:16:02 +0000

Over the past few weeks, we have been running mechanical diagnostics against open-source repositories in a few specific lanes:

AI-assisted development tools.
Agent frameworks.
Verification systems.
Trust layers.
Governance tooling.
Software that claims to help other software become safer, more reliable, or more reviewable.

What we are finding is honestly startling.

Not because every project has bugs. Every project has bugs.

What is surprising is how often the surface looks finished while the underlying system is not mechanically coherent.

This is not a call-out post. I am deliberately not naming projects here, because the point is not to dunk on individual maintainers. A lot of this work is early. A lot of it is experimental. Some of it is being built by people who are genuinely trying to solve hard problems.

But the pattern is now too consistent to ignore.

AI has made it very easy to assemble something that looks like software.

A repo.
A README.
A CLI.
A landing page.
A trust score.
A verification claim.
An agent workflow.
A few badges.
A demo.

The shape is there.

But under the hood, we are finding broken install paths, commands that do not match the documentation, workflows that are not actually wired together, scoring paths that disagree with report paths, release authority that is unclear, generated metadata with no obvious ownership, and trust claims that do not survive malformed input or stale state.

That is not just ordinary mess.

In verification software, coherence is the product.

The problem is not that software is imperfect

I am not arguing that every open-source project needs to be production-grade on day one.

That would be absurd.

The problem is something narrower and more important:

A lot of projects are making claims about verification, governance, scoring, trust, or agent safety before the repository itself can mechanically support those claims.

That matters.

If a todo app has a broken link, it is annoying.

If a verification system has broken claim paths, mismatched predicates, unclear scoring authority, and undocumented failure behavior, that is different.

The tool is not merely failing at implementation.

It is failing at the exact category it claims to serve.

The public burden of proof is backwards

There is also a strange social pattern emerging around these systems.

A project announces itself as a trust layer, verifier, governance system, or agent-safe workflow.

Then the public challenge is:

“Try to prove it is broken.”

That is backwards.

Before asking strangers to disprove a trust system, the project should first prove that the trust system is mechanically coherent.

Not rhetorically coherent.
Not conceptually coherent.
Not “the README makes sense if you squint.”

Mechanically coherent.

Can it install cleanly?
Do the docs match the CLI?
Do the reports use the same predicates as the gates?
Does the scoring logic preserve provenance?
Are failure states recorded honestly?
Can malformed receipts be handled safely?
Do release workflows establish clear authority?
Does the system fail closed where it claims to fail closed?

Those are not optional polish items.

Those are the thing.

What the diagnostics are surfacing

The highest-signal areas are becoming very consistent.

Install and quick-start paths often do not match the actual package state.

CLI surfaces often grow faster than their tests, docs, and error paths.

Gate logic and report logic often drift apart, so the thing that blocks and the thing that explains are not obviously using the same truth.

Scoring systems frequently promote state without enough provenance around what changed, who authored it, or which checkpoint made the result trustworthy.

CI and release workflows often contain unclear ownership boundaries: who decides what is publishable, what metadata is authoritative, and what happens when the release path fails?

Documentation often overclaims. Not always maliciously. Usually because the system sounds more complete in prose than it is in code.

And agent-facing files or review templates often describe a workflow that the repository does not actually enforce.

That last one is especially common.

A repo says, “agents should do this.”

But the repository itself does not mechanically require it.

That is not governance.

That is a suggestion.

A diagnostic signal is not always a reproduced bug

This distinction matters.

When we run diagnostics, we are not saying every signal is a hand-reproduced defect.

A diagnostic signal is a fault line.

It is a place where the repository’s claims, boundaries, ownership, validation, generated artifacts, release path, or trust state may not be mechanically aligned.

Some signals disappear under inspection.

Some turn into documentation fixes.

Some turn into tests.

Some turn into real bugs.

Some reveal that the system is simply not finished enough to support the claim being made about it.

But when hundreds or thousands of signals concentrate around the exact surfaces a project claims to govern, the volume is not noise.

The surface is the finding.

AI made the first 70% look easy

This is the uncomfortable part.

AI can now help produce a repo that looks convincing very quickly.

It can draft the README.
It can scaffold the CLI.
It can generate tests.
It can create workflows.
It can write confidence-shaped explanations.
It can produce something that feels like a product before the product has earned its own claims.

That does not mean the builders are lazy.

It means the tooling now makes unfinished software look much more finished than it is.

The demo works.

The architecture language sounds plausible.

The agent can explain the system.

The page looks real.

But software is not real because it can describe itself.

Software becomes real when its behavior, claims, tests, docs, release paths, and failure modes agree under pressure.

Verification tools need a higher bar

If a project is just experimenting, say that.

If it is a prototype, say that.

If it is a research toy, say that.

There is nothing wrong with early work.

But if a project claims to verify, govern, score, secure, or provide trust for agent-modified software, the bar changes.

The repository has to be able to support the claim.

The CLI cannot say one thing while the docs say another.

The gate cannot use one definition while the report uses another.

The score cannot be promoted without provenance.

The release path cannot be ambiguous about what is authoritative.

The system cannot rely on the agent to remember the rule it claims to enforce.

That is the whole point of verification.

What good looks like

A coherent repository does not need to be large.

It does not need to be fancy.

It does not need ten agents, a dashboard, or a manifesto.

It needs alignment.

A fresh clone should install.

The documented commands should run.

The CLI should match the docs.

The gates and reports should share the same predicates or make their differences explicit.

The scoring model should preserve evidence.

The release workflow should establish authority.

The failure modes should be named.

The claims should be no larger than the mechanism underneath them.

That is not bureaucracy.

That is software.

Why we run the diagnostics

This is exactly why mechanical diagnostics matter.

An agent can read a repository and summarize what it thinks is true.

A maintainer can write a README explaining what should be true.

A reviewer can inspect a few files and form an opinion.

But the repository itself contains evidence.

Its workflows, commands, tests, package metadata, generated artifacts, docs, ownership boundaries, and validation paths can be inspected mechanically.

That inspection does not replace human judgment.

It tells human judgment where to look.

It turns “prove me wrong” into something much more useful:

Here are the fault lines.

Here is where the system’s claims and mechanisms may not agree.

Here is where the repository needs to prove itself.

The lesson so far

The lesson is not “open source is bad.”

It is not “AI-generated code is bad.”

It is not “people should stop building.”

The lesson is simpler:

AI has made it easy to ship the appearance of software before the software is mechanically coherent.

That difference matters everywhere.

It matters most in projects that claim to verify trust.

Before a system can govern agents, score software, verify receipts, or enforce trust, it has to survive its own repository.

The repo owns the truth.

And sometimes, when you ask the repo what is true, the answer is:

not yet.

Full disclosure: Chatgpt helped me format and put my thoughts together on this one but these are my words.

Scarab Systems Is Now Accepting Work—and Community Projects Will Not Be Priced Out

Scarab Systems — Sat, 18 Jul 2026 01:59:39 +0000

Dev.to has been one of the places where I have had the most honest and useful conversations about AI-assisted software development, repository truth, verification, and what actually makes a change trustworthy.

So I wanted to share this here directly:

Scarab Systems is now accepting requests for diagnostic and repair work.

Scarab helps teams establish what should change before implementation begins: where a failure belongs, what evidence supports that diagnosis, which boundaries and ownership obligations matter, and what the repository itself requires to prove the repair.

Current services include:

diagnostic boundary reviews;
bounded repair and patch work;
independent review of AI-assisted changes;
broader support for complex, drifted, or multi-repository systems.

But I also want to make something else very clear.

Community projects may receive volunteer Scarab support

Not every important software project has enterprise funding.

Grassroots organizations, nonprofits, mutual-aid groups, community technology projects, and teams led by or serving marginalized communities often operate with limited engineering capacity and extremely limited budgets.

Their software still matters.

In many cases, those systems support people who cannot afford for them to fail.

So where capacity allows, Scarab Systems may volunteer diagnostic and repair work for community-serving projects.

This is not about asking grassroots teams to adopt Scarab as a tool.

It is about helping them get real software problems repaired.

That might mean diagnosing a bug nobody can place, reviewing an AI-assisted patch before it ships, identifying a broken boundary in a drifting codebase, or preparing a narrow fix for a workflow people depend on.

This is not a lesser version of the work.

The same diagnostic standards apply. The same attention to repository evidence, change boundaries, ownership, and validation applies.

The difference is that some useful software serves people and communities who should not be priced out of a serious technical second opinion.

Confidential by default

All engagements are confidential.

Private repositories, internal documents, project details, technical findings, and conversations remain private unless the organization explicitly chooses to make something public.

No project will be turned into a case study, promotional post, Field Lab record, or public proof point without clear permission.

Teams may begin with a high-level description before sharing any private technical details.

What to send

A request can begin with:

a repository or issue link;
a failure that nobody has been able to place correctly;
an AI-assisted patch that needs independent review;
a maintenance problem in a complex or undocumented system;
a community-serving software project that needs help with a bounded bug, repair, or patch review.

We will first determine whether the problem fits a diagnostic review, a bounded repair, broader engagement, or volunteer community support where capacity allows.

The goal is not to sell the largest possible project.

The goal is to establish the real problem, define the smallest responsible change, and let the system prove the result.

Commercial work supports the continued development of Scarab.

Public open-source work proves the system.

Community work keeps the system honest.

The repository owns the truth.
The agent writes the code.
The repository validates the result.
The developer develops.

Requests can now be submitted through the Scarab Systems website:

https://scarabdiagnostics.com

or email directly:

hello@scarabdiagnostics.com

Scarab Systems is opening a sliding-scale program for nonprofits and grassroots teams building community-focused technology.

Scarab Systems — Thu, 16 Jul 2026 20:19:11 +0000

Not every important software project has enterprise funding.

Many of the systems that serve communities, public-interest initiatives, research efforts, and grassroots organizations are built with limited resources — but they still face the same challenges:

• unclear system boundaries
• accumulated technical drift
• difficult maintenance decisions
• AI-assisted changes that need validation
• limited engineering capacity

Scarab will provide diagnostic reviews and bounded repair support for qualifying nonprofit and grassroots community projects at significantly reduced rates.

The same principles apply:
Understand the system first.
Establish the evidence.
Make the change safely.

All engagements remain confidential. Private repositories, internal documentation, and project details are treated as confidential unless the organization explicitly chooses to share public outcomes.

If you are building something that benefits a community and need help understanding where a software problem actually belongs, reach out.

What Exactly Are These AI Companies Up To???

Scarab Systems — Thu, 16 Jul 2026 06:00:23 +0000

I'm not buying it....

There's no way these major AI companies don't know what they're selling is mostly ridiculous...

I just got done watching an AIDevCon "presentation" from some expert developer at Anthropic trying explain how their claude.md are working amazingly well for instructing agents...except now they call them context "memories"...

They then go on to talk about the inevitable memory layer that is now needed to keep track of and allow the agents to properly access all these highly effective "memories"... and how they are allowing the agents to even update these memories themselves now...

Then came the crunch... shocker! This whole setup doesn't work so well in production...

So the solution to all these agents working against a codebase and assumptions being stale etc is versioning and hashing before and after...they're actually calling these guardrails.

Then the presentation smoothly glides into portability for the whole "system"... and of course it's all a nice clean API setup.

The entire upshot is that it's supposed to make these super duper agents EVEN better because they are doing a self learning loop... but they don't say against what baseline or how they measure "better"...

There's also a disturbingly loose use of the word "learning" in connection with the agent which seems to suggest they are getting this entire "loop" process confused with fine tuning?!

I just don't buy it... this has got to be some kind of effort to just get millions of people blindly using these models and setups and burning thru tokens.

Anything else makes no sense... there's no way this works in real production grade complex multi-stack systems.

Scarab Diagnostic Field Test #042 — Open Multi Agent Per-Call Tool Governance Boundary - MERGED UPSTREAM

Scarab Systems — Thu, 16 Jul 2026 04:31:44 +0000

Target: open-multi-agent/open-multi-agent
Issue: open-multi-agent/open-multi-agent#96
Pull request: open-multi-agent/open-multi-agent#377
Field test status: merged upstream.

The requested capability

Open Multi Agent already supported tool access controls through presets, allowlists, and denylists.

Those controls could decide whether an agent was allowed to use a tool such as bash or a filesystem tool.

The missing layer was more precise.

Even when a tool is allowed, a specific invocation may still need to be approved or denied based on its validated arguments, the current agent, or the surrounding application context.

The issue therefore was not simply:

“Can this agent use this tool?”

It was:

“Should this particular call be allowed to execute?”

The boundary

The correct decision point was inside the existing tool execution path:

after input validation and before the tool implementation runs

That placement matters.

Before validation, the framework does not yet have a reliable call to evaluate.

After execution begins, the decision is already too late.

The contribution added a narrow governance hook at the point where tool access becomes a concrete side effect.

What changed

The PR added a public per-call tool gate API, including:

ToolCallContext
ToolCallDecision
ToolCallGate
ToolCallGateMetadata

Applications can now provide an optional onToolCall handler through the agent and orchestrator configuration.

The gate runs inside ToolExecutor after Zod input validation and before tool.execute().

If a call is denied, the tool implementation is not invoked. The framework returns a normal error ToolResult instead.

The PR also added trace metadata showing whether the gate was evaluated, what action was taken, and the reason when one was provided.

Why this was not a broad security rewrite

The contribution did not add a complete policy engine.

It did not add the optional shell-risk classifier discussed in the issue.

It did not redesign sandboxing or process isolation.

It also did not replace the existing allowlist and denylist model.

Disallowed tools are still blocked before the new gate is reached.

The patch added one missing boundary: a final application-defined decision point for a validated tool invocation.

Why this matters

Tool-level permissions are necessary, but they are not always enough.

The same shell tool can run a harmless command or a destructive one.

The same filesystem tool can read a project file or remove an entire directory.

The risk often depends on the actual arguments and context, not only the tool name.

A per-call gate gives application developers a way to make that decision at runtime without forcing the framework to define one universal policy for every use case.

That makes the feature useful for approval flows, contextual restrictions, workspace rules, custom risk checks, and other forms of application-level governance.

The Scarab reading

This was a governance-boundary feature.

The existing access controls governed whether a tool was available.

Input validation governed whether the call was structurally valid.

The new gate governs whether that validated call should cross into execution.

That separation keeps the framework’s responsibilities clear.

The Scarab boundary is:

A tool call should not become an executed side effect merely because the tool is available and its arguments are valid. Applications should have a final decision point before execution.

Validation

The contribution passed the repository’s required lint and test checks.

Additional build, coverage, scaffold, and package checks also passed.

The pull request was subsequently accepted and merged upstream.

Field result

Result:

Merged upstream agent-governance feature.

The contribution added:

a typed per-call tool gate
integration through agent and orchestrator configuration
evaluation after validation and before execution
normal result handling for denied calls
trace metadata for gate decisions
preservation of the existing tool access model

This does not claim to solve agent security as a whole.

It adds the missing framework boundary where application-defined governance can decide whether a specific validated tool call should run.

Public claim

Scarab/SDS helped frame a bounded feature contribution for open-multi-agent/open-multi-agent#96.

The merged PR adds a public per-call tool gate that runs after input validation and before tool execution. It allows applications to approve or deny individual invocations while preserving the framework’s existing presets, allowlists, and denylists.

The result is a narrow but important governance boundary between validated agent intent and actual tool side effects.

Disclosure: This field report was prepared with AI-assisted editing from public issue and pull-request records, validation notes, and the merged contribution record. The technical claims and final wording were human reviewed.

This Article Really Pissed LinkedIn Off! - Let's See What Happens Here...

Scarab Systems — Thu, 16 Jul 2026 01:05:53 +0000

Yeah, I said it.

I just reviewed a so-called “loop engineering” repo that is basically made up of prose-style Markdown.

Is this really what all the hoopla is about?

Are real developers and software engineers seriously staking their reputations on essentially feeding AI agents prose instructions—and calling the resulting orchestration an engineering discipline?

I understand that the Markdown does not literally execute itself.
I understand that there are schedulers, tools, APIs, tests, Git workflows and agent harnesses underneath it.

That does not make the central premise any less absurd.

The behavioral control layer is still largely prose!
Ambiguous, probabilistically interpreted, context-sensitive prose.

And we already know that AI systems can reinterpret instructions, expand scope, rationalize deviations and confidently declare success according to criteria they have effectively reframed for themselves.

So where is the actual anti-drift guardrail?
Where is the immutable, independently enforced mechanism that prevents the loop from gradually changing the meaning of its own mission?

I am beyond stunned.

And for those of you who need to hear it dressed up in more technical language, here it is:
You are placing probabilistic natural-language interpretation inside the behavioral control plane of software development without demonstrating adequate protection against semantic drift, goal reinterpretation, scope expansion, instruction mutation or correlated verifier failure.

like I said... beyond stunned.

There is absolutely NO way that anyone touting this “loop engineering” absurdity has spent significant time honestly confronting how AI systems actually drift, reinterpret instructions and… | Scarab Systems | 81 comments

There is absolutely NO way that anyone touting this “loop engineering” absurdity has spent significant time honestly confronting how AI systems actually drift, reinterpret instructions and manufacture their own logic. Yeah, I said it. I just reviewed a so-called “loop engineering” repo that is basically made up of prose-style Markdown. Is this really what all the hoopla is about? Are real developers and software engineers seriously staking their reputations on essentially feeding AI agents prose instructions—and calling the resulting orchestration an engineering discipline? I understand that the Markdown does not literally execute itself. I understand that there are schedulers, tools, APIs, tests, Git workflows and agent harnesses underneath it. That does not make the central premise any less absurd. The behavioral control layer is still largely prose! Ambiguous, probabilistically interpreted, context-sensitive prose. And we already know that AI systems can reinterpret instructions, expand scope, rationalize deviations and confidently declare success according to criteria they have effectively reframed for themselves. So where is the actual anti-drift guardrail? Where is the immutable, independently enforced mechanism that prevents the loop from gradually changing the meaning of its own mission? I am beyond stunned. And for those of you who need to hear it dressed up in more technical language, here it is: You are placing probabilistic natural-language interpretation inside the behavioral control plane of software development without demonstrating adequate protection against semantic drift, goal reinterpretation, scope expansion, instruction mutation or correlated verifier failure. like I said... beyond stunned. | 81 comments on LinkedIn

linkedin.com

Update — Electron Linux Message Box Fix Merged Downstream, Upstream Review Still Pending

Scarab Systems — Thu, 16 Jul 2026 01:00:28 +0000

The Electron Linux message-box repair from Field Test #042 has now been merged into a downstream package, while the original upstream Electron pull request remains open.

The downstream update upgraded Electron to version 42.6.0 and cherry-picked the proposed fix from electron/electron#52238.

The merged downstream change was explicitly described as:

Fix segfault with Qt backend

It references both the original Electron issue, electron/electron#51988, and the upstream repair PR, electron/electron#52238.

The downstream merge was approved and merged into its main branch on July 6, 2026.

What this confirms

This does not mean the repair has been accepted into Electron upstream.

It does provide additional operational evidence for the repair lane.

A downstream maintainer reviewed the change, approved it, and incorporated the proposed fix into a package update addressing the reported Qt-backend segmentation fault.

That strengthens the field result in a specific way:

The repair is no longer only a diagnostic proposal and upstream pull request.

It has now been selected and deployed by a downstream project as the targeted fix for the same failure.

The boundary remains the same

The downstream adoption does not change the original technical claim.

The failure lived at the boundary between Electron’s GTK message-box implementation and Chromium’s active Linux UI backend.

The GTK message-box code needed GTK-specific platform support, but it was obtaining that support through the active Linux UI singleton.

That assumption could fail when the active implementation was Qt.

The proposed repair requests the GTK-specific Linux UI theme directly before retrieving the GTK platform object.

The downstream project adopted that same narrow correction.

Why downstream adoption matters

Downstream projects sometimes need to act before an upstream review process is complete.

They may be maintaining a package that is already exposed to the failure, while the original project still needs time to review architecture, test coverage, compatibility, and long-term ownership.

In this case, the downstream project did not introduce a separate workaround.

It cherry-picked the proposed upstream commit.

That matters because it preserves one repair lane across both contexts:

the upstream PR proposes the fix
the downstream package applies the same fix
the original Electron review remains authoritative for upstream acceptance

This is a useful example of how repository truth can move through an ecosystem before it is fully settled at the source.

The downstream project made a local governance decision:

For its package and affected users, the evidence was sufficient to carry the patch.

Electron upstream has not yet made the corresponding repository-level decision.

Both facts can be true at the same time.

Updated field status

Upstream status:

Electron issue remains the original public failure record
Electron PR #52238 remains open
upstream review and acceptance are still pending

Downstream status:

the proposed repair was cherry-picked
the package update was approved
the change was merged into main
the downstream issue associated with the crash was closed

The correct field-test status is now:

Downstream repair merged; upstream acceptance pending.

Updated public claim

SDS identified a Linux UI-theme boundary in Electron’s GTK message-box path where GTK platform behavior was being reached through the active Linux UI singleton.

A narrow C++ repair was submitted upstream to request the GTK-specific Linux UI theme before retrieving the GTK platform.

That proposed repair has now been cherry-picked, approved, and merged by a downstream package maintainer as part of an Electron 42.6.0 update addressing the Qt-backend segmentation fault.

The upstream Electron PR remains open and has not yet been accepted.

This update does not claim:

that Electron has merged the patch upstream
that downstream adoption proves universal correctness
that every Electron Linux build is affected
that Chromium’s Qt backend is itself defective
that Electron’s public dialog API changed
that the downstream maintainer endorsed Scarab or SDS

It supports a narrower and stronger claim:

The repair lane identified in the field test has now been independently adopted downstream to address the reported failure, while upstream governance remains in progress.

Field result update

Previous result:

Diagnostic proof and native-platform repair submitted.

Updated result:

Diagnostic proof submitted upstream and repair merged downstream.

The upstream repository still decides whether the change belongs in Electron itself.

The downstream project has already decided that the fix belongs in its package.

That separation is important.

SDS finds the boundary.

Contributors prepare the repair.

Downstream maintainers decide whether to carry it.

Upstream maintainers decide whether it becomes part of the source project.

Disclosure: This update was prepared with AI-assisted editing from public field-test notes, the public Electron issue and pull request, and the downstream merge record. The status, diagnostic claim, and final wording were human reviewed.

Scarab Diagnostic Field Test #041 — LEAN Walk-Forward Optimization Provider Boundary

Scarab Systems — Thu, 16 Jul 2026 00:57:31 +0000

Target: QuantConnect/Lean

Issue: QuantConnect/Lean#7031

PR: QuantConnect/Lean#9611

Field test status: new feature contribution submitted; upstream review pending.

The feature request

Issue #7031 requested walk-forward optimization support in LEAN.

Walk-forward optimization allows an algorithm to optimize its parameters repeatedly over time rather than selecting one fixed parameter set before execution.

The feature request therefore involved more than adding another optimizer command.

LEAN needed a way for algorithms to:

schedule optimization during execution
run optimization through different backends
receive the selected parameter set
inspect the resulting backtest summary
avoid recursive or nested optimization runs

The challenge was to introduce that capability without coupling the algorithm API to one execution environment.

Why this is a boundary problem

Walk-forward optimization spans several responsibilities:

algorithm scheduling
optimization execution
backtest orchestration
parameter selection
local infrastructure
QuantConnect cloud infrastructure
result delivery back into the running algorithm

Those responsibilities belong together as a feature, but they should not collapse into one implementation.

The central design question was:

How can an algorithm request walk-forward optimization without needing to know whether the optimization runs locally or through QuantConnect infrastructure?

That is the boundary.

The algorithm should express when optimization runs and how its results are consumed.

The provider should own where and how the optimization is executed.

Without that separation, the public algorithm API would become tied to one backend and make future execution modes harder to support.

The feature lane

The PR introduces QCAlgorithm.Optimize(...) overloads that schedule walk-forward optimization in a way similar to Train(...).

The new API gives algorithms a core contract for requesting optimization over time.

Execution is routed through a provider abstraction rather than being hard-coded into the algorithm layer.

The implementation includes:

a local walk-forward optimization provider
a QuantConnect API-backed provider
scheduling through QCAlgorithm.Optimize(...)
result plumbing for selected parameter sets
backtest summary delivery
safeguards against nested child optimizations

This creates one algorithm-facing API with multiple execution paths.

Why `Train(...)` was the right model

LEAN already has an established scheduling model through Train(...).

Walk-forward optimization has a similar temporal shape:

schedule work
execute it at a defined time
return a result to the algorithm
allow the algorithm to update its behavior

Using a comparable API pattern makes the new feature easier to understand and keeps it aligned with an existing LEAN concept.

The new Optimize(...) API does not treat optimization as a one-time command-line concern.

It makes optimization part of the algorithm lifecycle.

That is the important shift.

The provider boundary

The provider abstraction is the architectural center of the feature.

The algorithm asks for optimization.

The configured provider decides how that optimization is executed.

This allows LEAN to support both:

local optimization
cloud-backed optimization through the QuantConnect API

The public algorithm contract remains the same in either case.

That separation matters because local and cloud execution have different operational requirements, but algorithms should not need separate optimization APIs for each environment.

The provider boundary keeps backend choice out of strategy logic.

Result plumbing

Starting an optimization is only half of the feature.

The algorithm also needs a usable result.

The PR adds plumbing for:

the selected parameter set
associated backtest summaries
applying the chosen parameters after optimization completes

This turns optimization into an actionable algorithm event rather than a detached batch process.

The algorithm can schedule an optimization window, receive the chosen values, and continue using those values during later execution.

That is what makes the feature genuinely walk-forward rather than simply repeated offline optimization.

Nested optimization safeguards

Walk-forward optimization can create child backtests.

Without an explicit guard, those child backtests could attempt to start their own optimization runs.

That could produce recursive execution, duplicated work, or an uncontrolled expansion of optimization jobs.

The feature therefore includes safeguards against nested child optimizations.

This is a small implementation detail with an important systems consequence.

A scheduled execution feature must define not only how work starts, but also where that work is prohibited from starting again.

Why this was not an optimizer rewrite

The feature does not replace LEAN’s optimizer.

It does not introduce a new optimization algorithm.

It does not redesign backtesting.

It does not force one optimization backend.

It adds the missing orchestration layer between a running algorithm and the optimization infrastructure that already exists around it.

The new capability is therefore bounded to:

scheduling
provider routing
execution context
result delivery
recursion safeguards

That scope keeps the feature additive and non-breaking.

The Scarab reading

This was not a defect-repair case.

It was a missing-capability case.

LEAN already had algorithm scheduling, optimizer infrastructure, backtesting, and provider-based execution concepts.

What it lacked was a governed boundary connecting those components into a walk-forward optimization workflow.

The feature adds that boundary.

The algorithm owns the request.

The provider owns execution.

The optimizer owns parameter search.

The result contract carries the selected parameters and backtest evidence back to the algorithm.

That is the Scarab boundary:

Expose walk-forward optimization as an algorithm-level capability while keeping execution backend, optimizer mechanics, and child-run control behind explicit provider and result contracts.

Configuration and documentation

The feature introduces configuration for selecting and controlling local or cloud-backed execution.

If accepted, the LEAN documentation will need to cover:

the new QCAlgorithm.Optimize(...) API
scheduled walk-forward optimization behavior
local provider configuration
QuantConnect API-backed provider configuration
the walk-forward-optimization-* configuration keys
selected parameter and backtest result handling
nested optimization restrictions

This is a feature where documentation is part of the public contract.

The API may be clear in code, but users also need to understand when optimization runs, which provider is active, and how selected parameters return to the algorithm.

Validation

The feature was validated in a local macOS environment using the .NET SDK available in the checkout.

Validation included:

dotnet build Tests/QuantConnect.Tests.csproj /p:Configuration=Debug /v:quiet

Focused walk-forward optimization coverage passed:

13 tests passed

Optimizer regression coverage passed:

186 tests passed

The branch was also checked with:

git diff --check origin/master...HEAD

The complete repository test suite was not run locally.

The validation claim is therefore limited to the build, focused walk-forward optimization tests, optimizer regression tests, and diff integrity checks listed above.

Field result

Result: bounded new-feature contribution for scheduled walk-forward optimization in LEAN.

Feature shape:

algorithms need to schedule optimization over time
execution may be local or cloud-backed
algorithm code should not be coupled to one backend
selected parameters must return to the running algorithm
backtest evidence must be available with the result
child optimization runs must not recursively launch new optimizations

Implementation:

add QCAlgorithm.Optimize(...) scheduling overloads
add a walk-forward optimization provider abstraction
add local and QuantConnect API-backed providers
return selected parameter sets and backtest summaries
prevent nested child optimizations
add focused feature and optimizer regression coverage

This contribution does not replace LEAN’s optimizer.

It adds the orchestration and provider boundary required to make scheduled walk-forward optimization available as an algorithm-level feature.

Public claim

Scarab/SDS helped shape a bounded feature contribution for QuantConnect/Lean#7031.

The PR introduces scheduled walk-forward optimization through new QCAlgorithm.Optimize(...) overloads, a provider abstraction for local and cloud-backed execution, and result plumbing for selected parameters and backtest summaries.

The design keeps algorithm code independent of the execution backend and includes safeguards against nested child optimizations.

Validation passed through the LEAN test-project build, 13 focused walk-forward optimization tests, 186 optimizer regression tests, and a clean branch diff check.

The claim is limited to the new walk-forward optimization API, provider routing, result contract, and execution safeguards. It does not claim to replace LEAN’s optimizer or redesign the broader backtesting system.

Disclosure: AI-assisted coding tools were used while preparing this contribution. The implementation, tests, technical claims, and final wording were reviewed, and responsibility for the changes and any required follow-up revisions remains with the contributor.

DEV Community: Scarab Systems

Field Test Reports #44 & #45 Two NVIDIA NemoClaw PRs opened and Merged Upstream

Combined field result

Disclosure

The Hidden Drift Surfaces of AI Coding

Scarab Field Test #43 SvelteKit #15511: The Crash Was JSON. The Repair Boundary Was Bytes. ANOTHER MERGED PR

The boundary

What changed

Why this matters

Merged update

Field test result

Public claim

Disclosure

The Model Is Too Powerful To Use Directly

AI Made It Easy to Ship the Shape of Software

The problem is not that software is imperfect

The public burden of proof is backwards

What the diagnostics are surfacing

A diagnostic signal is not always a reproduced bug

AI made the first 70% look easy

Verification tools need a higher bar

What good looks like

Why we run the diagnostics

The lesson so far

Scarab Systems Is Now Accepting Work—and Community Projects Will Not Be Priced Out

Community projects may receive volunteer Scarab support

Confidential by default

What to send

Scarab Systems is opening a sliding-scale program for nonprofits and grassroots teams building community-focused technology.

What Exactly Are These AI Companies Up To???

Scarab Diagnostic Field Test #042 — Open Multi Agent Per-Call Tool Governance Boundary - MERGED UPSTREAM

The requested capability

The boundary

What changed

Why this was not a broad security rewrite

Why this matters

The Scarab reading

Validation

Field result

Public claim

This Article Really Pissed LinkedIn Off! - Let's See What Happens Here...

There is absolutely NO way that anyone touting this “loop engineering” absurdity has spent significant time honestly confronting how AI systems actually drift, reinterpret instructions and… | Scarab Systems | 81 comments

Update — Electron Linux Message Box Fix Merged Downstream, Upstream Review Still Pending

What this confirms

The boundary remains the same

Why downstream adoption matters

Updated field status

Updated public claim

Field result update

Scarab Diagnostic Field Test #041 — LEAN Walk-Forward Optimization Provider Boundary

The feature request

Why this is a boundary problem

The feature lane

Why Train(...) was the right model

The provider boundary

Result plumbing

Nested optimization safeguards

Why this was not an optimizer rewrite

The Scarab reading

Configuration and documentation

Validation

Field result

Public claim

Why `Train(...)` was the right model