pueding

Posted on May 24 • Originally published at learnaivisually.com

OpenSCAD Pantheon Benchmark: Human-In-The-Loop vs Autonomous Coding Agents

#ai #llm #agents #productivity

What: The OpenSCAD Pantheon benchmark grades six agentic coding tools — including Antigravity 2.0, ModelRift, Codex 5.5, and Cursor Composer — on the same CAD task, surfacing the autonomous vs human-in-the-loop (HITL) contrast as two ways to drive the same agent loop.

Why: Most agent products today let teams toggle between autonomous and HITL, and the choice changes the SLOs, the cost profile, and the failure modes — but most teams pick one by gut feel, not by measurement.

vs prior: Earlier "agent benchmarks" graded a single end-to-end output and assumed a single control mode; this one grades the same task under both modes and finds the best autonomous tool (Antigravity 2.0, 4.5/5) edged out the best HITL tool (ModelRift, 3.8/5) on quality — a result that runs against the common assumption that humans always lift quality.

Think of it as

A driver-ed lesson — a learner solo vs the same learner with an instructor who has a brake pedal.

                       SAME LEARNER, SAME PROMPT
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
            ┌───────▼────────┐         ┌────────▼───────┐
            │      Solo      │         │   Instructor   │
            │  (autonomous)  │         │     (HITL)     │
            └───────┬────────┘         └────────┬───────┘
                    │                           │
             keep going — no             pause at the
             wheel grabs                  checkpoint
                    │                           │
                    ▼                           ▼
              ✓ 4.5/5 quality            3.8/5 quality
                ~12 min                    ~10 min
              (Antigravity 2.0)           (ModelRift)

agent iteration = the learner's next driving move
tool call (OpenSCAD render) = pressing the gas, brake, or turning the wheel
autonomous mode = the learner solo: keep going, no one can pause you
HITL mode = the instructor in the passenger seat: the same drive, but they can grab the wheel
quality score = how close the final route is to the destination

Quick glossary

Agentic coding tool — A coding tool wrapped in an agent loop — it can read files, run commands, render or compile, see the output, and decide what to change next without re-prompting. Cursor Composer, Antigravity 2.0, Codex, and ModelRift are all examples; the loop is the agentic part, the editor / IDE is the surface.

Human-in-the-loop (HITL) — A control mode where the agent pauses at named checkpoints for a human to approve, edit, or redirect before the next step. The opposite of autonomous mode, where the agent runs the entire loop without stopping.

Autonomous mode — The agent executes the full loop end-to-end without pausing for human approval. Faster human time per task but no chance to course-correct mid-flight — a mistake at iteration 2 keeps propagating until the loop ends.

OpenSCAD — An open-source programmatic CAD language — you describe a 3D model in code (Boolean unions, rotations, extrusions) and render it through a CLI. Well-suited to LLM agents because the model is text, every iteration is a CLI call, and the output is a renderable mesh.

Iteration loop — One pass of: agent writes/edits the OpenSCAD source → CLI renders it → agent inspects the render → agent decides whether to refine. The benchmark publishes per-tool wall-clock totals (~12 min autonomous, ~10 min HITL) but does not release iteration counts; the worked example below uses illustrative counts that match the totals.

Benchmark quality score — The reviewer-graded score on a 1–5 scale measuring how close the final mesh matched the architectural intent of the Pantheon (proportion, radial symmetry, dome curvature, column detail). Subjective but consistent — the same human reviewer scored all six runs.

Wall-clock time — End-to-end elapsed time for the full task, human time included. HITL totals here include human pause time; autonomous totals are pure model time.

The news. On May 21, 2026, ModelRift published the OpenSCAD LLM benchmark: six agentic coding tools given the same prompt plus two reference images, asked to produce a Pantheon model in OpenSCAD. Best autonomous: Antigravity 2.0 + Gemini 3.5 Flash High at 4.5/5 in ~12 min. Best HITL: ModelRift + Gemini Flash 3.0 at 3.8/5 in ~10 min. Codex 5.5 High hit 3.0/5, Claude Sonnet ran 2-3× slower than Codex, and Cursor Composer was fastest but weakest at 1.4/5. Numbers are reviewer-scored; the post does not publish full transcripts or per-iteration traces.

Picture the metaphor. A learner in a driving lesson can either drive solo — head down, no one to grab the wheel — or drive with an instructor in the passenger seat who has a brake pedal. The solo learner gets to the destination faster on average, but if they pick the wrong turn at minute three, that wrong turn carries through to the end. The instructor-paired learner pauses at intersections, lets the instructor weigh in, and ends up with a route that's safer to defend but slower to take. The Pantheon benchmark is exactly this lesson played out on an OpenSCAD model: same prompt, same reference images, two different control modes for the same agent loop.

The mechanism is straightforward. Each tool was given the same prompt — "build a Pantheon model in OpenSCAD from these two reference images" — and ran its iteration loop. In autonomous mode, the agent generated OpenSCAD source, called the CLI to render it, looked at the render, decided what to refine, and repeated until it decided the model was done. In HITL mode, the loop paused at named checkpoints — typically after each render — for the human to approve, edit, or redirect before the next iteration. Both winners ran on the Gemini Flash family (Antigravity on Gemini 3.5 Flash High, ModelRift on Gemini Flash 3.0); the tools differed in how they wrapped that model in the loop.

The headline finding is that the best autonomous tool beat the best HITL tool on quality: 4.5/5 vs 3.8/5. The common assumption is that a human in the loop raises quality (more eyes, more course-correction). On this task it didn't — though with a single reviewer-scored run per tool and no replicates published, the gap should be read as suggestive, not statistically certain. The two runs also used different specific models (Flash 3.5 High vs Flash 3.0), so it's not a clean apples-to-apples isolation of HITL itself; what it does isolate is the end-to-end deliverable that a team would actually ship using each tool. (If you want the strict HITL-vs-autonomous A/B on identical models with replicates, this benchmark doesn't give it to you — that experiment is still missing.)

There's a second finding hiding under the first: HITL was faster here, not slower. ~10 min for ModelRift vs ~12 min for Antigravity 2.0. The conventional reading of HITL is "more careful, takes longer." This run flipped that — the human pauses were short enough, and the redirects shortcutting bad branches were valuable enough, that total wall-clock dropped. The reading is not "HITL is always faster"; it's that for a task with a clear visual target (the Pantheon, with two reference images), a few well-placed human redirects can save more iteration cycles than they cost.

Where the time and quality actually go

Walking through the math with the benchmark's reported totals and illustrative per-iteration breakdowns (the benchmark publishes wall-clock totals and final scores only — iteration counts and per-step times below are stylized estimates picked to match the reported totals, not numbers from a published trace):

Antigravity 2.0 — autonomous, total ~12 min. Treat that as roughly 5 illustrative iterations × ~2.4 min/iteration: no pauses, no human input. The 4.5/5 score reflects that the autonomous loop, given enough iterations, converged on architectural correctness — the dome curve, the column count, the pediment proportion — without anyone steering it.

ModelRift — HITL, total ~10 min. Treat that as roughly 3 illustrative iterations × ~2.7 min + 2 human reviews × ~0.4 min ≈ 9.0 min of model+review time, plus a small overhead. The 3.8/5 score reflects that the human redirects can shortcut some bad branches early (saving iterations vs autonomous) but the final model still trailed on detail in this run.

The cost story is different from the quality story. Autonomous wins on quality, HITL wins on wall-clock; HITL also wins on human attention, but only when that attention catches mistakes that would have cost a full iteration to fix downstream. If your iteration is cheap (seconds, not minutes), the autonomous loop's "just iterate more" strategy dominates. If your iteration is expensive (long renders, paid API calls, slow CI), each human-saved iteration is worth more, and HITL pays back.

What changes for production agent design

The benchmark forces a decision teams usually leave implicit: which control mode is your default for which task class? The shape of the answer has been emerging across other Agent Engineering work — the Decision Rule framing weighs autonomy, controllability, and cost against each other, and the Cost Profile lens walks through where the tokens go under each mode. This benchmark gives you concrete numbers to plug in.

Decision input	Autonomous wins when	HITL wins when
Task structure	well-defined target, the model has converged on similar tasks before, mistakes are cheap to discover at the end	target is fuzzy or shifts as work progresses, mistakes are expensive to discover late, the human can articulate preferences faster than they can write them down up front
Iteration cost	iterations are cheap (seconds, low-cost API calls); "just iterate more" is viable	iterations are expensive (long renders, paid runs, slow CI); human redirects save more than they cost
Reviewability	each iteration produces something a model can self-evaluate (renders, test results, type checks)	each iteration produces something only a human can score (visual judgment, taste, domain expertise)
Human time budget	operator is unavailable or expensive (off-hours, batch jobs)	operator is present and available; the wall-clock saved is worth their pause time
Failure cost	output is reversible — re-run, regenerate, throw away	output is irreversible — sent emails, executed trades, deployed code; the Lethal Trifecta lens applies here

The honest take after this benchmark is that autonomous is competitive on quality for tasks with clear visual targets — a fact that wasn't obvious a generation ago and that this benchmark puts a number on. HITL still wins where you genuinely can't define the target up front, where iteration is expensive, or where the failure cost dominates. The default that fits most production teams is probably autonomous-first with HITL escape hatches at the failure-cost-sensitive checkpoints — not HITL-throughout — and this benchmark is one data point pushing in that direction.

Related explainers

Agentic CLEAR — System/Trace/Node eval granularity — companion piece on evaluating agent runs across the same three abstraction levels, complementary to picking the right control mode.
Maestro — RL orchestrator over frozen experts — another agent-design pattern that competes with the autonomous-single-agent baseline this benchmark used.

FAQ

What is the OpenSCAD Pantheon benchmark?

A hands-on agentic-coding eval ModelRift published on May 21, 2026. Six agentic coding tools — including Antigravity 2.0, ModelRift, Codex 5.5, Claude Sonnet, and Cursor Composer — were given the same prompt and two reference images, then asked to build a Pantheon model in OpenSCAD. The benchmark grades the final mesh on a 1–5 scale against architectural intent (proportion, radial symmetry, dome curvature, column count) and reports both quality and wall-clock time per run.

Did autonomous really beat human-in-the-loop on quality?

On this task, yes. Antigravity 2.0 in autonomous mode scored 4.5/5; ModelRift in HITL mode scored 3.8/5. The two runs used different specific Gemini models (Flash 3.5 High vs Flash 3.0), so it's not a strict A/B on HITL itself — it's an end-to-end comparison of the best deliverable each control mode produced. The pattern still matters: autonomous loops are competitive on quality for tasks with clear visual targets, which used to be HITL's home turf.

When should I pick HITL over autonomous in production?

When the target is fuzzy or shifts as work progresses, when each iteration is expensive (long renders, paid runs, slow CI), when only a human can score the output (taste, domain judgment), or when the failure cost is high enough that one wrong action is worse than ten wasted iterations. Conversely, autonomous wins when iterations are cheap, the target is well-defined, and the model can self-evaluate each step. Many production teams end up with autonomous-first agents that escalate to HITL only at failure-cost-sensitive checkpoints.

Originally posted on Learn AI Visually.

DEV Community