Saqueib Ansari

Posted on May 27 • Originally published at qcode.in

AI 3D tools need product evals, not benchmark faith

#ai #llm #cad #testing

If you are building AI-generated 3D tooling, treat public benchmarks as lead signals, not product truth. A model can score well on an OpenSCAD-style benchmark and still be dangerous inside your app, because your product is not grading text against a reference file. It is asking users to trust generated geometry, measurements, layout intent, and downstream editability.

That changes the bar completely. The real question is not "which model topped the benchmark?" It is "what errors can this model make inside my workflow, and how cheaply can I catch them before the user pays for them?"

For CAD-like tools, room planners, parametric builders, scene generators, and layout systems, that question matters more than leaderboard position. Benchmarks are still useful. They help you narrow candidates and avoid obvious dead ends. But if you ship based on benchmark scores alone, you are outsourcing product judgment to someone else’s task design.

Benchmarks are useful, but only as a filter

A benchmark usually tells you something real. It can reveal whether a model follows structured prompts, emits syntactically valid code, and handles a certain family of geometry tasks better than its peers. That is valuable.

What it does not tell you is whether the model is good at your failure boundary.

A benchmark can reward the wrong thing for a production tool:

exact string or AST similarity instead of geometric intent
simple object generation instead of edit-safe output
valid code generation instead of stable dimensions
one-shot task success instead of repairability after failure
average score instead of worst-case damage

That last point matters most. In 3D tooling, the average result is often less important than the ugly 5 percent. If the model occasionally creates self-intersecting meshes, non-manifold solids, overlapping walls, impossible clearances, or silently wrong measurements, the benchmark score stops being comforting.

A practical rule: use public benchmarks to choose what to test, not what to trust.

If a model performs well on an OpenSCAD benchmark, that is a reason to include it in your eval set. It is not a reason to expose generated geometry directly to paying users.

Your evals should mirror the product contract

Most teams make the same mistake here. They evaluate the model at the prompt layer, but their product risk lives at the artifact layer.

If your product accepts a natural-language request like "make a 4x6 meter room with a centered 900mm door and a 1.2 meter window on the east wall," your eval should not stop at "did the model produce plausible code?" It should verify whether the generated result satisfies the actual contract:

are the dimensions correct within tolerance?
are named constraints respected?
is the output editable?
does regeneration preserve intent?
can downstream tools ingest it cleanly?

That means your eval dataset needs to be product-specific.

Build tasks around user intent, not benchmark trivia

A good internal eval set usually includes 30 to 100 tasks before you scale further. The point is not dataset size. The point is coverage of the decisions your product actually makes.

For a room-layout tool, that might include:

simple rectangular rooms with strict dimensions
openings with exact offsets from corners
furniture placement with clearance rules
invalid requests the system should refuse or repair
near-duplicate prompts with slightly different constraints
iterative edits like "same layout, but move the sofa 400mm away from the wall"

For a parametric CAD assistant, include:

parts with exact measurements
tolerance-sensitive cutouts
repeated features and symmetry
feature edits after initial generation
prompts that mix hard constraints with soft aesthetic intent

The key is that each case should have a machine-checkable success condition where possible.

{
  "id": "room-door-window-01",
  "prompt": "Create a 4m x 6m room with a 900mm centered door on the south wall and a 1200mm window on the east wall, 1m from the northeast corner.",
  "checks": {
    "room_width_mm": 4000,
    "room_length_mm": 6000,
    "door_width_mm": 900,
    "door_centered_on_wall": true,
    "window_width_mm": 1200,
    "window_offset_from_ne_corner_mm": 1000,
    "no_opening_overlap": true,
    "manifold_geometry": true
  },
  "severity_if_wrong": "high"
}

That structure is already more useful than a generic prompt-response benchmark, because it tells you what failure means in your product.

Score by business damage, not only pass rate

Not all errors are equal. A mislabeled material is annoying. A wrong cutout dimension can ruin fabrication. A sofa overlapping a wall is ugly. A staircase with impossible rise/run values is unsafe.

So weight your evals accordingly.

A good scoring model usually separates:

hard failures: constraint violations, invalid geometry, import failure
soft failures: ugly layout, awkward spacing, poor style match
recoverable failures: user can fix in one edit
toxic failures: result looks valid but encodes wrong measurements

That last category is where benchmark worship really breaks down. A fluent-looking result that is dimensionally wrong is much worse than an obvious failure, because users trust it longer.

Geometry failure modes matter more than model polish

In 3D generation, pretty demos hide the expensive bugs. You should assume the model can produce syntactically valid output that is still operationally broken.

That is why your evals need geometry-aware checks, not just text-level scoring.

The failure classes worth catching early

For CAD-like and layout tools, these are usually the ones that matter:

dimensional drift: the part or room is close, but not correct
topological invalidity: self-intersections, open shells, non-manifold edges
constraint breakage: features overlap or violate placement rules
frame-of-reference mistakes: wrong axis, mirrored placement, swapped width/depth
edit instability: a small prompt change causes a full structural collapse
unit confusion: mm vs cm vs meters, or implicit unit shifts during refinement
downstream incompatibility: exports that render but fail in slicers, CAD importers, or scene pipelines

You do not need a perfect automated judge for all of these on day one. But you do need to stop pretending that valid text output is a sufficient proxy.

Add deterministic validators around the model

The most practical architecture is usually LLM plus verifier, not LLM alone.

If the model emits OpenSCAD, CAD parameters, or scene JSON, run deterministic checks after generation and before surfacing the result. Use the model for synthesis; use code for trust.

from dataclasses import dataclass

@dataclass
class EvalResult:
    passed: bool
    errors: list[str]
    score: float


def validate_room(spec, artifact) -> EvalResult:
    errors = []

    if artifact.room.width_mm != spec["room_width_mm"]:
        errors.append("room width mismatch")

    if artifact.room.length_mm != spec["room_length_mm"]:
        errors.append("room length mismatch")

    if not artifact.geometry.is_manifold():
        errors.append("non-manifold geometry")

    if artifact.openings.overlap():
        errors.append("opening overlap")

    if artifact.units != "mm":
        errors.append("unexpected units")

    hard_fail = any(msg in errors for msg in [
        "room width mismatch",
        "room length mismatch",
        "non-manifold geometry",
    ])

    return EvalResult(
        passed=not hard_fail,
        errors=errors,
        score=max(0, 1 - 0.25 * len(errors)),
    )

This is unglamorous, and that is exactly the point. If your product depends on geometry being right, you need boring validators in front of user trust.

Official references like OpenSCAD help when your generation target is code-based, because you can often parse, render, and inspect outputs deterministically. That is much safer than evaluating only by screenshot quality.

Ship guarded workflows before you ship direct generation

The fastest way to hurt trust is to present generated geometry as if it were authoritative.

The safer rollout path is staged.

Start with proposal mode, not execution mode

In the first version, the model should propose, not decide.

Good early-product patterns:

generate a draft and require explicit user review
highlight inferred constraints versus exact constraints
show measurements as inspectable overlays
label low-confidence outputs and blocked validations
offer one-click repair suggestions instead of silent fixes

That product framing matters. Users are much more forgiving of a "generated draft" than a "done model" that later proves wrong.

This is especially important for iterative editing workflows. If a user asks, "make the countertop 300mm deeper but keep the sink centered," they are not asking for a fresh hallucination. They are asking for constraint-preserving transformation. Those are different jobs, and they should have different guardrails.

Treat repair as a first-class capability

A strong 3D tool does not only ask, "can the model generate this?" It asks, "when the model is wrong, can the system recover cheaply?"

That means storing enough structure to support repairs:

explicit constraints
semantic object labels
dimensions as typed fields, not only freeform code
provenance for which step created which feature

If you reduce everything to one final text blob, every correction becomes a full regeneration. That is fragile.

A better pattern is intermediate representation first, generated artifact second. Let the model fill a schema, validate the schema, then compile to the final representation.

type LayoutIntent = {
  room: { widthMm: number; lengthMm: number };
  openings: Array<{
    kind: "door" | "window";
    wall: "north" | "south" | "east" | "west";
    widthMm: number;
    offsetMm: number;
  }>;
  furniture: Array<{
    kind: string;
    xMm: number;
    yMm: number;
    rotationDeg: number;
  }>;
};

That schema gives you something you can validate, diff, repair, and version. The generated scene or CAD code becomes a compilation target, not the only source of truth.

Production evals should continue after launch

Offline evals are necessary, but they are not enough. Once real users start pushing the tool, they will discover edge cases your synthetic set missed.

The correct move is to build a feedback loop that turns production failures back into eval cases.

Log failure evidence, not just prompts

When a generation fails, capture more than the prompt:

prompt text
model version
system prompt or planner version
intermediate structured intent
validator outputs
user edits after generation
whether the artifact was accepted, repaired, or discarded

That gives you a real source of truth for future evals. Otherwise you end up debugging vibes instead of failures.

A useful internal taxonomy is simple:

gen_valid_user_accepted
gen_valid_user_repaired
gen_invalid_blocked_by_validator
gen_invalid_escaped_to_user
gen_refused_correctly

Now you can measure whether the system is improving in ways that matter.

Optimize for escape rate, not just benchmark rank

The metric I would care about most is not public benchmark position. It is failure escape rate: how often a materially wrong artifact reaches the user as if it were usable.

That metric aligns with product trust.

If benchmark score improves by 8 percent but escape rate barely moves, you probably improved syntax, not safety. If benchmark score stays flat but invalid geometry reaching users drops sharply, that is real progress.

This is the contrarian part builders need to accept: the best model for your product may not be the benchmark winner. It may be the one that works best with your validators, preserves constraints more reliably, degrades more honestly, or produces artifacts your pipeline can safely repair.

What I would actually do

If I were building an AI-powered 3D or CAD-adjacent tool today, I would use public benchmarks only to shortlist candidate models. Then I would build a product eval set with strict constraint checks, geometry validation, and severity-weighted scoring. I would ship proposal mode first, keep structured intermediate representations, and block any artifact that fails deterministic validation.

I would also assume that some failures will still escape, so I would log enough evidence to turn production mistakes into new eval cases every week.

That is slower than posting a benchmark chart and declaring victory. It is also how you avoid shipping a tool that looks intelligent in demos and becomes expensive in real use.

The practical decision rule is simple: never trust a 3D generation model more than your validators trust the artifact it produced. In this category, benchmarks help you start. They should not decide when you are safe to ship.

Read the full post on QCode: https://qcode.in/how-to-build-ai-generated-3d-tools-without-trusting-benchmarks/

Top comments (1)

Harjot Singh • May 31

"Benchmarks as lead signals, not product truth" should be tattooed on every AI PM. The disconnect you name is the core one: a benchmark grades text against a reference file, but your product is asking a human to trust geometry, measurements, and editability, and a model that nails the benchmark can still produce a part that's dimensionally plausible and physically wrong. The reframe from "which model won" to "what errors can it make in MY workflow, and how cheaply can I catch them before the user pays" is the entire job, that's product evals: failure-mode-driven, not leaderboard-driven. For CAD specifically the brutal part is that the dangerous errors are the confident-looking ones (a wall that's 10mm off, an unmanufacturable join) that pass a visual glance, so the eval has to check the properties that matter (does it close, does it fit, is it manufacturable) not just "looks like a chair." Build the eval around your specific downstream failures and run it as a gate, not a vibe. That product-evals-as-a-shipping-gate discipline is exactly how I think about quality in Moonshift. What's the cheapest automated check that catches the most 3D failures for you, geometric validity, or constraint satisfaction?