DEV Community

Cover image for Giving Robotic Dental Surgery a Memory: Building HELPIT on Qwen Cloud
William
William

Posted on

Giving Robotic Dental Surgery a Memory: Building HELPIT on Qwen Cloud

The problem that started this

Robotic dental implant surgery isn't science fiction anymore — implant-placement robots are already in clinical use, drilling and seating implants with sub-millimeter precision. But precision was never the missing piece. Memory was.

Every one of these systems starts every procedure the same way: cold. The robot has no idea this patient had a failed implant on the other side of their jaw three years ago. It has no idea their bone density scan from six months ago looked different than it does today. It has no idea that the last patient with this exact bone-density class and this exact implant brand had a 30% higher rejection rate once torque went past 45 Ncm. That knowledge either lives in a dentist's head, in a paper chart nobody cross-references in real time, or nowhere at all.

So I built HELPIT — a memory layer that sits between the dentist, the patient record, and the robot, and refuses to let a procedure start without a full picture: this patient's own history, what's worked across thousands of similar patients, and a hard gate that can say pause or stop before anything happens.

What HELPIT actually does

Before any robotic procedure begins, HELPIT runs a memory-and-safety pipeline and hands the dentist a cited, evidence-backed brief — never a robot's own unsupervised call:

  1. Retrieve memory — episodic history for this patient (prior procedures, complications, healing trajectory) and semantic memory (population-level outcome patterns for this bone-density class), pulled in parallel.
  2. Evaluate the gate — a deterministic PROCEED / PAUSE / ESCALATE decision. A single hard contraindication (a material allergy, an anticoagulant flag, a scan contradiction) always outranks an otherwise-good composite score. This part is plain Python, not a model call — more on why below.
  3. Approve & monitor — the dentist reviews a cited brief and approves. Every step is tracked live against plan, with automatic pause on deviation (and a way to correct a mis-keyed value afterward without losing the audit trail).
  4. Consolidate outcomes — post-procedure results get written back into memory automatically, so the next patient with similar anatomy benefits from what was just learned.

Why the gate itself isn't an LLM call

This is the design decision I'd defend hardest. It would've been easy to have Qwen just... decide. Read the patient's history, read the population data, output PROCEED/PAUSE/ESCALATE. But a safety-critical branch that a model can silently get wrong — or that changes answer on a re-roll — isn't something I was willing to ship.

Instead, gate_evaluator.py is deterministic:

if has_allergy_or_systemic_flag or has_anatomy_contradiction or risk_score >= settings.gate_risk_escalate_threshold:
    gate_result = GateResult.ESCALATE
elif (
    memory_completeness < settings.gate_memory_completeness_threshold
    or anatomy_match_score < settings.gate_anatomy_match_threshold
    or risk_score >= settings.gate_risk_pause_threshold
):
    gate_result = GateResult.PAUSE
else:
    gate_result = GateResult.PROCEED
Enter fullscreen mode Exit fullscreen mode

Qwen's job is everything around that decision: writing the narrative that explains it, citing the specific records it's grounded in, generating the anatomy-drift summary. The decision itself is reproducible and testable — I have a whole pytest file whose entire point is proving a hard contraindication can never be masked by an otherwise-good score, no matter what a model outputs.

How Qwen Cloud actually gets used, model by model

I didn't want one model doing everything by default — a hackathon "just call qwen-plus for whatever" version of this would've been both slower and less honest about what each call actually needs. So the routing is deliberate:

Model Used for Why this one
qwen3.7-max The surgery-ready brief The single highest-stakes generation call in the pipeline — worth the deepest reasoning model available
qwen3.6-flash The anatomy-drift narrative A short, grounded summary that doesn't need the flagship model — this call used to (wastefully) go to the vision model with zero images attached, before I caught and fixed it
qwen-vl-plus X-ray/CBCT analysis, post-op video-frame review The only calls that actually carry images
wan2.6-t2v AI-generated procedure preview video Not a chat-completion model at all — goes through DashScope's native async task API (submit task, poll by task_id) instead of the OpenAI-compatible surface everything else uses

The anatomy narrative and the surgery brief run concurrently via asyncio.gather since neither depends on the other's output — that alone keeps end-to-end gate evaluation to about 9 seconds instead of stacking two model calls sequentially.

The bug that taught me the most about Qwen3.x specifically

When I switched the brief generator to qwen3.7-max (to satisfy the hackathon's mandated-model requirement), gate evaluation quietly went from ~9 seconds to ~50 seconds. Same prompt, same output quality, 5x slower.

The cause: the qwen3.x family defaults to an extended chain-of-thought "thinking" pass before the actual answer. Great for open-ended reasoning. Completely wasted here, because the decision is deterministic Python — there's nothing for extended thinking to improve when the model is only ever writing narrative around an already-computed answer.

response = await _client.chat.completions.create(
    model=model or settings.model_reasoning,
    messages=messages,
    temperature=temperature,
    max_tokens=max_tokens,
    extra_body={"enable_thinking": False},
)
Enter fullscreen mode Exit fullscreen mode

One flag, verified with direct before/after timing against the live API: back to ~9 seconds, same eval-harness accuracy (100% gate accuracy, 0% false-PROCEED rate, both before and after). I would not have caught this without actually timing real API calls instead of trusting that "the mandated model" meant "just point at it and move on."

Making MCP actually real, not just decorative

HELPIT has 7 FastMCP tool servers (patient memory, imaging, procedure tracking, semantic knowledge base, risk scoring, dentist preferences, outcome tracking). The honest first version of this only ever called them as plain Python functions, in-process — @mcp.tool() was doing nothing but organizing code. Functionally true, but not really "using MCP."

So two of the servers now also carry a real standalone entrypoint:

if __name__ == "__main__":
    mcp.run(transport="stdio")
Enter fullscreen mode Exit fullscreen mode

And I wrote a client script that proves it — a genuine MCP initialize() handshake, list_tools(), call_tool(), over the actual protocol, against the same live SQLite database the running app uses:

async def call_server(module, tool_name, arguments):
    params = StdioServerParameters(command=sys.executable, args=["-m", module], cwd=str(REPO_ROOT))
    async with stdio_client(params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            result = await session.call_tool(tool_name, arguments)
Enter fullscreen mode Exit fullscreen mode

Running it against patient_memory_server returns the same five seeded patients the web app shows — a standalone MCP client and the FastAPI app reading identical memory through two completely different transports.

Real anatomy, and the decimation algorithm nobody asked for

I wanted the 3D tooth/implant previews to be actual anatomy, not procedural boxes-and-spheres. Real scanned models solved the "does this look like a tooth" problem instantly — but the full maxillary arch model I sourced (public domain, NIH 3D) was 951,316 vertices at 34MB, straight off a real scan. Fine for a CAD viewer. Not fine for a browser tab already auto-rotating four other 3D canvases.

three.js's standard edge-collapse simplifier was still running after 20+ minutes on this mesh. So I wrote a grid-based vertex-clustering decimator instead — snap every vertex to a 3D grid cell sized as a fraction of the mesh's bounding-box diagonal, average every vertex landing in the same cell into one, remap triangle indices through the old→new mapping, drop anything that degenerated:

cell_size = diagonal * 0.0026
cell = tuple(int((v[axis] - bbox_min[axis]) / cell_size) for axis in range(3))
# accumulate all vertices sharing a cell, average, remap triangle indices
Enter fullscreen mode Exit fullscreen mode

951,316 → 128,720 vertices (13.5%), 1.9M → 264K triangles, 34MB → 13.2MB, runtime a few seconds instead of tens of minutes. It's not a subtle algorithm — it's the simplest possible spatial hash — but "simplest thing that actually finishes in a reasonable time" beat "the textbook-correct simplifier" for what this needed to do.

Where it stands right now

  • 100% gate accuracy, 0% false-PROCEED rate on a labeled 5-patient eval harness (the single most dangerous failure mode — a false PROCEED on a patient who should PAUSE or ESCALATE — hard-fails the eval run if it's ever non-zero)
  • 26 pytest tests, gating every deploy in CI, including a permanent regression test for a genuinely nasty SQLAlchemy bug (mutating a JSON column's cached list in place and reassigning the same object reference is a silent no-op for the ORM's dirty-tracking — the API response looked correct, the database write just never happened)
  • Live on Alibaba Cloud ECS, redeployed automatically on every push to main

What's next

  • Real EHR integration instead of synthetic patient data
  • A second reference implementation of the same episodic + semantic + gate pattern in a non-dental domain, to make the "this architecture generalizes" claim concrete instead of just argued
  • CosyVoice narration for the brief (deprioritized this round — it needs a WebSocket protocol rather than a simple REST call)

I'll keep updating this post as the build continues. If you're working with Qwen Cloud's model family and have run into the enable_thinking latency trap yourself, or have opinions on where the PROCEED/PAUSE/ESCALATE line should actually sit for a safety-critical gate — I'd genuinely like to hear about it in the comments.

Top comments (0)