William

Posted on Jul 3 • Edited on Jul 6

Building HELPIT, a memory layer for robotic dental surgery on Qwen Cloud — and the bugs that taught me more than the features did.

#ai #alibabacloud #qwencloud #python

The problem that started this

Picture the second right before a robotic arm lowers a drill bit toward a patient's jaw. It has sub-millimeter precision. It has a plan, loaded and ready. What it doesn't have is any idea that this same patient's last implant failed on the other side of their mouth three years ago.

That's not hypothetical. Robotic dental implant surgery isn't science fiction anymore — implant-placement robots are already in clinical use, drilling and seating implants with precision no human hand can match. But precision was never the missing piece. Memory was.

Every one of these systems starts every procedure the exact same way: cold. The robot has no idea this patient's bone density scan looked different six months ago than it does today. It has no idea the last patient with this exact bone-density class and this exact implant brand had a 30% higher rejection rate the moment torque crossed 45 Ncm. That knowledge lives in one dentist's head, in a paper chart nobody cross-references mid-procedure, or nowhere at all — a robot with world-class precision and the situational awareness of a stranger.

So I built HELPIT — a memory layer that sits between the dentist, the patient record, and the robot, and refuses to let a procedure start without the full picture: this patient's own history, what's worked across thousands of similar patients, and a hard gate that can say pause or stop before anything happens.

What HELPIT actually does

Before any robotic procedure begins, HELPIT runs a memory-and-safety pipeline and hands the dentist a cited, evidence-backed brief — never a robot's own unsupervised call:

Retrieve memory — episodic history for this patient (prior procedures, complications, healing trajectory) and semantic memory (population-level outcome patterns for this bone-density class), pulled in parallel.
Evaluate the gate — a deterministic PROCEED / PAUSE / ESCALATE decision. One hard contraindication (a material allergy, an anticoagulant flag, a scan contradiction) always outranks an otherwise-good composite score. This part is plain Python, not a model call — more on why below, because it's the decision I'd defend hardest.
Approve & monitor — the dentist reviews a cited brief and approves. Every step is tracked live against plan, with automatic pause on deviation — and, mid-procedure, a live co-pilot that checks whether anything like this deviation has happened before and speaks back what actually worked last time.
Consolidate outcomes — post-procedure results get written back into memory automatically, so the next patient with similar anatomy benefits from what was just learned.

Why the gate itself isn't an LLM call

This is the design decision I'd defend hardest, so let me actually defend it.

It would have been trivial to have Qwen just... decide. Feed it the patient's history and the population data, ask for PROCEED, PAUSE, or ESCALATE, done. It would have demoed fine. It also would have been the kind of decision that quietly changes on a re-roll, or that a model gets subtly wrong in a way nobody notices until it matters. A safety-critical branch that can silently drift isn't something I was willing to ship, hackathon or not.

So gate_evaluator.py decides nothing with a model. It's deterministic Python, full stop:

if has_allergy_or_systemic_flag or has_anatomy_contradiction or risk_score >= settings.gate_risk_escalate_threshold:
    gate_result = GateResult.ESCALATE
elif (
    memory_completeness < settings.gate_memory_completeness_threshold
    or anatomy_match_score < settings.gate_anatomy_match_threshold
    or risk_score >= settings.gate_risk_pause_threshold
):
    gate_result = GateResult.PAUSE
else:
    gate_result = GateResult.PROCEED

Qwen's job is everything around that decision — writing the narrative that explains it, citing the specific records it's grounded in, generating the anatomy-drift summary a dentist actually reads. The decision itself is reproducible, testable, and immune to a bad generation. I have an entire pytest file whose only job is proving a hard contraindication can never be masked by an otherwise-good composite score, no matter what any model outputs on any given run.

How Qwen Cloud actually gets used, model by model

I didn't want one model doing everything by default. A hackathon build where every call just goes to "whichever model is handy" would have been both slower and dishonest about what each call actually needs. So the routing here is deliberate, not default:

Model	Used for	Why this one
`qwen3.7-max`	The surgery-ready brief	The single highest-stakes generation call in the pipeline — worth the deepest reasoning model available
`qwen3.6-flash`	The anatomy-drift narrative, and the live co-pilot's mid-procedure advisory	Short, grounded output that doesn't need the flagship model — and for the co-pilot, latency is the feature: a surgeon mid-procedure needs one sentence in under two seconds, not a paragraph in ten
`qwen-vl-plus`	X-ray/CBCT analysis, post-op video-frame review	The only calls that actually carry images
`text-embedding-v3`	Cross-patient case recall and deviation-event recall	Ranks by cosine similarity against every other patient's history — a practice-wide semantic index, not a per-patient linear scan
`wan2.6-t2v`	AI-generated, narrated procedure preview video	Not a chat-completion model at all — goes through DashScope's native async task API (submit, poll by `task_id`) instead of the OpenAI-compatible surface everything else uses
`cosyvoice-v3-plus`	Spoken co-pilot advisories and video-analysis Q&A	The one model in this stack that talks back — over a WebSocket, not a REST call

The anatomy narrative and the surgery brief run concurrently via asyncio.gather, since neither depends on the other's output — that alone keeps end-to-end gate evaluation to about 9 seconds instead of stacking two model calls sequentially.

The bug that taught me the most about Qwen3.x specifically

When I switched the brief generator to qwen3.7-max — to use the strongest reasoning model for the highest-stakes call — gate evaluation quietly went from ~9 seconds to ~50 seconds. Same prompt. Same output quality. Five times slower, for nothing.

The cause: the Qwen3.x family defaults to an extended chain-of-thought "thinking" pass before the actual answer. Genuinely useful for open-ended reasoning. Completely wasted here, because the decision is already deterministic Python by the time the model is called — there's nothing left for extended thinking to improve when the model's only job is writing narrative around an answer that's already been computed.

response = await _client.chat.completions.create(
    model=model or settings.model_reasoning,
    messages=messages,
    temperature=temperature,
    max_tokens=max_tokens,
    extra_body={"enable_thinking": False},
)

One flag. Verified with direct before/after timing against the live API: straight back to ~9 seconds, identical eval-harness accuracy — 100% gate accuracy, 0% false-PROCEED rate, both before and after. I would not have caught this by trusting that "point at the flagship model" meant the job was done. I caught it by actually timing real calls against a real API and refusing to believe a 5x slowdown was just "how it is."

Making MCP actually real, not just decorative

HELPIT runs on 7 FastMCP tool servers — patient memory, imaging, procedure tracking, a semantic knowledge base, risk scoring, dentist preferences, outcome tracking. The honest first version of this only ever called them as plain Python functions, in-process. @mcp.tool() was doing nothing but organizing code into folders with an official-looking decorator on top. Functionally fine. Not actually "using MCP" in any way a skeptical reader should accept.

So two of the servers now also carry a real standalone entrypoint:

if __name__ == "__main__":
    mcp.run(transport="stdio")

And I wrote a client script that proves it isn't decorative — a genuine MCP initialize() handshake, list_tools(), call_tool(), over the actual protocol, against the same live SQLite database the running app uses:

async def call_server(module, tool_name, arguments):
    params = StdioServerParameters(command=sys.executable, args=["-m", module], cwd=str(REPO_ROOT))
    async with stdio_client(params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            result = await session.call_tool(tool_name, arguments)

Running it against patient_memory_server returns the same five seeded patients the web app shows. A standalone MCP client and the FastAPI app, reading identical memory through two completely different transports. That's the difference between "we used MCP" as a slide bullet and "we used MCP" as something you can actually go run.

Real anatomy, and the decimation algorithm nobody asked for

I wanted the 3D tooth and implant previews to be actual anatomy, not procedural boxes-and-spheres wearing a tooth costume. Three separate attempts at hand-built geometry all read as "blocky toy" or, memorably, "light bulb" — a fully round, radially-symmetric shape structurally can't represent a tooth's flat mesial, distal, buccal, and lingual walls. No amount of clever code fixes that. Only a real scan does.

So I sourced one — a full maxillary arch, public domain, from NIH 3D. 951,316 vertices at 34MB, straight off a real scan. Beautiful. Also completely unusable: a browser tab auto-rotating that mesh next to several other live 3D canvases ground to a crawl. Three.js's standard edge-collapse simplifier was still churning after twenty minutes and counting.

So I wrote a decimator instead — the simplest one that could possibly work. Snap every vertex to a 3D grid cell sized as a fraction of the mesh's bounding-box diagonal. Average every vertex that lands in the same cell into one point. Remap every triangle's indices through that old-to-new mapping. Drop anything that degenerated into a zero-area sliver in the process.

cell_size = diagonal * 0.0026
cell = tuple(int((v[axis] - bbox_min[axis]) / cell_size) for axis in range(3))
# accumulate all vertices sharing a cell, average, remap triangle indices

951,316 vertices became 128,720 (13.5%). 1.9 million triangles became 264,000. 34MB became 13.2MB. Twenty-plus minutes became a few seconds. It is not a clever algorithm — it's arguably the crudest spatial hash you could write — and that's exactly the point. The textbook-correct simplifier was strictly better geometry and strictly useless for what I actually needed, which was "finishes before I lose interest in waiting for it."

Where it stands right now

100% gate accuracy, 0% false-PROCEED rate on a labeled eval harness — the single most dangerous failure mode a system like this can have (a false PROCEED on a patient who should PAUSE or ESCALATE) is checked explicitly, and hard-fails the eval run the moment it's ever non-zero.
42 pytest tests, gating every deploy in CI — including a permanent regression test for a genuinely nasty SQLAlchemy bug: mutating a JSON column's cached list in place and reassigning the same object reference is a silent no-op for the ORM's dirty-tracking. The API response looked completely correct. The database write just never happened, and nothing said otherwise until a second, independent read proved it.
A live memory co-pilot that speaks up the instant a step deviates mid-procedure, grounded in what actually happened the last time something similar occurred — and honest enough to say "nothing similar on file" instead of fabricating confidence when it has nothing to go on.
Deployed on Alibaba Cloud ECS, redeployed automatically on every push to main, gated by that same test suite.

What's next

Real EHR integration instead of synthetic patient data.
A second reference implementation of the same episodic + semantic + gate + live-co-pilot pattern in a non-dental domain, to make the "this architecture generalizes" claim concrete instead of just argued.
Feeding the live co-pilot a real practice's historical deviation/outcome records on day one, instead of starting its memory from zero.

If you're working with Qwen Cloud's model family and have run into the enable_thinking latency trap yourself, or have opinions on where the PROCEED/PAUSE/ESCALATE line should actually sit for a safety-critical gate — I'd genuinely like to hear about it in the comments.

DEV Community