Raihan

Posted on May 12

Three small models for healthcare intake — and what shipping all three taught me

#ai #python #machinelearning #healthcare

Two months ago I started a portfolio project: build three small specialized language models for healthcare practice intake, benchmark each one honestly against frontier APIs, and write about what I learned. The goal was to build the case that small specialized models still have a place in the 2026 toolkit alongside frontier LLMs — not as replacements, but as the first stage of a hybrid pipeline.

This is the post about the third model. It's also the post about the suite — what worked across all three, what didn't, and the pattern that emerged.

The three models, all on Hugging Face:

clarioscope-intent-deberta-v1 — 184M DeBERTa-v3-base, 7-class intent classification. Within 4 pp of Claude Haiku 4.5, 22× faster on CPU. methodology post →
clarioscope-phi-deberta-v1 — 125M RoBERTa-base, 18-category PHI span detection (HIPAA Safe Harbor). Loses on aggregate but triples frontier F1 on geographic locations. methodology post →
clarioscope-insurance-v1 — 125M RoBERTa-base, 12-field insurance / billing extraction. This post.

What the third model does

A 125M-parameter RoBERTa fine-tune that extracts twelve insurance/billing fields from patient text: carrier name, plan type, member ID, group number, policy number, subscriber name, relationship, claim ID, prior-auth number, copay, deductible, and billed amount. Output is BIO-tagged token spans, which downstream code converts into a JSON object a billing system can ingest directly.

The benchmark, on a 672-example held-out test set:

Model	Macro F1	Weighted F1	Latency / example	Cost / 1K inferences
`clarioscope-insurance-v1` (CPU)	0.7882	0.8202	45.4 ms	$0.00
`gpt-4o-2024-11-20`	0.9562	0.9572	1202 ms	$1.90

Same speed/cost shape as the other two models in the suite: ~26× faster than GPT-4o, $0 marginal cost. The accuracy gap is concentrated in a small number of low-frequency fields.

Fine-tune is competitive on the high-volume fields. CLAIM_ID (0.95 vs 1.00), MEMBER_ID (0.91 vs 0.99), CARRIER (0.91 vs 0.96), SUBSCRIBER_NAME (0.89 vs 0.91 — essentially tied). These four fields collectively cover ~70% of the test entities.

The gap is concentrated in a few low-volume fields. AUTH_NUMBER is the standout weakness: 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans and the format space is wide (PA-4421, auth #998-2210, AUTH998212, etc.). Same structured-ID problem as the PHI detector had with MRN. PLAN_TYPE is similar: short strings like "PPO", "HMO" with overloaded surface forms.

Three patterns that repeated across all three models

1. Synthetic data is fast and noisy, and the noise is systematic

In all three models, gpt-4o-mini produced label-noise patterns I had to discover and fix:

Intent classifier (Model 1): the LLM over-fitted to "ChatGPT-polite" message style on first attempts. Fixed by adding a mandatory realism mix (40/40/20 polished / casual / messy) to the generation prompt.
PHI detector (Model 2): the LLM included cue words in entity spans — "MRN 8472301" annotated as the MRN span instead of "8472301". About 8.6% of training spans had this contamination. Fixed by clean_data.py (cue-word stripping + re-locating spans).
Insurance extractor (Model 3): same cue-word noise pattern as PHI — "member ID AET-998-2210" instead of "AET-998-2210", "copay $35" instead of "$35". 7.4% of spans needed cleanup. Same fix.

Lesson: when synthetic data is the input, label QA is part of the pipeline. The LLM that generates the annotations does not produce ground truth, it produces a draft that humans (or scripts) need to validate. The version of clean_data.py that I shipped for Models 2 and 3 is now part of every future synthetic NER project I'll build.

2. Cross-generator test sets are not optional — and val numbers lie

In all three models, val macro F1 was 5–17 percentage points higher than test macro F1:

Model	Val macro F1	Test macro F1	Gap
Intent classifier	0.886	0.911	-0.025 (test actually higher)
PHI detector	0.863	0.630	+0.233
Insurance extractor	0.957	0.788	+0.169

The intent classifier was the exception — classification with 7 categories is more robust than span extraction with 12+ categories. For both span-extraction models, the val numbers from a same-generator split would have produced overconfident model cards.

Lesson: same-generator val splits are useful for early development feedback, but the headline number that goes on a model card should be from a held-out set generated by a different model with a different prompt style. Otherwise the benchmark inflates and you'll be surprised in production.

3. Small models beat frontier on linguistic entities, lose on structured-ID memorization

This pattern showed up clearest in the PHI detector and was the central observation in that model's writeup. The insurance extractor repeats it:

Linguistic + bounded vocabulary fields (CARRIER from a short list of insurance companies, CLAIM_ID with predictable claim patterns, SUBSCRIBER_NAME using ordinary names): fine-tune is competitive or tied with GPT-4o.
Structured-ID fields with high format variance (AUTH_NUMBER, PLAN_TYPE token boundaries, GROUP_NUMBER formats that vary widely): frontier wins because they've seen far more format variance during pretraining.

For both Model 2 and Model 3, the production recommendation is the same: hybrid pipeline. Fine-tuned model first, regex for highly-structured patterns, frontier API as the fallback for the long tail. Most of the cost and latency comes from the fine-tune; the frontier API runs on a small fraction of traffic.

What each model cost to build, total

Model	OpenAI (data gen)	RunPod (train)	Benchmark APIs	Total
Intent classifier	$1.20	$1.20	$1.78	$4.18
PHI detector	$1.40	$1.50	$5.20	$8.10
Insurance extractor	$1.50	$0.80	$1.10 (no Anthropic)	$3.40
Suite total	$4.10	$3.50	$8.08	~$15.70

Three published models with benchmark-grade write-ups for under sixteen dollars. The Anthropic credit gap in the insurance extractor benchmark is the only thing that prevents a clean head-to-head across all three, and that's just a "buy more credit" problem.

Hugging Face hosting: $0. Total infrastructure cost beyond the line items above: $0.

Why I built this

To educate why SLM matters! The pivot story has three components: (1) I can train transformers from scratch on consumer hardware (the ORCH series), (2) I can fine-tune larger base models with QLoRA (ORCH-7B), and (3) I can ship benchmark-grade specialized SLMs with rigorous, transparent evaluation against frontier APIs.

The ClarioScope SLM Suite is the third leg. Three months of work, three published models, three dev.to write-ups, full transparency on synthetic-data limitations and where frontier still beats us. If you're hiring for AI engineering roles where the candidate needs to understand both training-from-scratch AND production benchmarking AND honest model-limitations communication, my LinkedIn is in my GitHub profile.

What's next

A v1.1 of the insurance extractor with Anthropic benchmarks once credit is restored, and a v2 of all three models trained on real (de-identified) patient text from a partner practice — which moves the project into HIPAA-eligible infrastructure (AWS SageMaker / Azure ML with a BAA) and out of the "synthetic-data v1" phase.

If you've shipped a small specialized model that has the inverse story — beats frontier on aggregate but loses on a specific axis — I'd love to hear about it. The interesting trade-offs in 2026 aren't "should I use a frontier API" but "what's the right hybrid architecture for this task." This three-model suite was the project that taught me that lesson, three times in a row.

Follow along on Hugging Face or GitHub.

DEV Community