Raihan

Posted on May 11 • Originally published at huggingface.co

Matching frontier LLMs at 22 lower latency: a 184M-parameter intent classifier for healthcare text

#ai #machinelearning #python #huggingface

Healthcare practices drown in inbound patient text. Email, contact forms, live chat, SMS, voicemail transcripts — every channel sends messages that need to be routed: to scheduling, to billing, to clinical, to the front desk. It's a high-volume, deterministic, latency-sensitive task.

The obvious answer in 2026 is to throw a frontier LLM at it. Claude Haiku 4.5 will give you 95% accuracy on this kind of classification. GPT-4o will too. But every call costs real money, adds about a second of network round-trip, and sends patient text to a third party that doesn't have a BAA with you.

I built a small alternative — a 184M-parameter DeBERTa-v3-base fine-tune — and benchmarked it against Claude Haiku 4.5, Claude Sonnet 4.6, and GPT-4o on a 1,154-example test set. The fine-tuned model lands within 4 percentage points of accuracy of the best frontier model, runs 22× faster on a CPU, and costs effectively $0 per inference after training. Total cost to build it: under $3.

Model on Hugging Face: raihan-js/clarioscope-intent-deberta-v1.

This is model 1 of three I'm building for the ClarioScope SLM Suite — a healthcare intake intelligence pipeline. The other two are a PHI detector and an insurance extractor; they're in development. This post is the methodology and the benchmark for the first one.

The task

Seven intent labels, designed for production routing at a healthcare practice:

Label	What it captures
`new_patient_inquiry`	A prospective patient asking about becoming a patient
`existing_patient_question`	An existing patient with a non-urgent question
`appointment_request`	Scheduling, rescheduling, or cancellation
`billing_inquiry`	Questions about bills or pricing of services already received
`clinical_concern`	An active medical concern requiring clinical attention
`complaint`	Dissatisfaction with service, staff, communication, or outcome
`price_shopper`	Pricing-only inquiry, no commitment signals

These categories are opinionated and they have real ambiguity at the edges. A new patient asking for their first appointment is both new-patient and appointment-request. A frustrated patient describing a medical concern is both clinical and complaint. The data-generation prompt encodes explicit disambiguation rules (complaint dominates when both signals are present; pre-commitment pricing questions are price_shopper even if they mention insurance), but the boundary cases are where every model — fine-tuned or frontier — gives up F1 points.

Why not just use the API

Three reasons:

Latency. Frontier API calls from my Bangladesh ISP run 1,000–1,600 ms. For routing, that's the difference between an inbox that updates instantly and one that lags noticeably. The fine-tuned model on a CPU runs in 48 ms. On a GPU it would be another 5–10× faster. Either way, the wall-clock floor for a hosted API call is in the hundreds of milliseconds even before the model processes anything.

Cost. Claude Sonnet 4.6 costs $0.76 per 1,000 inferences on this task. Haiku is $0.25 per 1K. GPT-4o is $0.53 per 1K. For a single practice receiving 10,000 inbound messages per day across all channels (not unrealistic for a multi-location dental or dermatology group), that's $912 to $2,774 per practice per year — a hard line item on the SaaS economics. The fine-tuned model has a one-time training cost and approximately zero marginal per-inference cost.

Privacy. Frontier APIs are great, and they're also a third-party data path. For protected health information you'd want a BAA, and not every API provider offers one at every tier. A self-hosted classifier never sends patient text anywhere.

The accuracy gap versus frontier is real but small enough that for production routing, the speed/cost/privacy wins dominate.

The model

Standard DeBERTa-v3-base with a sequence classification head: a single linear layer over the pooled [CLS] representation producing 7 logits. All 184M parameters fine-tuned. No LoRA — at this dataset size, full fine-tuning beats parameter-efficient methods without much overhead. Training was 5 epochs of 8,099 examples on a single RTX 4090 (rented on RunPod), batch size 32, max sequence length 256 tokens, learning rate 2e-5 with cosine schedule and 10% warmup, fp16 mixed precision. Total training wall time: about five minutes.

The training data — synthetic and transparent about it

This is the most important section of the post for anyone considering similar work. All training and test data is synthetic. There is no real patient data anywhere in the pipeline. This is a deliberate choice — using synthetic data for v1 sidesteps HIPAA constraints entirely and lets the model ship fast. A v2 trained on real PHI would need HIPAA-eligible training infrastructure (AWS SageMaker or Azure ML with a BAA), and that's a separate, more careful project.

But "synthetic" is doing a lot of work in that sentence. The naïve approach — ask an LLM for 1,000 example patient inquiries per intent — produces what I'll call ChatGPT-polite text: every message opens with "Hi!", ends with "Thanks!", uses correct grammar and punctuation, and reads nothing like a real SMS message that an actual frustrated parent sends at 2 AM.

A model trained on ChatGPT-polite text will overfit to the politeness markers and degrade badly on real production text. So the generation prompt forces a mandatory realism mix per batch:

~40% polished (full sentences, correct grammar, proper punctuation, formal or neutral)
~40% casual (lowercase starts, contractions, fragments, missing terminal punctuation, conversational)
~20% messy (typos, autocorrect mistakes, abbreviations like u/appt/tmrw, ALL CAPS for urgency, run-on phrasing)

Plus channel-conditional scaling: SMS is the messiest, voicemail transcripts second messiest, email and web forms more polished. The prompt also includes about 20 lines of style anchors — concrete patterns the LLM should reproduce. Stuff like:

Abbreviations: u/ur, appt, tmrw, yr, pls, thx, rx, ins (insurance)

Fragment phrases: "billing question call me back", "need to reschedule thursday", "kid has fever 102", "still no answer about my x-ray"

Run-on voicemail: "uh hi yeah this is calling about that thing you mentioned last week i think it was a follow up or something can you call me back"

Conversational starts (no greeting): "Quick question —", "So I got this bill...", "Need to cancel —"

Two dry runs before the full 9,000-example generation: the first one without the realism mix produced very polite, very clean output (82% of messages opened with "Hi!", 0% had ALL CAPS, almost nothing was a fragment); the second one with the mix landed at 18% greetingless openers, 22% abbreviations, 21% no terminal punctuation, 6% ALL CAPS urgency. The shape of the distribution actually moved when the prompt told it to move.

Costs: the 9,000 training examples cost about $1.20 of OpenAI credit (via gpt-4o-mini-2024-07-18, JSON-object response format, temperature 1.0, 8-worker parallel generation).

Preventing benchmark leakage

The naive failure mode here is generating both train and test with the same model. The fine-tuned model would learn the generator's style, and the benchmark would inflate.

So train and test come from different generators:

Train (9,000 examples) — generated by gpt-4o-mini-2024-07-18 with the prompt above.
Test (1,154 examples) — generated by Claude with a deliberately different prompt style and a different abbreviation set (w/, &, hrs, BTW, IDK, plz versus the train prompt's u, tmrw, appt). The test set leans into more medically specific content (real conditions, real procedure names) and longer rambling messages.

A side effect of this split: when I benchmark against Claude Haiku 4.5 and Claude Sonnet 4.6 below, those models are from the same family as the test-set generator. If anything, they should get a small style-familiarity advantage. The benchmark numbers below are with that caveat in mind. (Spoiler: they don't visibly benefit.)

The benchmark

Evaluated on 1,154 held-out test examples:

Model	Accuracy	Macro F1	Latency / example	Cost / 1K inferences
`raihan-js/clarioscope-intent-deberta-v1` (CPU)	91.16%	91.07%	48.5 ms	$0.00
`claude-haiku-4-5-20251001`	95.32%	95.28%	1064 ms	$0.252
`claude-sonnet-4-6`	93.59%	93.53%	1566 ms	$0.759
`gpt-4o-2024-11-20`	95.23%	95.17%	1036 ms	$0.527

Latency is wall-clock single-example latency through each provider's chat completions API, measured from a Bangladesh ISP. The fine-tuned model number is on a CPU (no GPU acceleration). Cost is the actual API spend per 1,000 calls based on token counts from the run.

Three things in this table are interesting

Sonnet 4.6 is worse than Haiku 4.5. A bigger, slower, more expensive frontier model produces lower accuracy on this task. This isn't an artifact of one run — I've seen it consistently. My take: for narrow, well-structured classification with short prompts, more reasoning capacity sometimes second-guesses the correct intuition. The first thought is often right, and a smaller model that doesn't have the option to deliberate just commits to it. The right tool for this kind of job is small and specific.

The latency advantage is on CPU. The 48ms number is on CPU. A modest GPU would drop it to ~5–10 ms. The frontier API numbers are network-bound — the model itself processes the request in tens of milliseconds, but the wall-clock floor for a hosted API call from a non-US-East ISP is in the hundreds of milliseconds before the model has even started. Adding a GPU at the API side does nothing for that floor.

Cost gap doesn't shrink at scale. API cost scales linearly with call volume. The fine-tuned model has a one-time training cost (about $2.40 of OpenAI plus RunPod compute together) and approximately zero marginal cost. For 10K daily inferences over a year, the dollar swing is between zero and roughly $2,800.

Per-class F1 and where the errors live

The model's per-class F1 on the val set, ranked best to worst:

Intent	F1
`price_shopper`	0.957
`complaint`	0.929
`billing_inquiry`	0.908
`appointment_request`	0.881
`clinical_concern`	0.874
`existing_patient_question`	0.834
`new_patient_inquiry`	0.819

The hardest pairs to disambiguate are exactly the pairs you'd expect to be hard:

new_patient_inquiry ↔ appointment_request — a new patient asking to schedule their first visit fits both labels. The data-gen prompt resolves toward new_patient_inquiry for messages that lead with the becoming-a-patient signal, but the model lands on appointment_request more often than the label intends.
existing_patient_question ↔ clinical_concern — medical questions from established patients read as low-grade concerns to the model, because at the lexical level they are.
clinical_concern ↔ complaint — frustrated medical concerns combine both signals; the prompt's tie-breaker says complaint dominates, but the model occasionally goes the other way.

These same pairs gave Claude Haiku 4.5 trouble too when I ran the benchmark by hand on a sample. They're real ambiguity in the task, not classifier weakness. Useful production move: have the model emit confidence (max softmax) alongside the label, and route low-confidence predictions to a human reviewer.

The cost ledger

Full breakdown of what it cost to ship this model:

Item	Cost
9,000 synthetic training examples via OpenAI (`gpt-4o-mini`)	$1.20
RunPod RTX 4090 pod (about 50 minutes including iteration)	$1.20
Benchmark API calls (Haiku + Sonnet + GPT-4o, 1,154 examples each)	$1.78
Hugging Face hosting	$0
Total	$4.18

That's it. End-to-end, from empty repo to published model + reproducible benchmark, for less than the price of lunch.

How to use it

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "raihan-js/clarioscope-intent-deberta-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

texts = [
    "Hi, I'm new to the area and looking for a dermatologist. Are you accepting new patients?",
    "got a bill for $382 for my visit on 4/12 but my copay should only be $35 — what's the rest?",
    "my kid's fever is 103.2 and not coming down with tylenol. need advice now",
]

inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
labels = [model.config.id2label[i] for i in logits.argmax(dim=-1).tolist()]
print(labels)
# ['new_patient_inquiry', 'billing_inquiry', 'clinical_concern']

Limitations

I've put a full Limitations section in the model card, but the highlights:

All training and test data is synthetic. No real production validation yet. A real-world calibration pass is a prerequisite for production deployment.
English only.
Healthcare practice domain only. Routes messages within a practice — does not generalize to other industries.
Seven categories, not exhaustive. Messages that don't fit get the closest available label rather than an "unknown" bucket.
No PHI redaction is performed by this model. PHI detection is a separate model in the suite (in development), and HIPAA compliance is a regulatory determination that no model can make.

What's next

This is model 1 of three. The other two:

clarioscope-phi-deberta-v1 — a token-classification model (BIO tagging) for detecting PHI spans in patient text. Same DeBERTa base, different head, different training data (synthetic PHI-annotated text). Goal: redact before routing.
clarioscope-insurance-v1 — structured JSON extraction of insurance- and billing-relevant fields from inbound text. Probably a small encoder-decoder or constrained-decoding setup.

When all three are published, they'll go up as a Hugging Face collection and the master writeup will be a single longer post tying the suite together. Follow along on Hugging Face or GitHub.

If you've shipped a small specialized model that beats — sorry, matches — frontier APIs on a narrow task, I'd love to hear about it. The pattern works.

DEV Community