<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Raihan</title>
    <description>The latest articles on DEV Community by Raihan (@raihan-js).</description>
    <link>https://dev.to/raihan-js</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F254365%2Fb923a1bc-0077-4742-bafd-fff5c5d9b555.png</url>
      <title>DEV Community: Raihan</title>
      <link>https://dev.to/raihan-js</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/raihan-js"/>
    <language>en</language>
    <item>
      <title>Matching frontier LLMs at 22 lower latency: a 184M-parameter intent classifier for healthcare text</title>
      <dc:creator>Raihan</dc:creator>
      <pubDate>Mon, 11 May 2026 17:43:38 +0000</pubDate>
      <link>https://dev.to/raihan-js/matching-frontier-llms-at-22x-lower-latency-a-184m-parameter-intent-classifier-for-healthcare-text-5ec2</link>
      <guid>https://dev.to/raihan-js/matching-frontier-llms-at-22x-lower-latency-a-184m-parameter-intent-classifier-for-healthcare-text-5ec2</guid>
      <description>&lt;p&gt;Healthcare practices drown in inbound patient text. Email, contact forms, live chat, SMS, voicemail transcripts — every channel sends messages that need to be routed: to scheduling, to billing, to clinical, to the front desk. It's a high-volume, deterministic, latency-sensitive task.&lt;/p&gt;

&lt;p&gt;The obvious answer in 2026 is to throw a frontier LLM at it. Claude Haiku 4.5 will give you 95% accuracy on this kind of classification. GPT-4o will too. But every call costs real money, adds about a second of network round-trip, and sends patient text to a third party that doesn't have a BAA with you.&lt;/p&gt;

&lt;p&gt;I built a small alternative — a 184M-parameter DeBERTa-v3-base fine-tune — and benchmarked it against Claude Haiku 4.5, Claude Sonnet 4.6, and GPT-4o on a 1,154-example test set. &lt;strong&gt;The fine-tuned model lands within 4 percentage points of accuracy&lt;/strong&gt; of the best frontier model, &lt;strong&gt;runs 22× faster&lt;/strong&gt; on a CPU, and costs &lt;strong&gt;effectively $0 per inference&lt;/strong&gt; after training. Total cost to build it: under $3.&lt;/p&gt;

&lt;p&gt;Model on Hugging Face: &lt;a href="https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1" rel="noopener noreferrer"&gt;&lt;code&gt;raihan-js/clarioscope-intent-deberta-v1&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fraihan-js%2Fclarioscope-intent-deberta-v1%2Fresolve%2Fmain%2Faccuracy_vs_latency.png%3Fv%3D2" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fraihan-js%2Fclarioscope-intent-deberta-v1%2Fresolve%2Fmain%2Faccuracy_vs_latency.png%3Fv%3D2" alt="Accuracy vs latency" width="1691" height="967"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is model 1 of three I'm building for the ClarioScope SLM Suite — a healthcare intake intelligence pipeline. The other two are a PHI detector and an insurance extractor; they're in development. This post is the methodology and the benchmark for the first one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The task
&lt;/h2&gt;

&lt;p&gt;Seven intent labels, designed for production routing at a healthcare practice:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;What it captures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;new_patient_inquiry&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A prospective patient asking about becoming a patient&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;existing_patient_question&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;An existing patient with a non-urgent question&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;appointment_request&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scheduling, rescheduling, or cancellation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;billing_inquiry&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Questions about bills or pricing of services already received&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;clinical_concern&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;An active medical concern requiring clinical attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;complaint&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dissatisfaction with service, staff, communication, or outcome&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;price_shopper&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pricing-only inquiry, no commitment signals&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These categories are opinionated and they have real ambiguity at the edges. A new patient asking for their first appointment is both new-patient and appointment-request. A frustrated patient describing a medical concern is both clinical and complaint. The data-generation prompt encodes explicit disambiguation rules (complaint dominates when both signals are present; pre-commitment pricing questions are &lt;code&gt;price_shopper&lt;/code&gt; even if they mention insurance), but the boundary cases are where every model — fine-tuned or frontier — gives up F1 points.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use the API
&lt;/h2&gt;

&lt;p&gt;Three reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency.&lt;/strong&gt; Frontier API calls from my Bangladesh ISP run 1,000–1,600 ms. For routing, that's the difference between an inbox that updates instantly and one that lags noticeably. The fine-tuned model on a CPU runs in 48 ms. On a GPU it would be another 5–10× faster. Either way, the wall-clock floor for a hosted API call is in the hundreds of milliseconds even before the model processes anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Claude Sonnet 4.6 costs $0.76 per 1,000 inferences on this task. Haiku is $0.25 per 1K. GPT-4o is $0.53 per 1K. For a single practice receiving 10,000 inbound messages per day across all channels (not unrealistic for a multi-location dental or dermatology group), that's $912 to $2,774 per practice per year — a hard line item on the SaaS economics. The fine-tuned model has a one-time training cost and approximately zero marginal per-inference cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy.&lt;/strong&gt; Frontier APIs are great, and they're also a third-party data path. For protected health information you'd want a BAA, and not every API provider offers one at every tier. A self-hosted classifier never sends patient text anywhere.&lt;/p&gt;

&lt;p&gt;The accuracy gap versus frontier is real but small enough that for production routing, the speed/cost/privacy wins dominate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model
&lt;/h2&gt;

&lt;p&gt;Standard &lt;a href="https://huggingface.co/microsoft/deberta-v3-base" rel="noopener noreferrer"&gt;DeBERTa-v3-base&lt;/a&gt; with a sequence classification head: a single linear layer over the pooled &lt;code&gt;[CLS]&lt;/code&gt; representation producing 7 logits. All 184M parameters fine-tuned. No LoRA — at this dataset size, full fine-tuning beats parameter-efficient methods without much overhead. Training was 5 epochs of 8,099 examples on a single RTX 4090 (rented on RunPod), batch size 32, max sequence length 256 tokens, learning rate 2e-5 with cosine schedule and 10% warmup, fp16 mixed precision. Total training wall time: about five minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The training data — synthetic and transparent about it
&lt;/h2&gt;

&lt;p&gt;This is the most important section of the post for anyone considering similar work. &lt;strong&gt;All training and test data is synthetic.&lt;/strong&gt; There is no real patient data anywhere in the pipeline. This is a deliberate choice — using synthetic data for v1 sidesteps HIPAA constraints entirely and lets the model ship fast. A v2 trained on real PHI would need HIPAA-eligible training infrastructure (AWS SageMaker or Azure ML with a BAA), and that's a separate, more careful project.&lt;/p&gt;

&lt;p&gt;But "synthetic" is doing a lot of work in that sentence. The naïve approach — ask an LLM for 1,000 example patient inquiries per intent — produces what I'll call &lt;strong&gt;ChatGPT-polite text&lt;/strong&gt;: every message opens with "Hi!", ends with "Thanks!", uses correct grammar and punctuation, and reads nothing like a real SMS message that an actual frustrated parent sends at 2 AM.&lt;/p&gt;

&lt;p&gt;A model trained on ChatGPT-polite text will overfit to the politeness markers and degrade badly on real production text. So the generation prompt forces a &lt;strong&gt;mandatory realism mix per batch&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~40% polished&lt;/strong&gt; (full sentences, correct grammar, proper punctuation, formal or neutral)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~40% casual&lt;/strong&gt; (lowercase starts, contractions, fragments, missing terminal punctuation, conversational)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~20% messy&lt;/strong&gt; (typos, autocorrect mistakes, abbreviations like &lt;code&gt;u&lt;/code&gt;/&lt;code&gt;appt&lt;/code&gt;/&lt;code&gt;tmrw&lt;/code&gt;, ALL CAPS for urgency, run-on phrasing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus channel-conditional scaling: SMS is the messiest, voicemail transcripts second messiest, email and web forms more polished. The prompt also includes about 20 lines of &lt;strong&gt;style anchors&lt;/strong&gt; — concrete patterns the LLM should reproduce. Stuff like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Abbreviations: &lt;code&gt;u&lt;/code&gt;/&lt;code&gt;ur&lt;/code&gt;, &lt;code&gt;appt&lt;/code&gt;, &lt;code&gt;tmrw&lt;/code&gt;, &lt;code&gt;yr&lt;/code&gt;, &lt;code&gt;pls&lt;/code&gt;, &lt;code&gt;thx&lt;/code&gt;, &lt;code&gt;rx&lt;/code&gt;, &lt;code&gt;ins&lt;/code&gt; (insurance)&lt;/p&gt;

&lt;p&gt;Fragment phrases: "billing question call me back", "need to reschedule thursday", "kid has fever 102", "still no answer about my x-ray"&lt;/p&gt;

&lt;p&gt;Run-on voicemail: "uh hi yeah this is calling about that thing you mentioned last week i think it was a follow up or something can you call me back"&lt;/p&gt;

&lt;p&gt;Conversational starts (no greeting): "Quick question —", "So I got this bill...", "Need to cancel —"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two dry runs before the full 9,000-example generation: the first one without the realism mix produced very polite, very clean output (82% of messages opened with "Hi!", 0% had ALL CAPS, almost nothing was a fragment); the second one with the mix landed at 18% greetingless openers, 22% abbreviations, 21% no terminal punctuation, 6% ALL CAPS urgency. The shape of the distribution actually moved when the prompt told it to move.&lt;/p&gt;

&lt;p&gt;Costs: the 9,000 training examples cost about $1.20 of OpenAI credit (via &lt;code&gt;gpt-4o-mini-2024-07-18&lt;/code&gt;, JSON-object response format, temperature 1.0, 8-worker parallel generation).&lt;/p&gt;

&lt;h2&gt;
  
  
  Preventing benchmark leakage
&lt;/h2&gt;

&lt;p&gt;The naive failure mode here is generating both train and test with the same model. The fine-tuned model would learn the generator's style, and the benchmark would inflate.&lt;/p&gt;

&lt;p&gt;So train and test come from different generators:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Train (9,000 examples)&lt;/strong&gt; — generated by &lt;code&gt;gpt-4o-mini-2024-07-18&lt;/code&gt; with the prompt above.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test (1,154 examples)&lt;/strong&gt; — generated by Claude with a &lt;strong&gt;deliberately different prompt style&lt;/strong&gt; and a &lt;strong&gt;different abbreviation set&lt;/strong&gt; (&lt;code&gt;w/&lt;/code&gt;, &lt;code&gt;&amp;amp;&lt;/code&gt;, &lt;code&gt;hrs&lt;/code&gt;, &lt;code&gt;BTW&lt;/code&gt;, &lt;code&gt;IDK&lt;/code&gt;, &lt;code&gt;plz&lt;/code&gt; versus the train prompt's &lt;code&gt;u&lt;/code&gt;, &lt;code&gt;tmrw&lt;/code&gt;, &lt;code&gt;appt&lt;/code&gt;). The test set leans into more medically specific content (real conditions, real procedure names) and longer rambling messages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A side effect of this split: when I benchmark against Claude Haiku 4.5 and Claude Sonnet 4.6 below, those models are from the same family as the test-set generator. If anything, they should get a small style-familiarity advantage. The benchmark numbers below are with that caveat in mind. (Spoiler: they don't visibly benefit.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark
&lt;/h2&gt;

&lt;p&gt;Evaluated on 1,154 held-out test examples:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Macro F1&lt;/th&gt;
&lt;th&gt;Latency / example&lt;/th&gt;
&lt;th&gt;Cost / 1K inferences&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;raihan-js/clarioscope-intent-deberta-v1&lt;/code&gt; (CPU)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91.16%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91.07%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48.5 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.00&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-haiku-4-5-20251001&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;95.32%&lt;/td&gt;
&lt;td&gt;95.28%&lt;/td&gt;
&lt;td&gt;1064 ms&lt;/td&gt;
&lt;td&gt;$0.252&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-sonnet-4-6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;93.59%&lt;/td&gt;
&lt;td&gt;93.53%&lt;/td&gt;
&lt;td&gt;1566 ms&lt;/td&gt;
&lt;td&gt;$0.759&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gpt-4o-2024-11-20&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;95.23%&lt;/td&gt;
&lt;td&gt;95.17%&lt;/td&gt;
&lt;td&gt;1036 ms&lt;/td&gt;
&lt;td&gt;$0.527&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Latency is wall-clock single-example latency through each provider's chat completions API, measured from a Bangladesh ISP. The fine-tuned model number is on a CPU (no GPU acceleration). Cost is the actual API spend per 1,000 calls based on token counts from the run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqp4f5dg6rjgb3hv02jo2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqp4f5dg6rjgb3hv02jo2.png" alt="Accuracy and cost comparison" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things in this table are interesting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sonnet 4.6 is worse than Haiku 4.5.&lt;/strong&gt; A bigger, slower, more expensive frontier model produces lower accuracy on this task. This isn't an artifact of one run — I've seen it consistently. My take: for narrow, well-structured classification with short prompts, more reasoning capacity sometimes second-guesses the correct intuition. The first thought is often right, and a smaller model that doesn't have the option to deliberate just commits to it. The right tool for this kind of job is small and specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The latency advantage is on CPU.&lt;/strong&gt; The 48ms number is on CPU. A modest GPU would drop it to ~5–10 ms. The frontier API numbers are network-bound — the model itself processes the request in tens of milliseconds, but the wall-clock floor for a hosted API call from a non-US-East ISP is in the hundreds of milliseconds before the model has even started. Adding a GPU at the API side does nothing for that floor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost gap doesn't shrink at scale.&lt;/strong&gt; API cost scales linearly with call volume. The fine-tuned model has a one-time training cost (about $2.40 of OpenAI plus RunPod compute together) and approximately zero marginal cost. For 10K daily inferences over a year, the dollar swing is between zero and roughly $2,800.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-class F1 and where the errors live
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3kfcrkif2emfrtqppag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3kfcrkif2emfrtqppag.png" alt="Confusion matrix on val set" width="800" height="642"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model's per-class F1 on the val set, ranked best to worst:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Intent&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;price_shopper&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.957&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;complaint&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.929&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;billing_inquiry&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.908&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;appointment_request&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.881&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;clinical_concern&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.874&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;existing_patient_question&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.834&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;new_patient_inquiry&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.819&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hardest pairs to disambiguate are exactly the pairs you'd expect to be hard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;new_patient_inquiry&lt;/code&gt; ↔ &lt;code&gt;appointment_request&lt;/code&gt;&lt;/strong&gt; — a new patient asking to schedule their first visit fits both labels. The data-gen prompt resolves toward &lt;code&gt;new_patient_inquiry&lt;/code&gt; for messages that lead with the becoming-a-patient signal, but the model lands on &lt;code&gt;appointment_request&lt;/code&gt; more often than the label intends.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;existing_patient_question&lt;/code&gt; ↔ &lt;code&gt;clinical_concern&lt;/code&gt;&lt;/strong&gt; — medical questions from established patients read as low-grade concerns to the model, because at the lexical level they are.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;clinical_concern&lt;/code&gt; ↔ &lt;code&gt;complaint&lt;/code&gt;&lt;/strong&gt; — frustrated medical concerns combine both signals; the prompt's tie-breaker says complaint dominates, but the model occasionally goes the other way.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These same pairs gave Claude Haiku 4.5 trouble too when I ran the benchmark by hand on a sample. They're real ambiguity in the task, not classifier weakness. Useful production move: have the model emit confidence (max softmax) alongside the label, and route low-confidence predictions to a human reviewer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost ledger
&lt;/h2&gt;

&lt;p&gt;Full breakdown of what it cost to ship this model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;9,000 synthetic training examples via OpenAI (&lt;code&gt;gpt-4o-mini&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;$1.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RunPod RTX 4090 pod (about 50 minutes including iteration)&lt;/td&gt;
&lt;td&gt;$1.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benchmark API calls (Haiku + Sonnet + GPT-4o, 1,154 examples each)&lt;/td&gt;
&lt;td&gt;$1.78&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hugging Face hosting&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$4.18&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's it. End-to-end, from empty repo to published model + reproducible benchmark, for less than the price of lunch.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForSequenceClassification&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raihan-js/clarioscope-intent-deberta-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForSequenceClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hi, I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m new to the area and looking for a dermatologist. Are you accepting new patients?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;got a bill for $382 for my visit on 4/12 but my copay should only be $35 — what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the rest?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my kid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s fever is 103.2 and not coming down with tylenol. need advice now&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;
&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id2label&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ['new_patient_inquiry', 'billing_inquiry', 'clinical_concern']
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;I've put a full Limitations section in the &lt;a href="https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1" rel="noopener noreferrer"&gt;model card&lt;/a&gt;, but the highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All training and test data is synthetic.&lt;/strong&gt; No real production validation yet. A real-world calibration pass is a prerequisite for production deployment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;English only.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare practice domain only.&lt;/strong&gt; Routes messages within a practice — does not generalize to other industries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seven categories, not exhaustive.&lt;/strong&gt; Messages that don't fit get the closest available label rather than an "unknown" bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No PHI redaction is performed by this model.&lt;/strong&gt; PHI detection is a separate model in the suite (in development), and HIPAA compliance is a regulatory determination that no model can make.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;This is model 1 of three. The other two:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;clarioscope-phi-deberta-v1&lt;/code&gt;&lt;/strong&gt; — a token-classification model (BIO tagging) for detecting PHI spans in patient text. Same DeBERTa base, different head, different training data (synthetic PHI-annotated text). Goal: redact before routing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;clarioscope-insurance-v1&lt;/code&gt;&lt;/strong&gt; — structured JSON extraction of insurance- and billing-relevant fields from inbound text. Probably a small encoder-decoder or constrained-decoding setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When all three are published, they'll go up as a Hugging Face collection and the master writeup will be a single longer post tying the suite together. Follow along on &lt;a href="https://huggingface.co/raihan-js" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt; or &lt;a href="https://github.com/raihan-js" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you've shipped a small specialized model that beats — sorry, &lt;strong&gt;matches&lt;/strong&gt; — frontier APIs on a narrow task, I'd love to hear about it. The pattern works.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>huggingface</category>
    </item>
  </channel>
</rss>
