<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: John Bolognino</title>
    <description>The latest articles on DEV Community by John Bolognino (@jcbolo72012).</description>
    <link>https://dev.to/jcbolo72012</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3970299%2F4e7a5236-47f8-4d50-86c7-773401173f71.jpeg</url>
      <title>DEV Community: John Bolognino</title>
      <link>https://dev.to/jcbolo72012</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jcbolo72012"/>
    <language>en</language>
    <item>
      <title>Domain-specific intent classification for e-commerce: fine-tuning DistilBERT to outperform GPT-4o mini at 1/15th the cost</title>
      <dc:creator>John Bolognino</dc:creator>
      <pubDate>Fri, 05 Jun 2026 18:39:23 +0000</pubDate>
      <link>https://dev.to/jcbolo72012/domain-specific-intent-classification-for-e-commerce-fine-tuning-distilbert-to-outperform-gpt-4o-2on3</link>
      <guid>https://dev.to/jcbolo72012/domain-specific-intent-classification-for-e-commerce-fine-tuning-distilbert-to-outperform-gpt-4o-2on3</guid>
      <description>&lt;h1&gt;
  
  
  Domain-specific intent classification for e-commerce: fine-tuning DistilBERT to outperform GPT-4o mini at 1/15th the cost
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;A practical case study showing why fine-tuned encoders can still win for fixed-label classification tasks.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Hello world! I am John, first time poster. I fine-tuned this model in an afternoon with contribution from Claude Code. I am of the opinion that fine-tuned encoder models should not be counted out for well-defined tasks like intent or sentiment classification, and that it is our job as ML developers to solve problems with the most efficient tools available, rather than wielding the blowtorch of a generative model just because we can. I also just enjoy working with BERTs. I hope you find this useful as both a case study and a practical tool.&lt;/p&gt;

&lt;p&gt;I compared distilbert with GPT-4o mini for this case study because cursory research indicated that to be the model that best fits the use-case and price point of intent classification and is used in real-world scenarios. If you are aware of other models or methodologies that might be more appropriate for comparison, please let me know so I can benchmark this work more comprehensively.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Generic language models handle intent classification the same way they handle everything else: with general reasoning over general knowledge. For a fixed, well-defined taxonomy, such as the nine intent categories that cover roughly 95% of all e-commerce customer support volume, that generality is waste. You pay for reasoning you don't need, and you get latency you can't afford.&lt;/p&gt;

&lt;p&gt;Most e-commerce helpdesks route tickets manually, or use keyword rules that break on spelling errors and informal phrasing. The LLM alternative (prompt GPT-4o mini with your intent list) works, but at $0.015 per 1,000 calls and 450ms P95 latency, it's expensive and slow for a task that a smaller, purpose-built model can handle better. This post documents building that smaller model: a fine-tuned DistilBERT that classifies e-commerce support tickets into nine intent categories.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dataset
&lt;/h2&gt;

&lt;p&gt;Training data came from two Bitext LLM chatbot training datasets (retail + customer support, CDLA-Sharing 1.0 license) totaling 71,756 examples. After deduplication and label normalization, 61,445 examples remained across nine canonical intent classes. The source datasets use 68 fine-grained labels which were mapped to the nine-class taxonomy below. Class distribution was capped at 4:1 to prevent the dominant classes (OTHER, WISMO, ACCOUNT_ISSUE) from overwhelming the smaller ones. Final split: 49,156 train / 6,144 val / 6,145 test, stratified by class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important caveat on the evaluation numbers:&lt;/strong&gt; The Bitext datasets are themselves synthetically generated from a fixed set of templates per intent. Training and test examples share the same template distribution, which produces artificially high held-out metrics. Real-world accuracy on production customer tickets — with typos, multi-intent messages, and domain-specific jargon — is estimated at 87–93%. The benchmark numbers below are valid for this data distribution and useful for comparing model architectures, but should not be taken as predictions of production accuracy without validation on your own ticket data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why DistilBERT Over GPT-4o Mini
&lt;/h2&gt;

&lt;p&gt;For fixed-label classification with a well-defined taxonomy, encoder-only models have three structural advantages over generative LLMs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency.&lt;/strong&gt; No autoregressive decoding. A single forward pass through six transformer layers produces the classification. P95 on Modal A10G: 4ms warm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost.&lt;/strong&gt; Inference cost scales with GPU time per call, not per token. At $1.10/hr for an A10G and ~4ms per call, the cost is roughly $0.001 per 1,000 calls vs. $0.015 for GPT-4o mini zero-shot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy on narrow domains.&lt;/strong&gt; Fine-tuning on domain-specific data consistently outperforms zero-shot prompting for fixed-label tasks. The model learns the specific language patterns of your taxonomy rather than reasoning about it from scratch on every call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff: a fine-tuned encoder is locked to its training taxonomy. Adding a new intent class requires retraining. For e-commerce support — where WISMO, returns, and exchanges account for 60%+ of ticket volume and the label set is stable — this is an acceptable constraint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base model:&lt;/strong&gt; &lt;code&gt;distilbert-base-uncased&lt;/code&gt; (66M parameters, 6 layers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task:&lt;/strong&gt; 9-class sequence classification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt; NVIDIA RTX 4080 Laptop GPU (12.9GB VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training time:&lt;/strong&gt; 10.9 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Epochs:&lt;/strong&gt; 8 with early stopping (patience=3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch size:&lt;/strong&gt; 32&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning rate:&lt;/strong&gt; 2e-5 with cosine schedule and 10% warmup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weight decay:&lt;/strong&gt; 0.01&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Max token length:&lt;/strong&gt; 128 (P99 of training data is 24 tokens — these are short messages)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixed precision:&lt;/strong&gt; fp16&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework:&lt;/strong&gt; HuggingFace Transformers 4.47.0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is fine-tuned end-to-end with no frozen layers. DistilBERT's classification head (linear layer over the [CLS] token) learns the mapping from the distilled BERT representation to the nine intent classes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test Set Performance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Weighted F1&lt;/th&gt;
&lt;th&gt;P95 Latency&lt;/th&gt;
&lt;th&gt;Cost / 1k calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EcomIntent DistilBERT (ours)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;99.92%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9992&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.001&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o mini (zero-shot)&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;td&gt;0.840&lt;/td&gt;
&lt;td&gt;450ms&lt;/td&gt;
&lt;td&gt;$0.015&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o mini (5-shot)&lt;/td&gt;
&lt;td&gt;88.0%&lt;/td&gt;
&lt;td&gt;0.875&lt;/td&gt;
&lt;td&gt;700ms&lt;/td&gt;
&lt;td&gt;$0.045&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forethought Triage&lt;/td&gt;
&lt;td&gt;~88.5%&lt;/td&gt;
&lt;td&gt;~0.880&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;td&gt;$30k+/yr flat&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;GPT-4o mini baselines are published benchmarks on intent classification tasks. EcomIntent numbers are on the held-out Bitext test split. See caveat above regarding real-world generalization.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Class F1
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Intent&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;Test examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;WISMO&lt;/td&gt;
&lt;td&gt;0.9989&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;0.9979&lt;/td&gt;
&lt;td&gt;947&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RETURN_REQUEST&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EXCHANGE_REQUEST&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;378&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CANCEL_ORDER&lt;/td&gt;
&lt;td&gt;0.9979&lt;/td&gt;
&lt;td&gt;0.9958&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;236&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DAMAGED_ITEM&lt;/td&gt;
&lt;td&gt;0.9989&lt;/td&gt;
&lt;td&gt;0.9979&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;469&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BILLING_DISPUTE&lt;/td&gt;
&lt;td&gt;0.9985&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;0.9970&lt;/td&gt;
&lt;td&gt;677&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PRODUCT_QUESTION&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;664&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ACCOUNT_ISSUE&lt;/td&gt;
&lt;td&gt;0.9995&lt;/td&gt;
&lt;td&gt;0.9989&lt;/td&gt;
&lt;td&gt;1.0000&lt;/td&gt;
&lt;td&gt;947&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OTHER&lt;/td&gt;
&lt;td&gt;0.9984&lt;/td&gt;
&lt;td&gt;0.9979&lt;/td&gt;
&lt;td&gt;0.9989&lt;/td&gt;
&lt;td&gt;947&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;EXCHANGE_REQUEST and RETURN_REQUEST — the historically confused pair — are cleanly separated. The model learned that exchange intent requires explicit mention of a different variant (size, color) rather than just dissatisfaction with the received item.&lt;/p&gt;

&lt;h3&gt;
  
  
  Confusion Matrix
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2FJohnBolognino%2Fecomintent-distilbert%2Fresolve%2Fmain%2Fconfusion_matrix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2FJohnBolognino%2Fecomintent-distilbert%2Fresolve%2Fmain%2Fconfusion_matrix.png" alt="Confusion Matrix" width="800" height="708"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagonal is nearly solid. The only meaningful off-diagonal mass is a handful of CANCEL_ORDER examples predicted as OTHER (2 out of 236), which on inspection were ambiguous messages that could reasonably be either class.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost and Latency Deep Dive
&lt;/h2&gt;

&lt;p&gt;The cost calculation is straightforward. Modal's A10G GPU costs $1.10/hr. At 4ms P95 latency per call with scale-to-zero, the math is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$1.10 / 3600 seconds = $0.000306 per GPU-second
4ms per call = 0.004 seconds per call
Cost per call = $0.000306 × 0.004 = $0.0000012
Cost per 1,000 calls = $0.0012
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 100,000 calls/month (a mid-size helpdesk), that's $120/month vs. $1,500/month for GPT-4o mini zero-shot — and the fine-tuned model is more accurate on this specific task.&lt;/p&gt;

&lt;p&gt;Cold start latency (container spin-up from idle) is approximately 1.7 seconds. For latency-sensitive applications, set &lt;code&gt;scaledown_window&lt;/code&gt; higher to keep the container warm.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English only.&lt;/strong&gt; The model was trained exclusively on English-language examples. Performance on Spanish, French, or other languages is untested and likely poor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single intent per message.&lt;/strong&gt; V1 assigns the highest-probability class. Messages containing multiple intents (e.g., "my order arrived damaged and I want a refund") get one label — the dominant signal wins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Template distribution.&lt;/strong&gt; As noted, training data is synthetic. A model trained purely on synthetic data may underperform on edge cases that don't appear in the template inventory: highly informal phrasing, non-standard spelling, or industry-specific jargon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static taxonomy.&lt;/strong&gt; Adding or modifying intent classes requires retraining on new data.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Reproduce It
&lt;/h2&gt;

&lt;p&gt;The full training pipeline — data download, preprocessing, fine-tuning, evaluation, and Modal deployment — is open source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/jcbolo72012/ecomintent-api" rel="noopener noreferrer"&gt;https://github.com/jcbolo72012/ecomintent-api&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model weights:&lt;/strong&gt; &lt;a href="https://huggingface.co/JohnBolognino/ecomintent-distilbert" rel="noopener noreferrer"&gt;https://huggingface.co/JohnBolognino/ecomintent-distilbert&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live API (RapidAPI):&lt;/strong&gt; &lt;a href="https://rapidapi.com/john-UG9kfZiW5/api/ecomintent-e-commerce-intent-classifie" rel="noopener noreferrer"&gt;https://rapidapi.com/john-UG9kfZiW5/api/ecomintent-e-commerce-intent-classifie&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;

&lt;span class="n"&gt;classifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JohnBolognino/ecomintent-distilbert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;where is my order, it has been 5 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# [{'label': 'WISMO', 'score': 0.9998}]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training takes under 15 minutes on any 12GB+ GPU. The pipeline handles data download, label normalization, tokenization analysis, training with early stopping, test set evaluation, and Modal deployment end to end.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>api</category>
      <category>nlp</category>
    </item>
  </channel>
</rss>
