DEV Community: Mansi Somayajula

What Production ML Systems Taught Me About AI Hallucinations

Mansi Somayajula — Mon, 18 May 2026 18:16:15 +0000

Most discussions about AI hallucinations stay at the chatbot level.

“ChatGPT made up a legal case.”
“The AI invented a research paper.”
“The model confidently gave the wrong answer.”

Interesting? Sure.

But after working on production ML systems for several years, I think the bigger problem is this:

Hallucinations become far more dangerous once AI leaves the chat window and enters real systems.

Because in production environments, the issue usually isn’t obvious failure.

It’s believable failure.

And believable failure is much harder to detect.

First: Hallucinations Are Not Really Bugs

One thing that took me a while to fully appreciate while studying NLP and working with production ML systems is this:

LLMs are not designed to understand truth.

They’re designed to predict likely token sequences.

That distinction sounds subtle.
It changes everything.

At their core, Large Language Models work by predicting what word or token is statistically likely to come next based on patterns learned during training.

That means the model optimizes for:

coherence
fluency
pattern completion
conversational consistency

Not factual verification.

That’s why hallucinations are not random glitches.

They’re an emergent property of probabilistic language generation.

*Why This Gets Worse in Production
*
In experimental environments, hallucinations are annoying.

In production systems, they become operational risks.

Because production systems introduce:

stale data
schema drift
incomplete retrieval
broken pipelines
delayed synchronization
conflicting documents
partial context windows

And the model still tries to produce a coherent answer anyway.

That’s where things get dangerous.

During my experience building and monitoring ML systems, one lesson became painfully clear:

Most failures in production are silent.

The pipeline still runs.
The API still responds.
The dashboard still loads.

But the assumptions underneath the system have already drifted.

That’s exactly why hallucinations are difficult to catch at scale.

*Drift and Hallucinations Are More Connected Than People Realize
*
A lot of people think hallucinations are only an LLM problem.

I don’t think that’s true.

Hallucinations are often downstream symptoms of broader system drift.

While working on drift detection and monitoring systems, I noticed something interesting:

Model degradation rarely happens in isolation.

Usually something upstream changed first:

feature distributions shifted
missing values increased
source systems changed behavior
retrieval quality degraded
embeddings became stale
business logic evolved

The model output is often the last signal, not the first.

That changes how you should think about observability.

Monitoring token outputs alone is not enough.

You need visibility into:

data quality
retrieval pipelines
embedding freshness
vector search relevance
feature consistency
upstream dependencies

In modern AI systems, hallucinations are frequently systems problems disguised as model problems.

RAG Helps — But Introduces New Failure Modes

Retrieval-Augmented Generation (RAG) became the industry’s favorite solution to hallucinations.

And honestly, it does help.

Grounding LLMs against enterprise knowledge bases significantly improves factual consistency.

But RAG systems are not magic.

They introduce entirely new operational problems.

1. Stale Embeddings

Your documents change.
Your vector store doesn’t update correctly.
Now retrieval quality quietly degrades over time.

The model still answers confidently.
The context is simply outdated.

2. Poor Chunking

Bad chunking destroys retrieval relevance.

If semantic boundaries are weak:

context becomes fragmented
important relationships disappear
retrieval becomes noisy

The model starts filling gaps probabilistically.

That’s hallucination territory again.

3. Retrieval Ranking Problems

Sometimes the correct document exists.
The retriever simply fails to surface it.

Now the LLM generates the “most likely” answer from incomplete context.

Again:
coherent,
confident,
wrong.

4. Context Window Constraints

As systems scale, context management becomes difficult.

Too much retrieval creates noise.
Too little retrieval removes critical information.

Finding the balance is harder than most demos make it look.

The Real Problem Is Human Psychology

Honestly, I think the most overlooked part of hallucinations is not technical.

It’s psychological.

Humans are extremely vulnerable to polished language.

If something:

sounds structured
explains itself clearly
uses confident wording
resembles expert communication

…people naturally trust it more.

That’s why hallucinations are fundamentally different from traditional software failures.

Traditional systems fail loudly.

LLMs often fail persuasively.

And persuasion scales extremely well.

*AI Agents Make This Even More Serious
*
The industry is rapidly moving toward agentic systems:

autonomous workflows
tool-using agents
multi-agent orchestration
AI-triggered infrastructure actions

Now hallucinations stop being “wrong text.”

They become:

incorrect actions
flawed recommendations
workflow corruption
operational instability

That changes the engineering challenge completely.

You no longer just need good prompts.

You need:

approval gates
bounded autonomy
rollback mechanisms
audit trails
confidence scoring
human escalation paths
observability layers

In other words:
AI engineering starts looking a lot like distributed systems engineering.

What Developers Should Focus On Instead of Just Better Prompts

A lot of AI conversations today are overly prompt-centric.

Prompts matter.
But production reliability matters more.

The teams building reliable AI systems are investing heavily in:

observability
evaluation pipelines
retrieval monitoring
feedback loops
guardrails
lineage tracking
confidence calibration
human-in-the-loop systems

That’s where the real engineering work is happening now.

Not in viral demo threads.

Final Thought

I don’t think hallucinations will ever fully disappear.

Because probabilistic systems generating human language will always contain uncertainty.

The real challenge is learning how to engineer systems that:

expose uncertainty clearly
fail safely
remain observable
degrade gracefully
preserve human oversight

That’s a much harder problem than building impressive demos.

And honestly, I think it’s where the future of AI engineering is headed.

Your ML Model Is Lying to You — Here's How to Catch It.

Mansi Somayajula — Tue, 14 Apr 2026 23:40:13 +0000

The car that pulls to the right

Have you ever driven a car that slowly starts pulling to the right?

At first you don't even notice. You unconsciously make tiny steering corrections. Days pass. Weeks pass. You adapt so naturally that it feels normal.

Then someone else gets in your car, drives for thirty seconds, and immediately says — "something's wrong with this car."

You hadn't noticed. Not because you weren't paying attention. But because the change was so gradual, so slow, that your brain adjusted to it without raising an alarm.

This is exactly what happens to ML models in production.

They don't suddenly break. They drift. Slowly, quietly, consistently — until one day a user complains, a metric tanks, or someone from outside your team says "this thing isn't working right."

By then? The drift has been happening for weeks.

In this post I want to show you what drift actually looks like, why it's so easy to miss, and — most importantly — how to catch it before it catches you.

What is drift, really?

Before we get into code, let me make this concrete with something you experience every day.

The coffee shop analogy

Imagine a coffee shop near a big office building. For three years, they've had a morning rush of office workers every weekday at 8 AM. They've trained their staff, stocked their inventory, and scheduled their deliveries perfectly around this pattern.

Then the company in that office building goes fully remote. Overnight, the morning rush disappears.

The coffee shop's "model" of the world — built on three years of data — is now wrong. They're still ordering the same supplies. Still scheduling the same staff. Still expecting the same rush.

Nothing broke. The espresso machine still works. The staff still show up. But the predictions are wrong because the world changed and the model didn't.

This is data drift. The inputs to your model — the patterns, the distributions, the behaviors — have shifted away from what your model was trained on.

The weather app analogy

Now imagine a weather app trained entirely on summer data from California. Sunshine, warm temperatures, low humidity. It gets really good at predicting California summers.

Then it gets deployed to predict weather in... Minnesota. In January.

The model isn't broken. It's confidently predicting sunshine and 75°F. It's just completely wrong — because the world it's being asked to predict looks nothing like the world it learned from.

This is concept drift. The relationship between inputs and outputs has changed. What used to be true no longer is.

The gym scale analogy

One more. Imagine a scale at your gym that slowly starts reading 5 pounds lighter than it should.

You weigh yourself every Monday. The numbers look good! You're making progress! You feel great!

But the numbers are quietly, consistently wrong. Not dramatically wrong. Just... off. By a small enough margin that nothing looks suspicious — until you step on a different scale and suddenly wonder what happened.

This is measurement drift. The data coming into your system is corrupted in a subtle way — and because it's subtle, it's the hardest kind to catch.

Why drift is the sneakiest problem in ML

Here's what makes drift so dangerous compared to regular bugs.

A regular bug is obvious. The system crashes. An error gets thrown. Something visibly breaks. You know immediately that something is wrong.

Drift gives you no such signal. Everything keeps running. Predictions keep coming out. The API returns 200. Your dashboards show green. And somewhere underneath all of that apparent health, your model is getting progressively worse at its job.

I think of it like carbon monoxide. Unlike smoke, you can't see it. You can't smell it. The first sign that something's wrong might be when it's already too late. That's exactly why carbon monoxide detectors exist — not to tell you there's a problem after you're sick, but to catch the invisible signal early.

ML monitoring is your carbon monoxide detector.

The three types of drift you need to watch

Not all drift is the same. Here are the three types I've found most important to monitor:

Type 1: Input drift (the coffee shop problem)

The distribution of your input data changes over time.

Everyday example: A recommendation system trained when most users were in their 20s suddenly has to serve an older demographic. The inputs look different — different browsing patterns, different purchase histories, different time-of-day behaviors.

What to watch: Average values, variance, and distribution of your input features over time. If they start shifting away from your training distribution, your model is operating in territory it wasn't trained for.

import numpy as np
from scipy import stats
import json
from datetime import datetime

class InputDriftDetector:
    def __init__(self, reference_data: np.ndarray, feature_name: str):
        """
        reference_data: your training data distribution (sample)
        feature_name: name of the feature you're monitoring
        """
        self.reference = reference_data
        self.feature_name = feature_name
        self.alerts = []

    def check(self, current_data: np.ndarray, threshold: float = 0.05) -> bool:
        """
        Uses KS test to compare current data distribution
        against training distribution.

        p_value < threshold = distributions are significantly different = drift!
        Think of it like: "are these two groups of numbers
        drawn from the same population?"
        """
        ks_stat, p_value = stats.ks_2samp(self.reference, current_data)

        drift_detected = p_value < threshold

        result = {
            "timestamp": datetime.utcnow().isoformat(),
            "feature": self.feature_name,
            "ks_statistic": round(ks_stat, 4),
            "p_value": round(p_value, 4),
            "drift_detected": drift_detected
        }

        if drift_detected:
            self.alerts.append(result)
            print(f"⚠️  Input drift detected in '{self.feature_name}'!")
            print(f"    KS stat: {ks_stat:.4f} | p-value: {p_value:.4f}")
            print(f"    Your input distribution has shifted from training.")

        return drift_detected

# Example usage
# Simulate training distribution: mostly young users (20-35)
training_ages = np.random.normal(loc=28, scale=5, size=1000)

# Simulate current distribution: older users (40-55)
current_ages = np.random.normal(loc=47, scale=6, size=200)

detector = InputDriftDetector(training_ages, feature_name="user_age")
detector.check(current_ages)
# ⚠️  Input drift detected in 'user_age'!

Type 2: Prediction drift (the weather app problem)

The distribution of your model's outputs changes over time.

Everyday example: A customer churn classifier that used to predict "will churn" for 15% of users now predicts it for 40%. Did behavior actually change? Or is the model going off the rails?

What to watch: The distribution of predicted classes or predicted probabilities. Significant shifts — without a corresponding shift in actual outcomes — are a red flag.

class PredictionDriftDetector:
    def __init__(self, baseline_predictions: list):
        """
        baseline_predictions: list of predictions from a known-good period
        """
        self.baseline = baseline_predictions
        self.baseline_pos_rate = sum(baseline_predictions) / len(baseline_predictions)

    def check(self, current_predictions: list, tolerance: float = 0.1) -> bool:
        """
        Checks if the positive prediction rate has shifted
        significantly from the baseline.

        Like checking: "are we predicting churn for WAY more
        people than we used to? If so — something changed."
        """
        current_pos_rate = sum(current_predictions) / len(current_predictions)
        shift = abs(current_pos_rate - self.baseline_pos_rate)

        drift_detected = shift > tolerance

        print(f"📊 Prediction distribution check:")
        print(f"   Baseline positive rate: {self.baseline_pos_rate:.2%}")
        print(f"   Current positive rate:  {current_pos_rate:.2%}")
        print(f"   Shift: {shift:.2%}")

        if drift_detected:
            print(f"⚠️  Prediction drift detected! Shift exceeds {tolerance:.0%} tolerance.")
            if current_pos_rate > self.baseline_pos_rate:
                print(f"   Model is predicting positive MORE often than before.")
            else:
                print(f"   Model is predicting positive LESS often than before.")
        else:
            print(f"✅ Prediction distribution looks stable.")

        return drift_detected

# Example
baseline = [1 if x > 0.85 else 0 for x in np.random.random(1000)]  # ~15% positive rate
current  = [1 if x > 0.60 else 0 for x in np.random.random(200)]   # ~40% positive rate

detector = PredictionDriftDetector(baseline)
detector.check(current)

Type 3: Data quality drift (the broken scale problem)

The data coming into your system starts degrading in quality — missing values increase, formats change, upstream systems start sending garbage.

Everyday example: An e-commerce recommendation engine that relies on product view data. Silently, a front-end change stops logging certain types of product views. The model still runs. But it's now working with incomplete information — like a chef trying to cook a recipe when someone quietly removed half the ingredients from the kitchen without telling them.

class DataQualityMonitor:
    def __init__(self, 
                 baseline_null_rate: float,
                 baseline_value_range: tuple):
        """
        baseline_null_rate: expected % of null values (from training data)
        baseline_value_range: expected (min, max) of values
        """
        self.baseline_null_rate = baseline_null_rate
        self.value_min, self.value_max = baseline_value_range
        self.issues = []

    def check(self, data: list, feature_name: str) -> dict:
        issues_found = []

        # Check null rate
        null_rate = sum(1 for x in data if x is None) / len(data)
        if null_rate > self.baseline_null_rate * 2:  # alert if 2x baseline
            issues_found.append(
                f"Null rate: {null_rate:.1%} (baseline: {self.baseline_null_rate:.1%})"
            )

        # Check value range
        valid_values = [x for x in data if x is not None]
        if valid_values:
            actual_min = min(valid_values)
            actual_max = max(valid_values)

            if actual_min < self.value_min or actual_max > self.value_max:
                issues_found.append(
                    f"Values out of range: [{actual_min}, {actual_max}] "
                    f"(expected: [{self.value_min}, {self.value_max}])"
                )

        if issues_found:
            print(f"⚠️  Data quality issues in '{feature_name}':")
            for issue in issues_found:
                print(f"   - {issue}")
        else:
            print(f"✅ '{feature_name}' data quality looks good.")

        return {"feature": feature_name, "issues": issues_found}

# Example
monitor = DataQualityMonitor(
    baseline_null_rate=0.02,    # 2% nulls in training data
    baseline_value_range=(0, 100)  # values expected between 0 and 100
)

# Simulate degraded data: more nulls, some out-of-range values
bad_data = [None, None, None, None, None, 45, 23, 150, 67, 89, None, 34]
monitor.check(bad_data, feature_name="purchase_amount")

Putting it all together: a simple drift monitoring pipeline

Here's a minimal but real drift monitoring setup you can drop into any ML project:

from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime
import numpy as np

@dataclass
class DriftAlert:
    timestamp: str
    drift_type: str
    feature: str
    severity: str  # "low", "medium", "high"
    message: str

class MLDriftMonitor:
    """
    A simple, practical drift monitor for production ML systems.

    Think of this as your smoke detector — runs quietly in the
    background, only makes noise when something actually needs
    your attention.
    """

    def __init__(self, model_name: str):
        self.model_name = model_name
        self.alerts: List[DriftAlert] = []
        self.check_count = 0

    def check_input_drift(self, 
                          feature_name: str,
                          reference: np.ndarray, 
                          current: np.ndarray,
                          threshold: float = 0.05):
        from scipy import stats
        _, p_value = stats.ks_2samp(reference, current)

        if p_value < threshold:
            severity = "high" if p_value < 0.01 else "medium"
            self.alerts.append(DriftAlert(
                timestamp=datetime.utcnow().isoformat(),
                drift_type="input_drift",
                feature=feature_name,
                severity=severity,
                message=f"Input distribution shifted (p={p_value:.4f})"
            ))

    def check_prediction_drift(self,
                               baseline_preds: list,
                               current_preds: list,
                               tolerance: float = 0.1):
        baseline_rate = sum(baseline_preds) / len(baseline_preds)
        current_rate = sum(current_preds) / len(current_preds)
        shift = abs(current_rate - baseline_rate)

        if shift > tolerance:
            severity = "high" if shift > 0.2 else "medium"
            self.alerts.append(DriftAlert(
                timestamp=datetime.utcnow().isoformat(),
                drift_type="prediction_drift",
                feature="model_output",
                severity=severity,
                message=f"Prediction rate shifted by {shift:.1%}"
            ))

    def check_data_quality(self,
                           feature_name: str,
                           data: list,
                           max_null_rate: float = 0.05):
        null_rate = sum(1 for x in data if x is None) / len(data)

        if null_rate > max_null_rate:
            severity = "high" if null_rate > 0.2 else "low"
            self.alerts.append(DriftAlert(
                timestamp=datetime.utcnow().isoformat(),
                drift_type="data_quality",
                feature=feature_name,
                severity=severity,
                message=f"Null rate: {null_rate:.1%} (max allowed: {max_null_rate:.1%})"
            ))

    def report(self):
        print(f"\n{'='*50}")
        print(f"Drift Monitor Report — {self.model_name}")
        print(f"{'='*50}")

        if not self.alerts:
            print("✅ No drift detected. Model looks healthy.")
            return

        high   = [a for a in self.alerts if a.severity == "high"]
        medium = [a for a in self.alerts if a.severity == "medium"]
        low    = [a for a in self.alerts if a.severity == "low"]

        print(f"Total alerts: {len(self.alerts)}")
        print(f"  🔴 High:   {len(high)}")
        print(f"  🟡 Medium: {len(medium)}")
        print(f"  🟢 Low:    {len(low)}")
        print()

        for alert in self.alerts:
            icon = {"high": "🔴", "medium": "🟡", "low": "🟢"}[alert.severity]
            print(f"{icon} [{alert.drift_type}] {alert.feature}")
            print(f"   {alert.message}")
            print(f"   Detected at: {alert.timestamp}")
            print()

# Usage example
monitor = MLDriftMonitor("churn_classifier_v2")

# Simulate some drift
training_data = np.random.normal(28, 5, 1000)
current_data  = np.random.normal(47, 6, 200)
monitor.check_input_drift("user_age", training_data, current_data)

baseline_preds = [1 if x > 0.85 else 0 for x in np.random.random(1000)]
current_preds  = [1 if x > 0.60 else 0 for x in np.random.random(200)]
monitor.check_prediction_drift(baseline_preds, current_preds)

messy_data = [None, None, 45, None, 23, None, None, 67, None, 89]
monitor.check_data_quality("purchase_amount", messy_data)

monitor.report()

When to actually trigger alerts

Not every drift signal needs to wake someone up at 2 AM. Here's how I think about severity:

🔴 High — act immediately

Input distribution shifted significantly (p < 0.01)
Prediction positive rate changed by more than 20%
Null rate above 20%

These are "pull the car over" signals. Something is seriously wrong.

🟡 Medium — investigate soon

Input drift detected but mild (0.01 < p < 0.05)
Prediction rate shifted 10–20%
Null rate 5–20%

These are "the car is pulling to the right" signals. Not an emergency, but don't ignore them.

🟢 Low — log and watch

Small shifts within acceptable ranges
Null rate slightly elevated
Minor value range violations

These are "keep an eye on it" signals. Log them. Trend them. If they keep moving in the wrong direction, escalate.

The one thing I'd suggest you do today

If you have a model in production right now — even a simple one — add this:

import json
from datetime import datetime

def log_prediction(input_data: dict, prediction, confidence: float):
    """
    The simplest possible monitoring setup.
    Just log everything. You'll thank yourself later.
    """
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "inputs": input_data,
        "prediction": prediction,
        "confidence": round(confidence, 4)
    }

    # Write to a log file (or your logging system of choice)
    with open("predictions.log", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

That's it. Just start logging. Every prediction, every input, every confidence score.

You can build sophisticated drift detection on top of logs. You cannot build it on top of nothing.

The gym scale was wrong for weeks before anyone noticed — because nobody was keeping a record to compare against.

Start the record today.

Quick recap

Three types of drift to watch:

Input drift — your data looks different from training data. Like a coffee shop where the customers completely changed.

Prediction drift — your model's outputs are shifting. Like a weather app confidently predicting sunshine in a Minnesota winter.

Data quality drift — your incoming data is degrading silently. Like a scale that slowly reads wrong.

The common thread: none of these announce themselves. You have to build the detector. The model will not tell you it's struggling. It'll just quietly get worse — until someone notices.

Build the smoke detector before the fire. 🚨

What's next

Now that we know how to catch drift — the next question is: what do LLMs break differently?

Traditional ML models drift in predictable ways. LLMs have entirely new failure modes — hallucinations that increase over time, prompt brittleness, retrieval degradation in RAG systems.

Next post: "LLMs Break Differently Than Traditional ML — Here's What Nobody Warns You About"

Found this useful? Follow for Part 3 — the unique failure modes of LLMs in production that you won't find in any tutorial. 🚀

Tried implementing drift detection? I'd love to hear what you ran into — drop it in the comments.

Nobody Tells You This About Slow Transformer Models — I Fixed Mine in 3 Steps

Mansi Somayajula — Sat, 11 Apr 2026 06:23:34 +0000

Hot take: most "my model is slow" problems are not model problems.
They're inference problems. And the ML community almost never talks about that gap.
Everyone's obsessed with architecture choices, parameter counts, quantization-aware training, distillation strategies... while the actual bottleneck is sitting right there in the inference code, completely ignored.
I know because I was doing it wrong for longer than I'd like to admit.
I had a DistilBERT classifier running at ~750ms per request in production. My first instinct was "I need a better machine." Turns out, I needed to stop processing one input at a time like it was 2015.
Here's exactly what I did — three steps, same CPU, same model — and I got it down to 280ms.

What I was building
A support ticket classifier I'm calling SupportBot. Fine-tuned DistilBERT, three classes: billing, technical, general. Great accuracy. Terrible latency.
Here's the embarrassing baseline:
python# baseline.py
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch
import time

MODEL_PATH = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_PATH)
model = DistilBertForSequenceClassification.from_pretrained(MODEL_PATH)
model.eval()

def predict(text: str):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
return outputs.logits.argmax(dim=-1).item()

texts = ["My payment didn't go through"] * 10
start = time.time()
for text in texts:
predict(text)
elapsed = (time.time() - start) * 1000

print(f"Avg: {elapsed/10:.0f}ms per request")

750ms. I wish I was joking.

750ms. Let's fix this.

Step 1: I was running a dishwasher for a single fork
Processing one text at a time means one full forward pass per request. All the overhead — loading weights into cache, spinning up computation — for one input. Over and over.
The fix is almost insulting in how simple it is: just send multiple texts through at once.

python# step1_batching.py
def predict_batch(texts: list, batch_size: int = 16) -> list:
all_predictions = []

for i in range(0, len(texts), batch_size):
    batch = texts[i : i + batch_size]

    inputs = tokenizer(
        batch,
        return_tensors="pt",
        truncation=True,
        max_length=128,
        padding=True  # pads all texts to the same length within the batch
    )

    with torch.no_grad():
        outputs = model(**inputs)

    all_predictions.extend(outputs.logits.argmax(dim=-1).tolist())

return all_predictions

texts = ["My payment didn't go through"] * 10
start = time.time()
predict_batch(texts)
elapsed = (time.time() - start) * 1000

print(f"Avg: {elapsed/10:.0f}ms per request")

480ms. Same hardware, one change.

750ms → 480ms. 36% faster. One change. I genuinely stared at the screen for a moment.

💡 What batch size? I'd start with 16. Drop to 8 if you're hitting memory limits, try 32 if you have room. Dynamic batching (queuing requests for ~30–50ms before processing) is the next level if you're building a real API.

Step 2: PyTorch was carrying bags it didn't need to
Here's something that surprised me: PyTorch is not built for inference. It's built for training.
Every inference call drags along autograd, gradient tracking, training hooks... overhead I was paying for every single request but never using. ONNX Runtime strips all of that out and applies graph-level optimizations automatically. For CPU inference, the difference is real.
First, I exported the model:

python# export_onnx.py
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

dummy_input = tokenizer(
"sample text for export",
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True
)

torch.onnx.export(
model,
(dummy_input["input_ids"], dummy_input["attention_mask"]),
"supportbot.onnx",
input_names=["input_ids", "attention_mask"],
output_names=["logits"],
dynamic_axes={
# Critical — without this, ONNX bakes in fixed shapes from the dummy input
# and rejects any request that doesn't match exactly
"input_ids": {0: "batch_size", 1: "seq_length"},
"attention_mask": {0: "batch_size", 1: "seq_length"},
"logits": {0: "batch_size"}
},
opset_version=13
)
print("Exported.")

⚠️ Don't skip dynamic_axes. I learned this the hard way. Without it you'll get cryptic shape errors in production and spend an hour wondering why requests work in your test script but fail in your API.

Then switched to running inference with ONNX Runtime:

python# step2_onnx.py
import onnxruntime as ort
import numpy as np
from transformers import DistilBertTokenizer
import time

class SupportBotClassifier:
def init(self, model_path: str, tokenizer_path: str):
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
opts.intra_op_num_threads = 4 # match your CPU core count

    self.session = ort.InferenceSession(
        model_path,
        sess_options=opts,
        providers=["CPUExecutionProvider"]
    )
    self.tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_path)

def predict(self, texts: list) -> list:
    inputs = self.tokenizer(
        texts,
        return_tensors="np",  # numpy directly — no PyTorch tensors needed anymore
        max_length=128,
        padding=True,
        truncation=True
    )

    logits = self.session.run(
        ["logits"],
        {
            "input_ids":      inputs["input_ids"].astype(np.int64),
            "attention_mask": inputs["attention_mask"].astype(np.int64)
        }
    )[0]

    return np.argmax(logits, axis=-1).tolist()

classifier = SupportBotClassifier("supportbot.onnx", MODEL_PATH)
texts = ["My payment didn't go through"] * 10
start = time.time()
classifier.predict(texts)
elapsed = (time.time() - start) * 1000

print(f"Avg: {elapsed/10:.0f}ms per request")

350ms.

480ms → 350ms. Another 27% faster, zero model changes.

Step 3: One function call. I'm not exaggerating.
FP32 weights for a ticket classifier is overkill. Dynamic INT8 quantization compresses those weights to 8-bit integers — smaller memory, faster CPU math, almost no accuracy loss.
python# step3_quantize.py
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
model_input="supportbot.onnx",
model_output="supportbot_quantized.onnx",
weight_type=QuantType.QInt8
)
print("Done.")
That's it. That's the whole step.
Always verify predictions didn't shift though:
pythonoriginal = SupportBotClassifier("supportbot.onnx", MODEL_PATH)
quantized = SupportBotClassifier("supportbot_quantized.onnx", MODEL_PATH)

tests = [
"My payment failed twice",
"App keeps crashing on iOS",
"Question about my plan",
"Can't log in",
"Where's my invoice?",
]

matches = sum(
o == q for o, q in
zip(original.predict(tests), quantized.predict(tests))
)
print(f"Match rate: {matches/len(tests):.0%}")

Match rate: 100%

Final benchmark:
pythonclassifier_v2 = SupportBotClassifier("supportbot_quantized.onnx", MODEL_PATH)
start = time.time()
classifier_v2.predict(["My payment didn't go through"] * 10)
elapsed = (time.time() - start) * 1000
print(f"Avg: {elapsed/10:.0f}ms per request")

280ms 🎉

350ms → 280ms.

The full picture
ChangeLatencyvs. startBaseline (PyTorch, single)~750ms—+ Batch processing~480ms36% faster+ ONNX Runtime~350ms53% faster+ INT8 quantization~280ms63% faster
Same model. Same CPU. Same hardware.

The FastAPI wrapper I'm using
python# api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import time

app = FastAPI(title="SupportBot API")
classifier = SupportBotClassifier("supportbot_quantized.onnx", MODEL_PATH)
LABELS = ["billing", "technical", "general"]

class ClassifyRequest(BaseModel):
texts: List[str]

class ClassifyResponse(BaseModel):
predictions: List[str]
latency_ms: float

@app.post("/classify", response_model=ClassifyResponse)
async def classify(req: ClassifyRequest):
if not req.texts:
raise HTTPException(400, "texts can't be empty")
if len(req.texts) > 32:
raise HTTPException(400, "max 32 per request")

start = time.time()
raw = classifier.predict(req.texts)
ms = round((time.time() - start) * 1000, 2)

return ClassifyResponse(
    predictions=[LABELS[p] for p in raw],
    latency_ms=ms
)

@app.get("/health")
async def health():
return {"status": "ok"}
bashuvicorn api:app --host 0.0.0.0 --port 8000

curl -X POST http://localhost:8000/classify \
-H "Content-Type: application/json" \
-d '{"texts": ["payment failed", "app crashed"]}'

{"predictions":["billing","technical"],"latency_ms":42.8}

What I'd do differently
Export to ONNX from day one. I burned time micro-optimizing PyTorch before I made the switch. Should've started there.
Check actual input length distribution early. I defaulted to max_length=128. Turns out 80% of my inputs were under 64 tokens. Dropping max_length for short inputs gave me another ~15% I left on the table.
Add confidence logging from the start, not after. Track average confidence scores over time. When they start drifting downward — your input distribution is shifting and it's time to retrain. I bolted this on late and missed some early signals I shouldn't have.

The real takeaway
Three things — in order of impact:

Batch your inputs. Never process one at a time. Biggest win, easiest fix, most commonly ignored.
Use ONNX Runtime on CPU. PyTorch is brilliant for training. For CPU serving? It's carrying too much.
Quantize before you deploy. One function call. Almost never hurts accuracy on classification tasks.

None of this is magic. It's just doing inference the right way.
The model gets the credit. The inference pipeline does the actual work.

What's next
Now that the model is fast, the next thing I ran into was keeping it honest. Models degrade silently in production — input distributions shift, confidence drops, and you usually don't notice until something's already broken.
Next up: building a 3-level drift detection system to catch model degradation before it hits production. Follow along if you don't want to miss it 🚀

Tried this? Got different numbers? I'd genuinely love to know — drop it in the comments.