luffyguy

Posted on Apr 13 • Originally published at Medium

Evaluation, Monitoring, and Model Degradation in Production AI Systems

#driftdetection #ai #llmevaluation #technology

Last post covered the implementation layer — how speech-to-text, audio emotion, and facial analysis actually run in real-time systems. This one covers what happens after deployment. How you evaluate, monitor, and catch degradation before your users do.

The Evaluation Problem

Training metrics tell you how a model performed on a static dataset. Production metrics tell you how it performs on real, messy, constantly changing inputs.

These are not the same thing. A model with 94% accuracy on your test set can drop to 78% in production within weeks — and if you’re not measuring production performance, you won’t know until someone complains.

Offline Evaluation — Before Deployment

This is your baseline. Run these before any model touches production traffic.

Held-out test sets — standard practice, but the quality of your test set matters more than its size. If your test set doesn’t represent production traffic, your metrics are fiction. A speech emotion model tested on acted datasets (RAVDESS) will report great numbers that collapse on real spontaneous speech.

Cross-validation with stratification — for clinical models, stratify by demographics. A model that works well on average but fails for specific age groups, accents, or skin tones is a liability(Sounds biased right?). You need to know per-group performance before deployment.

Behavioral testing (CheckList framework) — beyond aggregate metrics, test specific capabilities. Does your NER model catch medication names when they’re misspelled? Does your emotion model handle whispering? Does your face model work when the patient is wearing glasses? These targeted tests catch failure modes that aggregate accuracy hides.

Adversarial testing — deliberately try to break your model. Feed edge cases(where the system breaks), ambiguous inputs, contradictory signals. If your guardrails post (coming next) is your safety net, adversarial testing is how you find the holes in that net before production does.

Online Evaluation — After Deployment

Once the model is live, you need a different set of metrics running continuously.

Prediction Quality Monitoring

Ground truth comparison — in systems with human-in-the-loop, every human correction is a data point. If a clinician reviews a generated SOAP note and changes the assessment, that’s a signal your model got it wrong. Track correction rates over time. If they trend upward, your model is degrading.

Confidence calibration — a model that says 0.92 confidence should be right about 92% of the time. If your model says 0.92 and is only right 70% of the time, it’s overconfident. Overconfident models are dangerous in production because downstream systems trust those scores. Plot reliability diagrams weekly. If calibration drifts, you have a problem.

Inter-annotator agreement as a ceiling — if two human clinicians agree 85% of the time on a task, your model’s ceiling is roughly 85%. Don’t chase 95% accuracy on a task where humans themselves disagree at 85%. Knowing this ceiling prevents wasted optimization effort.

Data Drift vs Concept Drift

Most teams monitor model accuracy but miss the distinction between these two. They require different fixes.

Data drift — your input distribution changed. The patients are now younger than your training set. A new clinic joined and their microphones have different audio characteristics. Accents shifted because you expanded to a new region. The model hasn’t changed — the world has. This is common. Data almost always changes after you deploy coz the real-world data is messy, unexpected, disorganized, disordered, cluttered, chaotic, unsystematic, haphazard what not? There is this almost thing called model drift. Talk about it later.

Detection: monitor input feature distributions. Track statistical distances (KL divergence, PSI — Population Stability Index) between your training data distribution and the rolling production distribution. When these exceed a threshold, flag it.

Fix: retrain on recent data that includes the new distribution. Your model’s architecture and labeling are fine — it just hasn’t seen these inputs before. Also, try to make your eval datasets with edge cases and more like real-time data.

Concept drift — the relationship between inputs and outputs changed. What “clinical distress” sounds like in your patient population has shifted. New therapy techniques changed how patients express themselves. The labeling criteria evolved because clinical guidelines updated.

Detection: this is harder. Your input distribution might look stable, but accuracy drops anyway. Monitor prediction-outcome correlations over time. If the model’s predictions are becoming less predictive of actual outcomes, concept drift is likely.

Fix: relabeling, not just retraining. You need fresh annotations under the new conceptual definitions. Retraining on old labels that reflect outdated concepts just reinforces the wrong mapping.

The critical difference: data drift means the model needs to see more. Concept drift means the model needs to learn differently. Treating concept drift as data drift — just throwing more data at it — won’t fix the problem.

Alert Design

Not every metric fluctuation is an incident. Your monitoring system needs to distinguish noise from signal.

Sliding window baselines — compare current performance against a rolling 7-day or 30-day window, not a fixed threshold. Production performance naturally fluctuates. A fixed threshold of “accuracy must stay above 90%” will either fire too often or not often enough depending on the period.

Severity tiers — not all degradation is equal. A 2% accuracy drop on a general transcription model is a watch item. A 2% drop on a safety-critical classifier that gates medication recommendations is an immediate incident.

Design your alerts in tiers. Info (log it, review weekly), Warning (investigate within 24 hours), Critical (page someone now). Map each model and metric to the appropriate tier based on what breaks if that model fails.

Alert fatigue is a real failure mode — if your team gets 50 alerts a day, they’ll start ignoring all of them. Tune your thresholds aggressively. Fewer, meaningful alerts beat comprehensive but noisy ones every time.

Shadow Deployments and Canary Rollouts

When you retrain a model and want to push it to production, you don’t swap it in directly. One bad deployment can degrade the experience for every user simultaneously.

Shadow mode — run the new model alongside the old one in production. Both models process the same inputs. Only the old model’s outputs are served to users. The new model’s outputs are logged and compared against the old model’s outputs and ground truth.

This tells you exactly how the new model would perform on real production traffic without any risk. Run shadow mode for a minimum of one week — ideally two — to capture enough variation in traffic patterns.

Canary rollout — after shadow mode validates the new model, route 5% of production traffic to it. Monitor all metrics on that 5% slice. If everything holds, increase to 10%, 25%, 50%, 100%. Each step gets a minimum soak period — usually 24–48 hours — before advancing.

Automatic rollback — set rollback triggers. If the canary model’s error rate exceeds the baseline model by more than a defined threshold, automatically route all traffic back to the old model. This should happen without human intervention. At 3am, you want the system to protect itself.

The combination of shadow + canary + auto-rollback is how you ship model updates without shipping regressions.

Logging and Observability

When something breaks in production — and it will — you need a full trace of what happened.

Log every decision point. For a multimodal system, that means: what the VAD detected, what Whisper transcribed, what confidence the emotion model assigned, what the face model predicted, how the fusion layer resolved conflicts, what the LLM generated, and whether guardrails modified or blocked the output.

Structured logging — not print statements. Every log entry should be a structured object with a session ID, timestamp, model version, input hash, output, confidence scores, and latency. This lets you query logs programmatically. “Show me all sessions where the emotion model predicted distress with >0.8 confidence but the LLM output was positive” — you need structured data to answer this.

Tracing tools — LangSmith if you’re in the LangChain ecosystem. Arize Phoenix for model-level observability. OpenTelemetry for general distributed tracing. Custom logging pipelines for anything these tools don’t cover. The point is full reconstructability — given a session ID, you should be able to replay the entire decision chain.

Retention policy — in healthcare, log retention is governed by regulation (HIPAA requires 6 years minimum). Design your logging pipeline with compliance in mind from the start, not as an afterthought. This includes encryption at rest, access controls on log data, and audit trails for who accessed what.

Retraining Strategy

Models degrade. The question isn’t whether you’ll retrain — it’s when and how.

Scheduled retraining — retrain on a fixed cadence (weekly, monthly) using accumulated production data. Simple and predictable. Works well when drift is gradual.

Triggered retraining — retrain when monitoring detects a performance threshold breach. More responsive than scheduled, but requires reliable drift detection to avoid false triggers.

Continuous learning — the model incrementally learns from new data as it arrives. Most complex to implement safely. Risk of catastrophic forgetting — the model improves on recent patterns but forgets older ones. Requires careful validation before each update goes live.

For most production systems, start with scheduled retraining on a monthly cadence. Add triggered retraining once your monitoring is mature enough to detect real drift reliably. Continuous learning is an optimization for later — and many teams never need it.

Always retrain on the full dataset plus new data, not just new data. Training only on recent data causes the model to forget everything it learned before. This is the most common retraining mistake teams make.

The Feedback Loop

The most valuable signal in your entire system is what happens after the model’s output is used.

Did the clinician accept the generated note or rewrite it? Did the patient outcome improve after the system flagged distress? Did the human reviewer override the model’s assessment?

Every one of these is a labeled data point you get for free. Build the pipeline to capture these signals, feed them back into your evaluation and retraining processes, and your system gets better over time instead of slowly degrading.

The teams that build this feedback loop early end up with models that improve with scale. The teams that don’t end up retraining on the same stale dataset every month and wondering why production performance isn’t getting better.

This covers evaluation, monitoring, drift detection, deployment strategy, and retraining. Next post goes into LLM guardrails and safety — input filtering, output validation, hallucination prevention, and what the layered defense architecture looks like in regulated systems. See you there.

DEV Community