luffyguy

Posted on Apr 13 • Originally published at Medium

Real-Time Speech, Audio, and Facial Analysis in Production AI Systems

#multimodal #ai #technology #speechrecognition

Last post covered multimodal fusion, temporal alignment, and conflict resolution at the architecture level. This one goes into the actual modality processing — how you handle speech, audio emotion, and facial analysis in real-time production systems.

Voice Activity Detection — Before Everything Else

Most teams jump straight to Whisper for speech-to-text. In production, you need VAD first.

Voice Activity Detection determines when someone is actually speaking versus silence versus background noise. Without it, you’re sending silent audio chunks to Whisper , wasting compute, and getting hallucinated transcriptions. Whisper is notorious for this — feed it silence and it will confidently transcribe words that were never spoken.

Silero VAD is the go-to lightweight option. Runs on CPU, sub-millisecond inference, and handles the segmentation you need — when speech starts, when it ends, and everything in between to ignore.

The pipeline order matters: raw audio → VAD → only speech segments hit the transcription model. This alone can cut your Whisper compute by 30–60% depending on how much silence and dead air exists in your audio streams. In telehealth or call center scenarios, that’s a lot of dead air.

Speech-to-Text in Production

Whisper is the default. But which Whisper matters.

Whisper large-v3 — highest accuracy, roughly 1.5GB model, too slow for real-time on a single GPU if you’re processing multiple concurrent streams.

Distil-Whisper — distilled version, 49% fewer parameters, 6x faster inference, minimal accuracy loss for English. This is what most production systems should start with.

Faster-Whisper — CTranslate2 backend, up to 4x faster than OpenAI’s implementation with the same accuracy. Uses int8 quantization by default. If you’re self-hosting Whisper, use this, not the original repo.

For real-time streaming, you can’t wait for the full utterance to finish before transcribing. You need chunked processing — typically 2–5 second windows with overlap. It’s like the words you speak appears on your screen while you speak.

The tradeoff here: shorter chunks give faster response times but worse accuracy on word boundaries. Longer chunks improve accuracy but add latency.

The practical setup: 3-second chunks with 0.5-second overlap, running through Faster-Whisper with VAD pre-filtering. This hits the 300–500ms latency target from the previous post’s budget.

Handling Disfluencies

Real speech is messy. “I feel, um, like, you know, pretty good I guess.” Production systems need to decide — do you keep the disfluencies or strip them?

For clinical applications, keep them. Hesitation patterns, filler words, and self-corrections carry diagnostic signal. Increased disfluency can indicate cognitive load, anxiety, or neurological changes. A professional setting won’t need this(mostly) but not some sensitive areas.

For general applications, strip them in a post-processing step. A lightweight text cleanup model or even regex-based rules can remove fillers without losing meaning.

Audio Emotion Analysis

This runs on the raw audio signal, separate from transcription. You’re not analyzing what someone said — you’re analyzing how they said it.

Feature Extraction

The core features that carry emotional signal in audio:

Prosodic features — pitch (F0), pitch variability, speaking rate, rhythm patterns. Flat pitch with slow rate often maps to sadness or fatigue. High pitch variability with fast rate maps to excitement or agitation.

Spectral features — MFCCs (Mel-frequency cepstral coefficients), spectral centroid, spectral flux. These capture the timbre and tonal quality of the voice. A trembling voice has distinct spectral characteristics that differ from a steady one.

Voice quality features — jitter (pitch perturbation), shimmer (amplitude perturbation), harmonics-to-noise ratio. These capture physiological tension in the vocal cords. Stress and anxiety measurably increase jitter and shimmer.

Model Options

wav2vec 2.0 — self-supervised speech representation model. Fine-tune on emotion-labeled audio datasets (IEMOCAP, RAVDESS, MSP-IMPROV). Strong baseline for production emotion detection.

HuBERT — similar architecture to wav2vec 2.0, often slightly better on downstream emotion tasks. Facebook/Meta research origin.

SpeechBrain — open-source toolkit that wraps these models with pre-built emotion recognition recipes. Fastest path from zero to a working emotion classifier.

Custom CNN on spectrograms — convert audio to mel-spectrograms and treat emotion detection as an image classification problem. Simpler to train and debug. Lower ceiling than transformer-based approaches but surprisingly effective for binary classifications like distress vs. no-distress.

Practical Consideration

Emotion models trained on acted datasets (RAVDESS, most of IEMOCAP) perform worse on real-world spontaneous speech. The gap is significant. Acted anger sounds different from real anger. If you’re deploying in a clinical or customer service context, you need fine-tuning on naturalistic data or your precision will be poor.

Facial Analysis

Three levels of facial analysis, each with different compute costs and signal value.

Face Detection

Before you analyze anything, you need to find the face in the frame. MTCNN and RetinaFace are the standards. RetinaFace is more accurate, especially with partially occluded faces (masks, hands covering face). For real-time, run detection every 5–10 frames, not every frame — faces don’t teleport between frames. Track between detections using a lightweight tracker like SORT or ByteTrack.

Facial Landmark Detection

68-point or 478-point (MediaPipe) landmark detection. Maps the geometry of the face — eyebrow position, mouth corners, eye openness, jaw tension. This is what downstream expression analysis uses.

MediaPipe Face Mesh — 478 3D landmarks, runs on CPU, real-time capable even on mobile. This is the production default for most teams. Google-maintained, well-documented, and surprisingly robust.

dlib — 68 landmarks, older but battle-tested. Slightly less accurate than MediaPipe but more predictable failure modes.

Facial Expression Recognition

Action Unit (AU) detection — the Facial Action Coding System (FACS) decomposes expressions into individual muscle movements. AU4 (brow lowerer) + AU15 (lip corner depressor) = sadness pattern. This is more granular and clinically useful than categorical emotion labels. Models: OpenFace 2.0, JAA-Net, or fine-tuned ResNets on AU-labeled datasets (BP4D, DISFA).

Categorical emotion classification — maps faces directly to emotion labels (happy, sad, angry, fearful, surprised, disgusted, neutral). Simpler to implement but loses nuance. A forced smile and a genuine smile both classify as “happy” — AU detection distinguishes them (genuine smiles include AU6, cheek raiser; forced smiles don’t).

For clinical applications, use AU detection. The muscle-level granularity is where the diagnostic value lives.

Frame Rate and Processing

You don’t need to process every frame. Facial expressions change slowly relative to video frame rates. Processing every 3rd or 5th frame at 30fps gives you 6–10 analyses per second — more than enough to capture expression transitions.

This is a major cost optimization. At 30fps you’d process 1,800 frames per minute per patient. At every 5th frame, that drops to 360. Same clinical signal, 80% less compute.

Model Serving Strategy

Running Whisper, an emotion model, and a face model simultaneously raises a practical question: where does each model live?

GPU allocation — Whisper (especially large-v3) needs GPU. Audio emotion models are small enough for CPU if you’re using feature extraction + lightweight classifier. Face detection and landmark extraction (MediaPipe) run fine on CPU. Expression recognition models benefit from GPU but can run on CPU with acceptable latency if quantized.

The practical split for most teams: Whisper on GPU, audio emotion on CPU, face analysis on CPU (MediaPipe + quantized expression model). This lets you serve all three modalities on a single GPU instance instead of three.

Quantization — INT8 quantization through ONNX Runtime cuts inference time by 2–3x with negligible accuracy loss for most emotion and expression models. Whisper benefits from this too — Faster-Whisper uses CTranslate2 which applies quantization by default.

Batch size tuning — if you’re processing multiple concurrent sessions, batch inference requests to your GPU-resident models. A batch of 4–8 Whisper chunks processed together is significantly more efficient than 4–8 sequential single inferences. This is the difference between supporting 10 concurrent sessions and 50 on the same hardware.

When to use ONNX Runtime vs native PyTorch — ONNX for any model in production inference. PyTorch for training and experimentation. ONNX Runtime with TensorRT execution provider on NVIDIA GPUs gives the best inference performance. The conversion step adds initial complexity but pays for itself immediately in latency and throughput.

Putting It Together

The full per-modality pipeline for a single audio-video input:

Raw audio → VAD (CPU, <1ms) → speech segments → Whisper (GPU, 300–500ms) → transcript + timestamps

Raw audio → feature extraction (CPU, 50ms) → emotion model (CPU, 100–200ms) → emotion label + confidence

Video frames → face detection every 5th frame (CPU, 20ms) → landmark extraction (CPU, 10ms) → expression/AU model (CPU/GPU, 50–100ms) → expression labels + confidence

All three run in parallel. Results feed into the fusion layer from the previous post. Total wall-clock time stays within the 2-second budget because nothing is waiting on anything else.

This is the implementation layer.

Next post covers evaluation, monitoring, and what happens when these models degrade in production. See you there.

DEV Community