Anupam Kumar

Posted on May 20 • Edited on Jun 4

Inside Hoovik: Building a Real-Time Multimodal Emotion AI Pipeline

#ai #webdev #python #showdev

When I started building Hoovik — a distributed video conferencing platform — I expected WebRTC signaling and transcription pipelines to be the hardest problems.

They weren’t.

The real engineering challenge was building a production-ready real-time multimodal emotion inference engine capable of processing live video meetings under strict latency constraints.

Unlike offline ML systems, live meeting environments are unstable by default:

microphones get muted
webcams disappear
packet loss spikes randomly
lighting changes constantly
CPU pressure builds under burst traffic

And unlike research notebooks, production systems need to survive all of it without freezing the application loop.

This article breaks down how I designed Hoovik’s emotion recognition backend using:

FastAPI
PyTorch
MediaPipe
XGBoost
Isolation Forests
Socket.IO
Azure compute nodes

The production configuration currently operates with:

seq_len = 10
audio_dim = 1024
face_dim = 326
d_model = 256
nhead = 8
3 encoder layers

🏗️ Multi-Cloud System Architecture

To separate lightweight websocket orchestration from heavy ML workloads, Hoovik runs on a distributed multi-cloud topology.

Render

Handles:

React frontend
Express backend
Socket.IO signaling
room state coordination

Azure VMs

Handles:

emotion inference
Whisper transcription
feature extraction
model execution

This separation keeps signaling latency stable while allowing compute services to scale independently.

⚡ Solving Python GIL Bottlenecks

The emotion_service runs as an asynchronous FastAPI + Socket.IO server.

A naive implementation quickly breaks under load because:

JPEG decoding is CPU-heavy
MediaPipe blocks aggressively
transformer forward passes stall the event loop
concurrent feature extraction creates backpressure

Instead of running everything inside one async path, I isolated the workload into dedicated executor pools.

                 +---------------------------------------+
                 |        Per-Participant Sockets        |
                 +---------------------------------------+
                    /                                 \
  [audio_chunk (Float32 PCM)]                      [emotion.frame (JPEG)]
                  /                                     \
                  v                                     v

        +----------------------+      +-----------------------+
        | _audio_executor (2T) |      | _face_executor (2T)   |
        | - Wav2Vec2 Features  |      | - MediaPipe Tracking  |
        +----------------------+      +-----------------------+
                    \                     /
                     \                   /
                      ---> Normalization --->
                                   |
                                   v
                     +---------------------------+
                     | _inference_executor (1T)  |
                     | - Transformer             |
                     | - XGBoost                 |
                     | - Isolation Forest        |
                     +---------------------------+
                                   |
                                   v
                           emotion.result

Worker Pools

`_audio_executor`

Handles:

PCM deserialization
audio normalization
Wav2Vec2 embedding extraction

`_face_executor`

Handles:

JPEG decoding
OpenCV preprocessing
MediaPipe landmark tracking

`_inference_executor`

Runs:

PyTorch inference
XGBoost evaluation
anomaly detection

Inference is intentionally serialized to guarantee thread safety without adding explicit locking overhead around model state.

🔄 Graceful Modality Degradation

Real meetings are noisy and inconsistent.

Users:

disable webcams
mute microphones
lose packets
reconnect mid-session

If one stream disappears for more than 0.4 seconds, the backend automatically switches execution profiles:

both
audio_only
video_only

This prevents runtime crashes and preserves stable predictions during degraded sessions.

🎛️ Feature Engineering Pipeline

The engine combines synchronized audio and facial features into aligned temporal embeddings.

🎤 Audio Pipeline

Incoming audio chunks are:

resampled to 16kHz
normalized
processed using:

audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim

The system extracts:

last hidden state embeddings
mean pooled temporal vectors
1024-dimensional acoustic features

using centered 0.6-second windows.

🙂 Video Pipeline

MediaPipe extracts 326 facial features per frame:

Spatial Landmarks

136 normalized facial landmarks

Blendshapes

51 facial muscle activations

Head Pose

pitch
yaw
roll

All landmarks are normalized relative to inter-ocular distance so the model remains invariant to camera proximity.

The backend stores rolling temporal sequences of:

seq_len = 10

allowing the model to track facial motion across time.

🧠 Brain A — EmotionTransformer

The primary deep learning model is a custom multimodal transformer architecture using:

projection layers
temporal convolutions
bidirectional cross-attention
gated modality fusion

Model Configuration

d_model = 256
nhead = 8
encoder_layers = 3

The network learns bidirectional attention:

visual queries attend to acoustic features
acoustic queries attend to visual features

A learned cross_gate dynamically balances voice tone against facial motion.

📚 Curriculum Training Strategy

Training multimodal systems becomes unstable when modalities disappear randomly.

To solve this, I trained the network in 3 phases.

Phase A — Full Bimodal Learning

Epochs: 1–15

The model trains only on complete:

audio
video

pairs.

This forces cross-attention layers to learn joint representations first.

Phase B — Missing Modality Robustness

Epochs: 16–55

The training pipeline introduces:

audio-only samples
video-only samples
mixup augmentation
modality dropout

This teaches the model to preserve embeddings even when streams disappear entirely.

Phase C — Calibration & Stabilization

Epochs: 56–90

Final training uses:

reduced learning rate
SWA (Stochastic Weight Averaging)

to smooth final weights and improve generalization.

Training currently uses:

batch_size = 64
mixup_alpha = 0.166
modality_drop_prob = 0.068
label_smoothing = 0.077
grad_clip = 1.0

🌲 Brain B — XGBoost Ensemble

Deep networks are excellent at latent representation learning.

But tree ensembles remain extremely effective at learning explicit statistical boundaries.

To complement the transformer, the backend engineers an 8,149-dimensional feature vector every inference cycle.

Engineered Features

Statistical Windows

rolling mean
std deviation
min/max windows

Temporal Metrics

frame deltas
OLS trend slopes
trajectory shifts

Motion Energy Features

facial asymmetry
landmark MSE
voice variance

Before training, features are compressed using:

PCA → 512 dimensions

The XGBoost model handles missing modalities naturally using:

missing=np.nan

which makes degraded stream inference extremely resilient.

Current XGBoost configuration:

n_estimators = 3150
max_depth = 5
learning_rate = 0.0308
tree_method = hist

📊 Feature Importance

Top predictive markers extracted from the engineered XGBoost feature space.

While the strongest features currently remain PCA-compressed latent representations, the distribution clearly shows the ensemble learning stable statistical boundaries across temporal motion patterns.

📉 Confusion Matrix

Normalized confusion matrix for the calibrated ensemble classifier.

The model performs strongest on:

angry
happy
neutral/calm

while softer affective states such as:

fearful
sad
disgust

show heavier overlap due to lower facial motion intensity and acoustic ambiguity.

The strongest normalized recalls were:

angry → angry = 0.81
happy → happy = 0.77
neutral/calm → neutral/calm = 0.73

🧪 Ensemble Calibration

Instead of averaging probabilities directly, Hoovik uses weighted ensemble calibration optimized using Optuna.

Final probability output:

P_final = 0.455 × P_Transformer + 0.545 × P_XGBoost

Both models pass through:

temperature scaling
Platt calibration

before fusion.

This significantly reduced overconfident predictions and improved calibration stability across degraded modalities.

🚨 Modality-Specific Anomaly Detection

Real-world meeting environments are chaotic:

poor lighting
motion blur
audio clipping
reverb
background interference

Without safeguards, models generate confident garbage predictions.

To solve this, every feature vector passes through dedicated Isolation Forest pipelines.

Active Isolation Models

iso_both
iso_audio_only
iso_video_only
iso_global_fallback

Each model is calibrated against:

modality-specific variance
explicit 10% FPR targets

The deployed thresholds currently operate at:

both = 0.0525
audio_only = 0.0884
video_only = -0.0264

The negative threshold for video_only reflects the inherently noisier variance distribution of isolated visual streams.

📦 Anomaly Distribution

Variance spreads across bimodal and unimodal execution paths.

Samples falling below calibrated thresholds are flagged as anomalous.

If facial landmarks collapse due to:

backlighting
packet corruption
unstable camera frames

the backend emits:

{
  "anomaly": true
}

allowing the frontend to suppress unreliable emotional predictions.

🛑 Real-Time Backpressure Protection

One hidden production problem was frame burst overload.

If browsers uploaded frames continuously at high FPS, executor queues eventually exploded.

To prevent memory collapse, the backend continuously monitors queue depth.

If _face_executor exceeds:

queue_depth >= 3

the backend emits a websocket backpressure event.

The frontend immediately reduces:

capture FPS
upload rate
processing pressure

until queues recover.

This dramatically stabilized long-running sessions under burst traffic.

📈 Runtime Telemetry

The service exposes live telemetry endpoints:

GET /stats
GET /stats/json

tracking:

P50 latency
P90 latency
P95 latency
participant counts
queue depth
active rooms

without blocking inference workers.

⚙️ Runtime Performance

Early inference profiling and websocket load testing were performed locally on Apple Silicon hardware using Locust-based stress testing and concurrent Socket.IO session simulation.

Observed runtime characteristics included:

Stable low-latency multimodal inference
Sustained concurrent websocket sessions
Controlled executor queue growth under burst traffic
Successful websocket backpressure throttling
Reliable degraded-mode fallback handling (audio_only / video_only)

The calibrated ensemble currently achieves:

Ensemble test accuracy: 74.34%
Balanced test accuracy: 73.84%
Calibrated validation accuracy: 78.68%
Transformer-only accuracy: 74.25%
XGBoost-only accuracy: 66.03%

Final probability output:

P_final = 0.455 × P_Transformer + 0.545 × P_XGBoost

with temperature calibration fixed at:

T = 0.3

🧩 Key Lessons Learned

1. Curriculum Learning Matters

Training only on perfect multimodal samples causes catastrophic failure when streams disappear.

Progressive modality degradation training was essential.

2. Serialized Inference Is Simpler

Trying to parallelize model execution aggressively created:

race conditions
queue instability
unpredictable latency spikes

Thread-isolated extraction + serialized inference proved significantly more stable.

3. Production Safeguards Matter More Than Accuracy

Backpressure protection, anomaly detection, and graceful degradation ended up being just as important as raw model accuracy.

🚀 Open Source & Contributors

Hoovik is fully open-source and actively looking for contributors around:

Dockerization
Redis backplanes
horizontal scaling
telemetry infrastructure
WebRTC optimization

GitHub : https://github.com/AnupamKumar-1/Hoovik

If you enjoy systems engineering, real-time ML infrastructure, or multimodal AI pipelines, contributions are welcome.