When I started building Hoovik β a distributed video conferencing platform β I expected WebRTC signaling and transcription pipelines to be the hardest problems.
They werenβt.
The real engineering challenge was building a production-ready real-time multimodal emotion inference engine capable of processing live video meetings under strict latency constraints.
Unlike offline ML systems, live meeting environments are unstable by default:
- microphones get muted
- webcams disappear
- packet loss spikes randomly
- lighting changes constantly
- CPU pressure builds under burst traffic
And unlike research notebooks, production systems need to survive all of it without freezing the application loop.
This article breaks down how I designed Hoovikβs emotion recognition backend using:
- FastAPI
- PyTorch
- MediaPipe
- XGBoost
- Isolation Forests
- Socket.IO
- Azure compute nodes
The production configuration currently operates with:
seq_len = 10audio_dim = 1024face_dim = 326d_model = 256nhead = 8-
3 encoder layers
ποΈ Multi-Cloud System Architecture
To separate lightweight websocket orchestration from heavy ML workloads, Hoovik runs on a distributed multi-cloud topology.
Render
Handles:
- React frontend
- Express backend
- Socket.IO signaling
- room state coordination
Azure VMs
Handles:
- emotion inference
- Whisper transcription
- feature extraction
- model execution
This separation keeps signaling latency stable while allowing compute services to scale independently.
β‘ Solving Python GIL Bottlenecks
The emotion_service runs as an asynchronous FastAPI + Socket.IO server.
A naive implementation quickly breaks under load because:
- JPEG decoding is CPU-heavy
- MediaPipe blocks aggressively
- transformer forward passes stall the event loop
- concurrent feature extraction creates backpressure
Instead of running everything inside one async path, I isolated the workload into dedicated executor pools.
+---------------------------------------+
| Per-Participant Sockets |
+---------------------------------------+
/ \
[audio_chunk (Float32 PCM)] [emotion.frame (JPEG)]
/ \
v v
+----------------------+ +-----------------------+
| _audio_executor (2T) | | _face_executor (2T) |
| - Wav2Vec2 Features | | - MediaPipe Tracking |
+----------------------+ +-----------------------+
\ /
\ /
---> Normalization --->
|
v
+---------------------------+
| _inference_executor (1T) |
| - Transformer |
| - XGBoost |
| - Isolation Forest |
+---------------------------+
|
v
emotion.result
Worker Pools
_audio_executor
Handles:
- PCM deserialization
- audio normalization
- Wav2Vec2 embedding extraction
_face_executor
Handles:
- JPEG decoding
- OpenCV preprocessing
- MediaPipe landmark tracking
_inference_executor
Runs:
- PyTorch inference
- XGBoost evaluation
- anomaly detection
Inference is intentionally serialized to guarantee thread safety without adding explicit locking overhead around model state.
π Graceful Modality Degradation
Real meetings are noisy and inconsistent.
Users:
- disable webcams
- mute microphones
- lose packets
- reconnect mid-session
If one stream disappears for more than 0.4 seconds, the backend automatically switches execution profiles:
bothaudio_onlyvideo_only
This prevents runtime crashes and preserves stable predictions during degraded sessions.
ποΈ Feature Engineering Pipeline
The engine combines synchronized audio and facial features into aligned temporal embeddings.
π€ Audio Pipeline
Incoming audio chunks are:
- resampled to 16kHz
- normalized
- processed using:
audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
The system extracts:
- last hidden state embeddings
- mean pooled temporal vectors
- 1024-dimensional acoustic features
using centered 0.6-second windows.
π Video Pipeline
MediaPipe extracts 326 facial features per frame:
Spatial Landmarks
- 136 normalized facial landmarks
Blendshapes
- 51 facial muscle activations
Head Pose
- pitch
- yaw
- roll
All landmarks are normalized relative to inter-ocular distance so the model remains invariant to camera proximity.
The backend stores rolling temporal sequences of:
seq_len = 10
allowing the model to track facial motion across time.
π§ Brain A β EmotionTransformer
The primary deep learning model is a custom multimodal transformer architecture using:
- projection layers
- temporal convolutions
- bidirectional cross-attention
- gated modality fusion
Model Configuration
d_model = 256nhead = 8encoder_layers = 3
The network learns bidirectional attention:
- visual queries attend to acoustic features
- acoustic queries attend to visual features
A learned cross_gate dynamically balances voice tone against facial motion.
π Curriculum Training Strategy
Training multimodal systems becomes unstable when modalities disappear randomly.
To solve this, I trained the network in 3 phases.
Phase A β Full Bimodal Learning
Epochs: 1β15
The model trains only on complete:
- audio
- video
pairs.
This forces cross-attention layers to learn joint representations first.
Phase B β Missing Modality Robustness
Epochs: 16β55
The training pipeline introduces:
- audio-only samples
- video-only samples
- mixup augmentation
- modality dropout
This teaches the model to preserve embeddings even when streams disappear entirely.
Phase C β Calibration & Stabilization
Epochs: 56β90
Final training uses:
- reduced learning rate
- SWA (Stochastic Weight Averaging)
to smooth final weights and improve generalization.
Training currently uses:
batch_size = 64mixup_alpha = 0.166modality_drop_prob = 0.068label_smoothing = 0.077-
grad_clip = 1.0
π² Brain B β XGBoost Ensemble
Deep networks are excellent at latent representation learning.
But tree ensembles remain extremely effective at learning explicit statistical boundaries.
To complement the transformer, the backend engineers an 8,149-dimensional feature vector every inference cycle.
Engineered Features
Statistical Windows
- rolling mean
- std deviation
- min/max windows
Temporal Metrics
- frame deltas
- OLS trend slopes
- trajectory shifts
Motion Energy Features
- facial asymmetry
- landmark MSE
- voice variance
Before training, features are compressed using:
PCA β 512 dimensions
The XGBoost model handles missing modalities naturally using:
missing=np.nan
which makes degraded stream inference extremely resilient.
Current XGBoost configuration:
n_estimators = 3150max_depth = 5learning_rate = 0.0308-
tree_method = hist
π Feature Importance
Top predictive markers extracted from the engineered XGBoost feature space.
While the strongest features currently remain PCA-compressed latent representations, the distribution clearly shows the ensemble learning stable statistical boundaries across temporal motion patterns.
π Confusion Matrix
Normalized confusion matrix for the calibrated ensemble classifier.
The model performs strongest on:
- angry
- happy
- neutral/calm
while softer affective states such as:
- fearful
- sad
- disgust
show heavier overlap due to lower facial motion intensity and acoustic ambiguity.
The strongest normalized recalls were:
angry β angry = 0.81happy β happy = 0.77neutral/calm β neutral/calm = 0.73
π§ͺ Ensemble Calibration
Instead of averaging probabilities directly, Hoovik uses weighted ensemble calibration optimized using Optuna.
Final probability output:
P_final = 0.455 Γ P_Transformer + 0.545 Γ P_XGBoost
Both models pass through:
- temperature scaling
- Platt calibration
before fusion.
This significantly reduced overconfident predictions and improved calibration stability across degraded modalities.
π¨ Modality-Specific Anomaly Detection
Real-world meeting environments are chaotic:
- poor lighting
- motion blur
- audio clipping
- reverb
- background interference
Without safeguards, models generate confident garbage predictions.
To solve this, every feature vector passes through dedicated Isolation Forest pipelines.
Active Isolation Models
iso_bothiso_audio_onlyiso_video_onlyiso_global_fallback
Each model is calibrated against:
- modality-specific variance
- explicit 10% FPR targets
The deployed thresholds currently operate at:
both = 0.0525audio_only = 0.0884video_only = -0.0264
The negative threshold for video_only reflects the inherently noisier variance distribution of isolated visual streams.
π¦ Anomaly Distribution
Variance spreads across bimodal and unimodal execution paths.
Samples falling below calibrated thresholds are flagged as anomalous.
If facial landmarks collapse due to:
- backlighting
- packet corruption
- unstable camera frames
the backend emits:
{
"anomaly": true
}
allowing the frontend to suppress unreliable emotional predictions.
π Real-Time Backpressure Protection
One hidden production problem was frame burst overload.
If browsers uploaded frames continuously at high FPS, executor queues eventually exploded.
To prevent memory collapse, the backend continuously monitors queue depth.
If _face_executor exceeds:
queue_depth >= 3
the backend emits a websocket backpressure event.
The frontend immediately reduces:
- capture FPS
- upload rate
- processing pressure
until queues recover.
This dramatically stabilized long-running sessions under burst traffic.
π Runtime Telemetry
The service exposes live telemetry endpoints:
GET /stats
GET /stats/json
tracking:
- P50 latency
- P90 latency
- P95 latency
- participant counts
- queue depth
- active rooms
without blocking inference workers.
βοΈ Runtime Performance
Early inference profiling and websocket load testing were performed locally on Apple Silicon hardware using Locust-based stress testing and concurrent Socket.IO session simulation.
Observed runtime characteristics included:
- Stable low-latency multimodal inference
- Sustained concurrent websocket sessions
- Controlled executor queue growth under burst traffic
- Successful websocket backpressure throttling
- Reliable degraded-mode fallback handling (
audio_only/video_only)
The calibrated ensemble currently achieves:
- Ensemble test accuracy:
74.34% - Balanced test accuracy:
73.84% - Calibrated validation accuracy:
78.68% - Transformer-only accuracy:
74.25% - XGBoost-only accuracy:
66.03%
Final probability output:
P_final = 0.455 Γ P_Transformer + 0.545 Γ P_XGBoost
with temperature calibration fixed at:
T = 0.3
π§© Key Lessons Learned
1. Curriculum Learning Matters
Training only on perfect multimodal samples causes catastrophic failure when streams disappear.
Progressive modality degradation training was essential.
2. Serialized Inference Is Simpler
Trying to parallelize model execution aggressively created:
- race conditions
- queue instability
- unpredictable latency spikes
Thread-isolated extraction + serialized inference proved significantly more stable.
3. Production Safeguards Matter More Than Accuracy
Backpressure protection, anomaly detection, and graceful degradation ended up being just as important as raw model accuracy.
π Open Source & Contributors
Hoovik is fully open-source and actively looking for contributors around:
- Dockerization
- Redis backplanes
- horizontal scaling
- telemetry infrastructure
- WebRTC optimization
GitHub : https://github.com/AnupamKumar-1/Hoovik
If you enjoy systems engineering, real-time ML infrastructure, or multimodal AI pipelines, contributions are welcome.





Top comments (0)