DEV Community

Cover image for Inside Hoovik: Building a Real-Time Multimodal Emotion AI Pipeline
Anupam Kumar
Anupam Kumar

Posted on

Inside Hoovik: Building a Real-Time Multimodal Emotion AI Pipeline

πŸ‘‰GitHub 🌐Live Demo

When I started building Hoovik β€” a distributed video conferencing platform β€” I expected WebRTC signaling and transcription pipelines to be the hardest problems.

They weren’t.

The real engineering challenge was building a production-ready real-time multimodal emotion inference engine capable of processing live video meetings under strict latency constraints.

Unlike offline ML systems, live meeting environments are unstable by default:

  • microphones get muted
  • webcams disappear
  • packet loss spikes randomly
  • lighting changes constantly
  • CPU pressure builds under burst traffic

And unlike research notebooks, production systems need to survive all of it without freezing the application loop.

This article breaks down how I designed Hoovik’s emotion recognition backend using:

  • FastAPI
  • PyTorch
  • MediaPipe
  • XGBoost
  • Isolation Forests
  • Socket.IO
  • Azure compute nodes

The production configuration currently operates with:

  • seq_len = 10
  • audio_dim = 1024
  • face_dim = 326
  • d_model = 256
  • nhead = 8
  • 3 encoder layers

πŸ—οΈ Multi-Cloud System Architecture

To separate lightweight websocket orchestration from heavy ML workloads, Hoovik runs on a distributed multi-cloud topology.

Render

Handles:

  • React frontend
  • Express backend
  • Socket.IO signaling
  • room state coordination

Azure VMs

Handles:

  • emotion inference
  • Whisper transcription
  • feature extraction
  • model execution

This separation keeps signaling latency stable while allowing compute services to scale independently.


⚑ Solving Python GIL Bottlenecks

The emotion_service runs as an asynchronous FastAPI + Socket.IO server.

A naive implementation quickly breaks under load because:

  • JPEG decoding is CPU-heavy
  • MediaPipe blocks aggressively
  • transformer forward passes stall the event loop
  • concurrent feature extraction creates backpressure

Instead of running everything inside one async path, I isolated the workload into dedicated executor pools.

                 +---------------------------------------+
                 |        Per-Participant Sockets        |
                 +---------------------------------------+
                    /                                 \
  [audio_chunk (Float32 PCM)]                      [emotion.frame (JPEG)]
                  /                                     \
                  v                                     v

        +----------------------+      +-----------------------+
        | _audio_executor (2T) |      | _face_executor (2T)   |
        | - Wav2Vec2 Features  |      | - MediaPipe Tracking  |
        +----------------------+      +-----------------------+
                    \                     /
                     \                   /
                      ---> Normalization --->
                                   |
                                   v
                     +---------------------------+
                     | _inference_executor (1T)  |
                     | - Transformer             |
                     | - XGBoost                 |
                     | - Isolation Forest        |
                     +---------------------------+
                                   |
                                   v
                           emotion.result
Enter fullscreen mode Exit fullscreen mode

Worker Pools

_audio_executor

Handles:

  • PCM deserialization
  • audio normalization
  • Wav2Vec2 embedding extraction

_face_executor

Handles:

  • JPEG decoding
  • OpenCV preprocessing
  • MediaPipe landmark tracking

_inference_executor

Runs:

  • PyTorch inference
  • XGBoost evaluation
  • anomaly detection

Inference is intentionally serialized to guarantee thread safety without adding explicit locking overhead around model state.


πŸ”„ Graceful Modality Degradation

Real meetings are noisy and inconsistent.

Users:

  • disable webcams
  • mute microphones
  • lose packets
  • reconnect mid-session

If one stream disappears for more than 0.4 seconds, the backend automatically switches execution profiles:

  • both
  • audio_only
  • video_only

This prevents runtime crashes and preserves stable predictions during degraded sessions.


πŸŽ›οΈ Feature Engineering Pipeline

The engine combines synchronized audio and facial features into aligned temporal embeddings.


🎀 Audio Pipeline

Incoming audio chunks are:

  1. resampled to 16kHz
  2. normalized
  3. processed using:
audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
Enter fullscreen mode Exit fullscreen mode

The system extracts:

  • last hidden state embeddings
  • mean pooled temporal vectors
  • 1024-dimensional acoustic features

using centered 0.6-second windows.


πŸ™‚ Video Pipeline

MediaPipe extracts 326 facial features per frame:

Spatial Landmarks

  • 136 normalized facial landmarks

Blendshapes

  • 51 facial muscle activations

Head Pose

  • pitch
  • yaw
  • roll

All landmarks are normalized relative to inter-ocular distance so the model remains invariant to camera proximity.

The backend stores rolling temporal sequences of:

seq_len = 10
Enter fullscreen mode Exit fullscreen mode

allowing the model to track facial motion across time.


🧠 Brain A β€” EmotionTransformer

The primary deep learning model is a custom multimodal transformer architecture using:

  • projection layers
  • temporal convolutions
  • bidirectional cross-attention
  • gated modality fusion

Model Configuration

  • d_model = 256
  • nhead = 8
  • encoder_layers = 3

The network learns bidirectional attention:

  • visual queries attend to acoustic features
  • acoustic queries attend to visual features

A learned cross_gate dynamically balances voice tone against facial motion.


πŸ“š Curriculum Training Strategy

Training multimodal systems becomes unstable when modalities disappear randomly.

To solve this, I trained the network in 3 phases.


Phase A β€” Full Bimodal Learning

Epochs: 1–15

The model trains only on complete:

  • audio
  • video

pairs.

This forces cross-attention layers to learn joint representations first.


Phase B β€” Missing Modality Robustness

Epochs: 16–55

The training pipeline introduces:

  • audio-only samples
  • video-only samples
  • mixup augmentation
  • modality dropout

This teaches the model to preserve embeddings even when streams disappear entirely.


Phase C β€” Calibration & Stabilization

Epochs: 56–90

Final training uses:

  • reduced learning rate
  • SWA (Stochastic Weight Averaging)

to smooth final weights and improve generalization.

Training currently uses:

  • batch_size = 64
  • mixup_alpha = 0.166
  • modality_drop_prob = 0.068
  • label_smoothing = 0.077
  • grad_clip = 1.0

🌲 Brain B β€” XGBoost Ensemble

Deep networks are excellent at latent representation learning.

But tree ensembles remain extremely effective at learning explicit statistical boundaries.

To complement the transformer, the backend engineers an 8,149-dimensional feature vector every inference cycle.

Engineered Features

Statistical Windows

  • rolling mean
  • std deviation
  • min/max windows

Temporal Metrics

  • frame deltas
  • OLS trend slopes
  • trajectory shifts

Motion Energy Features

  • facial asymmetry
  • landmark MSE
  • voice variance

Before training, features are compressed using:

PCA β†’ 512 dimensions
Enter fullscreen mode Exit fullscreen mode

The XGBoost model handles missing modalities naturally using:

missing=np.nan
Enter fullscreen mode Exit fullscreen mode

which makes degraded stream inference extremely resilient.

Current XGBoost configuration:

  • n_estimators = 3150
  • max_depth = 5
  • learning_rate = 0.0308
  • tree_method = hist

πŸ“Š Feature Importance

Top predictive markers extracted from the engineered XGBoost feature space.

While the strongest features currently remain PCA-compressed latent representations, the distribution clearly shows the ensemble learning stable statistical boundaries across temporal motion patterns.


πŸ“‰ Confusion Matrix

Normalized confusion matrix for the calibrated ensemble classifier.

The model performs strongest on:

  • angry
  • happy
  • neutral/calm

while softer affective states such as:

  • fearful
  • sad
  • disgust

show heavier overlap due to lower facial motion intensity and acoustic ambiguity.

The strongest normalized recalls were:

  • angry β†’ angry = 0.81
  • happy β†’ happy = 0.77
  • neutral/calm β†’ neutral/calm = 0.73

πŸ§ͺ Ensemble Calibration

Instead of averaging probabilities directly, Hoovik uses weighted ensemble calibration optimized using Optuna.

Final probability output:

P_final = 0.455 Γ— P_Transformer + 0.545 Γ— P_XGBoost

Both models pass through:

  • temperature scaling
  • Platt calibration

before fusion.

This significantly reduced overconfident predictions and improved calibration stability across degraded modalities.


🚨 Modality-Specific Anomaly Detection

Real-world meeting environments are chaotic:

  • poor lighting
  • motion blur
  • audio clipping
  • reverb
  • background interference

Without safeguards, models generate confident garbage predictions.

To solve this, every feature vector passes through dedicated Isolation Forest pipelines.

Active Isolation Models

  • iso_both
  • iso_audio_only
  • iso_video_only
  • iso_global_fallback

Each model is calibrated against:

  • modality-specific variance
  • explicit 10% FPR targets

The deployed thresholds currently operate at:

  • both = 0.0525
  • audio_only = 0.0884
  • video_only = -0.0264

The negative threshold for video_only reflects the inherently noisier variance distribution of isolated visual streams.


πŸ“¦ Anomaly Distribution

Variance spreads across bimodal and unimodal execution paths.

Samples falling below calibrated thresholds are flagged as anomalous.

If facial landmarks collapse due to:

  • backlighting
  • packet corruption
  • unstable camera frames

the backend emits:

{
  "anomaly": true
}
Enter fullscreen mode Exit fullscreen mode

allowing the frontend to suppress unreliable emotional predictions.


πŸ›‘ Real-Time Backpressure Protection

One hidden production problem was frame burst overload.

If browsers uploaded frames continuously at high FPS, executor queues eventually exploded.

To prevent memory collapse, the backend continuously monitors queue depth.

If _face_executor exceeds:

queue_depth >= 3
Enter fullscreen mode Exit fullscreen mode

the backend emits a websocket backpressure event.

The frontend immediately reduces:

  • capture FPS
  • upload rate
  • processing pressure

until queues recover.

This dramatically stabilized long-running sessions under burst traffic.


πŸ“ˆ Runtime Telemetry

The service exposes live telemetry endpoints:

GET /stats
GET /stats/json
Enter fullscreen mode Exit fullscreen mode

tracking:

  • P50 latency
  • P90 latency
  • P95 latency
  • participant counts
  • queue depth
  • active rooms

without blocking inference workers.


βš™οΈ Runtime Performance

Early inference profiling and websocket load testing were performed locally on Apple Silicon hardware using Locust-based stress testing and concurrent Socket.IO session simulation.

Observed runtime characteristics included:

  • Stable low-latency multimodal inference
  • Sustained concurrent websocket sessions
  • Controlled executor queue growth under burst traffic
  • Successful websocket backpressure throttling
  • Reliable degraded-mode fallback handling (audio_only / video_only)

The calibrated ensemble currently achieves:

  • Ensemble test accuracy: 74.34%
  • Balanced test accuracy: 73.84%
  • Calibrated validation accuracy: 78.68%
  • Transformer-only accuracy: 74.25%
  • XGBoost-only accuracy: 66.03%

Final probability output:

P_final = 0.455 Γ— P_Transformer + 0.545 Γ— P_XGBoost

with temperature calibration fixed at:

T = 0.3


🧩 Key Lessons Learned

1. Curriculum Learning Matters

Training only on perfect multimodal samples causes catastrophic failure when streams disappear.

Progressive modality degradation training was essential.


2. Serialized Inference Is Simpler

Trying to parallelize model execution aggressively created:

  • race conditions
  • queue instability
  • unpredictable latency spikes

Thread-isolated extraction + serialized inference proved significantly more stable.


3. Production Safeguards Matter More Than Accuracy

Backpressure protection, anomaly detection, and graceful degradation ended up being just as important as raw model accuracy.


πŸš€ Open Source & Contributors

Hoovik is fully open-source and actively looking for contributors around:

  • Dockerization
  • Redis backplanes
  • horizontal scaling
  • telemetry infrastructure
  • WebRTC optimization

GitHub : https://github.com/AnupamKumar-1/Hoovik

If you enjoy systems engineering, real-time ML infrastructure, or multimodal AI pipelines, contributions are welcome.

Top comments (0)