A complete breakdown of Hoovik: WebRTC signaling, distributed Node.js with Redis, real-time emotion AI, RAG on meeting transcripts, and a Python transcription pipeline — all wired together.
👉 GitHub: https://github.com/AnupamKumar-1/Hoovik
🌐 Live Demo: https://hoovik.onrender.com
🎮 Interactive Demo: https://app.supademo.com/demo/cmpy5ggyv95b0qmy7ccrkd3ms?utm_source=link
I've previously written about individual parts of Hoovik, including its emotion analysis system and WebRTC signaling architecture.
Those articles focused on specific subsystems. This one focuses on the complete platform.
Hoovik is not a single application. It is a collection of services working together: a React/WebRTC frontend, a distributed Node.js backend, a transcription pipeline, a real-time emotion recognition service, and a retrieval-augmented search system built on meeting transcripts.
This article walks through how those systems interact, the architectural decisions behind them, and the tradeoffs encountered while building each component.
What Hoovik Actually Is
Hoovik is a multi-party video meeting platform that combines real-time communication, AI-assisted analysis, and transcript intelligence.
The platform includes:
- Real-time WebRTC video meetings with Socket.IO signaling
- Live facial and vocal emotion analysis for meeting participants
- Multi-speaker transcription with segment-level NLP emotion tagging
- AI-generated meeting summaries enriched with live emotion data
- Retrieval-Augmented Generation (RAG) over meeting transcripts
- Transcript access requests and approval workflows
- Distributed room management backed by Redis and MongoDB
The system is composed of four primary services.
The Four Services
- React Frontend (Vite)
- Node.js Backend (Express + Socket.IO)
- Python Transcript Service (FastAPI)
- Python Emotion Service (FastAPI + Socket.IO)
The remainder of this article follows the lifecycle of a meeting and explains how each service participates.
1. The Node.js Backend
The backend is responsible for:
- Authentication
- Meeting creation and management
- Socket.IO signaling
- Transcript storage
- Transcript access requests
- AI summary generation
- RAG indexing and querying
The deployment runs as multiple PM2 processes connected through:
- MongoDB for persistence
- Redis for shared state
- Socket.IO Redis Adapter for cross-process event delivery
Shared Room State
Room state cannot safely live in process memory when multiple Node.js instances are handling requests.
Instead, mutable meeting state is stored in Redis.
Participants are stored in a Redis Hash:
text meeting:participants:
Each field contains a serialized participant object.
This design allows:
- Targeted HSET updates during joins
- Targeted HDEL updates during leaves
- Shared state across all backend processes
- Reduced serialization overhead
Join order is stored separately and is used for WebRTC role assignment.
Distributed Join Locking
Joining a room modifies shared state.
To prevent race conditions, room joins are serialized using a Redis-backed distributed lock.
js await withRoomLock(meetingCode, async () => { // join logic });
The lock uses:
- SET NX PX acquisition
- Token-based ownership
- Lua-script compare-and-delete release
This guarantees that only one join operation mutates room state at a time.
Authentication
Authentication uses JWT access tokens and refresh token rotation.
Login issues:
- A short-lived JWT access token
- An opaque refresh token stored only in an HttpOnly cookie
Refresh tokens are rotated on every refresh request, reducing replay risk while preserving user sessions.
2. The Frontend
The frontend is a React application built around specialized hooks that manage independent subsystems.
Major responsibilities include:
- WebRTC peer connection management
- Socket.IO signaling
- Chat
- Active speaker detection
- Emotion capture
- Recording
- Transcript viewing
- RAG interaction
WebRTC
Peer connections are managed through dedicated React hooks and implement the perfect negotiation pattern.
The application supports:
- Multi-party video
- ICE restarts
- Screen sharing
- Remote participant management
Active Speaker Detection
Two independent detection paths exist.
SSRC Path
When available:
js RTCRtpReceiver.getSynchronizationSources()
is used to obtain RTP audio levels directly.
RMS Fallback
Browsers without SSRC support use:
- Web Audio API
- AnalyserNode
- RMS energy calculations
The application selects the appropriate method dynamically.
Emotion Capture
The host captures:
- Video frames from remote participants
- Audio chunks from remote participant streams
Captured media is sent directly to the emotion service using dedicated Socket.IO connections.
Each participant receives an independent emotion-service connection, allowing participant-level media state tracking and backpressure control.
The emotion service can instruct the frontend to adjust capture rates through server status and backpressure events.
Emotion-Aware Summaries
Emotion events collected during a meeting are stored locally and later submitted when generating an AI summary.
The backend combines:
- Transcript-derived emotion information
- Live captured emotion history
This enables AI summaries to highlight notable discrepancies between spoken content and observed participant emotions.
3. The Transcript Service
The transcript service is implemented in FastAPI.
Its responsibilities include:
- Audio processing
- Speech recognition
- Speaker segmentation
- Segment-level NLP emotion classification
The service uses:
- Whisper
- DistilRoBERTa
for transcription and emotion tagging.
Asynchronous Processing
Meeting recordings are uploaded after a meeting ends.
The service immediately returns:
http 202 Accepted
and performs processing in a background task.
The processing pipeline is:
Audio Upload
↓
FFmpeg Conversion
↓
Whisper Transcription
↓
Segment Merging
↓
NLP Emotion Classification (DistilRoBERTa)
↓
Transcript Callback To Node Backend
Transcript Delivery
After processing completes, the transcript service sends structured transcript data back to the Node.js backend.
Retry logic is used to improve reliability during temporary backend failures.
4. The Emotion Service
The emotion service performs real-time inference on participant media streams.
The frontend sends:
- emotion.frame events
- audio_chunk events
directly to the service.
The service performs inference using:
- Wav2Vec2
- MediaPipe
- XGBoost ensemble models
and emits:
text emotion.result
events back to the frontend.
Modality-Aware Processing
Inference continues even when a participant disables one modality.
Examples:
- Camera enabled, microphone disabled → video-only mode
- Microphone enabled, camera disabled → audio-only mode
This allows emotion tracking to continue without requiring both media streams.
Backpressure Support
The service also emits:
- server.status
- backpressure
events that allow the frontend to dynamically adjust capture rates and reduce load.
5. The RAG Pipeline
After transcripts are stored, they can be indexed for semantic retrieval.
The indexing pipeline consists of:
- Chunking
- Embedding generation
- Background indexing
- Vector retrieval
- LLM answer generation
Chunking
When speaker segments are available, chunks preserve:
- Speaker attribution
- Timestamps
- Transcript structure
Otherwise, a sliding-window chunking strategy is used.
Embeddings
Embeddings are generated using:
text nomic-embed-text-v1.5
Embedding results are cached in Redis to avoid redundant computation.
Indexing
Transcript indexing runs asynchronously through BullMQ workers.
This prevents long-running embedding operations from blocking API requests.
Retrieval
Retrieval combines:
- MongoDB Vector Search
- Maximum Marginal Relevance (MMR)
to balance relevance and diversity.
Answer Generation
Retrieved context is passed to Groq-hosted language models to generate answers.
Session history is maintained to support multi-turn conversations over meeting data.
Access control follows the same authorization model as transcript access:
- Transcript owner
- Approved transcript request
- Legacy transcripts without ownership metadata
Tradeoffs And Future Improvements
Several known tradeoffs remain in the current architecture.
- Meeting cleanup jobs execute independently in each backend process.
- BullMQ workers currently run alongside the application server rather than in dedicated worker processes.
- The transcript service does not yet use a centralized job queue.
- Some browser-specific handling remains necessary, including Safari media preview workarounds.
These decisions were acceptable for the current scale of the platform, but dedicated workers and queue-based processing would be natural next steps.
After Putting It All Together
Hoovik evolved from a simple video meeting application into a distributed platform that combines WebRTC, real-time machine learning, transcript intelligence, and retrieval-augmented search.
The most interesting part of the project was not any single technology. It was designing the boundaries between services and making them work reliably together under real-world constraints.
If you'd like to explore the implementation, try the interactive demo or browse the source code on GitHub.

Top comments (0)