Anupam Kumar

Posted on Jun 3

I Built an AI-Powered Meeting Platform From Scratch — Here’s How It Actually Works

#webdev #javascript #ai #python

A complete breakdown of Hoovik: WebRTC signaling, distributed Node.js with Redis, real-time emotion AI, RAG on meeting transcripts, and a Python transcription pipeline — all wired together.

👉 GitHub: https://github.com/AnupamKumar-1/Hoovik

🌐 Live Demo: https://hoovik.onrender.com

🎮 Interactive Demo: https://app.supademo.com/demo/cmpy5ggyv95b0qmy7ccrkd3ms?utm_source=link

I've previously written about individual parts of Hoovik, including its emotion analysis system and WebRTC signaling architecture.

Those articles focused on specific subsystems. This one focuses on the complete platform.

Hoovik is not a single application. It is a collection of services working together: a React/WebRTC frontend, a distributed Node.js backend, a transcription pipeline, a real-time emotion recognition service, and a retrieval-augmented search system built on meeting transcripts.

This article walks through how those systems interact, the architectural decisions behind them, and the tradeoffs encountered while building each component.

What Hoovik Actually Is

Hoovik is a multi-party video meeting platform that combines real-time communication, AI-assisted analysis, and transcript intelligence.

The platform includes:

Real-time WebRTC video meetings with Socket.IO signaling
Live facial and vocal emotion analysis for meeting participants
Multi-speaker transcription with segment-level NLP emotion tagging
AI-generated meeting summaries enriched with live emotion data
Retrieval-Augmented Generation (RAG) over meeting transcripts
Transcript access requests and approval workflows
Distributed room management backed by Redis and MongoDB

The system is composed of four primary services.

The Four Services

React Frontend (Vite)
Node.js Backend (Express + Socket.IO)
Python Transcript Service (FastAPI)
Python Emotion Service (FastAPI + Socket.IO)

The remainder of this article follows the lifecycle of a meeting and explains how each service participates.

1. The Node.js Backend

The backend is responsible for:

Authentication
Meeting creation and management
Socket.IO signaling
Transcript storage
Transcript access requests
AI summary generation
RAG indexing and querying

The deployment runs as multiple PM2 processes connected through:

MongoDB for persistence
Redis for shared state
Socket.IO Redis Adapter for cross-process event delivery

Shared Room State

Room state cannot safely live in process memory when multiple Node.js instances are handling requests.

Instead, mutable meeting state is stored in Redis.

Participants are stored in a Redis Hash:

meeting:participants:<roomCode>

Each field contains a serialized participant object.

This design allows:

Targeted HSET updates during joins
Targeted HDEL updates during leaves
Shared state across all backend processes
Reduced serialization overhead

Join order is stored separately and is used for WebRTC role assignment.

Distributed Join Locking

Joining a room modifies shared state.

To prevent race conditions, room joins are serialized using a Redis-backed distributed lock.

await withRoomLock(meetingCode, async () => {
   // join logic 
});

The lock uses:

SET NX PX acquisition
Token-based ownership
Lua-script compare-and-delete release

This guarantees that only one join operation mutates room state at a time.

Authentication

Authentication uses JWT access tokens and refresh token rotation.

A short-lived JWT access token
An opaque refresh token stored only in an HttpOnly cookie

Refresh tokens are rotated on every refresh request, reducing replay risk while preserving user sessions.

2. The Frontend

The frontend is a React application built around specialized hooks that manage independent subsystems.

Major responsibilities include:

WebRTC peer connection management
Socket.IO signaling
Chat
Active speaker detection
Emotion capture
Recording
Transcript viewing
RAG interaction

WebRTC

Peer connections are managed through dedicated React hooks and implement the perfect negotiation pattern.

The application supports:

Multi-party video
ICE restarts
Screen sharing
Remote participant management

Active Speaker Detection

Two independent detection paths exist.

SSRC Path

When available:

RTCRtpReceiver.getSynchronizationSources()

is used to obtain RTP audio levels directly.

RMS Fallback

Browsers without SSRC support use:

Web Audio API
AnalyserNode
RMS energy calculations

The application selects the appropriate method dynamically.

Emotion Capture

The host captures:

Video frames from remote participants
Audio chunks from remote participant streams

Captured media is sent directly to the emotion service using dedicated Socket.IO connections.

Each participant receives an independent emotion-service connection, allowing participant-level media state tracking and backpressure control.

The emotion service can instruct the frontend to adjust capture rates through server status and backpressure events.

Emotion-Aware Summaries

Emotion events collected during a meeting are stored locally and later submitted when generating an AI summary.

The backend combines:

Transcript-derived emotion information
Live captured emotion history

This enables AI summaries to highlight notable discrepancies between spoken content and observed participant emotions.

3. The Transcript Service

The transcript service is implemented in FastAPI.

Its responsibilities include:

Audio processing
Speech recognition
Speaker segmentation
Segment-level NLP emotion classification

The service uses:

Whisper
DistilRoBERTa

for transcription and emotion tagging.

Asynchronous Processing

Meeting recordings are uploaded after a meeting ends.

The service immediately returns:

http 202 Accepted

and performs processing in a background task.

The processing pipeline is:

Audio Upload
↓
FFmpeg Conversion
↓
Whisper Transcription
↓
Segment Merging
↓
NLP Emotion Classification (DistilRoBERTa)
↓
Transcript Callback To Node Backend

Transcript Delivery

After processing completes, the transcript service sends structured transcript data back to the Node.js backend.

Retry logic is used to improve reliability during temporary backend failures.

4. The Emotion Service

The emotion service performs real-time inference on participant media streams.

The frontend sends:

emotion.frame events
audio_chunk events

directly to the service.

The service performs inference using:

Wav2Vec2
MediaPipe
XGBoost ensemble models

and emits:

emotion.result

events back to the frontend.

Modality-Aware Processing

Inference continues even when a participant disables one modality.

Examples:

Camera enabled, microphone disabled → video-only mode
Microphone enabled, camera disabled → audio-only mode

This allows emotion tracking to continue without requiring both media streams.

Backpressure Support

The service also emits:

server.status
backpressure

events that allow the frontend to dynamically adjust capture rates and reduce load.

5. The RAG Pipeline

After transcripts are stored, they can be indexed for semantic retrieval.

The indexing pipeline consists of:

Chunking
Embedding generation
Background indexing
Vector retrieval
LLM answer generation

Chunking

When speaker segments are available, chunks preserve:

Speaker attribution
Timestamps
Transcript structure

Otherwise, a sliding-window chunking strategy is used.

Embeddings

Embeddings are generated using:

nomic-embed-text-v1.5

Embedding results are cached in Redis to avoid redundant computation.

Indexing

Transcript indexing runs asynchronously through BullMQ workers.

This prevents long-running embedding operations from blocking API requests.

Retrieval

Retrieval combines:

MongoDB Vector Search
Maximum Marginal Relevance (MMR)

to balance relevance and diversity.

Answer Generation

Retrieved context is passed to Groq-hosted language models to generate answers.

Session history is maintained to support multi-turn conversations over meeting data.

Access control follows the same authorization model as transcript access:

Transcript owner
Approved transcript request
Legacy transcripts without ownership metadata

Tradeoffs And Future Improvements

Several known tradeoffs remain in the current architecture.

Meeting cleanup jobs execute independently in each backend process.
BullMQ workers currently run alongside the application server rather than in dedicated worker processes.
The transcript service does not yet use a centralized job queue.
Some browser-specific handling remains necessary, including Safari media preview workarounds.

These decisions were acceptable for the current scale of the platform, but dedicated workers and queue-based processing would be natural next steps.

After Putting It All Together

Hoovik evolved from a simple video meeting application into a distributed platform that combines WebRTC, real-time machine learning, transcript intelligence, and retrieval-augmented search.

The most interesting part of the project was not any single technology. It was designing the boundaries between services and making them work reliably together under real-world constraints.

If you'd like to explore the implementation, try the interactive demo or browse the source code on GitHub.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.