DEV Community

Cover image for I Built an AI-Powered Meeting Platform From Scratch — Here’s How It Actually Works
Anupam Kumar
Anupam Kumar

Posted on

I Built an AI-Powered Meeting Platform From Scratch — Here’s How It Actually Works

A complete breakdown of Hoovik: WebRTC signaling, distributed Node.js with Redis, real-time emotion AI, RAG on meeting transcripts, and a Python transcription pipeline — all wired together.


👉 GitHub: https://github.com/AnupamKumar-1/Hoovik

🌐 Live Demo: https://hoovik.onrender.com

🎮 Interactive Demo: https://app.supademo.com/demo/cmpy5ggyv95b0qmy7ccrkd3ms?utm_source=link

I've previously written about individual parts of Hoovik, including its emotion analysis system and WebRTC signaling architecture.

Those articles focused on specific subsystems. This one focuses on the complete platform.

Hoovik is not a single application. It is a collection of services working together: a React/WebRTC frontend, a distributed Node.js backend, a transcription pipeline, a real-time emotion recognition service, and a retrieval-augmented search system built on meeting transcripts.

This article walks through how those systems interact, the architectural decisions behind them, and the tradeoffs encountered while building each component.


What Hoovik Actually Is

Hoovik is a multi-party video meeting platform that combines real-time communication, AI-assisted analysis, and transcript intelligence.

The platform includes:

  • Real-time WebRTC video meetings with Socket.IO signaling
  • Live facial and vocal emotion analysis for meeting participants
  • Multi-speaker transcription with segment-level NLP emotion tagging
  • AI-generated meeting summaries enriched with live emotion data
  • Retrieval-Augmented Generation (RAG) over meeting transcripts
  • Transcript access requests and approval workflows
  • Distributed room management backed by Redis and MongoDB

The system is composed of four primary services.

The Four Services

Hoovik Services

  1. React Frontend (Vite)
  2. Node.js Backend (Express + Socket.IO)
  3. Python Transcript Service (FastAPI)
  4. Python Emotion Service (FastAPI + Socket.IO)

The remainder of this article follows the lifecycle of a meeting and explains how each service participates.

1. The Node.js Backend

The backend is responsible for:

  • Authentication
  • Meeting creation and management
  • Socket.IO signaling
  • Transcript storage
  • Transcript access requests
  • AI summary generation
  • RAG indexing and querying

The deployment runs as multiple PM2 processes connected through:

  • MongoDB for persistence
  • Redis for shared state
  • Socket.IO Redis Adapter for cross-process event delivery

Shared Room State

Room state cannot safely live in process memory when multiple Node.js instances are handling requests.

Instead, mutable meeting state is stored in Redis.

Participants are stored in a Redis Hash:

text meeting:participants:

Each field contains a serialized participant object.

This design allows:

  • Targeted HSET updates during joins
  • Targeted HDEL updates during leaves
  • Shared state across all backend processes
  • Reduced serialization overhead

Join order is stored separately and is used for WebRTC role assignment.

Distributed Join Locking

Joining a room modifies shared state.

To prevent race conditions, room joins are serialized using a Redis-backed distributed lock.

js await withRoomLock(meetingCode, async () => { // join logic });

The lock uses:

  • SET NX PX acquisition
  • Token-based ownership
  • Lua-script compare-and-delete release

This guarantees that only one join operation mutates room state at a time.

Authentication

Authentication uses JWT access tokens and refresh token rotation.

Login issues:

  • A short-lived JWT access token
  • An opaque refresh token stored only in an HttpOnly cookie

Refresh tokens are rotated on every refresh request, reducing replay risk while preserving user sessions.

2. The Frontend

The frontend is a React application built around specialized hooks that manage independent subsystems.

Major responsibilities include:

  • WebRTC peer connection management
  • Socket.IO signaling
  • Chat
  • Active speaker detection
  • Emotion capture
  • Recording
  • Transcript viewing
  • RAG interaction

WebRTC

Peer connections are managed through dedicated React hooks and implement the perfect negotiation pattern.

The application supports:

  • Multi-party video
  • ICE restarts
  • Screen sharing
  • Remote participant management

Active Speaker Detection

Two independent detection paths exist.

SSRC Path

When available:

js RTCRtpReceiver.getSynchronizationSources()

is used to obtain RTP audio levels directly.

RMS Fallback

Browsers without SSRC support use:

  • Web Audio API
  • AnalyserNode
  • RMS energy calculations

The application selects the appropriate method dynamically.

Emotion Capture

The host captures:

  • Video frames from remote participants
  • Audio chunks from remote participant streams

Captured media is sent directly to the emotion service using dedicated Socket.IO connections.

Each participant receives an independent emotion-service connection, allowing participant-level media state tracking and backpressure control.

The emotion service can instruct the frontend to adjust capture rates through server status and backpressure events.

Emotion-Aware Summaries

Emotion events collected during a meeting are stored locally and later submitted when generating an AI summary.

The backend combines:

  • Transcript-derived emotion information
  • Live captured emotion history

This enables AI summaries to highlight notable discrepancies between spoken content and observed participant emotions.

3. The Transcript Service

The transcript service is implemented in FastAPI.

Its responsibilities include:

  • Audio processing
  • Speech recognition
  • Speaker segmentation
  • Segment-level NLP emotion classification

The service uses:

  • Whisper
  • DistilRoBERTa

for transcription and emotion tagging.

Asynchronous Processing

Meeting recordings are uploaded after a meeting ends.

The service immediately returns:

http 202 Accepted

and performs processing in a background task.

The processing pipeline is:

Audio Upload

FFmpeg Conversion

Whisper Transcription

Segment Merging

NLP Emotion Classification (DistilRoBERTa)

Transcript Callback To Node Backend

Transcript Delivery

After processing completes, the transcript service sends structured transcript data back to the Node.js backend.

Retry logic is used to improve reliability during temporary backend failures.

4. The Emotion Service

The emotion service performs real-time inference on participant media streams.

The frontend sends:

  • emotion.frame events
  • audio_chunk events

directly to the service.

The service performs inference using:

  • Wav2Vec2
  • MediaPipe
  • XGBoost ensemble models

and emits:

text emotion.result

events back to the frontend.

Modality-Aware Processing

Inference continues even when a participant disables one modality.

Examples:

  • Camera enabled, microphone disabled → video-only mode
  • Microphone enabled, camera disabled → audio-only mode

This allows emotion tracking to continue without requiring both media streams.

Backpressure Support

The service also emits:

  • server.status
  • backpressure

events that allow the frontend to dynamically adjust capture rates and reduce load.

5. The RAG Pipeline

After transcripts are stored, they can be indexed for semantic retrieval.

The indexing pipeline consists of:

  1. Chunking
  2. Embedding generation
  3. Background indexing
  4. Vector retrieval
  5. LLM answer generation

Chunking

When speaker segments are available, chunks preserve:

  • Speaker attribution
  • Timestamps
  • Transcript structure

Otherwise, a sliding-window chunking strategy is used.

Embeddings

Embeddings are generated using:

text nomic-embed-text-v1.5

Embedding results are cached in Redis to avoid redundant computation.

Indexing

Transcript indexing runs asynchronously through BullMQ workers.

This prevents long-running embedding operations from blocking API requests.

Retrieval

Retrieval combines:

  • MongoDB Vector Search
  • Maximum Marginal Relevance (MMR)

to balance relevance and diversity.

Answer Generation

Retrieved context is passed to Groq-hosted language models to generate answers.

Session history is maintained to support multi-turn conversations over meeting data.

Access control follows the same authorization model as transcript access:

  • Transcript owner
  • Approved transcript request
  • Legacy transcripts without ownership metadata

Tradeoffs And Future Improvements

Several known tradeoffs remain in the current architecture.

  • Meeting cleanup jobs execute independently in each backend process.
  • BullMQ workers currently run alongside the application server rather than in dedicated worker processes.
  • The transcript service does not yet use a centralized job queue.
  • Some browser-specific handling remains necessary, including Safari media preview workarounds.

These decisions were acceptable for the current scale of the platform, but dedicated workers and queue-based processing would be natural next steps.

After Putting It All Together

Hoovik evolved from a simple video meeting application into a distributed platform that combines WebRTC, real-time machine learning, transcript intelligence, and retrieval-augmented search.

The most interesting part of the project was not any single technology. It was designing the boundaries between services and making them work reliably together under real-world constraints.

If you'd like to explore the implementation, try the interactive demo or browse the source code on GitHub.

Top comments (0)