DEV Community

Cover image for Supportly – Real-Time Voice & Video Agent for Customer Support
Adeniji Olajide
Adeniji Olajide

Posted on • Edited on

Supportly – Real-Time Voice & Video Agent for Customer Support

AssemblyAI Voice Agents Challenge: Business Automation

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

Supportly is a plug-and-play real-time voice & video support module that developers can integrate into any web application. It falls under the following challenge categories:

Business Automation – The voice agent records interactions between support agents and customers, saving them to a database. After each session, it generates a summary of the conversation, which is automatically sent to the customer emails address.
Real-Time Performance – provides live transcription during support calls.

The project empowers support teams to offer on-demand human assistance while using AssemblyAI’s streaming to:

Transcribe conversations live.

Demo

https://supportly-zzsu.onrender.com




GitHub Repository

https://github.com/GoldenThrust/Supportly

Supportly - Video Support Call Scheduling Platform

A modern video call customer support application built with React Router v7, TypeScript, and Tailwind CSS. This platform allows customers to easily schedule video calls with support teams to resolve issues and get product assistance.

🚀 Features

Customer Features

  • Easy Session Booking: Schedule video support sessions with a simple form
  • Real-time Video Calls: High-quality video calls with screen sharing capabilities
  • Session Management: View upcoming and completed sessions
  • Profile Management: Update personal information and preferences
  • Session History: Track all past sessions with ratings and feedback

Admin/Support Team Features

  • Admin Dashboard: Comprehensive overview of all support sessions
  • Team Management: Manage support team members and their availability
  • Schedule Management: Set available time slots and manage bookings
  • Session Analytics: Track performance metrics and customer satisfaction

Technical Features

  • 🎥 Video Call Integration: Browser-based video calls (no additional software…

Technical Implementation & AssemblyAI Integration

The Supportly application uses AssemblyAI's streaming transcription service to provide real-time speech-to-text functionality during video support sessions. The integration involves:

  1. Audio Processing: Capturing audio from user's microphone using Web Audio API
  2. Real-time Streaming: Sending audio chunks to AssemblyAI via WebSocket
  3. Live Transcription: Receiving and displaying transcripts in real-time
  4. Multi-user Support: Managing separate transcription sessions for each user

Architecture Components

1. AssemblyAI Configuration (config/assembyai.js)

The main configuration class that handles the AssemblyAI streaming connection:

class AssemblyAIConfig {
    constructor() {
        try {
            this.client = new AssemblyAI({
                apiKey: process.env.ASSEMBLYAI_API_KEY,
            });
            this.transcriber = null;
            this.isConnected = false;
            this.isConnecting = false;
        } catch (error) {
            console.error(error);
        }
    }

    async run() {
        try {
            // Prevent multiple concurrent connection attempts
            if (this.isConnecting || this.isConnected) {
                console.log('Connection already in progress or established...');
                return;
            }

            this.isConnecting = true;

            this.transcriber = this.client.streaming.transcriber({
                sampleRate: 16_000,
                formatTurns: true
            });

            // Set up event handlers
            this.transcriber.on("open", ({ id }) => {
                console.log(`Session opened with ID: ${id}`);
                this.isConnected = true;
                this.isConnecting = false;
            });

            this.transcriber.on("error", (error) => {
                console.error("Transcriber error:", error);
                this.isConnected = false;
                this.isConnecting = false;
            });

            await this.transcriber.connect();
            console.log("Starting streaming...");
        } catch (error) {
            console.error('Error in run():', error);
            this.isConnected = false;
            this.isConnecting = false;
        }
    }

    transcribe(callBack) {
        this.transcriber.on("turn", (turn) => {
            if (!turn.transcript) {
                return;
            }
            callBack(turn.transcript);
        });
    }
}
Enter fullscreen mode Exit fullscreen mode

2. WebSocket Manager (config/websocket.js)

Manages the connection between clients and handles AssemblyAI instances for each user:

class WebSocketManager {
    constructor() {
        this.io = null;
        this.userTranscribers = new Map(); // Store AssemblyAI instance per user
    }

    async connect(io) {
        io.on("connection", async (socket) => {
            // Create a new AssemblyAI instance for this user
            const assemblyai = new AssemblyAIConfigClass();
            this.userTranscribers.set(socket.id, assemblyai);

            socket.on("start-transcription", async () => {
                console.log(`Starting transcription for ${socket.user.email}`);
                const assemblyai = this.userTranscribers.get(socket.id);
                if (assemblyai) {
                    // Check if already running to prevent duplicate starts
                    if (assemblyai.isConnected || assemblyai.isConnecting) {
                        console.log('Transcription already running or starting...');
                        return;
                    }

                    try {
                        await assemblyai.run();
                        assemblyai.transcribe((transcript) => {
                            console.log(`Transcription for ${socket.user.email}:`, transcript);
                            // Emit transcription to all users in the session
                            socket.to(sessionId).emit("transcription", transcript);
                        });
                        console.log('Transcription started successfully');
                    } catch (error) {
                        console.error('Error starting transcription:', error);
                    }
                }
            });

            socket.on('audio-chunk', async (audioBlob) => {
                const assemblyai = this.userTranscribers.get(socket.id);
                if (assemblyai) {
                    try {
                        assemblyai.transcriber.sendAudio(Buffer.from(audioBlob));
                    } catch (error) {
                        console.error('Error processing audio chunk:', error);
                    }
                }
            });

            socket.on("disconnect", async () => {
                // Clean up transcription when user disconnects
                const assemblyai = this.userTranscribers.get(socket.id);
                if (assemblyai) {
                    await assemblyai.safeClose();
                    this.userTranscribers.delete(socket.id);
                }
            });
        });
    }
}
Enter fullscreen mode Exit fullscreen mode

3. Audio Processing (public/audio-processor.js)

Web Audio API worklet for processing audio in real-time:

const MAX_16BIT_INT = 32767

class AudioProcessor extends AudioWorkletProcessor {
  process(inputs) {
    try {
      const input = inputs[0]
      if (!input) throw new Error('No input')

      const channelData = input[0]
      if (!channelData) throw new Error('No channelData')

      // Convert Float32 audio data to Int16 for AssemblyAI
      const float32Array = Float32Array.from(channelData)
      const int16Array = Int16Array.from(
        float32Array.map((n) => n * MAX_16BIT_INT)
      )
      const buffer = int16Array.buffer

      // Send processed audio to main thread
      this.port.postMessage({ audio_data: buffer })

      return true
    } catch (error) {
      console.error(error)
      return false
    }
  }
}

registerProcessor('audio-processor', AudioProcessor)
Enter fullscreen mode Exit fullscreen mode

4. Frontend Integration (app/routes/video-call.$sessionId.tsx)

The React component that handles the UI and audio processing:

export default function VideoCall() {
  const audioWorkletNodeRef = useRef<AudioWorkletNode | null>(null);
  const audioBufferQueueRef = useRef<Int16Array>(new Int16Array(0));
  const [transcripts, setTranscripts] = useState<Array<{
    id: number;
    text: string;
    timestamp: Date;
    speaker: string;
  }>>([]);
  const [currentTranscript, setCurrentTranscript] = useState("");

  // Setup audio processor for real-time transcription
  const setupAudioProcessor = async () => {
    try {
      if (!localStreamRef.current) return;

      // Create audio context with 16kHz sample rate (required by AssemblyAI)
      audioContextRef.current = new AudioContext({
        sampleRate: 16000,
        latencyHint: "balanced",
      });

      // Load audio processor worklet
      await audioContextRef.current.audioWorklet.addModule(
        "/audio-processor.js"
      );

      // Create audio worklet node
      audioWorkletNodeRef.current = new AudioWorkletNode(
        audioContextRef.current,
        "audio-processor"
      );

      // Handle processed audio data
      audioWorkletNodeRef.current.port.onmessage = (event) => {
        const { audio_data } = event.data;

        // Merge with previous buffer
        const newBuffer = new Int16Array(audio_data);
        audioBufferQueueRef.current = mergeBuffers(
          audioBufferQueueRef.current, 
          newBuffer
        );

        // Send audio chunks when buffer reaches sufficient size
        const CHUNK_SIZE = 1600; // 100ms at 16kHz
        while (audioBufferQueueRef.current.length >= CHUNK_SIZE) {
          const chunk = audioBufferQueueRef.current.slice(0, CHUNK_SIZE);
          audioBufferQueueRef.current = audioBufferQueueRef.current.slice(CHUNK_SIZE);

          // Send to server via WebSocket
          socketRef.current?.emit('audio-chunk', chunk.buffer);
        }
      };

      // Connect audio source to processor
      const source = audioContextRef.current.createMediaStreamSource(
        localStreamRef.current
      );
      source.connect(audioWorkletNodeRef.current);
      audioWorkletNodeRef.current.connect(audioContextRef.current.destination);

      // Start transcription
      socketRef.current?.emit("start-transcription");

      console.log("Audio processor setup completed");
    } catch (error) {
      console.error("Error setting up audio processor:", error);
    }
  };

  // Handle incoming transcriptions
  useEffect(() => {
    if (socketRef.current) {
      socketRef.current.on("transcription", (transcript: string) => {
        console.log("Received transcription:", transcript);

        // Update current live transcript
        setCurrentTranscript(transcript);

        // Add to transcript history if it's a complete sentence
        if (transcript.trim().endsWith('.') || 
            transcript.trim().endsWith('?') || 
            transcript.trim().endsWith('!')) {
          setTranscripts(prev => [...prev, {
            id: Date.now(),
            text: transcript,
            timestamp: new Date(),
            speaker: "Speaker" // Could be enhanced to identify speakers
          }]);
          setCurrentTranscript(""); // Clear current transcript
        }
      });
    }
  }, []);

  function mergeBuffers(lhs: Int16Array, rhs: Int16Array) {
    const merged = new Int16Array(lhs.length + rhs.length);
    merged.set(lhs, 0);
    merged.set(rhs, lhs.length);
    return merged;
  }
}
Enter fullscreen mode Exit fullscreen mode

Data Flow

  1. Audio Capture: User's microphone audio is captured via getUserMedia()
  2. Audio Processing: Raw audio is processed through Web Audio API worklet
  3. Format Conversion: Float32 audio is converted to Int16 format at 16kHz sample rate
  4. Chunking: Audio is buffered and sent in chunks via WebSocket
  5. Server Processing: Node.js server receives audio chunks and forwards to AssemblyAI
  6. Transcription: AssemblyAI processes audio and returns transcripts
  7. Broadcasting: Transcripts are broadcast to all participants in the session
  8. UI Update: Frontend displays live and completed transcripts

Key Features

Real-time Transcription

  • Live Updates: Transcripts appear as users speak
  • Turn-based: Uses AssemblyAI's formatTurns: true for better sentence structure
  • Low Latency: Optimized audio processing for minimal delay

Multi-user Support

  • Isolated Sessions: Each user gets their own AssemblyAI transcriber instance
  • Concurrent Processing: Multiple users can speak simultaneously
  • Session Management: Proper cleanup when users disconnect

Audio Optimization

  • 16kHz Sample Rate: Optimized for speech recognition
  • Chunk-based Processing: Efficient real-time streaming
  • Buffer Management: Prevents audio loss during processing

Configuration

Environment Variables

ASSEMBLYAI_API_KEY=your_assemblyai_api_key_here
Enter fullscreen mode Exit fullscreen mode

AssemblyAI Settings

this.transcriber = this.client.streaming.transcriber({
    sampleRate: 16_000,     // 16kHz for optimal speech recognition
    formatTurns: true       // Better sentence formatting
});
Enter fullscreen mode Exit fullscreen mode

Error Handling

The integration includes comprehensive error handling:

  • Connection Management: Prevents duplicate connections
  • Graceful Cleanup: Proper resource disposal on disconnect
  • Error Recovery: Automatic reconnection attempts
  • State Tracking: Connection status monitoring

Usage in Video Calls

  1. Start Call: User joins video session
  2. Enable Transcription: Audio processor automatically starts
  3. Live Transcripts: Real-time transcripts appear in the UI
  4. Session History: Completed transcripts are stored during the session
  5. End Call: Resources are cleaned up when call ends

This integration provides a seamless real-time transcription experience that enhances accessibility and documentation for support sessions.

🔐 Tech Stack
Frontend: React + TailwindCSS

Video Calls: Socket.io and Simple Peer JS

Voice Streaming: AssemblyAI + Mic stream

Backend: Node.js + WebSocket + Mongoose

AI/NLP: AssemblyAI + Gemini

Top comments (1)

Collapse
 
mahdijazini profile image
Mahdi Jazini

👏👏👏