Adeniji Olajide

Posted on Jul 27 • Edited on Jul 30

Supportly – Real-Time Voice & Video Agent for Customer Support

#devchallenge #assemblyaichallenge #ai #api

AssemblyAI Voice Agents Challenge: Business Automation

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

Supportly is a plug-and-play real-time voice & video support module that developers can integrate into any web application. It falls under the following challenge categories:

Business Automation – The voice agent records interactions between support agents and customers, saving them to a database. After each session, it generates a summary of the conversation, which is automatically sent to the customer emails address.
Real-Time Performance – provides live transcription during support calls.

The project empowers support teams to offer on-demand human assistance while using AssemblyAI’s streaming to:

Transcribe conversations live.

Demo

https://supportly-zzsu.onrender.com

GitHub Repository

https://github.com/GoldenThrust/Supportly

GoldenThrust / Supportly

Supportly - Video Support Call Scheduling Platform

A modern video call customer support application built with React Router v7, TypeScript, and Tailwind CSS. This platform allows customers to easily schedule video calls with support teams to resolve issues and get product assistance.

🚀 Features

Customer Features

Easy Session Booking: Schedule video support sessions with a simple form
Real-time Video Calls: High-quality video calls with screen sharing capabilities
Session Management: View upcoming and completed sessions
Profile Management: Update personal information and preferences
Session History: Track all past sessions with ratings and feedback

Admin/Support Team Features

Admin Dashboard: Comprehensive overview of all support sessions
Team Management: Manage support team members and their availability
Schedule Management: Set available time slots and manage bookings
Session Analytics: Track performance metrics and customer satisfaction

Technical Features

🎥 Video Call Integration: Browser-based video calls (no additional software…

View on GitHub

Technical Implementation & AssemblyAI Integration

The Supportly application uses AssemblyAI's streaming transcription service to provide real-time speech-to-text functionality during video support sessions. The integration involves:

Audio Processing: Capturing audio from user's microphone using Web Audio API
Real-time Streaming: Sending audio chunks to AssemblyAI via WebSocket
Live Transcription: Receiving and displaying transcripts in real-time
Multi-user Support: Managing separate transcription sessions for each user

Architecture Components

1. AssemblyAI Configuration (`config/assembyai.js`)

The main configuration class that handles the AssemblyAI streaming connection:

class AssemblyAIConfig {
    constructor() {
        try {
            this.client = new AssemblyAI({
                apiKey: process.env.ASSEMBLYAI_API_KEY,
            });
            this.transcriber = null;
            this.isConnected = false;
            this.isConnecting = false;
        } catch (error) {
            console.error(error);
        }
    }

    async run() {
        try {
            // Prevent multiple concurrent connection attempts
            if (this.isConnecting || this.isConnected) {
                console.log('Connection already in progress or established...');
                return;
            }

            this.isConnecting = true;

            this.transcriber = this.client.streaming.transcriber({
                sampleRate: 16_000,
                formatTurns: true
            });

            // Set up event handlers
            this.transcriber.on("open", ({ id }) => {
                console.log(`Session opened with ID: ${id}`);
                this.isConnected = true;
                this.isConnecting = false;
            });

            this.transcriber.on("error", (error) => {
                console.error("Transcriber error:", error);
                this.isConnected = false;
                this.isConnecting = false;
            });

            await this.transcriber.connect();
            console.log("Starting streaming...");
        } catch (error) {
            console.error('Error in run():', error);
            this.isConnected = false;
            this.isConnecting = false;
        }
    }

    transcribe(callBack) {
        this.transcriber.on("turn", (turn) => {
            if (!turn.transcript) {
                return;
            }
            callBack(turn.transcript);
        });
    }
}

2. WebSocket Manager (`config/websocket.js`)

Manages the connection between clients and handles AssemblyAI instances for each user:

class WebSocketManager {
    constructor() {
        this.io = null;
        this.userTranscribers = new Map(); // Store AssemblyAI instance per user
    }

    async connect(io) {
        io.on("connection", async (socket) => {
            // Create a new AssemblyAI instance for this user
            const assemblyai = new AssemblyAIConfigClass();
            this.userTranscribers.set(socket.id, assemblyai);

            socket.on("start-transcription", async () => {
                console.log(`Starting transcription for ${socket.user.email}`);
                const assemblyai = this.userTranscribers.get(socket.id);
                if (assemblyai) {
                    // Check if already running to prevent duplicate starts
                    if (assemblyai.isConnected || assemblyai.isConnecting) {
                        console.log('Transcription already running or starting...');
                        return;
                    }

                    try {
                        await assemblyai.run();
                        assemblyai.transcribe((transcript) => {
                            console.log(`Transcription for ${socket.user.email}:`, transcript);
                            // Emit transcription to all users in the session
                            socket.to(sessionId).emit("transcription", transcript);
                        });
                        console.log('Transcription started successfully');
                    } catch (error) {
                        console.error('Error starting transcription:', error);
                    }
                }
            });

            socket.on('audio-chunk', async (audioBlob) => {
                const assemblyai = this.userTranscribers.get(socket.id);
                if (assemblyai) {
                    try {
                        assemblyai.transcriber.sendAudio(Buffer.from(audioBlob));
                    } catch (error) {
                        console.error('Error processing audio chunk:', error);
                    }
                }
            });

            socket.on("disconnect", async () => {
                // Clean up transcription when user disconnects
                const assemblyai = this.userTranscribers.get(socket.id);
                if (assemblyai) {
                    await assemblyai.safeClose();
                    this.userTranscribers.delete(socket.id);
                }
            });
        });
    }
}

3. Audio Processing (`public/audio-processor.js`)

Web Audio API worklet for processing audio in real-time:

const MAX_16BIT_INT = 32767

class AudioProcessor extends AudioWorkletProcessor {
  process(inputs) {
    try {
      const input = inputs[0]
      if (!input) throw new Error('No input')

      const channelData = input[0]
      if (!channelData) throw new Error('No channelData')

      // Convert Float32 audio data to Int16 for AssemblyAI
      const float32Array = Float32Array.from(channelData)
      const int16Array = Int16Array.from(
        float32Array.map((n) => n * MAX_16BIT_INT)
      )
      const buffer = int16Array.buffer

      // Send processed audio to main thread
      this.port.postMessage({ audio_data: buffer })

      return true
    } catch (error) {
      console.error(error)
      return false
    }
  }
}

registerProcessor('audio-processor', AudioProcessor)

4. Frontend Integration (`app/routes/video-call.$sessionId.tsx`)

The React component that handles the UI and audio processing:

export default function VideoCall() {
  const audioWorkletNodeRef = useRef<AudioWorkletNode | null>(null);
  const audioBufferQueueRef = useRef<Int16Array>(new Int16Array(0));
  const [transcripts, setTranscripts] = useState<Array<{
    id: number;
    text: string;
    timestamp: Date;
    speaker: string;
  }>>([]);
  const [currentTranscript, setCurrentTranscript] = useState("");

  // Setup audio processor for real-time transcription
  const setupAudioProcessor = async () => {
    try {
      if (!localStreamRef.current) return;

      // Create audio context with 16kHz sample rate (required by AssemblyAI)
      audioContextRef.current = new AudioContext({
        sampleRate: 16000,
        latencyHint: "balanced",
      });

      // Load audio processor worklet
      await audioContextRef.current.audioWorklet.addModule(
        "/audio-processor.js"
      );

      // Create audio worklet node
      audioWorkletNodeRef.current = new AudioWorkletNode(
        audioContextRef.current,
        "audio-processor"
      );

      // Handle processed audio data
      audioWorkletNodeRef.current.port.onmessage = (event) => {
        const { audio_data } = event.data;

        // Merge with previous buffer
        const newBuffer = new Int16Array(audio_data);
        audioBufferQueueRef.current = mergeBuffers(
          audioBufferQueueRef.current, 
          newBuffer
        );

        // Send audio chunks when buffer reaches sufficient size
        const CHUNK_SIZE = 1600; // 100ms at 16kHz
        while (audioBufferQueueRef.current.length >= CHUNK_SIZE) {
          const chunk = audioBufferQueueRef.current.slice(0, CHUNK_SIZE);
          audioBufferQueueRef.current = audioBufferQueueRef.current.slice(CHUNK_SIZE);

          // Send to server via WebSocket
          socketRef.current?.emit('audio-chunk', chunk.buffer);
        }
      };

      // Connect audio source to processor
      const source = audioContextRef.current.createMediaStreamSource(
        localStreamRef.current
      );
      source.connect(audioWorkletNodeRef.current);
      audioWorkletNodeRef.current.connect(audioContextRef.current.destination);

      // Start transcription
      socketRef.current?.emit("start-transcription");

      console.log("Audio processor setup completed");
    } catch (error) {
      console.error("Error setting up audio processor:", error);
    }
  };

  // Handle incoming transcriptions
  useEffect(() => {
    if (socketRef.current) {
      socketRef.current.on("transcription", (transcript: string) => {
        console.log("Received transcription:", transcript);

        // Update current live transcript
        setCurrentTranscript(transcript);

        // Add to transcript history if it's a complete sentence
        if (transcript.trim().endsWith('.') || 
            transcript.trim().endsWith('?') || 
            transcript.trim().endsWith('!')) {
          setTranscripts(prev => [...prev, {
            id: Date.now(),
            text: transcript,
            timestamp: new Date(),
            speaker: "Speaker" // Could be enhanced to identify speakers
          }]);
          setCurrentTranscript(""); // Clear current transcript
        }
      });
    }
  }, []);

  function mergeBuffers(lhs: Int16Array, rhs: Int16Array) {
    const merged = new Int16Array(lhs.length + rhs.length);
    merged.set(lhs, 0);
    merged.set(rhs, lhs.length);
    return merged;
  }
}

Data Flow

Audio Capture: User's microphone audio is captured via getUserMedia()
Audio Processing: Raw audio is processed through Web Audio API worklet
Format Conversion: Float32 audio is converted to Int16 format at 16kHz sample rate
Chunking: Audio is buffered and sent in chunks via WebSocket
Server Processing: Node.js server receives audio chunks and forwards to AssemblyAI
Transcription: AssemblyAI processes audio and returns transcripts
Broadcasting: Transcripts are broadcast to all participants in the session
UI Update: Frontend displays live and completed transcripts

Key Features

Real-time Transcription

Live Updates: Transcripts appear as users speak
Turn-based: Uses AssemblyAI's formatTurns: true for better sentence structure
Low Latency: Optimized audio processing for minimal delay

Multi-user Support

Isolated Sessions: Each user gets their own AssemblyAI transcriber instance
Concurrent Processing: Multiple users can speak simultaneously
Session Management: Proper cleanup when users disconnect

Audio Optimization

16kHz Sample Rate: Optimized for speech recognition
Chunk-based Processing: Efficient real-time streaming
Buffer Management: Prevents audio loss during processing

Configuration

Environment Variables

ASSEMBLYAI_API_KEY=your_assemblyai_api_key_here

AssemblyAI Settings

this.transcriber = this.client.streaming.transcriber({
    sampleRate: 16_000,     // 16kHz for optimal speech recognition
    formatTurns: true       // Better sentence formatting
});

Error Handling

The integration includes comprehensive error handling:

Connection Management: Prevents duplicate connections
Graceful Cleanup: Proper resource disposal on disconnect
Error Recovery: Automatic reconnection attempts
State Tracking: Connection status monitoring

Usage in Video Calls

Start Call: User joins video session
Enable Transcription: Audio processor automatically starts
Live Transcripts: Real-time transcripts appear in the UI
Session History: Completed transcripts are stored during the session
End Call: Resources are cleaned up when call ends

This integration provides a seamless real-time transcription experience that enhances accessibility and documentation for support sessions.

🔐 Tech Stack
Frontend: React + TailwindCSS

Video Calls: Socket.io and Simple Peer JS

Voice Streaming: AssemblyAI + Mic stream

Backend: Node.js + WebSocket + Mongoose

AI/NLP: AssemblyAI + Gemini

Top comments (1)

Mahdi Jazini • Jul 28

👏👏👏