This is a submission for the AssemblyAI Voice Agents Challenge
What I Built
Veew is a real-time video communication platform that connects users through video calls and enhances the experience with live captioning, automatic minutes generation and speaker diarization at sub-300ms latency. This project prioritizes a fast and responsive voice experience, ensuring captions for every spoken word is delivered to all participants in real time, offering an inclusive solution for individuals with auditory impairments, enabling them to fully participate in video calls.
Demo
Live Site
GitHub Repository
Veew - Simplifying Communication
Veew is a video communication platform, which utilizes the Assemblyai's universal streaming api to auto generate live video captions with speaker diarizations.
Features
- Create room: This allows users to start a video channel
- Join room: Users can join an already created room to connect with other participants.
- Live Captioning: Users can enable live captions during a video call.
Technical Implementation & AssemblyAI Integration
AssemblyAI's Universal Streaming played a pivotal role in turning the vision for this application into reality. By providing real-time, speaker-diarized transcription capabilities, it enabled the seamless generation of live video captions with high accuracy. This technology also made it possible to automatically produce well-structured meeting minutes, enhancing both accessibility and post-call productivity.
Below is a snippet of how I integrated AssemblyAI into the application to generate the live captions, as well as the meeting minutes
const startTranscription = useCallback(async () => {
try {
// Reset any previous error and set connection status
setError(null);
setConnectionStatus('connecting');
// Fetch authentication token
const token = await getToken();
if (!token) return;
// Create WebSocket connection with AssemblyAI using token and transcription parameters
const wsUrl = `wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speaker_diarization=true&formatted_finals=true&token=${token}`;
socket.current = new WebSocket(wsUrl);
// When WebSocket connection is successfully opened
socket.current.onopen = async () => {
console.log('🔰🔰🔰AssemblyAI WebSocket connected');
setIsConnected(true);
setConnectionStatus('connected');
setIsListening(true);
// Access user's microphone
mediaStream.current = await navigator.mediaDevices.getUserMedia({ audio: true });
// Create audio context with sample rate matching AssemblyAI
audioContext.current = new AudioContext({ sampleRate: 16000 });
// Create a media stream source and script processor node
const source = audioContext.current.createMediaStreamSource(mediaStream.current);
scriptProcessor.current = audioContext.current.createScriptProcessor(4096, 1, 1);
// Connect the audio nodes
source.connect(scriptProcessor.current);
scriptProcessor.current.connect(audioContext.current.destination);
// Process and send audio data on each audio processing event
scriptProcessor.current.onaudioprocess = (event) => {
if (!socket.current || socket.current.readyState !== WebSocket.OPEN) return;
const input = event.inputBuffer.getChannelData(0);
const buffer = new ArrayBuffer(input.length * 2);
const view = new DataView(buffer);
// Convert audio float samples to 16-bit PCM
for (let i = 0; i < input.length; i++) {
const s = Math.max(-1, Math.min(1, input[i]));
view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7fff, true); // little-endian
}
// Send the audio buffer to the WebSocket
socket.current.send(buffer);
};
};
// Handle incoming messages from AssemblyAI WebSocket
socket.current.onmessage = (event) => {
console.log("⬅️⬅️⬅️ AssemblyAI says:", event.data);
try {
const message = JSON.parse(event.data);
console.log("🟢🟢🟢Parsed message:", message);
// Handle live partial transcript (for real-time display only)
if (message.type === 'PartialTranscript') {
const { text, speaker, created } = message;
const timestamp = new Date(created || Date.now()).toLocaleTimeString();
setPartialTranscript({
text: text || '',
speaker: speaker || 'Unknown',
timestamp,
type: 'partial'
});
return;
}
// Handle final transcript (Turn or FinalTranscript)
if (message.type === 'Turn' || message.message_type === 'FinalTranscript') {
const transcriptText = message.transcript || message.text || '';
const speakerLabel = `${currentSpeakerRef.current}`;
const timestamp = new Date(message.created || Date.now()).toLocaleTimeString();
const transcriptId = message.id || Date.now().toString();
const finalTranscript: Transcript = {
text: transcriptText,
speaker: speakerLabel,
timestamp,
id: transcriptId,
type: 'final'
};
// Save the final transcript
setTranscripts(() => ({
[transcriptId]: finalTranscript
}));
// Clear the partial transcript display
setPartialTranscript(null);
// Update speaker statistics
setSpeakers(prev => ({
...prev,
[speakerLabel]: {
name: speakerLabel,
lastSeen: timestamp,
totalMessages: (prev[speakerLabel]?.totalMessages || 0) + 1
}
}));
// Add final transcript to minutes buffer if session is active
if (minutesInSessionRef.current) {
setMinutesBuffer(prev => [...prev, finalTranscript]);
}
}
} catch (e) {
console.error('Error parsing message:', e);
}
};
// Handle WebSocket errors
socket.current.onerror = (e) => {
console.error('WebSocket error:', e);
setError('WebSocket error');
stopTranscription(); // Gracefully stop transcription on error
};
// Handle WebSocket close
socket.current.onclose = () => {
console.log('WebSocket closed');
setIsConnected(false);
setConnectionStatus('disconnected');
};
} catch (err) {
console.error('startTranscription error:', err);
setError('Failed to start transcription');
}
}, []);
The complete code for this project can be found in the linked repository.
This was an amazing challenge to participate in, and I'd like to thank the AssemblyAI, as well as the DEV Team, for putting it together.
Top comments (0)