Shimanta Krishna Bhuyan

Posted on Jul 24

GetAutoCue: A Hands-Free Teleprompter with Real-Time Voice Control built with AssemblyAI

#devchallenge #assemblyaichallenge #ai #api

AssemblyAI Voice Agents Challenge: Real-Time

This is a submission for the AssemblyAI Voice Agents Challenge.

What I Built

I built GetAutoCue, a professional, browser-based teleprompter designed to eliminate the most common frustration for presenters: unnatural pacing. It uses AssemblyAI's Universal-Streaming model to listen to your voice in real-time, automatically scrolling the script to match your natural speaking speed.

This project is being submitted to the Real-Time Performance prompt by creating a highly responsive, low-latency voice experience that makes presenting feel seamless and intuitive. Whether you're recording a video, giving a speech, or practicing a presentation, GetAutoCue ensures the script is always exactly where you need it to be.

Demo

Live App: https://getautocue.vercel.app
Video Walkthrough:

Key Features in Action

1. Flawless Voice-Activated Scrolling

The app actively listens, highlighting the current word and dimming spoken words, while scrolling the viewport to keep you perfectly on track.

2. On-the-Fly Script Editing

Paste your script and make edits without ever leaving the teleprompter view.

3. Professional Display Controls

Full control over font size, colors, and horizontal/vertical mirroring for professional beam-splitter glass setups.

GitHub Repository

ShimantaBhuyan / getautocue

Technical Implementation & AssemblyAI Integration

The core of GetAutoCue is the useVoiceMode hook, which seamlessly integrates AssemblyAI's real-time streaming capabilities into the React/Next.js front-end.

1. Establishing the Real-Time Connection

When the user clicks on "Start Listening" in the voice mode of the app, it first fetches a temporary authentication token from a NextJS api route which calls AssemblyAI's API endpoint. This token is then used to establish a secure WebSocket connection to AssemblyAI.


    // Now proceed with the actual connection setup
    const response = await fetch('/api/assemblyai/token');
    const data = await response.json();
    if (!data.token) throw new Error('Failed to get AssemblyAI token');

    const wsUrl = `wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&token=${data.token}`;
    socketRef.current = new WebSocket(wsUrl);

    socketRef.current.onopen = () => {
        setIsConnected(true);
    };

2. Streaming Audio from the Browser

I used the navigator.mediaDevices.getUserMedia API to capture microphone input with a 16000 sample rate to match AssemblyAI's requirements. A MediaRecorder instance then chunks this audio stream into manageable blobs, which are sent through the WebSocket. This ensures a continuous, low-latency flow of data.


    // Set up audio processing for PCM data
    const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
            sampleRate: 16000,
            channelCount: 1,
            echoCancellation: true,
            noiseSuppression: true
        }
    });

    // Create audio context for processing
    const audioContext = new AudioContext({ sampleRate: 16000 });
    const source = audioContext.createMediaStreamSource(stream);
    const processor = audioContext.createScriptProcessor(4096, 1, 1);

    processor.onaudioprocess = (event) => {
        if (socketRef.current?.readyState === WebSocket.OPEN) {
            const inputData = event.inputBuffer.getChannelData(0);

            // Convert float32 audio data to int16 PCM
            const pcmData = new Int16Array(inputData.length);
            for (let i = 0; i < inputData.length; i++) {
                // Clamp the value to [-1, 1] and convert to 16-bit integer
                const sample = Math.max(-1, Math.min(1, inputData[i]));
                pcmData[i] = sample * 0x7FFF;
            }

            socketRef.current.send(pcmData.buffer);
        }
    };

3. Processing Transcripts and Syncing the UI

The magic happens when the WebSocket sends back transcript data. The onmessage handler listens for both partial and final transcripts returned from Assembly AI's universal streaming model.

To achieve the dynamic highlighting and scrolling, I implemented a fuzzy matching algorithm using fuse.js. As transcripts arrive, the app:

Identifies the last spoken word in the transcript.
Finds its corresponding position in the full script.
Updates the UI state to highlight the next word as "current" and dim all previous words as "spoken."
Calculates the progress through the script and smoothly scrolls the container to keep the current word in view.

This approach creates a robust and natural-feeling experience, as the teleprompter doesn't jump but flows with the speaker.


    socketRef.current.onmessage = (event) => {
        const message = JSON.parse(event.data);

        // Handle different message types from AssemblyAI v3 API
        if (message.message_type === 'PartialTranscript') {
            handlePartialTranscript(message.text);
        } else if (message.message_type === 'FinalTranscript') {
            handleFinalTranscript(message.text);
        } else if (message.transcript) {
            // Handle Turn events with transcript data
            if (message.end_of_turn) {
                handleFinalTranscript(message.transcript);
            } else {
                handlePartialTranscript(message.transcript);
            }
        }
    };

    socketRef.current.onerror = (error) => {
        stopListening();
    };

    socketRef.current.onclose = (event) => {
        setIsConnected(false);
    };

By leveraging AssemblyAI's Universal-Streaming model, I was able to build a teleprompter that feels less like a machine and more like a self-paced and natural one.

Top comments (2)

Nikoloz Turazashvili (@axrisi) • Jul 25

Okay, this deserves attention!
good job!

Shimanta Krishna Bhuyan • Jul 25

Glad you liked it Nikoloz!