Adarsh Kant

Posted on Apr 5

Building Real-Time Voice Forms with Google Gemini API: Architecture & Learnings

#webdev #ai #javascript #productivity

When you want to build voice-input forms that feel responsive and intuitive, the key challenge isn't transcription—modern APIs handle that well. It's latency. Transcription that takes 2 seconds to return feels broken. Transcription that streams back in real-time (200-400ms for first token) feels magical.

This post walks through the architecture we built at Anve Voice Forms to make real-time voice transcription feel fast and seamless in the browser.

The Challenge: Why Basic Transcription APIs Feel Slow

Most voice API approaches work like this:

User speaks for N seconds
Collect all audio
Send entire audio file to API
Wait for transcription response
Display result

Round-trip latency: 2-5 seconds. That's dead time where the user is waiting and nothing is happening.

The better approach is streaming: send audio chunks as they arrive, start processing immediately, and stream back results in real-time.

The Architecture

Here's the high-level flow:

Browser (Frontend)
  Microphone API → WebAudio Processor → WebSocket Client
                                              │ Chunks
                                              ▼
Backend (Node.js/Python)
  WebSocket Server → Audio Processor → Gemini API (Streaming)
                          │
                          ▼
                    Transcript Builder → Browser updates UI

1. Browser-Side Audio Capture

// Capture audio from microphone
const audioContext = new (window.AudioContext || window.webkitAudioContext)();
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = audioContext.createMediaStreamAudioSource(mediaStream);

const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (event) => {
  const audioData = event.inputBuffer.getChannelData(0);
  const pcmData = new Float32Array(audioData);
  const int16Data = float32ToInt16(pcmData);
  socket.emit('audio_chunk', int16Data);
};

source.connect(processor);
processor.connect(audioContext.destination);

function float32ToInt16(float32Array) {
  const int16Array = new Int16Array(float32Array.length);
  for (let i = 0; i < float32Array.length; i++) {
    int16Array[i] = float32Array[i] < 0
      ? float32Array[i] * 0x8000
      : float32Array[i] * 0x7fff;
  }
  return int16Array;
}

Key decisions:

4096 sample chunk size: 93ms at 44.1kHz (good balance between latency and overhead)
Int16 encoding: most APIs expect 16-bit PCM audio
Send immediately: don't buffer, start streaming as chunks arrive

2. Streaming to Gemini API

This is where real-time transcription happens:

const { GoogleGenerativeAI } = require("@google/generative-ai");
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

async function transcribeAudioStream(ws, audioChunks) {
  const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });

  const response = await model.generateContentStream({
    contents: [{
      role: "user",
      parts: [
        { inlineData: { mimeType: "audio/mp3", data: audioStream } },
        { text: "Transcribe this audio. Return ONLY the transcription." }
      ]
    }]
  });

  for await (const chunk of response.stream) {
    const text = chunk.text();
    if (text) {
      ws.send(JSON.stringify({
        type: 'partial_transcript',
        text: text,
        timestamp: Date.now()
      }));
    }
  }
}

3. Handling Codec Mismatches

This was our biggest surprise issue. Browsers capture audio as PCM (44.1kHz, 16-bit mono). But APIs have different requirements — some want WAV, some MP3, some raw PCM.

const ffmpeg = require('fluent-ffmpeg');

async function convertAudioCodec(inputBuffer, outputFormat) {
  return new Promise((resolve, reject) => {
    ffmpeg(inputBuffer)
      .format(outputFormat)
      .audioFrequency(16000)
      .audioChannels(1)
      .on('end', () => resolve(outputBuffer))
      .on('error', reject)
      .pipe(outputBuffer);
  });
}

4. Latency Optimization

Real-time means <500ms perception. Our latency breakdown:

Browser capture: 93ms (chunk size)
Network round-trip: 50ms
Gemini processing: 150ms
Response streaming: 20ms
Total: ~310ms before transcription appears

5. Cost Optimization

// Don't send silence
function shouldSendChunk(audioData, threshold = 0.01) {
  const rms = Math.sqrt(
    audioData.reduce((sum, s) => sum + s ** 2, 0) / audioData.length
  );
  return rms > threshold;
}

We estimate $0.0005 per form submission at scale.

Lessons Learned

Streaming changes everything. 500ms feels slow. 200ms feels responsive.
Test with real audio. Background noise, accents, quiet voices — test aggressively.
Browser audio APIs are still janky. ScriptProcessorNode is deprecated but most compatible.
Don't ignore codec issues. We lost 2 weeks to garbage transcription from wrong formats.
Frontend UX matters. Debounce updates, show partial results clearly.

Production Stack

Frontend: React + WebSocket client
Backend: Node.js with ws library
API: Google Gemini 2.0 Flash
Codec: ffmpeg-wasm (browser) + ffmpeg (backend)
Hosting: Render + Cloudflare CDN

Building something with voice? We'd love to hear about it. Drop a comment or check out Anve Voice Forms if you want to see this architecture in action.

—Adarsh, Founder @ Anve Voice Forms

DEV Community