When you want to build voice-input forms that feel responsive and intuitive, the key challenge isn't transcription—modern APIs handle that well. It's latency. Transcription that takes 2 seconds to return feels broken. Transcription that streams back in real-time (200-400ms for first token) feels magical.
This post walks through the architecture we built at Anve Voice Forms to make real-time voice transcription feel fast and seamless in the browser.
The Challenge: Why Basic Transcription APIs Feel Slow
Most voice API approaches work like this:
- User speaks for N seconds
- Collect all audio
- Send entire audio file to API
- Wait for transcription response
- Display result
Round-trip latency: 2-5 seconds. That's dead time where the user is waiting and nothing is happening.
The better approach is streaming: send audio chunks as they arrive, start processing immediately, and stream back results in real-time.
The Architecture
Here's the high-level flow:
Browser (Frontend)
Microphone API → WebAudio Processor → WebSocket Client
│ Chunks
▼
Backend (Node.js/Python)
WebSocket Server → Audio Processor → Gemini API (Streaming)
│
▼
Transcript Builder → Browser updates UI
1. Browser-Side Audio Capture
// Capture audio from microphone
const audioContext = new (window.AudioContext || window.webkitAudioContext)();
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = audioContext.createMediaStreamAudioSource(mediaStream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (event) => {
const audioData = event.inputBuffer.getChannelData(0);
const pcmData = new Float32Array(audioData);
const int16Data = float32ToInt16(pcmData);
socket.emit('audio_chunk', int16Data);
};
source.connect(processor);
processor.connect(audioContext.destination);
function float32ToInt16(float32Array) {
const int16Array = new Int16Array(float32Array.length);
for (let i = 0; i < float32Array.length; i++) {
int16Array[i] = float32Array[i] < 0
? float32Array[i] * 0x8000
: float32Array[i] * 0x7fff;
}
return int16Array;
}
Key decisions:
- 4096 sample chunk size: 93ms at 44.1kHz (good balance between latency and overhead)
- Int16 encoding: most APIs expect 16-bit PCM audio
- Send immediately: don't buffer, start streaming as chunks arrive
2. Streaming to Gemini API
This is where real-time transcription happens:
const { GoogleGenerativeAI } = require("@google/generative-ai");
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
async function transcribeAudioStream(ws, audioChunks) {
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });
const response = await model.generateContentStream({
contents: [{
role: "user",
parts: [
{ inlineData: { mimeType: "audio/mp3", data: audioStream } },
{ text: "Transcribe this audio. Return ONLY the transcription." }
]
}]
});
for await (const chunk of response.stream) {
const text = chunk.text();
if (text) {
ws.send(JSON.stringify({
type: 'partial_transcript',
text: text,
timestamp: Date.now()
}));
}
}
}
3. Handling Codec Mismatches
This was our biggest surprise issue. Browsers capture audio as PCM (44.1kHz, 16-bit mono). But APIs have different requirements — some want WAV, some MP3, some raw PCM.
const ffmpeg = require('fluent-ffmpeg');
async function convertAudioCodec(inputBuffer, outputFormat) {
return new Promise((resolve, reject) => {
ffmpeg(inputBuffer)
.format(outputFormat)
.audioFrequency(16000)
.audioChannels(1)
.on('end', () => resolve(outputBuffer))
.on('error', reject)
.pipe(outputBuffer);
});
}
4. Latency Optimization
Real-time means <500ms perception. Our latency breakdown:
- Browser capture: 93ms (chunk size)
- Network round-trip: 50ms
- Gemini processing: 150ms
- Response streaming: 20ms
- Total: ~310ms before transcription appears
5. Cost Optimization
// Don't send silence
function shouldSendChunk(audioData, threshold = 0.01) {
const rms = Math.sqrt(
audioData.reduce((sum, s) => sum + s ** 2, 0) / audioData.length
);
return rms > threshold;
}
We estimate $0.0005 per form submission at scale.
Lessons Learned
- Streaming changes everything. 500ms feels slow. 200ms feels responsive.
- Test with real audio. Background noise, accents, quiet voices — test aggressively.
- Browser audio APIs are still janky. ScriptProcessorNode is deprecated but most compatible.
- Don't ignore codec issues. We lost 2 weeks to garbage transcription from wrong formats.
- Frontend UX matters. Debounce updates, show partial results clearly.
Production Stack
- Frontend: React + WebSocket client
-
Backend: Node.js with
wslibrary - API: Google Gemini 2.0 Flash
- Codec: ffmpeg-wasm (browser) + ffmpeg (backend)
- Hosting: Render + Cloudflare CDN
Building something with voice? We'd love to hear about it. Drop a comment or check out Anve Voice Forms if you want to see this architecture in action.
—Adarsh, Founder @ Anve Voice Forms
Top comments (0)