Voice input is becoming table stakes for modern web apps. Users expect to tap a mic button and speak instead of typing — especially on mobile. Here is how I added real-time voice transcription to a React form using OpenAI Whisper.
The Architecture
The setup is straightforward:
- Browser captures audio via MediaRecorder API
- WebSocket streams audio chunks to a Whisper server
- Whisper server transcribes and returns text in real-time
- React updates the input field as transcription arrives
[Browser Mic] → [MediaRecorder] → [WebSocket] → [Whisper Server] → [Text]
The Whisper WebSocket Server
I am running whisper-streaming — a streaming Whisper server that accepts audio over WebSocket and returns transcriptions in real-time. It runs on a small VPS at wss://whisper-ws.byldr.co.
The server handles:
- Audio chunk buffering
- VAD (voice activity detection)
- Streaming transcription with partial results
- Final transcription when speech ends
The React Implementation
Here is the core hook pattern:
const WHISPER_WS_URL = "wss://whisper-ws.byldr.co";
const [isRecording, setIsRecording] = useState(false);
const [inputValue, setInputValue] = useState('');
const wsRef = useRef(null);
const mediaRecorderRef = useRef(null);
const streamRef = useRef(null);
Starting Recording
const startRecording = async () => {
// Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
sampleRate: 16000,
channelCount: 1,
echoCancellation: true,
noiseSuppression: true
}
});
streamRef.current = stream;
// Connect to Whisper server
const ws = new WebSocket(WHISPER_WS_URL);
wsRef.current = ws;
ws.onopen = () => {
// Send config first
ws.send(JSON.stringify({ type: "config", sampleRate: 16000 }));
// Start recording
const mediaRecorder = new MediaRecorder(stream, {
mimeType: 'audio/webm;codecs=opus'
});
mediaRecorderRef.current = mediaRecorder;
// Stream audio chunks every 500ms
mediaRecorder.ondataavailable = async (e) => {
if (e.data.size > 0 && ws.readyState === WebSocket.OPEN) {
const buffer = await e.data.arrayBuffer();
ws.send(buffer);
}
};
mediaRecorder.start(500); // Chunk every 500ms
setIsRecording(true);
};
// Handle transcription results
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "partial" || data.type === "final") {
setInputValue(data.text);
}
};
};
Stopping Recording
const stopRecording = () => {
// Stop the MediaRecorder
if (mediaRecorderRef.current?.state !== 'inactive') {
mediaRecorderRef.current.stop();
}
// Stop all audio tracks
streamRef.current?.getTracks().forEach(track => track.stop());
// Signal end to server and close connection
if (wsRef.current?.readyState === WebSocket.OPEN) {
wsRef.current.send(JSON.stringify({ type: "end" }));
setTimeout(() => wsRef.current?.close(), 2000);
}
setIsRecording(false);
};
The UI
The button toggles between three states:
<button onClick={isRecording ? stopRecording : startRecording}>
{isConnecting ? (
<span>Connecting...</span>
) : isRecording ? (
<span>Stop Recording</span>
) : (
<span>🎤 Speak</span>
)}
</button>
{isRecording && (
<p className="text-red-400 animate-pulse">Listening...</p>
)}
<input
value={inputValue}
readOnly={isRecording}
placeholder="What's on your mind?"
/>
Key Implementation Details
Audio Format
Whisper works best with 16kHz mono audio. The MediaRecorder uses audio/webm;codecs=opus which the server can decode. Some Whisper setups prefer raw PCM — check your server's requirements.
Chunk Timing
500ms chunks provide a good balance between latency and efficiency. Smaller chunks (100-250ms) give faster partial results but increase overhead. Larger chunks (1s+) feel laggy.
Partial vs Final Results
The server sends two types of transcriptions:
-
partial: Best guess so far (may change) -
final: Committed transcription after silence detection
I update the input on both — users see text appearing as they speak, and it stabilizes when they pause.
Error Handling
Always have a fallback:
ws.onerror = () => {
cleanupRecording();
alert("Voice service unavailable. Please type instead.");
};
Microphone Permissions
Browsers require HTTPS for getUserMedia. Handle permission denials gracefully:
try {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
} catch (error) {
alert("Please allow microphone access to use voice input.");
}
Self-Hosting Whisper
If you want to run your own Whisper server:
- whisper-streaming (Python): Good for real-time streaming
- faster-whisper (Python): Optimized inference with CTranslate2
- whisper.cpp (C++): Low resource usage, good for edge deployment
I run whisper-streaming on a 2 vCPU VPS with 4GB RAM. It handles several concurrent connections fine with the base model. For production, consider small or medium models for better accuracy.
Why Not Web Speech API?
The browser's built-in webkitSpeechRecognition is free but:
- Only works in Chrome/Edge
- Requires internet (uses Google's servers)
- No control over the model
- Privacy concerns for sensitive data
Self-hosted Whisper gives you:
- Works in all browsers
- Runs on your infrastructure
- Better accuracy (especially for technical terms)
- Full privacy
Wrapping Up
Voice input is surprisingly simple to add once you have the pieces in place. The key is streaming — do not wait for the user to finish speaking before transcribing. Show them text appearing in real-time and it feels magical.
The full implementation is live at ryancwynar.com — try the voice button on the hero section.
Building AI-powered features? I help companies integrate voice, automation, and AI into their products. Get in touch.
Top comments (0)