RyanCwynar

Posted on Jan 29 • Originally published at ryancwynar.com

Adding Voice Input to Web Forms with Whisper

#whisper #voice #react #websocket

Voice input is becoming table stakes for modern web apps. Users expect to tap a mic button and speak instead of typing — especially on mobile. Here is how I added real-time voice transcription to a React form using OpenAI Whisper.

The Architecture

The setup is straightforward:

Browser captures audio via MediaRecorder API
WebSocket streams audio chunks to a Whisper server
Whisper server transcribes and returns text in real-time
React updates the input field as transcription arrives

[Browser Mic] → [MediaRecorder] → [WebSocket] → [Whisper Server] → [Text]

The Whisper WebSocket Server

I am running whisper-streaming — a streaming Whisper server that accepts audio over WebSocket and returns transcriptions in real-time. It runs on a small VPS at wss://whisper-ws.byldr.co.

The server handles:

Audio chunk buffering
VAD (voice activity detection)
Streaming transcription with partial results
Final transcription when speech ends

The React Implementation

Here is the core hook pattern:

const WHISPER_WS_URL = "wss://whisper-ws.byldr.co";

const [isRecording, setIsRecording] = useState(false);
const [inputValue, setInputValue] = useState('');
const wsRef = useRef(null);
const mediaRecorderRef = useRef(null);
const streamRef = useRef(null);

Starting Recording

const startRecording = async () => {
  // Get microphone access
  const stream = await navigator.mediaDevices.getUserMedia({ 
    audio: {
      sampleRate: 16000,
      channelCount: 1,
      echoCancellation: true,
      noiseSuppression: true
    } 
  });
  streamRef.current = stream;

  // Connect to Whisper server
  const ws = new WebSocket(WHISPER_WS_URL);
  wsRef.current = ws;

  ws.onopen = () => {
    // Send config first
    ws.send(JSON.stringify({ type: "config", sampleRate: 16000 }));

    // Start recording
    const mediaRecorder = new MediaRecorder(stream, { 
      mimeType: 'audio/webm;codecs=opus'
    });
    mediaRecorderRef.current = mediaRecorder;

    // Stream audio chunks every 500ms
    mediaRecorder.ondataavailable = async (e) => {
      if (e.data.size > 0 && ws.readyState === WebSocket.OPEN) {
        const buffer = await e.data.arrayBuffer();
        ws.send(buffer);
      }
    };

    mediaRecorder.start(500); // Chunk every 500ms
    setIsRecording(true);
  };

  // Handle transcription results
  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === "partial" || data.type === "final") {
      setInputValue(data.text);
    }
  };
};

Stopping Recording

const stopRecording = () => {
  // Stop the MediaRecorder
  if (mediaRecorderRef.current?.state !== 'inactive') {
    mediaRecorderRef.current.stop();
  }

  // Stop all audio tracks
  streamRef.current?.getTracks().forEach(track => track.stop());

  // Signal end to server and close connection
  if (wsRef.current?.readyState === WebSocket.OPEN) {
    wsRef.current.send(JSON.stringify({ type: "end" }));
    setTimeout(() => wsRef.current?.close(), 2000);
  }

  setIsRecording(false);
};

The UI

The button toggles between three states:

<button onClick={isRecording ? stopRecording : startRecording}>
  {isConnecting ? (
    <span>Connecting...</span>
  ) : isRecording ? (
    <span>Stop Recording</span>
  ) : (
    <span>🎤 Speak</span>
  )}
</button>

{isRecording && (
  <p className="text-red-400 animate-pulse">Listening...</p>
)}

<input
  value={inputValue}
  readOnly={isRecording}
  placeholder="What's on your mind?"
/>

Key Implementation Details

Audio Format

Whisper works best with 16kHz mono audio. The MediaRecorder uses audio/webm;codecs=opus which the server can decode. Some Whisper setups prefer raw PCM — check your server's requirements.

Chunk Timing

500ms chunks provide a good balance between latency and efficiency. Smaller chunks (100-250ms) give faster partial results but increase overhead. Larger chunks (1s+) feel laggy.

Partial vs Final Results

The server sends two types of transcriptions:

partial: Best guess so far (may change)
final: Committed transcription after silence detection

I update the input on both — users see text appearing as they speak, and it stabilizes when they pause.

Error Handling

Always have a fallback:

ws.onerror = () => {
  cleanupRecording();
  alert("Voice service unavailable. Please type instead.");
};

Microphone Permissions

Browsers require HTTPS for getUserMedia. Handle permission denials gracefully:

try {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
} catch (error) {
  alert("Please allow microphone access to use voice input.");
}

Self-Hosting Whisper

If you want to run your own Whisper server:

whisper-streaming (Python): Good for real-time streaming
faster-whisper (Python): Optimized inference with CTranslate2
whisper.cpp (C++): Low resource usage, good for edge deployment

I run whisper-streaming on a 2 vCPU VPS with 4GB RAM. It handles several concurrent connections fine with the base model. For production, consider small or medium models for better accuracy.

Why Not Web Speech API?

The browser's built-in webkitSpeechRecognition is free but:

Only works in Chrome/Edge
Requires internet (uses Google's servers)
No control over the model
Privacy concerns for sensitive data

Self-hosted Whisper gives you:

Works in all browsers
Runs on your infrastructure
Better accuracy (especially for technical terms)
Full privacy

Wrapping Up

Voice input is surprisingly simple to add once you have the pieces in place. The key is streaming — do not wait for the user to finish speaking before transcribing. Show them text appearing in real-time and it feels magical.

The full implementation is live at ryancwynar.com — try the voice button on the hero section.

Building AI-powered features? I help companies integrate voice, automation, and AI into their products. Get in touch.

DEV Community