DEV Community

RyanCwynar
RyanCwynar

Posted on • Originally published at ryancwynar.com

Adding Voice Input to Web Forms with Whisper

Voice input is becoming table stakes for modern web apps. Users expect to tap a mic button and speak instead of typing — especially on mobile. Here is how I added real-time voice transcription to a React form using OpenAI Whisper.

The Architecture

The setup is straightforward:

  1. Browser captures audio via MediaRecorder API
  2. WebSocket streams audio chunks to a Whisper server
  3. Whisper server transcribes and returns text in real-time
  4. React updates the input field as transcription arrives
[Browser Mic] → [MediaRecorder] → [WebSocket] → [Whisper Server] → [Text]
Enter fullscreen mode Exit fullscreen mode

The Whisper WebSocket Server

I am running whisper-streaming — a streaming Whisper server that accepts audio over WebSocket and returns transcriptions in real-time. It runs on a small VPS at wss://whisper-ws.byldr.co.

The server handles:

  • Audio chunk buffering
  • VAD (voice activity detection)
  • Streaming transcription with partial results
  • Final transcription when speech ends

The React Implementation

Here is the core hook pattern:

const WHISPER_WS_URL = "wss://whisper-ws.byldr.co";

const [isRecording, setIsRecording] = useState(false);
const [inputValue, setInputValue] = useState('');
const wsRef = useRef(null);
const mediaRecorderRef = useRef(null);
const streamRef = useRef(null);
Enter fullscreen mode Exit fullscreen mode

Starting Recording

const startRecording = async () => {
  // Get microphone access
  const stream = await navigator.mediaDevices.getUserMedia({ 
    audio: {
      sampleRate: 16000,
      channelCount: 1,
      echoCancellation: true,
      noiseSuppression: true
    } 
  });
  streamRef.current = stream;

  // Connect to Whisper server
  const ws = new WebSocket(WHISPER_WS_URL);
  wsRef.current = ws;

  ws.onopen = () => {
    // Send config first
    ws.send(JSON.stringify({ type: "config", sampleRate: 16000 }));

    // Start recording
    const mediaRecorder = new MediaRecorder(stream, { 
      mimeType: 'audio/webm;codecs=opus'
    });
    mediaRecorderRef.current = mediaRecorder;

    // Stream audio chunks every 500ms
    mediaRecorder.ondataavailable = async (e) => {
      if (e.data.size > 0 && ws.readyState === WebSocket.OPEN) {
        const buffer = await e.data.arrayBuffer();
        ws.send(buffer);
      }
    };

    mediaRecorder.start(500); // Chunk every 500ms
    setIsRecording(true);
  };

  // Handle transcription results
  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === "partial" || data.type === "final") {
      setInputValue(data.text);
    }
  };
};
Enter fullscreen mode Exit fullscreen mode

Stopping Recording

const stopRecording = () => {
  // Stop the MediaRecorder
  if (mediaRecorderRef.current?.state !== 'inactive') {
    mediaRecorderRef.current.stop();
  }

  // Stop all audio tracks
  streamRef.current?.getTracks().forEach(track => track.stop());

  // Signal end to server and close connection
  if (wsRef.current?.readyState === WebSocket.OPEN) {
    wsRef.current.send(JSON.stringify({ type: "end" }));
    setTimeout(() => wsRef.current?.close(), 2000);
  }

  setIsRecording(false);
};
Enter fullscreen mode Exit fullscreen mode

The UI

The button toggles between three states:

<button onClick={isRecording ? stopRecording : startRecording}>
  {isConnecting ? (
    <span>Connecting...</span>
  ) : isRecording ? (
    <span>Stop Recording</span>
  ) : (
    <span>🎤 Speak</span>
  )}
</button>

{isRecording && (
  <p className="text-red-400 animate-pulse">Listening...</p>
)}

<input
  value={inputValue}
  readOnly={isRecording}
  placeholder="What's on your mind?"
/>
Enter fullscreen mode Exit fullscreen mode

Key Implementation Details

Audio Format

Whisper works best with 16kHz mono audio. The MediaRecorder uses audio/webm;codecs=opus which the server can decode. Some Whisper setups prefer raw PCM — check your server's requirements.

Chunk Timing

500ms chunks provide a good balance between latency and efficiency. Smaller chunks (100-250ms) give faster partial results but increase overhead. Larger chunks (1s+) feel laggy.

Partial vs Final Results

The server sends two types of transcriptions:

  • partial: Best guess so far (may change)
  • final: Committed transcription after silence detection

I update the input on both — users see text appearing as they speak, and it stabilizes when they pause.

Error Handling

Always have a fallback:

ws.onerror = () => {
  cleanupRecording();
  alert("Voice service unavailable. Please type instead.");
};
Enter fullscreen mode Exit fullscreen mode

Microphone Permissions

Browsers require HTTPS for getUserMedia. Handle permission denials gracefully:

try {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
} catch (error) {
  alert("Please allow microphone access to use voice input.");
}
Enter fullscreen mode Exit fullscreen mode

Self-Hosting Whisper

If you want to run your own Whisper server:

  1. whisper-streaming (Python): Good for real-time streaming
  2. faster-whisper (Python): Optimized inference with CTranslate2
  3. whisper.cpp (C++): Low resource usage, good for edge deployment

I run whisper-streaming on a 2 vCPU VPS with 4GB RAM. It handles several concurrent connections fine with the base model. For production, consider small or medium models for better accuracy.

Why Not Web Speech API?

The browser's built-in webkitSpeechRecognition is free but:

  • Only works in Chrome/Edge
  • Requires internet (uses Google's servers)
  • No control over the model
  • Privacy concerns for sensitive data

Self-hosted Whisper gives you:

  • Works in all browsers
  • Runs on your infrastructure
  • Better accuracy (especially for technical terms)
  • Full privacy

Wrapping Up

Voice input is surprisingly simple to add once you have the pieces in place. The key is streaming — do not wait for the user to finish speaking before transcribing. Show them text appearing in real-time and it feels magical.

The full implementation is live at ryancwynar.com — try the voice button on the hero section.


Building AI-powered features? I help companies integrate voice, automation, and AI into their products. Get in touch.

Top comments (0)