monkeymore studio

Posted on Apr 12

Building a Browser-Based Speech-to-Text System with Whisper AI

#ai #javascript #openai #tutorial

Have you ever needed to transcribe audio without sending your files to the cloud? Whether you're a journalist interviewing sources, a student recording lectures, or a developer building voice interfaces, privacy matters. In this guide, I'll show you how we built a speech-to-text system that runs entirely in your browser using OpenAI's Whisper model, the same technology behind our free online speech-to-text tool.

Why Browser-Based Speech Recognition?

Before diving into the code, let's talk about why you'd want speech-to-text processing to happen locally.

Your Audio Stays Private

When you upload audio to cloud transcription services, you're trusting third parties with potentially sensitive content. Browser-based processing means your recordings never leave your device. This is crucial for journalists, therapists, lawyers, or anyone handling confidential information.

No API Costs or Rate Limits

Cloud STT services charge by the minute and impose rate limits. With local processing, you can transcribe hours of audio without worrying about bills or throttling.

Works Offline

Once the page and AI model are loaded, you can transcribe audio even without an internet connection. Perfect for field work, travel, or areas with poor connectivity.

Real-Time and Batch Processing

Our implementation supports both real-time transcription via the Web Speech API and high-quality batch processing using Whisper AI.

The Architecture Overview

Our SST system uses a dual approach: Web Speech API for real-time recording and Whisper AI (via Transformers.js) for high-quality file transcription. Here's how it all fits together:

Understanding the Dual Approach

Our implementation offers two transcription modes:

1. Web Speech API (Real-Time)

Built into modern browsers, this provides instant transcription as you speak:

Pros: Instant results, no model download, works offline after page load
Cons: Accuracy varies by browser, requires internet for some implementations

2. Whisper AI (High-Quality)

OpenAI's Whisper model running locally via Transformers.js:

Pros: State-of-the-art accuracy, supports 99 languages, generates timestamps
Cons: Requires ~75MB model download, slower than real-time API

Core Data Structures

Let's examine the key data structures in our SST implementation:

Language Configuration

const LANGUAGE_OPTIONS = [
  { code: "en-US", name: "English (US)" },
  { code: "en-GB", name: "English (UK)" },
  { code: "zh-CN", name: "Chinese (Simplified)" },
  { code: "zh-TW", name: "Chinese (Traditional)" },
  { code: "ja-JP", name: "Japanese" },
  { code: "ko-KR", name: "Korean" },
  { code: "es-ES", name: "Spanish" },
  { code: "fr-FR", name: "French" },
  { code: "de-DE", name: "German" },
  { code: "pt-BR", name: "Portuguese" },
  { code: "ru-RU", name: "Russian" },
];

We support 11 languages, mapping browser language codes to Whisper's language identifiers.

React State Management

const [text, setText] = useState("");
const [isRecording, setIsRecording] = useState(false);
const [recognitionLanguage, setRecognitionLanguage] = useState("en-US");
const [audioFile, setAudioFile] = useState<File | null>(null);
const [audioUrl, setAudioUrl] = useState<string | null>(null);
const [isProcessing, setIsProcessing] = useState(false);
const [modelLoaded, setModelLoaded] = useState(false);
const [modelLoading, setModelLoading] = useState(false);
const [modelProgress, setModelProgress] = useState(0);
const [transcriptionChunks, setTranscriptionChunks] = useState<Array<{timestamp: [number, number], text: string}>>([]);
const [interimText, setInterimText] = useState("");
const finalTextRef = useRef("");
const recognitionRef = useRef<any>(null);
const pipelineRef = useRef<any>(null);

We track recording state, audio files, model loading progress, and transcription chunks with timestamps.

The Complete Processing Flow

Here's the entire flow from audio input to transcribed text:

Real-Time Speech Recognition

The Web Speech API provides instant transcription:

const startRecording = () => {
  if (!isSpeechRecognitionSupported()) {
    setRecognitionError(t.sstNotSupported || "Speech recognition is not supported in your browser.");
    return;
  }

  setRecognitionError("");
  setText("");
  setInterimText("");
  finalTextRef.current = "";

  const SpeechRecognition = (window as any).SpeechRecognition || (window as any).webkitSpeechRecognition;
  const recognition = new SpeechRecognition();

  recognition.continuous = true;
  recognition.interimResults = true;
  recognition.lang = recognitionLanguage;

  recognition.onstart = () => {
    setIsRecording(true);
  };

  recognition.onresult = (event: any) => {
    let finalTranscript = "";
    let interimTranscript = "";

    for (let i = event.resultIndex; i < event.results.length; i++) {
      const transcript = event.results[i][0].transcript;
      if (event.results[i].isFinal) {
        finalTranscript += transcript + " ";
      } else {
        interimTranscript += transcript;
      }
    }

    // Only append final results to the accumulated text
    if (finalTranscript) {
      finalTextRef.current += finalTranscript;
      setText(finalTextRef.current);
      setInterimText("");
    } else if (interimTranscript) {
      // Show interim results temporarily without saving them
      setInterimText(interimTranscript);
      setText(finalTextRef.current + interimTranscript);
    }
  };

  recognition.onerror = (event: any) => {
    console.error("Speech recognition error:", event.error);
    setRecognitionError(t.sstError || "Recognition error. Please try again.");
    setIsRecording(false);
  };

  recognition.onend = () => {
    setIsRecording(false);
  };

  recognitionRef.current = recognition;
  recognition.start();
};

Key features:

Continuous mode: Keeps listening until manually stopped
Interim results: Shows live transcription as you speak
Final results: Only saves confirmed transcriptions
Language support: Uses the selected language code

Loading Whisper AI Model

For high-quality transcription, we load OpenAI's Whisper model:

const loadWhisper = async () => {
  if (pipelineRef.current) return;

  setModelLoading(true);
  setModelProgress(0);

  try {
    // Dynamic import to avoid SSR issues
    const { pipeline } = await import('@xenova/transformers');

    // Create automatic speech recognition pipeline
    // Using tiny model (~75MB) for faster download
    const transcriber = await pipeline(
      'automatic-speech-recognition',
      'Xenova/whisper-tiny',
      {
        progress_callback: (progress: any) => {
          if (progress && typeof progress.loaded === 'number' && typeof progress.total === 'number') {
            const percent = Math.round((progress.loaded / progress.total) * 100);
            setModelProgress(percent);
          }
        }
      }
    );

    pipelineRef.current = transcriber;
    setModelLoaded(true);
  } catch (error) {
    console.error("Failed to load Whisper:", error);
    setRecognitionError("Failed to load AI model. Please check your connection and try again.");
  } finally {
    setModelLoading(false);
  }
};

We use the Xenova/whisper-tiny model (~75MB) for a balance of accuracy and download size. The progress_callback provides real-time download progress.

Audio Preprocessing

Before sending audio to Whisper, we need to convert it to the right format:

const audioFileToFloat32 = async (file: File): Promise<Float32Array> => {
  const audioContext = new (window.AudioContext || (window as any).webkitAudioContext)();
  const arrayBuffer = await file.arrayBuffer();
  const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);

  // Get audio data from first channel
  const channelData = audioBuffer.getChannelData(0);
  const sampleRate = audioBuffer.sampleRate;
  const targetSampleRate = 16000;

  // Resample to 16kHz if needed
  if (sampleRate === targetSampleRate) {
    return channelData;
  }

  // Simple resampling
  const ratio = sampleRate / targetSampleRate;
  const newLength = Math.floor(channelData.length / ratio);
  const resampled = new Float32Array(newLength);

  for (let i = 0; i < newLength; i++) {
    const index = Math.floor(i * ratio);
    resampled[i] = channelData[index];
  }

  return resampled;
};

Whisper expects 16kHz mono audio, so we:

Decode the audio file using Web Audio API
Extract the first channel (mono)
Resample to 16kHz using simple decimation

Transcribing Audio Files

Here's the complete transcription process:

const extractText = async () => {
  if (!audioFile || !audioUrl) return;

  setIsProcessing(true);
  setRecognitionError("");
  setText("");
  setTranscriptionChunks([]);

  try {
    // Load whisper if not loaded
    if (!pipelineRef.current) {
      await loadWhisper();
    }

    if (!pipelineRef.current) {
      throw new Error("Failed to initialize speech recognition");
    }

    // Map language codes
    const langMap: Record<string, string> = {
      'en-US': 'en', 'en-GB': 'en', 'zh-CN': 'zh', 'zh-TW': 'zh',
      'ja-JP': 'ja', 'ko-KR': 'ko', 'es-ES': 'es', 'fr-FR': 'fr',
      'de-DE': 'de', 'pt-BR': 'pt', 'ru-RU': 'ru',
    };

    const langCode = langMap[recognitionLanguage] || 'en';

    // Transcribe with timestamps
    const result = await pipelineRef.current(audioUrl, {
      language: langCode,
      task: 'transcribe',
      return_timestamps: true,
    });

    if (result.chunks && result.chunks.length > 0) {
      setTranscriptionChunks(result.chunks);
      setText(result.text || result.chunks.map((c: any) => c.text).join(' '));
    } else {
      setText(result.text || '');
    }
  } catch (error: any) {
    console.error("Audio processing error:", error);
    setRecognitionError(error.message || t.sstProcessingError || "Error processing audio file.");
  } finally {
    setIsProcessing(false);
  }
};

Key features:

Language mapping: Converts browser locale codes to Whisper language codes
Timestamps: Returns start/end times for each segment
Chunks: Provides granular transcription segments for SRT export

Generating SRT Subtitles

One powerful feature is exporting transcriptions as SRT subtitle files:

// Format seconds to SRT time format: HH:MM:SS,mmm
const formatSrtTime = (seconds: number): string => {
  const hrs = Math.floor(seconds / 3600);
  const mins = Math.floor((seconds % 3600) / 60);
  const secs = Math.floor(seconds % 60);
  const ms = Math.floor((seconds % 1) * 1000);

  return `${hrs.toString().padStart(2, '0')}:${mins.toString().padStart(2, '0')}:${secs.toString().padStart(2, '0')},${ms.toString().padStart(3, '0')}`;
};

// Generate SRT content from chunks
const generateSrtContent = (chunks: Array<{timestamp: [number, number], text: string}>): string => {
  return chunks.map((chunk, index) => {
    const [start, end] = chunk.timestamp;
    return `${index + 1}\n${formatSrtTime(start)} --> ${formatSrtTime(end)}\n${chunk.text.trim()}\n`;
  }).join('\n');
};

// Download SRT file
const downloadSrt = () => {
  if (transcriptionChunks.length === 0) return;

  const srtContent = generateSrtContent(transcriptionChunks);
  const blob = new Blob([srtContent], { type: 'text/plain;charset=utf-8' });
  const url = URL.createObjectURL(blob);

  const a = document.createElement('a');
  a.href = url;
  a.download = 'transcription.srt';
  document.body.appendChild(a);
  a.click();
  document.body.removeChild(a);
  URL.revokeObjectURL(url);
};

This converts timestamped chunks into standard SRT format for use in video players.

Transformers.js Configuration

We configure Transformers.js to use remote models:

// Set transformers.js environment to use remote CDN
if (typeof window !== 'undefined') {
  (window as any).env = {
    ...(window as any).env,
    TRANSFORMERS_CACHE: undefined,
    USE_REMOTE_MODELS: true,
  };
}

This ensures models are loaded from Hugging Face's CDN rather than being bundled.

Performance Considerations

Model Size Trade-offs

We use whisper-tiny (~75MB) for faster downloads, but you can use larger models for better accuracy:

tiny: ~75MB, fastest, good accuracy
base: ~150MB, balanced
small: ~500MB, better accuracy
medium: ~1.5GB, best accuracy
large: ~3GB, state-of-the-art

File Size Limits

We limit uploads to 10MB for browser processing:

if (file.size > 10 * 1024 * 1024) {
  setRecognitionError("File too large. Maximum size is 10MB for browser processing.");
  return;
}

This prevents memory issues and ensures reasonable processing times.

Memory Management

We clean up object URLs to prevent memory leaks:

const clearText = () => {
  setText("");
  setInterimText("");
  finalTextRef.current = "";
  if (audioUrl) {
    URL.revokeObjectURL(audioUrl);
  }
  setAudioFile(null);
  setAudioUrl(null);
  setRecognitionError("");
  setTranscriptionChunks([]);
};

Browser Compatibility

Our SST system works in modern browsers:

Chrome/Edge: Full support (Web Speech API + Whisper)
Firefox: Full support (Web Speech API + Whisper)
Safari: Partial support (Whisper only, limited Web Speech API)

Required APIs:

SpeechRecognition or webkitSpeechRecognition: For real-time mode
AudioContext: For audio decoding
WebAssembly: For ONNX Runtime

Try It Yourself

Ready to transcribe your audio? Visit our free online speech-to-text tool and give it a try. All processing happens locally - your audio files never leave your device.

Conclusion

Building a browser-based speech-to-text system demonstrates the power of modern AI in the browser:

Dual approach flexibility: Web Speech API for speed, Whisper for accuracy
Privacy by design: Local processing keeps sensitive audio private
Multi-language support: 11 languages with Whisper's universal model
Export options: Plain text and SRT subtitles for versatility

The complete source is available in our repository. Whether you're building accessibility tools, transcription services, or voice interfaces, I hope this guide helps you add speech recognition to your projects.

Happy transcribing! 🎤📝

DEV Community