Evie Wang

Posted on Sep 20 • Edited on Sep 21

Building an AI Conversation Practice App: Part 1 - Browser Audio Recording with MediaRecorder API

#javascript #ai #typescript #nextjs

This is the first post in a series documenting the technical implementation of a browser-based English learning application with a cost-friendly conversation system.

The Story Behind This Project

When I started building my language learning app, I wanted to create something more engaging than traditional flashcards. The idea was simple: let users practice real conversations through AI-powered role-play scenarios which alighs with real-life English and be really helpful.
What seemed like a straightforward feature turned into one of the most technically challenging parts of my entire application.

The final result? A pseudo-real-time but cost-friendly conversation system where users can speak naturally with AI characters, get instant feedback, and practice scenarios like ordering in a restaurant or customer service calls. But getting there required solving problems I never expected as a student developer.

Overview: The Complete Audio-to-Conversation Workflow

Our English learning platform implements a sophisticated role-play conversation system that transforms user speech into intelligent AI responses within 2-5 seconds.
Now let me walk through what happens when a user starts a role-play conversation:

User Interaction → User presses and holds the microphone button
Audio Capture → Browser records audio using MediaRecorder API
Audio Processing → Create audio blob and prepare FormData for upload
Speech Recognition → OpenAI Whisper-1 transcribes audio to text
Character Response → GPT-4 generates contextual response using character system
Text-to-Speech → Browser Web Speech API converts response to audio
Progress Tracking → User progress and usage limits updated in database

Total response time: 2-5 seconds (optimized for conversational feel)

This series takes a closer look at each component of the workflow, starting with the foundation—frontend audio recording and processing. The first post focuses on the technical details of browser-based audio capture, while the following posts will explore the rest of the pipeline that enables real-time AI conversations.

The Challenges We Discovered

What seemed straightforward in our project planning phase revealed layer after layer of unexpected complexity:

Cost Management Reality

API Cost Explosion: Our initial implementation used detailed Azure pronunciation analysis but quickly became prohibitively expensive at scale
Usage Control Strategy: After analyzing token-based and conversation-based limits, we settled on recording-count limits (5 free recordings per day) - simple, predictable, and abuse-resistant.

Performance vs. Quality Trade-offs

Response Time Optimization: Detailed Azure analysis took ~15 seconds; we optimized to ~6 seconds while maintaining meaningful feedback
Platform Constraints: Vercel's 10-second serverless function limits forced architectural decisions
TTS Evolution: Switched from OpenAI TTS to browser Web Speech API, removed network latency, and improved user experience at the same time.

Real-World Production Constraints

Browser Compatibility: Built detection and fallback systems for MediaRecorder and Web Audio API support
Abuse Prevention: Implemented session limits and daily caps with real-time usage tracking
Graceful Error Handling: Fallback systems for failed TTS, audio processing errors, and network issues

File Structure & Architecture

components/
├── ConversationInterface.tsx # Main conversation UI & state management
├── AudioRecorder.jsx # Audio recording component
└── SlangLoader.tsx # Loading animations
pages/
├── scenario/[id]/roleplay.js # Role-play page container
└── api/
├── enhanced-chat.ts # GPT conversation API
├── stt.ts # Speech-to-text API
└── azure-analytics-client.ts # Audio analysis API
utils/
├── clientAudioProcessor.ts # Client-side audio conversion
├── browserTTS.js # Text-to-speech utility
└── usageLimits.js # Usage tracking system
lib/
└── characterService.mjs # Dynamic character generation

Technical Implementation

Because it’s too complex to explain clearly in a single post, I split it into separate posts, each covering a different stage of the workflow.

sequenceDiagram
    participant Client as Frontend (Browser)
    participant Server as Backend (Server)

    Client->>Client: User records audio
    Client->>Client: Create audioBlob
    Client->>Client: FormData
    Client->>+Server: HTTP POST /api/stt
    Server->>Server: Parse audio file
    Server->>Server: Call OpenAI Whisper
    Server->>-Client: HTTP Response (transcript)

1. Audio Recording Architecture

Our implementation uses the modern Web Audio API with React state management:

// Core state management for audio recording
const [isRecording, setIsRecording] = useState(false);
const [isProcessing, setIsProcessing] = useState(false);
const mediaRecorderRef = useRef(null);
const audioChunksRef = useRef([]);
const recordingStartTimeRef = useRef(null);

2. MediaRecorder Configuration

We chose WebM with Opus codec as our recording format for optimal compression and quality:

const startRecording = async () => {
  try {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    mediaRecorderRef.current = new MediaRecorder(stream, {
      mimeType: 'audio/webm;codecs=opus',
    });

    audioChunksRef.current = [];
    recordingStartTimeRef.current = Date.now();

    // Event handlers for data collection
    mediaRecorderRef.current.ondataavailable = (event) => {
      audioChunksRef.current.push(event.data);
    };

    mediaRecorderRef.current.onstop = () => {
      handleAudioSubmission();
    };

    mediaRecorderRef.current.start();
    setIsRecording(true);
  } catch (error) {
    console.error('Error starting recording:', error);
  }
};

3. Resource Management & Cleanup

const stopRecording = () => {
  if (mediaRecorderRef.current && isRecording) {
    // Stop the MediaRecorder
    mediaRecorderRef.current.stop();

    // Critical: Release microphone permissions
    mediaRecorderRef.current.stream.getTracks().forEach(track => track.stop());

    setIsRecording(false);
  }
};

So, this can

Stops the “microphone active” light from staying on
Releases system resources
Addresses user privacy concerns
Allows other applications to use the microphone

4. Audio Data Processing

The MediaRecorder API delivers audio in chunks, which we collect and process:

const handleAudioSubmission = async () => {
  if (audioChunksRef.current.length === 0) return;

  // Quality check: Filter out accidental short recordings
  const recordingDuration = Date.now() - recordingStartTimeRef.current;
  if (recordingDuration < 300) {
    // Handle too-short recordings gracefully
    return;
  }

  // Create blob from audio chunks
  const audioBlob = new Blob(audioChunksRef.current, { type: 'audio/webm' });

  // Prepare for backend processing
  const formData = new FormData();
  formData.append('audio', audioBlob, 'recording.webm');

  // Send to backend API
  const response = await fetch('/api/stt', {
    method: 'POST',
    body: formData,
  });
};

Technical Decisions & Trade-offs

When building a speech recognition app that integrates with multiple AI services, one of the biggest decisions is choosing the right audio format. Each API has different requirements, and the wrong choice can impact performance, user experience, and development complexity.

After testing various formats, we chose WebM + Opus as our frontend recording format. Here's why:

Browser Compatibility

Works on all modern browsers (except some iOS Safari versions)
Native support in Chrome, Firefox, Edge
Graceful fallback for older browsers

File Size Optimization

10:1 compression ratio compared to WAV
30-second recording: ~50KB vs ~500KB
Faster uploads, especially on mobile networks

Real-time Performance

Designed for streaming applications
Low latency encoding
Minimal CPU usage during recording

Audio Quality

High quality at low bitrates
Opus codec optimized for speech
Maintains clarity and quality for AI processing

Instead of forcing one format for all APIs, we implemented intelligent routing:

// Frontend: Always record as WebM
const mediaRecorder = new MediaRecorder(stream, {
  mimeType: 'audio/webm;codecs=opus',
});

// Backend: Route based on API requirements
if (apiType === 'openai') {
  // Direct WebM support - zero conversion
  return await openai.audio.transcriptions.create({
    file: formData, 
    model: "whisper-1"
  });
} else if (apiType === 'azure') {
  // Convert to WAV for Azure
  const wavBuffer = await convertWebMToWav(audioData);
  return await azureSTT(wavBuffer);
}

Real-world measurements:

Recording latency: <50ms (WebM) vs <100ms (WAV)
File upload time: 3-4x faster with WebM
API response time: OpenAI 300ms, Azure 800ms

Frontend Recording

// Smart format detection
const getSupportedMimeType = () => {
  const types = [
    'audio/webm;codecs=opus',
    'audio/webm',
    'audio/mp4',
    'audio/wav'
  ];

  for (const type of types) {
    if (MediaRecorder.isTypeSupported(type)) {
      return type;
    }
  }
  return 'audio/webm'; // fallback
};

Next Steps

In the next post, I'll cover the backend processing pipeline, including:

Audio format conversion and file handling
Integration with OpenAI Whisper for speech-to-text
The complete /api/stt implementation
Error handling and performance optimization strategies

The complete audio workflow achieves our target latency of under 2-5 seconds from speech capture to AI-generated response, which is a natural conversation flow for language learners.

Coming Up Next: Part 2 - Backend Speech-to-Text Processing with OpenAI Whisper

DEV Community