This is the first post in a series documenting the technical implementation of a browser-based English learning application with real-time speech processing capabilities.
The Story Behind This Project
When I started building my language learning app, I wanted to create something more engaging than traditional flashcards. The idea was simple: let users practice real conversations through AI-powered role-play scenarios which alighs with real-life English and be really helpful.
What seemed like a straightforward feature turned into one of the most technically challenging parts of my entire application.
The final result? A pseudo-real-time but cost-friendly conversation system where users can speak naturally with AI characters, get instant feedback, and practice scenarios like ordering in a restaurant or customer service calls. But getting there required solving problems I never expected as a student developer.
Overview: The Complete Audio-to-Conversation Workflow
Our English learning platform implements a sophisticated role-play conversation system that transforms user speech into intelligent AI responses within 2-5 seconds.
Now let me walk through what happens when a user starts a role-play conversation:
- User Interaction → User presses and holds the microphone button
- Audio Capture → Browser records audio using MediaRecorder API
- Audio Processing → Create audio blob and prepare FormData for upload
- Speech Recognition → OpenAI Whisper-1 transcribes audio to text
- Character Response → GPT-4 generates contextual response using character system
- Text-to-Speech → Browser Web Speech API converts response to audio
- Progress Tracking → User progress and usage limits updated in database
Total response time: 2-5 seconds (optimized for conversational feel)
This series takes a closer look at each component of the workflow, starting with the foundation—frontend audio recording and processing. The first post focuses on the technical details of browser-based audio capture, while the following posts will explore the rest of the pipeline that enables real-time AI conversations.
The Challenges We Discovered
What seemed straightforward in our project planning phase revealed layer after layer of unexpected complexity:
Cost Management Reality
API Cost Explosion: Our initial implementation used detailed Azure pronunciation analysis but quickly became prohibitively expensive at scale
Usage Control Strategy: After analyzing token-based and conversation-based limits, we settled on recording-count limits (5 free recordings per day) - simple, predictable, and abuse-resistant.
Performance vs. Quality Trade-offs
Response Time Optimization: Detailed Azure analysis took ~15 seconds; we optimized to ~6 seconds while maintaining meaningful feedback
Platform Constraints: Vercel's 10-second serverless function limits forced architectural decisions
TTS Evolution: Switched from OpenAI TTS to browser Web Speech API, removed network latency, and improved user experience at the same time.
Real-World Production Constraints
Browser Compatibility: Built detection and fallback systems for MediaRecorder and Web Audio API support
Abuse Prevention: Implemented session limits and daily caps with real-time usage tracking
Graceful Error Handling: Fallback systems for failed TTS, audio processing errors, and network issues
File Structure & Architecture
components/
├── ConversationInterface.tsx # Main conversation UI & state management
├── AudioRecorder.jsx # Audio recording component
└── SlangLoader.tsx # Loading animations
pages/
├── scenario/[id]/roleplay.js # Role-play page container
└── api/
├── enhanced-chat.ts # GPT conversation API
├── stt.ts # Speech-to-text API
└── azure-analytics-client.ts # Audio analysis API
utils/
├── clientAudioProcessor.ts # Client-side audio conversion
├── browserTTS.js # Text-to-speech utility
└── usageLimits.js # Usage tracking system
lib/
└── characterService.mjs # Dynamic character generation
Technical Implementation
Because it’s too complex to explain clearly in a single post, I split it into separate posts, each covering a different stage of the workflow.
sequenceDiagram
participant Client as Frontend (Browser)
participant Server as Backend (Server)
Client->>Client: User records audio
Client->>Client: Create audioBlob
Client->>Client: FormData
Client->>+Server: HTTP POST /api/stt
Server->>Server: Parse audio file
Server->>Server: Call OpenAI Whisper
Server->>-Client: HTTP Response (transcript)
1. Audio Recording Architecture
Our implementation uses the modern Web Audio API with React state management:
// Core state management for audio recording
const [isRecording, setIsRecording] = useState(false);
const [isProcessing, setIsProcessing] = useState(false);
const mediaRecorderRef = useRef(null);
const audioChunksRef = useRef([]);
const recordingStartTimeRef = useRef(null);
2. MediaRecorder Configuration
We chose WebM with Opus codec as our recording format for optimal compression and quality:
const startRecording = async () => {
try {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorderRef.current = new MediaRecorder(stream, {
mimeType: 'audio/webm;codecs=opus',
});
audioChunksRef.current = [];
recordingStartTimeRef.current = Date.now();
// Event handlers for data collection
mediaRecorderRef.current.ondataavailable = (event) => {
audioChunksRef.current.push(event.data);
};
mediaRecorderRef.current.onstop = () => {
handleAudioSubmission();
};
mediaRecorderRef.current.start();
setIsRecording(true);
} catch (error) {
console.error('Error starting recording:', error);
}
};
3. Resource Management & Cleanup
const stopRecording = () => {
if (mediaRecorderRef.current && isRecording) {
// Stop the MediaRecorder
mediaRecorderRef.current.stop();
// Critical: Release microphone permissions
mediaRecorderRef.current.stream.getTracks().forEach(track => track.stop());
setIsRecording(false);
}
};
So, this can
- Stops the “microphone active” light from staying on
- Releases system resources
- Addresses user privacy concerns
- Allows other applications to use the microphone
4. Audio Data Processing
The MediaRecorder API delivers audio in chunks, which we collect and process:
const handleAudioSubmission = async () => {
if (audioChunksRef.current.length === 0) return;
// Quality check: Filter out accidental short recordings
const recordingDuration = Date.now() - recordingStartTimeRef.current;
if (recordingDuration < 300) {
// Handle too-short recordings gracefully
return;
}
// Create blob from audio chunks
const audioBlob = new Blob(audioChunksRef.current, { type: 'audio/webm' });
// Prepare for backend processing
const formData = new FormData();
formData.append('audio', audioBlob, 'recording.webm');
// Send to backend API
const response = await fetch('/api/stt', {
method: 'POST',
body: formData,
});
};
Technical Decisions & Trade-offs
When building a speech recognition app that integrates with multiple AI services, one of the biggest decisions is choosing the right audio format. Each API has different requirements, and the wrong choice can impact performance, user experience, and development complexity.
After testing various formats, we chose WebM + Opus as our frontend recording format. Here's why:
Browser Compatibility
- Works on all modern browsers (except some iOS Safari versions)
- Native support in Chrome, Firefox, Edge
- Graceful fallback for older browsers
File Size Optimization
- 10:1 compression ratio compared to WAV
- 30-second recording: ~50KB vs ~500KB
- Faster uploads, especially on mobile networks
Real-time Performance
- Designed for streaming applications
- Low latency encoding
- Minimal CPU usage during recording
Audio Quality
- High quality at low bitrates
- Opus codec optimized for speech
- Maintains clarity and quality for AI processing
Instead of forcing one format for all APIs, we implemented intelligent routing:
// Frontend: Always record as WebM
const mediaRecorder = new MediaRecorder(stream, {
mimeType: 'audio/webm;codecs=opus',
});
// Backend: Route based on API requirements
if (apiType === 'openai') {
// Direct WebM support - zero conversion
return await openai.audio.transcriptions.create({
file: formData,
model: "whisper-1"
});
} else if (apiType === 'azure') {
// Convert to WAV for Azure
const wavBuffer = await convertWebMToWav(audioData);
return await azureSTT(wavBuffer);
}
Real-world measurements:
- Recording latency: <50ms (WebM) vs <100ms (WAV)
- File upload time: 3-4x faster with WebM
- API response time: OpenAI 300ms, Azure 800ms
Frontend Recording
// Smart format detection
const getSupportedMimeType = () => {
const types = [
'audio/webm;codecs=opus',
'audio/webm',
'audio/mp4',
'audio/wav'
];
for (const type of types) {
if (MediaRecorder.isTypeSupported(type)) {
return type;
}
}
return 'audio/webm'; // fallback
};
Next Steps
In the next post, I'll cover the backend processing pipeline, including:
- Audio format conversion and file handling
- Integration with OpenAI Whisper for speech-to-text
- The complete /api/stt implementation
- Error handling and performance optimization strategies
The complete audio workflow achieves our target latency of under 2-5 seconds from speech capture to AI-generated response, which is a natural conversation flow for language learners.
Coming Up Next: Part 2 - Backend Speech-to-Text Processing with OpenAI Whisper
Top comments (0)