This is the second post in a series documenting the technical implementation of a browser-based English learning application with real-time speech processing capabilities.
Overview: The STT Pipeline
The complete STT workflow involves:
- Audio Reception → FormData parsing with formidable
- File Validation → WebM format verification and size checks
- Stream Processing → Direct file stream to OpenAI API
- Transcription → Whisper-1 model with Canadian English optimization
- Response Handling → Error management and cleanup
- Integration → Seamless handoff to conversation system
Total processing time: 200-500ms
Technical Stack Summary:
- Primary STT: OpenAI Whisper-1
- File Processing: Formidable + Node.js streams
- Language: TypeScript with Next.js API routes
- Error Handling: Basic try-catch with error logging
- Performance: Stream processing, Node.js runtime
The Challenges I Solved
1. File Upload Complexity in Next.js
Problem: Next.js API routes have strict limitations on file uploads, especially with form-data.
Solution: Used a custom formidable-based parser:
// Disable Next.js body parsing
export const config = { api: { bodyParser: false } };
// Custom form parsing with formidable
const form = new IncomingForm({
keepExtensions: true,
});
const formData: [Fields, Files] = await new Promise((resolve, reject) => {
form.parse(req, (err, fields, files) => {
if (err) return reject(err);
resolve([fields, files]);
});
});
The reason:
- Bypasses Next.js 1MB body size limit
- Handles WebM files up to 25MB
- Maintains file metadata and extensions
- Provides proper error handling
2. Stream Processing for Large Files
Problem: Loading entire audio files into memory causes performance issues and potential crashes after deployment.
Solution: Direct stream processing to OpenAI API:
// Create readable stream from uploaded file
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);
// Stream directly to OpenAI (no memory buffering)
const transcription = await openai.audio.transcriptions.create({
file: audioStream,
model: "whisper-1",
language: "en",
prompt: "This is a conversation in Canadian English.",
});
Performance Benefits:
- Significantly reduced memory usage through streaming
- Faster processing for large files
- Better reliability and no memory overflow crashes
3. Frontend Audio Validation
Problem: Short audio recordings (< 300ms) are often accidental and waste API calls.
Solution: Early validation on the frontend before sending to backend
// Frontend validation before API call
const recordingDuration = Date.now() - recordingStartTimeRef.current;
if (recordingDuration < 300) {
const clarificationText = getRandomClarification();
const assistantMessage: Message = {
role: 'assistant',
content: '',
isStreaming: true
};
setMessages(prevMessages => [...prevMessages, assistantMessage]);
streamText(clarificationText, messageIndex);
return; // Don't call STT API
}
// Only send to backend if recording is long enough
const sttResponse = await fetch('/api/stt', {
method: 'POST',
body: formData,
});
Results:
- API call reduction: ~15% fewer unnecessary calls
- User experience: Immediate feedback for accidental recordings
- Cost savings: Reduced unwanted OpenAI API usage
4. Canadian English Optimization
Problem: Default Whisper models aren't optimized for Canadian English expressions and pronunciation patterns.
Solution: Custom prompt engineering:
const transcription = await openai.audio.transcriptions.create({
file: audioStream,
model: "whisper-1",
language: "en",
prompt: "This is a conversation in Canadian English.",
});
Results:
- Better recognition of Canadian expressions
- Improved handling of slang and culture-related expressions
Core Technical Implementation
1. API Endpoint Architecture
Our main STT endpoint (/api/stt
) follows a robust error-handling pattern:
export default async function handler(
req: NextApiRequest,
res: NextApiResponse<ApiResponse>
) {
if (req.method !== 'POST') {
return res.status(405).json({ success: false, error: 'Method not allowed' });
}
try {
// Parse form data
const form = new IncomingForm({ keepExtensions: true });
const formData: [Fields, Files] = await new Promise((resolve, reject) => {
form.parse(req, (err, fields, files) => {
if (err) return reject(err);
resolve([fields, files]);
});
});
const [fields, files] = formData;
// Validate audio file
const audioFiles = files.audio;
if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
return res.status(400).json({ success: false, error: 'No audio file provided' });
}
const audioFile = audioFiles[0] as File;
// Process with OpenAI
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);
const transcription = await openai.audio.transcriptions.create({
file: audioStream,
model: "whisper-1",
language: "en",
prompt: "This is a conversation in Canadian English.",
});
// Cleanup and respond
fs.unlinkSync(audioPath);
return res.status(200).json({
success: true,
transcript: transcription.text
});
} catch (error) {
console.error('STT Error:', error);
return res.status(500).json({
success: false,
error: error instanceof Error ? error.message : 'Failed to transcribe audio'
});
}
}
2. File Validation & Security
// Access the audio file with proper type checking
const audioFiles = files.audio;
if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
return res.status(400).json({
success: false,
error: 'No audio file provided'
});
}
const audioFile = audioFiles[0] as File;
// Additional validation
if (!audioFile.filepath || audioFile.size === 0) {
return res.status(400).json({
success: false,
error: 'Invalid audio file'
});
}
Security Measures:
- File type validation (WebM only)
- Size limits (25MB max)
- Temporary file cleanup
- No persistent storage
3. Resource Management
// Critical: Clean up temporary files
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);
// Process audio...
// Always cleanup, even on error
try {
fs.unlinkSync(audioPath);
} catch (cleanupError) {
console.warn('Failed to cleanup temp file:', cleanupError);
}
Resource Management Benefits:
- Disk space: Prevents temp file accumulation
- Security: No persistent audio storage
- Performance: Clean server state
Performance Optimizations
1. Streaming vs Buffering
Before (Buffering):
// Load entire file into memory
const audioBuffer = fs.readFileSync(audioPath);
const transcription = await openai.audio.transcriptions.create({
file: audioBuffer, // Large memory usage
});
After (Streaming):
// Stream file directly
const audioStream = createReadStream(audioPath);
const transcription = await openai.audio.transcriptions.create({
file: audioStream, // Minimal memory usage
});
Results:
- Significantly reduced memory usage through streaming
- Faster processing for large files
- Better support for concurrent requests
Integration with Frontend
The STT API seamlessly integrates with our frontend conversation system:
// Frontend STT call
const sttResponse = await fetch('/api/stt', {
method: 'POST',
body: formData,
});
const sttData = await sttResponse.json();
if (!sttData.success) {
// Handle error gracefully
const clarificationText = getRandomClarification();
// Show clarification message to user
} else {
// Continue with conversation
const transcript = sttData.transcript;
// Send to GPT for response generation
}
Error Handling & User Experience
1. Graceful Degradation
// If STT fails, don't break the conversation
if (!sttData.success) {
const clarificationPhrases = [
"Sorry, can you repeat that?",
"Could you say that again please?",
"I didn't quite get that. Could you repeat?",
];
const randomClarification = clarificationPhrases[
Math.floor(Math.random() * clarificationPhrases.length)
];
// Continue conversation with clarification
}
2. Debugging & Monitoring
// Comprehensive logging for debugging
console.log('STT Response:', {
success: sttData.success,
transcript: sttData.transcript?.substring(0, 50) + '...',
processingTime: Date.now() - startTime,
fileSize: audioFile.size
});
Production Considerations
Rate Limiting
// Implement rate limiting for production
if (requestCount > 10) { // 10 requests per minute
return res.status(429).json({
success: false,
error: 'You\'re speaking too fast! Please wait a moment before trying again.'
});
}
frontend:
if (response.status === 429) {
showError("Please wait a moment before recording again");
}
What's Next
In the next post, we’ll see how the transcribed text powers our AI conversation system, from selecting specific characters and crafting prompts for Canadian English, also integrating with GPT-4 and keeping conversations flowing naturally.
Top comments (0)