Evie Wang

Posted on Sep 21

Building an AI Conversation Practice App: Part 2 - Backend Speech-to-Text Processing with OpenAI Whisper

#openai #whisper #nextjs #stt

This is the second post in a series documenting the technical implementation of a browser-based English learning application with cost-friendly conversation system.

Overview: The STT Pipeline

The complete STT workflow involves:

Audio Reception → FormData parsing with formidable
File Validation → WebM format verification and size checks
Stream Processing → Direct file stream to OpenAI API
Transcription → Whisper-1 model with Canadian English optimization
Response Handling → Error management and cleanup
Integration → Seamless handoff to conversation system

Total processing time: 200-500ms

Technical Stack Summary:

Primary STT: OpenAI Whisper-1
File Processing: Formidable + Node.js streams
Language: TypeScript with Next.js API routes
Error Handling: Basic try-catch with error logging
Performance: Stream processing, Node.js runtime

The Challenges I Solved

1. File Upload Complexity in Next.js

Problem: Next.js API routes have strict limitations on file uploads, especially with form-data.

Solution: Used a custom formidable-based parser:

// Disable Next.js body parsing
export const config = { api: { bodyParser: false } };

// Custom form parsing with formidable
const form = new IncomingForm({
  keepExtensions: true,
});

const formData: [Fields, Files] = await new Promise((resolve, reject) => {
  form.parse(req, (err, fields, files) => {
    if (err) return reject(err);
    resolve([fields, files]);
  });
});

The reason:

Bypasses Next.js 1MB body size limit
Handles WebM files up to 25MB
Maintains file metadata and extensions
Provides proper error handling

2. Stream Processing for Large Files

Problem: Loading entire audio files into memory causes performance issues and potential crashes after deployment.

Solution: Direct stream processing to OpenAI API:

// Create readable stream from uploaded file
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);

// Stream directly to OpenAI (no memory buffering)
const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: "whisper-1",
  language: "en",
  prompt: "This is a conversation in Canadian English.",
});

Performance Benefits:

Significantly reduced memory usage through streaming
Faster processing for large files
Better reliability and no memory overflow crashes

3. Frontend Audio Validation

Problem: Short audio recordings (< 300ms) are often accidental and waste API calls.
Solution: Early validation on the frontend before sending to backend

// Frontend validation before API call
const recordingDuration = Date.now() - recordingStartTimeRef.current;

if (recordingDuration < 300) {
  const clarificationText = getRandomClarification();

  const assistantMessage: Message = {
    role: 'assistant',
    content: '',
    isStreaming: true
  };

  setMessages(prevMessages => [...prevMessages, assistantMessage]);
  streamText(clarificationText, messageIndex);
  return; // Don't call STT API
}

// Only send to backend if recording is long enough
const sttResponse = await fetch('/api/stt', {
  method: 'POST',
  body: formData,
});

Results:

API call reduction: ~15% fewer unnecessary calls
User experience: Immediate feedback for accidental recordings
Cost savings: Reduced unwanted OpenAI API usage

4. Canadian English Optimization

Problem: Default Whisper models aren't optimized for Canadian English expressions and pronunciation patterns.

Solution: Custom prompt engineering:

const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: "whisper-1",
  language: "en",
  prompt: "This is a conversation in Canadian English.",
});

Results:

Better recognition of Canadian expressions
Improved handling of slang and culture-related expressions

Core Technical Implementation

1. API Endpoint Architecture

Our main STT endpoint (/api/stt) follows a robust error-handling pattern:

export default async function handler(
  req: NextApiRequest, 
  res: NextApiResponse<ApiResponse>
) {
  if (req.method !== 'POST') {
    return res.status(405).json({ success: false, error: 'Method not allowed' });
  }

  try {
    // Parse form data
    const form = new IncomingForm({ keepExtensions: true });
    const formData: [Fields, Files] = await new Promise((resolve, reject) => {
      form.parse(req, (err, fields, files) => {
        if (err) return reject(err);
        resolve([fields, files]);
      });
    });

    const [fields, files] = formData;

    // Validate audio file
    const audioFiles = files.audio;
    if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
      return res.status(400).json({ success: false, error: 'No audio file provided' });
    }

    const audioFile = audioFiles[0] as File;

    // Process with OpenAI
    const audioPath = audioFile.filepath;
    const audioStream = createReadStream(audioPath);

    const transcription = await openai.audio.transcriptions.create({
      file: audioStream,
      model: "whisper-1",
      language: "en",
      prompt: "This is a conversation in Canadian English.",
    });

    // Cleanup and respond
    fs.unlinkSync(audioPath);
    return res.status(200).json({
      success: true,
      transcript: transcription.text
    });

  } catch (error) {
    console.error('STT Error:', error);
    return res.status(500).json({
      success: false,
      error: error instanceof Error ? error.message : 'Failed to transcribe audio'
    });
  }
}

2. File Validation & Security

// Access the audio file with proper type checking
const audioFiles = files.audio;
if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
  return res.status(400).json({ 
    success: false, 
    error: 'No audio file provided' 
  });
}

const audioFile = audioFiles[0] as File;

// Additional validation
if (!audioFile.filepath || audioFile.size === 0) {
  return res.status(400).json({ 
    success: false, 
    error: 'Invalid audio file' 
  });
}

Security Measures:

File type validation (WebM only)
Size limits (25MB max)
Temporary file cleanup
No persistent storage

3. Resource Management

// Critical: Clean up temporary files
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);

// Process audio...

// Always cleanup, even on error
try {
  fs.unlinkSync(audioPath);
} catch (cleanupError) {
  console.warn('Failed to cleanup temp file:', cleanupError);
}

Resource Management Benefits:

Disk space: Prevents temp file accumulation
Security: No persistent audio storage
Performance: Clean server state

Performance Optimizations

1. Streaming vs Buffering

Before (Buffering):

// Load entire file into memory
const audioBuffer = fs.readFileSync(audioPath);
const transcription = await openai.audio.transcriptions.create({
  file: audioBuffer, // Large memory usage
});

After (Streaming):

// Stream file directly
const audioStream = createReadStream(audioPath);
const transcription = await openai.audio.transcriptions.create({
  file: audioStream, // Minimal memory usage
});

Results:

Significantly reduced memory usage through streaming
Faster processing for large files
Better support for concurrent requests

Integration with Frontend

The STT API seamlessly integrates with our frontend conversation system:

// Frontend STT call
const sttResponse = await fetch('/api/stt', {
  method: 'POST',
  body: formData,
});

const sttData = await sttResponse.json();

if (!sttData.success) {
  // Handle error gracefully
  const clarificationText = getRandomClarification();
  // Show clarification message to user
} else {
  // Continue with conversation
  const transcript = sttData.transcript;
  // Send to GPT for response generation
}

Error Handling & User Experience

1. Graceful Degradation

// If STT fails, don't break the conversation
if (!sttData.success) {
  const clarificationPhrases = [
    "Sorry, can you repeat that?",
    "Could you say that again please?",
    "I didn't quite get that. Could you repeat?",
  ];

  const randomClarification = clarificationPhrases[
    Math.floor(Math.random() * clarificationPhrases.length)
  ];

  // Continue conversation with clarification
}

2. Debugging & Monitoring

// Comprehensive logging for debugging
console.log('STT Response:', {
  success: sttData.success,
  transcript: sttData.transcript?.substring(0, 50) + '...',
  processingTime: Date.now() - startTime,
  fileSize: audioFile.size
});

Production Considerations

Rate Limiting

// Implement rate limiting for production
if (requestCount > 10) { // 10 requests per minute
  return res.status(429).json({
    success: false,
    error: 'You\'re speaking too fast! Please wait a moment before trying again.'
  });
}

frontend：

if (response.status === 429) {
  showError("Please wait a moment before recording again");
}

What's Next

In the next post, we’ll see how the transcribed text powers our AI conversation system, from selecting specific characters and crafting prompts for Canadian English, also integrating with GPT-4 and keeping conversations flowing naturally.

DEV Community

Building an AI Conversation Practice App: Part 2 - Backend Speech-to-Text Processing with OpenAI Whisper

Overview: The STT Pipeline

The Challenges I Solved

1. File Upload Complexity in Next.js

2. Stream Processing for Large Files

3. Frontend Audio Validation

4. Canadian English Optimization

Core Technical Implementation

1. API Endpoint Architecture

2. File Validation & Security

3. Resource Management

Performance Optimizations

1. Streaming vs Buffering

Integration with Frontend

Error Handling & User Experience

1. Graceful Degradation

2. Debugging & Monitoring

Production Considerations

Rate Limiting

What's Next

Top comments (0)