Building a FREE Speech-to-Text App with OpenAI Whisper π€π€
Ever wanted to add speech-to-text to your project but got scared off by API costs? What if I told you there's a way to run OpenAI's Whisper model completely free on your own server?
π Try it live: https://www.devtestmode.com/speech-to-text.html
β οΈ Note: The live demo is for testing purposes only and has usage limits (5 requests per 15 minutes). Want unlimited usage? Read on to learn how to build your own instance, or DM me for implementation help!
π― Why I Built This
I needed a speech-to-text solution that was:
- β 100% FREE - No API costs
- β Privacy-First - Audio never leaves the server
- β Multilingual - Support for 96+ languages
- β High Quality - OpenAI Whisper accuracy
- β Self-Hosted - Complete control
Commercial APIs like OpenAI's Whisper API, Google Speech-to-Text, or AWS Transcribe charge around $0.006 per minute. For a high-traffic app, this adds up quickly!
π‘ Check out my other AI projects:
- Wise Cash AI - AI financial assistant
- Daily AI Collection - Curated AI tools
- N8N Workflows - Automation templates
- AI Prompts Library - Prompt collection
- Gold Copy Trading - Trading insights
ποΈ Architecture Overview
The solution uses a 3-layer architecture:
Browser (Frontend)
    β
Node.js Express API
    β
Python Whisper Service
    β
OpenAI Whisper Model (Local)
Why This Stack?
- Browser - Clean UI with drag-and-drop, handles file validation
- Node.js - Manages uploads, routing, and error handling
- 
Python - Runs faster-whisperlibrary for optimal performance
- Whisper Model - Downloaded once, runs locally forever
π Key Features Implemented
1οΈβ£ Multiple Model Support
Users can choose between different Whisper models based on their needs:
const models = {
    'whisper-tiny': '39M (~75MB) - Very Fast',
    'whisper-base': '74M (~142MB) - Fast',
    'whisper-small': '244M (~466MB) - Recommended',
    'whisper-medium': '769M (~1.5GB) - High Accuracy',
    'whisper-large': '1550M (~2.9GB) - Best Quality'
};
Performance on CPU (Intel i5/Ryzen 5):
- Small model: 2-5x real-time (60s audio = 30s processing)
- With GPU: 20-30x real-time (60s audio = 2-3s processing) π
2οΈβ£ Smart File Upload with Validation
// MIME type validation
const ALLOWED_AUDIO_TYPES = [
    'audio/mpeg', 'audio/wav', 'audio/mp4',
    'audio/webm', 'audio/ogg', 'audio/flac'
];
// Extension validation as fallback
const ALLOWED_EXTENSIONS = [
    '.mp3', '.wav', '.m4a', '.webm', '.ogg', '.flac'
];
function validateFile(file) {
    const validMime = ALLOWED_AUDIO_TYPES.includes(file.type);
    const validExt = ALLOWED_EXTENSIONS.some(ext =>
        file.name.toLowerCase().endsWith(ext)
    );
    return validMime || validExt;
}
Why double validation? Some browsers report incorrect MIME types, so checking both ensures reliability.
3οΈβ£ Automatic Retry with Exponential Backoff
Network issues happen. The app automatically retries failed requests:
const MAX_RETRIES = 3;
const RETRY_DELAY = 2000;
async function transcribeWithRetry(retriesLeft = MAX_RETRIES) {
    try {
        const response = await fetch(API_URL, {
            method: 'POST',
            body: formData,
            signal: currentRequestController.signal
        });
        if (!response.ok && retriesLeft > 0) {
            showError(`Retrying... (${retriesLeft} attempts left)`);
            await delay(RETRY_DELAY);
            return await transcribeWithRetry(retriesLeft - 1);
        }
        return response;
    } catch (error) {
        if (retriesLeft > 0) {
            await delay(RETRY_DELAY);
            return await transcribeWithRetry(retriesLeft - 1);
        }
        throw error;
    }
}
4οΈβ£ Rate Limiting (Frontend)
Prevent abuse with client-side rate limiting:
const RATE_LIMIT_MAX = 5;        // Max requests
const RATE_LIMIT_WINDOW = 15 * 60 * 1000;  // Per 15 minutes
function checkRateLimit() {
    const now = Date.now();
    const rateLimitData = JSON.parse(
        localStorage.getItem('rateLimitData') ||
        '{"requests": [], "count": 0}'
    );
    // Remove expired requests
    rateLimitData.requests = rateLimitData.requests.filter(
        timestamp => now - timestamp < RATE_LIMIT_WINDOW
    );
    if (rateLimitData.requests.length >= RATE_LIMIT_MAX) {
        const oldestRequest = Math.min(...rateLimitData.requests);
        const waitTime = Math.ceil(
            (RATE_LIMIT_WINDOW - (now - oldestRequest)) / 60000
        );
        return { allowed: false, waitMinutes: waitTime };
    }
    return { allowed: true };
}
βοΈ Backend Implementation Highlights
Express Route Handler
// routes/speech-to-text.js
const express = require('express');
const multer = require('multer');
const { spawn } = require('child_process');
const upload = multer({
    dest: 'uploads/',
    limits: { fileSize: 25 * 1024 * 1024 }, // 25MB max
    fileFilter: (req, file, cb) => {
        const allowedMimes = [
            'audio/mpeg', 'audio/wav', 'audio/mp4'
        ];
        cb(null, allowedMimes.includes(file.mimetype));
    }
});
router.post('/transcribe', upload.single('audio'), async (req, res) => {
    const { model = 'small' } = req.body;
    const audioPath = req.file.path;
    const python = spawn('python3', [
        'services/whisper-service.py',
        audioPath,
        model
    ]);
    let output = '';
    python.stdout.on('data', (data) => {
        output += data.toString();
    });
    python.on('close', (code) => {
        // Cleanup temp file
        fs.unlinkSync(audioPath);
        if (code === 0) {
            const result = JSON.parse(output);
            res.json({ success: true, data: result });
        } else {
            res.status(500).json({
                success: false,
                error: 'Transcription failed'
            });
        }
    });
});
Python Whisper Service
# services/whisper-service.py
from faster_whisper import WhisperModel
import sys
import json
def transcribe_audio(audio_path, model_size="small"):
    # Load model (cached after first run)
    model = WhisperModel(
        model_size,
        device="cpu",
        compute_type="int8"
    )
    # Transcribe
    segments, info = model.transcribe(
        audio_path,
        beam_size=5,
        vad_filter=True  # Voice Activity Detection
    )
    # Combine segments
    text = " ".join([segment.text for segment in segments])
    return {
        "text": text,
        "language": info.language,
        "language_probability": info.language_probability,
        "duration": info.duration
    }
if __name__ == "__main__":
    audio_path = sys.argv[1]
    model_size = sys.argv[2] if len(sys.argv) > 2 else "small"
    result = transcribe_audio(audio_path, model_size)
    print(json.dumps(result))
π Security & Best Practices
1οΈβ£ Content Security Policy
<meta http-equiv="Content-Security-Policy"
      content="default-src 'self';
               script-src 'self' 'unsafe-inline';
               connect-src 'self' https://aiml.devtestmode.com;
               frame-ancestors 'none';">
2οΈβ£ File Validation (Double Layer)
- Client-side: Check MIME type + extension
- Server-side: Validate with multer filter + FFmpeg probe
3οΈβ£ Temporary File Cleanup
// Always cleanup temp files
python.on('close', () => {
    try {
        fs.unlinkSync(audioPath);
    } catch (err) {
        console.error('Cleanup error:', err);
    }
});
4οΈβ£ Rate Limiting (Backend)
const rateLimit = require('express-rate-limit');
const limiter = rateLimit({
    windowMs: 15 * 60 * 1000, // 15 minutes
    max: 10, // 10 requests per IP
    message: 'Too many requests, please try again later'
});
app.use('/api/speech-to-text/', limiter);
βΏ Accessibility Features
The app is built with WCAG 2.1 Level AA compliance:
<!-- Skip navigation -->
<a href="#main-content" class="skip-link">
    Skip to main content
</a>
<!-- Semantic HTML -->
<main id="main-content" role="main">
    <h1>Speech to Text</h1>
    <!-- Content -->
</main>
<!-- ARIA labels -->
<button aria-label="Upload audio file for transcription">
    Upload Audio
</button>
<!-- Error announcements -->
<div role="alert" aria-live="assertive" class="error-message">
    <!-- Dynamic error messages -->
</div>
π Performance Optimization
  
  
  1οΈβ£ Use faster-whisper Instead of OpenAI's Whisper
# OpenAI's original (slower)
pip install openai-whisper
# Faster alternative (4-5x speedup!)
pip install faster-whisper
Why faster-whisper?
- Built on CTranslate2 (optimized inference engine)
- 4-5x faster than original
- Lower memory usage
- Same accuracy
2οΈβ£ Model Caching
Models are downloaded once and cached:
# First run: Downloads model (~466MB for small)
model = WhisperModel("small", device="cpu")
# Subsequent runs: Uses cached model
# Location: ~/.cache/huggingface/hub/
3οΈβ£ GPU Acceleration (Optional)
For production with high traffic:
# Install CUDA version
pip install faster-whisper[cuda]
# Use GPU
model = WhisperModel(
    "small",
    device="cuda",
    compute_type="float16"
)
Performance boost: 10-20x faster! π
π¨ UI/UX Highlights
Drag-and-Drop with Visual Feedback
dropZone.addEventListener('dragover', (e) => {
    e.preventDefault();
    dropZone.classList.add('dragover');
});
dropZone.addEventListener('drop', (e) => {
    e.preventDefault();
    dropZone.classList.remove('dragover');
    const files = e.dataTransfer.files;
    if (files.length > 0) {
        handleFileSelect(files[0]);
    }
});
Real-Time Progress Updates
function updateProgress(stage) {
    const stages = {
        'uploading': 'Uploading audio file...',
        'processing': 'Transcribing with Whisper AI...',
        'complete': 'Transcription complete!'
    };
    statusDiv.textContent = stages[stage];
    progressBar.className = `progress-${stage}`;
}
π SEO & Structured Data
Implemented Schema.org structured data for better discoverability:
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "WebApplication",
  "name": "Free Speech to Text Converter",
  "description": "Free local Speech-to-Text using OpenAI Whisper",
  "url": "https://www.devtestmode.com/speech-to-text.html",
  "applicationCategory": "MultimediaApplication",
  "operatingSystem": "Any",
  "offers": {
    "@type": "Offer",
    "price": "0",
    "priceCurrency": "USD"
  },
  "featureList": [
    "96+ language support",
    "Local processing",
    "Privacy-first"
  ]
}
</script>
β οΈ Live Demo Limitations
The public demo at devtestmode.com/speech-to-text.html is provided for testing and evaluation purposes only.
Current Limits:
- π 5 requests per 15 minutes per user (client-side rate limiting)
- π¦ 25MB maximum file size
- π― Best effort availability (may be down for maintenance)
Why these limits?
- Prevent server overload
- Fair usage for all testers
- Encourage self-hosting for production needs
Want unlimited usage? Learn how to build your own instance from this article, or reach out to me directly for implementation assistance! π¬
π How to Build Your Own
Want to implement this yourself? Here's what you'll need:
Prerequisites
# Install FFmpeg (required for audio processing)
sudo apt install ffmpeg  # Ubuntu/Debian
brew install ffmpeg      # macOS
# Install Python dependencies
pip3 install faster-whisper
Implementation Overview
Based on the code highlights above, you'll need to:
- Set up Express server with multer for file uploads
- Create Python service using faster-whisper library
- Build frontend with drag-and-drop file handling
- Implement security (CSP, rate limiting, validation)
- Add accessibility features (ARIA, semantic HTML)
π§ Need help implementing? Feel free to reach out to me on LinkedIn for guidance, or read through all the code examples in this article to build it yourself!
Testing Your Implementation
π‘ Tip: Once you've built your own instance, test it locally:
curl -X POST http://localhost:3001/api/speech-to-text/transcribe \
  -F "audio=@sample.mp3" \
  -F "model=whisper-small"
π‘ Lessons Learned
1οΈβ£ Always Implement Retry Logic
Network issues are common. Auto-retry improved success rate by 40%.
2οΈβ£ Double File Validation is Essential
Browser MIME types are unreliable. Extension checking prevented many edge cases.
3οΈβ£ Rate Limiting on Both Sides
- Frontend: Better UX, immediate feedback
- Backend: True protection against abuse
4οΈβ£ faster-whisper > Original Whisper
The performance difference is massive. Always use faster-whisper for production.
5οΈβ£ Accessibility from Day One
Adding ARIA labels and semantic HTML later is painful. Start accessible!
π° Cost Comparison
| Service | Cost | Privacy | Quality | 
|---|---|---|---|
| This Solution | FREE | β Private | Excellent | 
| OpenAI API | $0.006/min | β Cloud | Excellent | 
| Google Speech-to-Text | $0.006/15s | β Cloud | Good | 
| AWS Transcribe | $0.0004/s | β Cloud | Good | 
For 1000 minutes of audio:
- This solution: $0
- Commercial APIs: $6-24 πΈ
π Resources & Tools
This Project
- Live Demo (Testing Only): devtestmode.com/speech-to-text.html (5 requests per 15 min limit)
- OpenAI Whisper: github.com/openai/whisper
- faster-whisper: github.com/guillaumekln/faster-whisper
My Other Projects
- π° Wise Cash AI - wisecashai.com - AI-powered financial assistant
- π€ Daily AI Collection - dailyaicollection.net - Curated AI tools & resources
- π N8N Workflows - n8n.dailyaicollection.net - Automation workflow templates
- β¨ AI Prompts Library - prompts.dailyaicollection.net - Ready-to-use AI prompts
- π Gold Copy Trading - goldcopytrading.com - Trading insights & strategies
- β Mini AI Projects Collection - github.com/stars/allanninal/lists/mini-ai-projects - Curated list of my AI projects
π€ Support This Project
If you found this helpful, consider:
- π¬ Reach out on LinkedIn for implementation help or collaborations
- π¦ Share this article to help others learn
- β Buy me a coffee: ko-fi.com/dailyaicollection
Your support helps me create more free AI tools for everyone! π
π¬ Conclusion
Building a free, self-hosted speech-to-text service is easier than you think! With OpenAI's Whisper model and the faster-whisper library, you can:
β
 Save hundreds of dollars in API costs
β
 Keep user data private and secure
β
 Support 96+ languages out of the box
β
 Maintain complete control over your infrastructure
Try the demo: https://www.devtestmode.com/speech-to-text.html (testing purposes only - has rate limits)
For production: Use the code examples and architecture from this article to build your own unlimited instance, or contact me for implementation help!
Have questions or suggestions? Drop them in the comments below! π
Tags: #ai #opensource #webdev #python #nodejs #whisper #speechtotext #tutorial #freesoftware
About Me
I'm Allan, a full-stack developer building free AI tools and innovative solutions that anyone can use. Currently working on making AI more accessible through simple, practical applications.
π My Projects
AI & Automation:
- π€ Daily AI Collection - Curated AI tools & resources
- π° Wise Cash AI - AI-powered financial assistant
- π N8N Workflows - Automation templates
- β¨ AI Prompts Library - Ready-to-use prompts
Trading & Finance:
- π Gold Copy Trading - Trading insights & strategies
π Connect With Me
- πΌ LinkedIn: linkedin.com/in/allanninal
- π» GitHub: github.com/allanninal
- π Portfolio: devtestmode.com
- β Support: ko-fi.com/dailyaicollection
Building the next generation of free AI tools - one project at a time! π€β¨
 
 
              
 
    
Top comments (0)