My first Dev.to post about building ToolsOnFire covered the broad overview. This time I want to go deep on one specific tool: the Voice Separator.
*The Problem *
I kept seeing the same requests in podcasting and journalism communities: "I recorded an interview and need to edit just one speaker's audio" or "I need a transcript that shows who said what."

The existing options were either expensive (Descript at $24/month), required desktop software, or didn't actually separate the audio - they just labelled who spoke when.
I wanted to build something that: 1. Identifies each speaker in a recording
- Creates separate downloadable audio files per speaker
- Allows users to separate background music into a separate file
- Produces a timestamped transcript with speaker labels
- Is free to try without creating an account
The Challenges
Transcript accuracy is never perfect. This was the biggest reality check. No matter which AI model you use, transcripts will have errors - especially with accents, technical jargon, mumbling, or background noise. I spent a long time chasing 100% accuracy before accepting that even professional human transcribers don't achieve that. The goal became "accurate enough to be useful" rather than perfect.
Speaker misidentification. The AI sometimes assigns the wrong speaker label to short utterances, especially when speakers have similar voices or one person only says a few words. I had to build post-processing logic to smooth out these errors - grouping nearby utterances and correcting obvious misattributions.
Overlapping speech is the hardest problem. When two people talk over each other, basic diarization falls apart. My free tier handles this reasonably well for brief interruptions, but I built a premium tier with a more advanced pipeline specifically for recordings with heavy crosstalk - panel discussions, heated interviews, group meetings.
Audio quality varies wildly. A studio-recorded podcast processes beautifully. A phone call recorded on speakerphone in a noisy cafe is a completely different challenge. I had to set expectations clearly in the UI and add guidance about what makes a good recording for separation.
Processing time and user feedback. Some recordings take 30-60 seconds to process. My initial spinning loader felt broken for anything longer than 10 seconds. I replaced it with a simulated progress bar that moves through phases (uploading... processing... generating results...). Users need to see movement even when I have no real progress data from the AI.
*Cost control *
The AI APIs cost real money per minute of audio processed. Without proper limits, someone could process hours of audio for free. I built a tiered system with minute-based quotas, prepaid credit packs, and usage tracking.
*What I Learned *
Free tiers drive conversions. Letting people try with no account was the right call. Most users try it once and leave, but the ones who find it useful come back and upgrade. If I'd required sign-up from the start, most people would never have tried it.
Podcasters are the sweet spot. I built this for a broad audience but podcasters are by far the most engaged users. They record regularly, always need to edit individual speakers, and the time savings are immediate. If I were starting over, I'd market specifically to podcasters from day one.
Managing expectations matters more than improving accuracy. Users who understand the limitations upfront are happy with 90% accuracy. Users who expect perfection are frustrated at 95%. Clear communication about what the tool can and can't do made a bigger difference to satisfaction than any technical improvement.
Try It
The Voice Separator is free to try - upload any recording with 2 speakers, up to 5 minutes, no account needed.
I also built a Meeting Recorder, Transcriber and Summarizer for live recording and transcription, and Talkbuoy for AI speech coaching.
Have you built anything with audio or speech processing? I'd love to hear what challenges you ran into.

Top comments (0)