I do a lot of interviews with subject-matter experts for work. Usually it’s over Teams or Zoom. Sometimes the built-in transcript is missing, locked, or just unusable.
For a while I tried the usual options. Some required subscriptions I didn’t want. Others had weird formatting that meant I spent as much time cleaning up the output as I would have just typing it myself. A few couldn’t handle multiple speakers without turning it into a mess.
At some point I was messing around with Claude Code and thought: why not just build something myself?
That turned into a lot of hours, a bunch of blind alleys, and more tweaking than I expected.
*What actually worked
*
Speech recognition has gotten surprisingly good in the last few years. The open source options are solid now. Getting accurate text from clear audio isn’t the hard part anymore.
Speaker diarization was trickier. Figuring out who said what in a conversation is a different problem than just converting speech to text. Getting those two pieces to work together cleanly took more debugging than I’d like to admit.
*The stuff I underestimated
*
Audio quality variation.
A clean studio recording and a laptop mic in a conference room are completely different problems. I spent a lot of time on preprocessing that I didn’t plan for.
**Output formats.
**I originally just wanted plain text. Then I needed SRT for a video project. Then JSON for piping into other tools. Scope creep is real, even on your own projects.
**Edge cases with speaker detection.
**Two people with similar voices. Someone who talks over someone else. Long pauses where the model isn’t sure if it’s a new speaker or the same person thinking. These are harder than they sound.
*Where it’s at now
*
Eventually I had something that worked well enough for my own use, so I turned it into a small platform:
It takes audio or video and produces transcripts that are actually usable. Mostly I cared about speaker separation and output that didn’t need a lot of cleanup afterward.
No subscription — just pay per file. I built it that way because that’s what I wanted as a user. I transcribe maybe five to ten recordings a month. Paying $20/month for that felt wrong.
I’m sure there are edge cases I haven’t hit yet. I’m still adjusting things as I run into them. The diarization in particular is something I keep tweaking.
Posting here mostly in case it’s useful to anyone else who runs into the same problem. Or if you’ve built something similar and have thoughts on approaches I should try.
Top comments (0)