Today was a pretty long day, worked on the real-time transcription app and tested different implementations. Ended up going for a stable-ts + faster-whisper combo, did some testing and it looked good.
Since the goal is to demonstrate the capabilties of Whisper, I made the transcription timeout (batch size) and beam size configurable. The performance of faster-whisper in a Google Colab environment was quite surprising, as I was able to use a beam size of 5 and still get near-immediate response with a batch size of 3 seconds.
I still have some accuracy problems with the diarization, that are all related to the accuracy. The application itself is working, so no problems with that. I'm planning on exploring the diarization stage further, specifically the speaker-embedding brain at this time to improve the accuracy.
So far, the focus was on the transcription part. There is still room for improvment, by adding support for other Whisper implementations and alignment models that would possibly allow for more accurate timestamps, which would also benefit the diarization part of the app.
The user would also bear some of the responsibility for the diarization when the diarization pipeline's config parameters' values are made configurable. The performance also depends on those values.
The focus will shift to the diarization part of the app for now :)
Overall, the demo is looking quite good for this stage. Still can't believe how long it took me to get here and how many different approaches I took. But super happy with what I learned and will learn, hoping that the next projects will be coming out like hotcakes :)
Planning to make the improvement of this project a side-quest for the weekends in terms of adding support for more implementations and approaches. The bugs and diarization of course, are a priority. Tomorrow, if there's no review, I'm changing my focus a bit so that I have a breath of some fresh project air :)
That's it for today, happy coding everyone!
Top comments (0)