Hey guys!
Worked some more on the real-time transcription app, decided to drop multi-client support at this stage since the transcription calls are not thread-safe and this is not a priority.
What was a priority is understanding why the diarization identifies speakers as unknown so frequently.
I tested Diart and found that the diarization is pretty much on-point, it's just that the transcriber didn't work quite well. It overshot the start and end times, exceeding the duration of the stream.
It seems to happen more frequently when using the small model rather than when using the large-v2 model, but it still happens. I started by testing Diart separately without any transcription and then printed out everything that would aid me in understanding the source of the problem.
Diarization results
Time shift (the time that should be added to the start time the transcription generates for a given segment, as the diarization accounts for the entire stream and it may already have been transcribing for 120 seconds, for example.)
Transcription start and end times
Transcription start and end times after time shift
That's when I found the start times can sometimes be 20 seconds, which doesn't make sense since the audio data being transcribed is 2 seconds long.
I'm currently exploring ways to fix this, hoping to solve this tomorrow and finally release a working version of this! If this works, will move on to exploring ways to re-format the transcription live and implementing Whisper implementations that will give a performance boost of up to 80%.
Happy coding everyone!
Top comments (0)