Today I worked on debugging the real-time transcription application I've been working on some more. I came across time-alignment problems and still working on exploring other options to fix them. There also considerations when it comes to the speaker diarization feature, as when there's misalignments which usually happen when using the smaller models, I need to decide what happens. If the transcriber was unable to identify which segments belong to which speaker, I'll have to mark the speaker as an unknown speaker. However, only if other speakers were detected during the transcription. If no other speakers were detected, it's safe to assume that the only one detected is the one who spoke.
Otherwise, if there's even a slight second of another person speaking, I'll mark it as an Unknown Speaker. I think that's fair enough for now.
I'll also explore faster whisper implementations, the problem is the alignment. WhisperX does not easily fit within the project, which does perform alignment. I'll try to see what I can do and if I can perform alignment separately, but if not, I'll stick to my current implementation and debug it some more to see where it excels and where it doesn't.
Hoping to finalize this part of the app and release it,
Happy coding everyone!!
Top comments (0)