Today, I worked on improving the accuracy of the Speaker Diarization in my real-time transcription app. The biggest challenge was the duplicates in the app. The same speaker would be identified multiple times as different speakers.
First, I made it so that the speaker's embedding would improve upon itself. Every time that the speaker was identified, I would create a new embedding whose values are the averages of the speaker's existing embedding and the new embedding.
Then, since the speakers' embeddings began containing more data as time goes on, I was able to detect duplicates.
Every 10 seconds, the app looks for duplicates by comparing every set of embeddings and when it finds a duplicate, it improves the existing embedding with the duplicate and removes it. It then reformats the transcription to change all the labels belonging to the duplicates to the label of the actual speaker.
Some challenges still remain, the accuracy could be improved but I'm wondering whether it's worth it at the moment, as the app has a strong base.
To improve the accuracy without putting too much effort in, I think, but not sure, that I could find a large dataset online and train my model on it. Since I'm using deep-speaker now, it doesn't seem difficult to do.
The goal in that would be to extract more general speaker embeddings, to not fail duplicate checks because of nuances.
Also thinking where this app could go from here. This can be made into a fully-fledged app that in my opinion would help podcasts and could be used for many transcription tasks, just because of the real-time diarization.
That's about it for today,
Happy coding everyone!
Top comments (0)