Hey everyone!
Today, I did more research on real-time Speaker Diarization. My app had awful accuracy.
I found Diart, an open-source library that focuses exactly on this with Whisper transcriptions.
This is exactly what I'm looking for!
I explored this article: Color Your Captions: Streamlining Live Transcriptions With “diart” and OpenAI’s Whisper
I tried to understand the logic as much as I could by going over it and asking ChatGPT questions.
It would work great, just wanted to learn more about it and understand what's happening behind the scenes.
Here are some of its advantages: Real-time processing out of the box, it uses a Sliding Window approach which allows it to continuously analyze the audio stream, it can handle overlapping speech segments (amazing!!!) and it's super efficient.
I didn't get to expirement with it today, it didn't work when I tried to stream the audio from my microphone for some reason, and I do want to get some hands-on experience with it before I start connecting it to the client and streaming over WebSockets (which it supports out of the box!!!).
My previous approach was clearly was way too simple, I have strong hopes that this one will solve the speaker diarization accuracy problem I had. I also had the problem of it attributing some partial transcriptions to the wrong speaker.
And probably the most amazing find, is how it conditions the model based on context. Apparently, when making a transcription call with Whisper, you can provide a buffer that contains the previous transcriptions and gives it context. Almost like it was made for this use case!
I never felt more not efficient than I did today when I found this out today. I used ChatGPT for refinement and went so deep into something that was right in front of my eyes!
I gotta say, this project is teaching me so much about do your research, and I'm so glad it does, as Whisper is a relatively new game-changer and I get to experience a lot with it.
Hoping for amazing developments tomorrow!
Happy coding everyone :)
Top comments (0)