Today was a day full of coding, 10 hours of sessions.
Following yesterday, I tried to implement the buffering approach I wanted to take.
Some context about the app: This real-time transcription app serves to improve an existing project that incrementally sends audio chunks (5 seconds worth of audio chunks) via a regular HTTP protocol for the real-time transcription.
The only thing that the web socket approach changes is that it allows for constant streaming of audio, but then, what does it change? We still need to accumulate a certain amount of audio chunks that will be equal to some amount of time. We still need to buffer it, just that this time, it will be done in the server.
So I chose to ask some great programmers I know, and once I've clarified the goal, I'll finish with the implementation. For now, it just seemed goal-less and therefore, why work hard?
After all the frustration leading up to that point, I chose to move on to a more exciting project. Adding Speaker Diarization to that existing project that does transcripions in real-time with Whisper.
Speaker Diarization is recognizing who is speaking. For example, if you had an audio file where two people talk and you chose to run Speaker Diarization on it, the output would look like this:
Speaker 1: Hey, how are you?
Speaker 2: Great! What about you?
Speaker 2: The weather is quite good today.
Speaker 1: I'm great! And yes, you're right, the San Francisco fog is amazing.
One way to implement this would be using the open-source library pyannote, which is what I chose to use.
The goal was to build a "Diarizator" which does what I described above. However, since there's a new recording coming in every 5 seconds, I need to remember the speakers in the application are so that I won't keep creating new speakers.
For that, I implemented embeddings. Embeddings are what makes voices unique, like your ID, but for your voice.
In each recording, I extract the embedding of each speaker, check if I've already identified an identical or very similar embedding before, and create a dictionary that labels each speech segment with its corresponding speaker. I then return this dictionary, which should suffice for implementing Speaker Diarization in a real-time transcription app.
This is not going into detail at all, but I plan to make a different post going into detail about how it works and it will soon be available on GitHub. I believe that it can prove useful in an infinite amount of cases.
There's a lot of things to improve upon, and I will improve it as the cases where it fails present themselves. It should be good enough to work in the app, so that'll be the starting point. I haven't had the time to actually modify the app to implement it, that I'll do tomorrow.
That's about it for today!
Happy coding everyone!
Top comments (0)