Today, I took the day a bit easier. I tested all the different ways I used to implement Speaker Diarization in my real-time transcription app and I decieded to use the approach that made use of the deep-speaker library as it was the most accurate. Surely there are more things I could do to achieve better accuracy, but there's a working demo now that improves upon itself.
I also made it so that it's possible for the user to request real-time transcription without diarization, which I made use of right there and then. I reduced the minimum buffer time (accumulated audio chunks) to one second before transcription and it looks good.
By made use of, I mean that I implemented it in my project that makes restaurant reservations over the phone. It still doesn't have the actual calling functionality, but it will connect to the server, stream the audio and receive real-time transcription as the waiter taking the reservation is speaking. I believe that this will allow for reduced latency, but only actual testing will show. I still have to check when the waiter starts speaking and when they stop speaking so that I can move on and finalize the transcription. For now, it looks promising.
Tomorrow, I'll upload a final demo along with the code for the restaurant project and the real-time transcription app itself which requires a lot of re-factoring.
That's it for today in which I didn't time myself with pomodoro sessions, just worked as it went on! That's what taking it easy is for me.
Happy coding everyone!
See y'all tomorrow!
Top comments (0)