Today was mostly a day of research. I thought that by today, I would already have my real-time transcription app on GitHub. But then another question popped into my mind, what's it worth without accuracy? In this state, it just spits output. A real-time transcription application using Google's speech-to-text API by Sahar Mor showcased how Google generate interim results which they then determine as final. This is super powerful, as it allows to distinguish uncertainity from certainity. When showing real-time transcriptions to clients, this marks the difference between an inaccurate transcription and an accurate transcription in the working.
This is something I lacked and just had to implement, so I started researching.
How would I do this? First, confidence scores from the ASR.
However, Whipser does not provide confidence scores natively. It does provide a metric called avg_logprob one could use to determine the confidence score of the entire segment, but that's not that useful in the word/token level.
My focus was getting confidence scores in the word-level, so that I could apply more techinques that I'm yet to research their implementation such as making use of language models and context for words that are below a certain confidence threshold to see if they match. If they don't, will probably go for the best alternative I can find (big big big ?).
After researching the issues and ideas in the Whisper repo I managed to find some solutions that involved changing Whisper's code, but I couldn't trust myself enough to do it.
I then found something else that worked, it did generate the confidence scores for each word, but the overall confidence score it generated for each segment had me thinking. I didn't feel confident in using that method.
I came across an implementation of Whisper called Whisper JAX which is super-duper fast in their testings, and then I asked myself, "what if there is some other implementation which offers this natively?". I knew that I would have a lot more trust in an implementation that knows its way around the Whisper library. So I found one called faster-whisper, which does exactly what I want! It also does it faster than the original does, that never hurts :)
Example of how a word looks with faster-whisper:
Word(start=4.44, end=4.66, word=' later', probability=0.6848104596138)
This is also amazing for highlighting words on playback if I choose to implement it at the end of the real-time transcription process!
So I have the confidence score to work with, and I will start by filtering those with a lower confidence score, probably those with a score less than 0.8
Then, I believe that I'll create a separate thread to work with analyzing the interim results and improve them. Each segment will be sent and the words with a lower confidence score will be analyzed. If there are none, it should be good to go! I doubt that I'll need to analyse the segment itself if everything passes with a good confidence score.
I'm going to try not to obsess too much over the accuracy and justt hope to get something working and get a demo out already.
Tomorrow, I'll do more research on the next approaches. Right now, I'm pretty sure that language models are what's to come.
As for implementing this in the project that makes restaurant reservations in real-time, the waiter taking the reservation is likely to be saying sentences no longer than 4 seconds so this isn't too effective when compared with just transcribing everything he said at once. However, this will be quite useful if I want to "be in a meeting" and reply as usual (Apple's has a nice new feature in iOS 17 that shows us the way there :)
So tomorrow, the goal is to get interim results in my real-time transcription app!
That's about it for today,
Happy coding everyone!
Top comments (0)