Hey everyone!
Today, working on the real-time transcription application I've been working on for the longest time now but that is also very close to being finished, I added VAD to mitigate hallucinations (skipping silent batches in the stream) and explored every possible implementation I could use for the transcription to make sure the alignment model is the best.
The implementation I've been using is the only that seems to have an "out-of-bounds" issue reported, which means that it will hallucinate start/end times and falsely mark the speaker as unknown for a given segment. However, there's not even a mention of this issue in other libraries.
I tried using faster-whisper in conjugtion with stable-ts for the alignment, benefiting from its speed-up and saving on memory usage (the most important thing), but I had a problem with loading faster-whisper on my computer.
I then tried to load WhisperX to see how good its alignment model works, which already uses faster-whisper out of the box, but the same problem reoccurred. I tried it on Google Colab and it did work, it looked quite promising. I've created a script that simulates real-time streaming with pre-defined audio data and will serve me well when comparing the different transcription models. They may work great when given a generous amount of data to work with, but the batches here may get as short as 1 second. Now, I'll be testing them with 2-second batches, which I believe puts quite the strain on the alignment (the absolutely most important thing after good transcription quality).
The testing will focus on the "large-v2" model and then move on to the smaller models at the end. I haven't tried the stable-ts/faster-whisper combo, but my scope is on WhisperX. I avoided it before because Diart anchors to Python 3.8 while WhisperX anchors to Python 3.10, but if there's no issues when testing, it shouldn't be a problem to anchor to 3.10
I also took a look at whisper-jax, but it'd be better to make it an option rather than a primary transcription mechanism. Its performance is astonishing (judging by the benchmarks) when working with a TPU (for example, in a Google Colab environment) and it would serve well as an option, as that's where I like to run my server when testing and it would be amazing to host this on a TPU-based platform. But, when running locally, it's usually either the CPU or the GPU, and the GPU is the one that does the heavy lifting, and is the one faster-whisper will provide transcriptions way faster than the original with.
I also plan to allow making custom speech embedding brains for this project. Not too hard as the diarization is done using Diart, which allows this out-of-the-box. I'm saying this because I wanna test it myself with a SpeechBrain embedding brain, later on of course.
And then, the smaller things. Such as being more flexible with different sample rates and audio data of different types.
Since this project is about demonstrating what's possible with Whisper, the possibilties are endless. I might add the option of using whisper.cpp for generating transcriptions as well, following what I mentioned I might do with whisper-jax. As this would allow for a true demonstration of the speed-up these implementations provide.
That's it, the project is coming to look promising! Can't believe I started by trying to make my own diarization algorithm and ended up here after having researched so much on the subject and learned so much about Whisper and how different things work in general, when running a server with CPU-bound tasks such as this. Been abusing the threading library and I can't say that I'm not liking it! Also quite gotten used to the idea of audio streaming, which I think will serve me well in my next endeavors :)
Top comments (0)