Hey everyone!
Today, I wrote out some of the steps for the personal assistant project I'm working on (preparation is important!), and worked on improving the real-time transcription app I've been working on as well.
I'm working on adding VAD to make sure I don't work with speechless batches, as these can cause false new speaker identifications and hallucinations on Whisper's side. We don't want that :)
Using Silero VAD, the results don't seem to be tooooo promising but I'll try to work by the chunk rather than by the batch, as it was trained on chunks of 30ms.
Happy coding everyone!
Top comments (0)