Hey everyone!
Today, I finalized the first working version of the real-time transcription app I've been working and I made a pull request. It's looking great, I mostly worked on the installation script and made sure that the onboarding experience would be smooth for new users. There are still some more tests to run in different environments but it's promising overall.
I am lucky to be working with a guy (the owner of the repo) that's all for this project who left me a very detailed code review that I'll be working according to tomorrow.
I also worked a bit on the bot that automatically makes reservations using ChatGPT using Twilio, I made it so that synthesized audio generated with Google's TTS service can be played in the call, the difficult part was getting the media payload to fit Twilio's requirements, but some quick inspiration from Stack Overflow solved it.
Now, the more difficult part is implementing VAD to be able to tell when the recording of the other party needs to start. For some reason, RMS and Energy-based approaches didn't work quite well, I suspect that the audio chunks undergo some normalization process before being sent by Twilio, which could be causing this. webrtcvad wasn't too promising neither, and the chunks are only 20ms long. Will try using Silero VAD that's been proven to work quite well with chunks that are 30ms long at base level.
That's it for today, hoping for a productive day tomorrow!
Happy coding everyone :)
Top comments (0)