Today was another vacation day but I managed to get in 3.5 hours of coding. Today was mostly about figuring out how to enable real-time transcription in my project that's focused about creating an API which makes real-time transcription possible with Whisper.
Currently, I'm working with WebM blobs as audio chunks, and they're super small. I chose to employ the buffering strategy for the transcription, which means that I chose to collect a certain amount of audio chunks, transcribe them, reset the buffer and then repeat the process. This is how I currently plan on making real-time transcription possible in my API. However, I had the problem of my buffer getting unexpectedly modified due to the new chunks coming in while I process it and sometimes resetting it. So, unexpected behavior.
This is where threading comes into play, as we can't let other threads (the other chunks coming in) modify the buffer. At times, we need to keep it exclusive to a certain thread.
To solve it, I first created a second buffer that would store the audio chunks being received while I process the first buffer.
Then, at first, for some reason, following the advice of ChatGPT, I created two threads. One thread was the main thread, and the other was the processing thread that's in charge of processing the buffer. Granted, should've worked fine. However, there was unexpected behavior when processing the second buffer and I couldn't figure out what was wrong, everything was getting a bit too complex. It felt like a complicated approach for a simpler goal. That's something I prefer avoiding.
So I chose to go back to where I started and do it again, but simpler. Currently, the only use of threading in the application is locking the buffer variable to make sure it doesn't get modified. I stopped there and made some more research using ChatGPT on how real-time transcription services do the real-time transcription. I came to a conclusion that this approach should be fine and not too bad, but I'll definitly be making more research and optimizing the strategies I choose to employ and the approaches I take.
Tomorrow, the goal is to achieve real-time transcription with the approach I wanted to take today. Should it work, I will start testing it as an "interested developer" that wants to make use of the API, as currently, I'm working on this in correlation with my React app.
Then, I'll start working on optimizing it.
Following yesterday's "list", I'm in the first stage, considering how I should be approaching this in terms of buffering. The idea is clear, the implementation is to-be-continued. Tomorrow the initial approach I discussed above will be implemented, but I believe that it won't be the final approach to this. Right now, I just need what works. So two buffers, one is the current one, the second one stores the audio chunks being received while the current one is processing. Should be easy-peasy!
Excited about this project and excited to be learning more and more about handling audio, this is something I've never dirtied my hands with before and I love learning about completely new things like this. Big thanks to ChatGPT for saving me years of googling!
Happy coding everyone!
Top comments (0)