Today was a day full of experimenting with prompts with ChatGPT and thinking of approaches to refine the transcription in my real-time transcription app, in real time.
The biggest challenge was "when?"
Upon experimenting with a lot of different prompts, I realized that I might have been going too far. The goal is to get something working and relatively basic that I could improve later on, but what's important right now is to get a demo out.
So there are a lot of complications with refining in real-time, since there can be lack of context and missed words by the ASR. There's a thousand scenarios where things can go wrong and in my head, probably went 1% of them. That was a lot. So I chose to go for the simplest approach: "When the speaker is silent, that's the end of a sentence. That's when you give it to ChatGPT to refine it, and it can place a full stop at the end of the transcription with confidence."
So, when working with raw transcriptions, since we're working with a second-long buffers, just check if there's no voice activity. That's when you refine (could be better but we're not greedy).
I should note that I purposefully removed all punctuation and capitalization from the transcribed data so as to not confuse our dear language model.
But we're also working with Speaker Diarization.
So, for that, I refine the transcription when the speaker changes. When the speaker changes, that's when he's done speaking.
Later on, VAD is possible. But I first need to learn how everything works and recognize the patterns, so it's better to have a starting point that's not too effective, but functional.
Then rises another problem. False identifications of new speakers happen a lot at the beginning, when the embeddings are fresh and are not at the point of having collected enough data to improve themselves.
I addressed this before, but I also said that I check for duplicates and remove duplicates. So here's the approach (not token efficient at all, token efficiency was removed from my list of concerns when I said just functional):
Keep a transcription of every speaker. Whenever a duplicate is found, add all of its contents to the transcription of the actual speaker, and then refine its entire transcription. Now there's more context and probably clashing words that need to be sorted out. ChatGPT does this quite perfectly and my prompt is quite extensive, which I don't know if to regret or not. The uncertainty when working with it can kill. The prompt is something I will probably be improving and re-iterating on.
So far, my app kept a log of all of the different segments generated by the ASR. I don't know if I'll change this approach, it's quite useful when it comes to maintaining the order. As here, maintaining the order is quite important. Here's an example scenario:
Speaker 1: hey whats up
Speaker 2: man, heard you were looking for
Speaker 1: cinema to go to
And now we found that Speaker 1 and 2 are the same, we must maintain the order and properly join their dialogues (which can get way longer with more instances), so since their order is already maintained in the segment log we kept, we change the speaker in the logs where Speaker 2 is registered and reformat the transcription.
This is a prime example for why I removed the punctuation, it removes any contextual confusion that may arise for the language model when refining the transcription.
And as you may have noticed, there's a grammatical error of an article missing before "cinema", a big no-no in the language of Shakespeare! This is where the language model comes in.
Just as a side-note, the language model receives data like this:
hey (0.97) man (0.84) whats (0.78)...
With the numbers in the parentheses being the confidence score of the ASR. When the ASR transcribed those words, it had a certain confidence doing so. The numbers represent the confidence it had.
So just some more data for the language model, explicitly stated that it shouldn't guess new words, but can remove and modify words using that metric.
So back to the grammatical error, that's also something the language model handles quite well and will be great when combining the transcriptions.
But now, I lose one thing, I cannot work in segments when combining the transcriptions. I combine the needed segments to get one transcription, and in a case like the case above, all of the dialogue turns into one big segment. So we basically lose the timestamp functionality we're provided with. However, that doesn't matter. We don't need the start and end times of inidividual words, we've already processed the speaker embeddings and everything's great. Should we implement timestamp functionality that marks the words when playing back the recording, that will happen when the recording ends.
When the recording ends, I plan to re-process everything and refine it all, comfortably.
Some sacrifices can be made in real-time, but the final one has to be perfect. However, I'm not worrying about this, implementing that is the cherry on top after a demo is out :)
And, since we're refining the transcription on change of speaker, we end up refining the same thing multiple times in scnearios like the case above.
Just a note :)
I mentioned that a combination occurs when combining transcriptions from different speakers and therefore the loss of individual segments occurs, but this also happens with every refined transcription. The individual segments are lost, it's all one big segment.
But this shouldn't be a problem, the order will still be maintained.
This actually makes for what is, in my opinion, better structure of data. Since when the recording ends, the segment log will look exactly like the transcription itself. Everything each speaker says before a change is one big segment.
Those are things I realized as I wrote this, my head is clear and the approach is set for working on this tomorrow.
I must admit, I do have one weakness, which is asynchronous programming. I have somewhat of an idea but not a complete idea of how to implement this to let other code in the transcription thread run while ChatGPT works on refining transcriptions. I could also start a separate thread. What I'm lacking is the theory, I need to learn the theory to understand how it works and I'm hoping to learn more about it soon to be confident enough to implement it. However, I do know that the path to an async function starts from the function that calls the first function that calls the function, which then calls multiple functions until the end goal is called. If I'm not wrong, a series of coroutines. So that would mean changing quite a few things, if I'm not wrong.
Worth looking into some more! Super interested to find a video that dives into the theory rather than the output. The theory is what matters in the more complex uses, in my opinion.
Quite a long one today, but that's it for today!
My appreication for ChatGPT is immesurable, doing these things a few years ago would seem impossible. ChatGPT opens the door to a lot of beautiful projects.
That's it for today,
Happy coding everyone!
Top comments (0)