When I was building intervu.dev - an AI interviewing app - I found that the latency between the AI finishing speaking (TTS) and the mic activating to capture the user’s response was too high.
This was because the TTS and STT WebSocket connections were opened sequentially.
I cut this latency in half by opening the STT WebSocket connection as soon as the TTS WebSocket connection was established and the first audio chunk was received.
This “pre-warming” means the STT connection is ready to go the instant the TTS finishes.
I wrote about the full build here.
Top comments (0)