The 800ms Barrier: Architecting Interruptible Voice Agents (Lessons from Sarvam AI x Swiggy)

#agents #automation #ai #infrastructure

The 800ms Barrier: Architecting Interruptible Voice Agents (Lessons from Sarvam AI x Swiggy)
The Signal: The 800ms Latency Barrier
In a research lab, a 3-second delay is an "optimization ticket." In a live call with a hungry customer on the Swiggy app, 3 seconds is a churn event.

The partnership between Sarvam AI and Swiggy represents a shift in the "Boss Level" of agentic AI. Most developers build voice agents using a Cascaded Pipeline: STT -> LLM -> TTS. The result? A cumulative lag that makes the agent feel like a slow walkie-talkie. To build for the next billion users, you have to architect for Native Audio Streaming and sub-second response times.

Phase 1: The Architectural Bet
We are moving from Request-Response to Streaming State Machines.

The Vendor Trap is relying on general-purpose, text-centric models for a multilingual, audio-first market. If you have to translate "Hinglish" to English just to understand an order, you’ve already lost the latency battle.

The Ownership Path is the Indic-Native Stack. Using Sarvam’s natively trained audio models allows us to process speech-to-intent directly. More importantly, we must implement a Bi-Directional WebSocket architecture. This allows the agent to "listen" while it "speaks"—the only way to handle the most difficult part of human conversation: The Barge-in.

Phase 2: Implementation (The Interruptible Voice Handler)
In a high-stakes environment like Swiggy, the agent must be able to stop mid-sentence and roll back its logic if the user changes their mind.

// High-Level Logic for an Interruptible Voice Kernel
class VoiceAgentKernel {
    constructor(wsConnection) {
        this.ws = wsConnection;
        this.isSpeaking = false;
        this.transactionLock = null; // Ensuring tool-use safety
    }

    // Detecting the "Barge-in" (Interruption)
    onUserSpeechDetected() {
        if (this.isSpeaking) {
            console.warn("SIGNAL: Interruption detected. Executing State Rollback.");
            this.killAudioPlayback(); 
            this.abortCurrentLLMGeneration();
            this.clearPendingTransactions();
        }
    }

    async handleAudioStream(chunk) {
        // Stream raw audio to Sarvam's native Indic-pipeline
        const response = await this.ws.processAudio(chunk);

        if (response.intent_confidence > 0.9) {
            // Pre-warm tools before the user even stops talking
            this.prepareOrderTransaction(response.entities);
        }
    }

    clearPendingTransactions() {
        // Essential: Prevents the "Ghost Order" bug
        if (this.transactionLock) {
            this.transactionLock.cancel();
            this.transactionLock = null;
        }
    }
}

Phase 3: The Senior Security & Testing Audit
I put this Swiggy-scale blueprint through a professional Senior QA & Security Audit. Here is why your "standard" voice agent will fail in the wild.

The "Ghost Order" Race Condition (Logic Fault)
The Fault: The agent says "Ordering your Paneer Tikka..." The user interrupts: "No, wait! Make it a Chicken Roll!"
The Audit: In naive implementations, the "Order Tool" is triggered the moment the LLM starts talking. If the user interrupts, the audio stops, but the backend API has already committed the Paneer Tikka. You now have a frustrated customer and a wasted order.
The Fix: Implement Deferred Commits. The tool-call must remain in a PENDING state until the audio playback reaches a "Commit Threshold" (e.g., 90% completion) or receives a final verbal confirmation.
The "Ambient Audio Injection" (Security Breach)
The Fault: The user is ordering food while walking past a loud TV. The TV says "Cancel all orders."
The Audit: Without Speaker Diarization, the agent cannot distinguish between the primary user and background noise. A malicious or accidental "audio injection" can trigger unauthorized actions.
The Fix: Use Sarvam’s front-end audio processing to enforce Voice Activity Detection (VAD) with a noise-floor gate. If the audio signal doesn't match the primary speaker’s decibel profile or spatial characteristics, the kernel must ignore the intent.
The "Colloquial Logic Bypass" (Semantic Security)
The Fault: Your security prompts are in English, but the user is speaking a dialect-heavy mix of Hindi and regional slang.
The Audit: Traditional English-centric guardrails often miss the nuance of regional insults or "Hinglish" social engineering attempts used to trick the agent into granting a 100% discount.
The Fix: Security filters must be Indic-Native. By using Sarvam’s regional guardrails, we ensure that semantic boundaries are enforced at the phoneme level, not just the translation level.

Phase 4: Checklist (The Architect’s Standard)
[ ] Native Audio or Bust: If you are still converting audio to text before processing intent, your latency will never hit the 800ms gold standard.

[ ] Transactional Barge-in: Verify that every interruption triggers a State Rollback for any pending API calls.

[ ] Acoustic Hardening: Test your agent against 60dB of background "street noise" to ensure VAD stability.

[ ] Regional Edge-Cases: Audit your "Hinglish" logic. Does your agent understand the difference between a user "asking for a discount" and a user "threatening to cancel"?

The Bottom Line: Building for the next billion users requires an infrastructure that respects the speed of human thought. Sarvam AI provides the native Indic engine; your job is to build the Deterministic House that keeps the order safe.

DEV Community

The 800ms Barrier: Architecting Interruptible Voice Agents (Lessons from Sarvam AI x Swiggy)

Top comments (0)