DEV Community

Cover image for Killing Latency: Wiring a Voice-Activated Cyber Triage Bot with Gemini and Deepgram
Teja
Teja

Posted on

Killing Latency: Wiring a Voice-Activated Cyber Triage Bot with Gemini and Deepgram

Typing out incident reports while a server is actively under attack is a massive waste of time. I wanted a way to triage network threats using just my microphone, so I built Aegis-Twin.

Aegis-Twin is a voice-activated AI digital twin specifically engineered for cybersecurity triage. The concept is straightforward: I describe the anomaly out loud, the AI processes the threat context, and it speaks back immediate mitigation steps.

To make this work without lag, I had to completely ditch standard REST APIs for the audio ingestion and wire together Deepgram, Google Gemini, and Murf AI. Here is how the actual pipeline looks under the hood.


The Latency Problem (And Why REST Fails)

If you try to record a WAV file, save it locally, POST it to a transcription API, wait for a response, and then send that to an LLM, your bot is going to take 10 seconds to reply. In a SOC environment, that is useless.

The only way to make conversational AI feel real is by using WebSockets to stream raw PCM audio chunks in real-time.

Phase 1: Streaming Audio to Deepgram

I used Deepgram for the speech-to-text layer because their WebSocket integration is incredibly fast. Instead of waiting for me to stop speaking, it transcribes the audio stream on the fly.

Here is the core async connection loop using Python:


python
import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

async def stream_mic_audio():
    # Never hardcode this in production, grab it from your .env
    client = DeepgramClient("YOUR_DEEPGRAM_AFFILIATE_KEY")
    dg_connection = client.listen.live.v("1")

    # This fires the millisecond Deepgram detects a spoken word
    def on_transcript(self, result, **kwargs):
        sentence = result.channel.alternatives[0].transcript
        if sentence:
            print(f"Analyst Input: {sentence}")
            asyncio.run(trigger_gemini_reasoning(sentence))

    dg_connection.on(LiveTranscriptionEvents.TranscriptReceived, on_transcript)

    # nova-2 is currently their fastest model for tech jargon
    options = LiveOptions(model="nova-2", language="en-US")
    await dg_connection.start(options)

##  Phase 2: Constraining Google Gemini
import google.generativeai as genai

async def trigger_gemini_reasoning(threat_text):
    genai.configure(api_key="YOUR_GEMINI_API_KEY")

    # Force the LLM to act like a Tier 3 SOC Engineer
    system_rules = """
    You are a cybersecurity triage AI. 
    Do not use conversational filler. 
    Analyze the threat, state the severity, and output exactly two immediate mitigation steps.
    """

    model = genai.GenerativeModel(
        model_name="gemini-1.5-pro",
        system_instruction=system_rules
    )

    response = model.generate_content(threat_text)
    print(f"Aegis-Twin: {response.text}")
    return response.text

## Phase 3: Giving It a Voice with Murf AI
import requests

def speak_mitigation(triage_text):
    url = "[https://api.murf.ai/v1/speech/generate](https://api.murf.ai/v1/speech/generate)"
    headers = {
        "token": "YOUR_MURF_API_KEY",
        "Content-Type": "application/json"
    }
    # Using a crisp, professional voice model
    payload = {
        "voiceId": "en-US-marcus", 
        "text": triage_text
    }

    response = requests.post(url, json=payload, headers=headers)
    if response.status_code == 200:
        audio_url = response.json().get("audioUrl")
        # Pipe this URL to your system's audio output device

Enter fullscreen mode Exit fullscreen mode

Top comments (0)